# Learning Abstraction - learning a commuting diagram with a NN

In this notebook we switch from studying abstractions from the point of view of their definition, their properties and their implications, to considering learning problems. Our attention will focus on how abstractions (as they were explained in the previous notebooks) may be learned from data. We will start define artificial problems and we will examine how they can be solved using machine learning models.

In this notebook we review how we could learn the entire commuting diagrams starting from data collected from two models (a low-level model and a high-level model) undergoing corresponding interventions. We will solve the learning problem deploying simple neural networks with internal paths tailored to reflect the shape of our commuting diagrams.

This notebook was developed in order to exploit the framework introduced in [Rischel2020] in order to perform learning. The notebook is structured as follows: 
- Import of relevant modules and classes (Section 2)
- Definition of the problem (Section 3)
- First case study (Section 4)
- Second case study (Section 5)

DISCLAIMER 1: the notebook refers to ideas from *causality* and *category theory* for which only a quick definition is offered. Useful references for causality are [Pearl2009,Peters2017], while for category theory are [Spivak2014,Fong2018].

DISCLAIMER 2: mistakes are in all likelihood due to misunderstandings by the notebook author in reading [Rischel2020] and/or [Rubenstein2017]. Feedback very welcome! :)

# Setup

## Standard libraries
We start importing standard and custom libraries.

In [1]:
import numpy as np
import networkx as nx
import itertools
import pandas as pd
from scipy.spatial import distance

from pgmpy.models import BayesianNetwork as BN
from pgmpy.factors.discrete import TabularCPD as cpd
from pgmpy.inference import VariableElimination

import torch
import torch.nn as nn

from src.SCMMappings import Abstraction

For reproducibility, and for discussing our results in this notebook, we set a random seed to $1985$.

In [2]:
np.random.seed(1985)

We also define a number of samples to be collected in simulations.

In [3]:
n_samples = 10**4

# Problem statement

## Assumptions
We assume that two research groups (LOW and HIGH) have developed different models of a same system in the following scenario:
1. The two models represent the same system with different levels of details;
2. The two groups do not know the analytic form of their model (they can draw the DAG of causal dependencies, but the do not know the form of the structural equations);
3. The two groups can still sample data from the model (they can run a black-box model or collect data form the real-world systems they are modelling);
4. The two groups are interested in a particular interventional distribution (typically they both want to know the distribution of an outcome given an intervention on a treatment);
5. The two groups are aware on how to align their interventions (they know how manipulations of the treatment variables relate in the two models).

## Objective
Let us consider our usual commuting diagrams. Given (interventional) data $\mathcal{D}$, we want to learn the functions making it commute:

$$
\begin{array}{ccc}
\mathcal{\mathcal{M}_{do}}\left[\mathbf{A}\right] & \overset{\mathcal{\mathcal{M}_{do}}\left[\phi_{\mathbf{B}}\right]}{\longrightarrow} & \mathcal{\mathcal{M}_{do}}\left[\mathbf{B}\right]\\
\sideset{}{\alpha_{\mathbf{X}}}\downarrow &  & \sideset{}{\alpha_{\mathbf{Y}}}\downarrow\\
\mathcal{\mathcal{M'}_{do}}\left[\mathbf{X}\right] & \overset{\mathcal{\mathcal{M'}_{do}}\left[\phi_{\mathbf{Y}}\right]}{\longrightarrow} & \mathcal{\mathcal{M'}_{do}}\left[\mathbf{Y}\right]
\end{array}
$$

where:
- $\mathcal{M}_{do}\left[\mathbf{X}\right]$ and $\mathcal{M}_{do}\left[\mathbf{Y}\right]$: are disjoint sub-sets of variables in the high-level model;
- $\mathcal{M}_{do}\left[\mathbf{A}\right]$ and $\mathcal{M}_{do}\left[\mathbf{A}\right]$: are the corresponding sub-sets of variables in the high-level model;
- $\mathcal{M}_{do}\left[\phi_\mathbf{Y}\right]$ and $\mathcal{M}_{do}\left[\phi_\mathbf{B}\right]$: are the mechanisms;
- $\alpha_{\mathbf{X}}$ and $\alpha_{\mathbf{Y}}$: are the abstractions.

In other words the two research groups (LOW and HIGH) want to learn at the same time (i) the mechanisms in their models and (ii) the relation of abstraction between them. 

## Data

The two group are going to sample aligned data. Following Assumption (4) this means that when group LOW perform the intervention of interest ($do(\mathbf{a})$) and collects data ($\mathbf{a},\mathbf{b}$), group HIGH performs the corresponding interventions ($do(\mathbf{x})$) and collects its own data ($\mathbf{x},\mathbf{y}$).

These two pieces of data are then assembled in a single datasample in $\mathcal{D}$ that takes the form of a tuple:
$$ (\mathbf{a}, \mathbf{b}, \mathbf{x}, \mathbf{y}) $$
where:
- $\mathbf{b} = \mathcal{M}_{do}\left[\phi_{\mathbf{B}}\right](\mathbf{a})$
- $\mathbf{x} = \alpha_{\mathbf{X}}(\mathbf{a})$
- $\mathbf{y} = \mathcal{M'}_{do}\left[\phi_{\mathbf{Y}}\right](\mathbf{x}) = \alpha_{\mathbf{Y}}(\mathbf{b})$

Notice that the fact that the two groups know how to align interventions, does not necessarily means they know the function $\alpha_X$; their knowledge of this function may be restricted to a small subset of interventions; by learning the whole commuting diagram knowledge of $\alpha_X$ may be generalized to the whole domain of $\mathbf{A}$.

## Solution
We implement a neural network which takes in input $\mathbf{a}$ and $\mathbf{x}$ learns $\mathcal{M}_{do}\left[\phi_{\mathbf{B}}\right]$, $\mathcal{M'}_{do}\left[\phi_{\mathbf{Y}}\right]$, $\alpha_{\mathbf{X}}$ and $\alpha_{\mathbf{Y}}$ by trying to predict $\mathbf{\hat{b}}, \mathbf{\hat{y}}, \mathbf{\hat{a}}$ and trying to match the result of the upper path and the lower path of the commuting diagram.

Specifically we define the loss function as:
$$
\mathcal{L} = \mathcal{l}(\mathbf{\hat{b}},\mathbf{{b}}) + \mathcal{l}(\mathbf{\hat{y}},\mathbf{{y}}) + \mathcal{l}(\mathbf{\hat{a}},\mathbf{{a}}) + \mathcal{l}(\mathbf{\hat{y}}_{up},\mathbf{\hat{y}}_{low})
$$
where:
- $\mathcal{l}(\mathbf{\hat{b}},\mathbf{{b}})$ is the prediction error for the low-level mechanism
- $\mathcal{l}(\mathbf{\hat{y}},\mathbf{{y}})$ is the prediction error for the high-level mechanism
- $\mathcal{l}(\mathbf{\hat{a}},\mathbf{{a}})$ is the prediction error for the abstraction on $A$
- $\mathcal{l}(\mathbf{\hat{y}}_{up},\mathbf{\hat{y}}_{low})$ is the prediction error for the commutativity.

# Example 1

## Definition of models and abstraction 

We define the standard *lung cancer scenario* models we have used in the previous notebooks. For a detailed description, see the first notebook *Categorical Abstraction.ipynb*.

In [4]:
M0 = BN([('Smoking','Tar'),('Tar','Cancer')])

cpdS = cpd(variable='Smoking',
          variable_card=2,
          values=[[.8],[.2]],
          evidence=None,
          evidence_card=None)
cpdT = cpd(variable='Tar',
          variable_card=2,
          values=[[1,.2],[0.,.8]],
          evidence=['Smoking'],
          evidence_card=[2])
cpdC = cpd(variable='Cancer',
          variable_card=2,
          values=[[.9,.6],[.1,.4]],
          evidence=['Tar'],
          evidence_card=[2])

M0.add_cpds(cpdS,cpdT,cpdC)
M0.check_model()

True

In [5]:
M1 = BN([('Smoking_','Cancer_')])

cpdS = cpd(variable='Smoking_',
          variable_card=2,
          values=[[.8],[.2]],
          evidence=None,
          evidence_card=None)
cpdC = cpd(variable='Cancer_',
          variable_card=2,
          values=[[.9,.66],[.1,.34]],
          evidence=['Smoking_'],
          evidence_card=[2])

M1.add_cpds(cpdS,cpdC)
M1.check_model()

True

We define the abstraction between the two models

In [6]:
R = ['Smoking','Cancer']

a = {'Smoking': 'Smoking_',
    'Cancer': 'Cancer_'}
alphas = {'Smoking_': np.eye(2),
         'Cancer_': np.eye(2)}

In [7]:
A = Abstraction(M0,M1,R,a,alphas)

We will then pay attention to the interventional distribution of interest $P(C'\vert do(S'))$ and learn the following diagram:
$$
\begin{array}{ccc}
\mathcal{\mathcal{M}_{do}}\left[{S}\right] & \overset{\mathcal{\mathcal{M}_{do}}\left[\phi_{\tilde{C}}\right]}{\longrightarrow} & \mathcal{\mathcal{M}_{do}}\left[C\right]\\
\sideset{}{\alpha_{S'}}\downarrow &  & \sideset{}{\alpha_{C'}}\downarrow\\
\mathcal{\mathcal{M'}_{do}}\left[S'\right] & \overset{\mathcal{\mathcal{M'}_{do}}\left[\phi_{C'}\right]}{\longrightarrow} & \mathcal{\mathcal{M'}_{do}}\left[C'\right]
\end{array}
$$

## Data collection

We generate data from the low-level model for learning. This will provide the first part $(\mathbf{a},\mathbf{b})$ of our datapoints. For readability we change the notation from $(\mathbf{a},\mathbf{b})$ to $(\mathbf{s},\mathbf{c})$ to denote samples of smoking and cancer from the low-level model $\mathcal{M}$.

In [8]:
M0do = M0.do('Smoking')
lowlevel_data = M0do.simulate(n_samples=n_samples, show_progress=False)
print(lowlevel_data)

      Tar  Smoking  Cancer
0       0        0       0
1       0        0       0
2       0        0       0
3       0        0       0
4       0        0       0
...   ...      ...     ...
9995    0        0       0
9996    0        0       0
9997    0        0       0
9998    0        0       0
9999    0        0       0

[10000 rows x 3 columns]


Similarly, we generate data from the high-level model. This is meant to provide the second part $(\mathbf{x},\mathbf{y})$ or $(\mathbf{s'},\mathbf{c'})$ of our datapoints. Notice that we need to align correctly our datapoints since we want $\mathbf{s'} = \alpha_{S}(\mathbf{s})$; therefore we generate the datapoints one by one.

**TODO: this generation method is highly inefficient!!!**

In [9]:
M1do = M1.do('Smoking_')

high_level_samples = [M1do.simulate(n_samples=1, evidence={'Smoking_': lowlevel_data.loc[i]['Smoking']}, show_progress=False) for i in range(lowlevel_data.shape[0])]
highlevel_data = pd.concat(high_level_samples)
    
print(highlevel_data)

    Cancer_  Smoking_
0         0         0
0         0         0
0         0         0
0         0         0
0         0         0
..      ...       ...
0         0         0
0         0         0
0         0         0
0         0         0
0         0         0

[10000 rows x 2 columns]


## Learning model definition

We now define a simple neural network with the modules and loss functions described above.

We start with some hyperparameters.

In [10]:
lr = 0.005
hiddensize = 4
num_epochs = 2000

We specify its architecture. Notice how the layes are structured to mimic the arrows in our commuting diagrams.

In [11]:
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(1,hiddensize)
        self.sm1 = nn.Sigmoid()
        self.fc2 = nn.Linear(hiddensize,1)
        self.lowmech = nn.Sigmoid()
        
        self.fc3 = nn.Linear(1,hiddensize)
        self.sm2 = nn.Sigmoid()
        self.fc4 = nn.Linear(hiddensize,1)
        self.absS = nn.Sigmoid()
        
        self.fc5 = nn.Linear(1,hiddensize)
        self.sm3 = nn.Sigmoid()
        self.fc6 = nn.Linear(hiddensize,1)
        self.highmech = nn.Sigmoid()
        
        self.fc7 = nn.Linear(1,hiddensize)
        self.sm4 = nn.Sigmoid()
        self.fc8 = nn.Linear(hiddensize,1)
        self.absC = nn.Sigmoid()
        
    def forward(self, input_lowS, input_highS):
        out = self.fc1(input_lowS)
        out = self.sm1(out)
        out = self.fc2(out)
        mech_lowS = self.lowmech(out)
        
        out = self.fc3(input_lowS)
        out = self.sm2(out)
        out = self.fc4(out)
        abs_lowS = self.absS(out)
        
        out = self.fc5(input_highS)
        out = self.sm3(out)
        out = self.fc6(out)
        mech_highS = self.highmech(out)
        
        out = self.fc5(abs_lowS)
        out = self.sm3(out)
        out = self.fc6(out)
        mech_abs_lowS = self.highmech(out)
        
        out = self.fc7(mech_lowS)
        out = self.sm4(out)
        out = self.fc8(out)
        abs_mech_lowS = self.absS(out)
        
        return mech_lowS, abs_lowS, mech_highS, abs_mech_lowS, mech_abs_lowS 

We instantiate the model and its optimizer. We use the simplest loss function provided by *pytorch*.

In [12]:
model = NeuralNet()

In [13]:
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

loss_lowmech = nn.MSELoss()
loss_highmech = nn.MSELoss()
loss_abstraction = nn.MSELoss()
loss_commutativity = nn.MSELoss()

## Training

We train the network with our data.

In [14]:
for epoch in range(num_epochs):
    input_lowS = torch.from_numpy(np.expand_dims(lowlevel_data['Smoking'].to_numpy(dtype=np.float32),axis=1))
    input_highS = torch.from_numpy(np.expand_dims(highlevel_data['Smoking_'].to_numpy(dtype=np.float32),axis=1))
    target_lowC = torch.from_numpy(np.expand_dims(lowlevel_data['Cancer'].to_numpy(dtype=np.float32),axis=1))
    target_highC = torch.from_numpy(np.expand_dims(highlevel_data['Cancer_'].to_numpy(dtype=np.float32),axis=1))
    
    mech_lowS, abs_lowS, mech_highS, abs_mech_lowS, mech_abs_lowS  = model(input_lowS, input_highS)
    
    l1 = loss_lowmech(mech_lowS,target_lowC)
    l2 = loss_highmech(mech_highS,target_highC)
    l3 = loss_abstraction(abs_lowS,input_highS)
    l4 = loss_commutativity(abs_mech_lowS,mech_abs_lowS)                                
    loss = l1+l2+l3+l4
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 50 == 0:
        print ('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))

Epoch [50/2000], Loss: 0.6090
Epoch [100/2000], Loss: 0.4534
Epoch [150/2000], Loss: 0.4034
Epoch [200/2000], Loss: 0.3788
Epoch [250/2000], Loss: 0.3569
Epoch [300/2000], Loss: 0.3336
Epoch [350/2000], Loss: 0.3112
Epoch [400/2000], Loss: 0.2922
Epoch [450/2000], Loss: 0.2778
Epoch [500/2000], Loss: 0.2676
Epoch [550/2000], Loss: 0.2604
Epoch [600/2000], Loss: 0.2553
Epoch [650/2000], Loss: 0.2517
Epoch [700/2000], Loss: 0.2490
Epoch [750/2000], Loss: 0.2470
Epoch [800/2000], Loss: 0.2454
Epoch [850/2000], Loss: 0.2442
Epoch [900/2000], Loss: 0.2432
Epoch [950/2000], Loss: 0.2424
Epoch [1000/2000], Loss: 0.2417
Epoch [1050/2000], Loss: 0.2411
Epoch [1100/2000], Loss: 0.2406
Epoch [1150/2000], Loss: 0.2401
Epoch [1200/2000], Loss: 0.2397
Epoch [1250/2000], Loss: 0.2393
Epoch [1300/2000], Loss: 0.2390
Epoch [1350/2000], Loss: 0.2386
Epoch [1400/2000], Loss: 0.2383
Epoch [1450/2000], Loss: 0.2379
Epoch [1500/2000], Loss: 0.2376
Epoch [1550/2000], Loss: 0.2373
Epoch [1600/2000], Loss: 0.2

## Testing

We define a test function to output a readable result.

In [15]:
def smalltest(model, test_lowS,test_highS):
    mech_lowS, abs_lowS, mech_highS, abs_mech_lowS, mech_abs_lowS = model(test_lowS,test_highS)
    
    print('-------------------------')
    print('Given I perform the intervention S={0}'.format(test_lowS.item()))
    print('The prediction of the low level mechanism P_M0 (C=1 | S={0}) outputs {1}'.format(test_lowS.item(),mech_lowS.item()))
    
    print('\nThe prediction of the abstraction to high-level alpha_S({0}) outputs {1} against the expected {2}'.format(test_lowS.item(),abs_lowS.item(),test_highS.item()))
    
    print('\nThe prediction of the high level mechanism P_M1 (C=1 | S={0}) outputs {1} using the input {2}'.format(test_highS.item(),mech_highS.item(),test_highS.item()))
    print('The prediction of the high level mechanism P_M1 (C=1 | S={0}) outputs {1} using the transformed input {2}'.format(abs_lowS.item(),mech_abs_lowS.item(),test_lowS.item()))
    
    print('\nThe upper path produces {0}'.format(mech_abs_lowS.item()))
    print('The lower path produces {0}'.format(abs_mech_lowS.item()))
    print('-------------------------')

Let us check what our network produces with an input of $(\mathbf{s}=1,\mathbf{s'}=1)$. Notice that $\mathcal{s'}$ is optional, but we will use it to see how the correct high-level interventions would be processed by the network.

Remember we are trying to learn $\mathcal{M}[\phi_\tilde{C}]$, $\mathcal{M'}[\phi_{C'}]$ which is the same as $\mathcal{M}[\phi_\tilde{C}]$, and $\alpha_{C}$ and $\alpha_{C'}$ which are both identities. As a reference, we first output the value of the mechanism $\mathcal{M}[\phi_\tilde{C}] = \mathcal{M}[\phi_{C'}] = P(Cancer|do(Smoking))$ which we are trying to learn.

In [16]:
inference = VariableElimination(M0do)
M0_joint_TS = inference.query(['Cancer','Smoking'],show_progress=False)
M0_joint_S = inference.query(['Smoking'],show_progress=False)
M0_cond_TS = M0_joint_TS / M0_joint_S
M0_cond_TS.values

array([[0.9 , 0.1 ],
       [0.66, 0.34]])

Looking at the above matrix, we know that under intervention $S=1$, $P(C=1 \vert do(S=1)) = 0.34$.

In [17]:
test_lowS = torch.from_numpy(np.array([1.],dtype=np.float32))
test_highS = torch.from_numpy(np.array([1.],dtype=np.float32))

smalltest(model, test_lowS,test_highS)

-------------------------
Given I perform the intervention S=1.0
The prediction of the low level mechanism P_M0 (C=1 | S=1.0) outputs 0.37025806307792664

The prediction of the abstraction to high-level alpha_S(1.0) outputs 0.943284273147583 against the expected 1.0

The prediction of the high level mechanism P_M1 (C=1 | S=1.0) outputs 0.3083462417125702 using the input 1.0
The prediction of the high level mechanism P_M1 (C=1 | S=0.943284273147583) outputs 0.29659566283226013 using the transformed input 1.0

The upper path produces 0.29659566283226013
The lower path produces 0.2572880983352661
-------------------------


The network roughly learnt our diagram:
- The prediction $\hat{P}_\mathcal{M}(C=1 \vert do(S=1))$ is close to $0.34$;
- The prediction $\alpha_S(1)$ is close to $1$;
- The prediction $\hat{P}_\mathcal{M'}(C'=1 \vert do(S'=1))$ is roughly around $0.34$ when using the input data $\mathbf{s'}$; it is further away when using $\alpha_S(\mathbf{s})$ because of accumulation of approximations.
- The results along the upper and lower path are close.

We could perform the same analysis for the intervention $S=0$, where we expect $P(C=1 \vert do(S=1)) = 0.1$.

In [18]:
test_lowS = torch.from_numpy(np.array([0.],dtype=np.float32))
test_highS = torch.from_numpy(np.array([0.],dtype=np.float32))

smalltest(model, test_lowS,test_highS)

-------------------------
Given I perform the intervention S=0.0
The prediction of the low level mechanism P_M0 (C=1 | S=0.0) outputs 0.09720170497894287

The prediction of the abstraction to high-level alpha_S(0.0) outputs 0.021648824214935303 against the expected 0.0

The prediction of the high level mechanism P_M1 (C=1 | S=0.0) outputs 0.10958739370107651 using the input 0.0
The prediction of the high level mechanism P_M1 (C=1 | S=0.021648824214935303) outputs 0.1124633401632309 using the transformed input 0.0

The upper path produces 0.1124633401632309
The lower path produces 0.1256360113620758
-------------------------


This again confirms learning:
- The prediction $\hat{P}_\mathcal{M}(C=1 \vert do(S=0))$ is roughly around $0.1$;
- The prediction $\alpha_S(0)$ is close to $0$;
- The prediction $\hat{P}_\mathcal{M'}(C'=1 \vert do(S'=0))$ is roughly around $0.1$ when using the input data $\mathbf{s'}$; it is further away when using $\alpha_S(\mathbf{s})$ because of accumulation of approximations.
- The results along the upper and lower path are close.

# Example 2

## Definition of models and abstraction 

We define an alternative synthetic case with a slightly more complex underlying graph. In particular, we will consider a base model where variables will be collapsed ($B,C \mapsto Y$), where domains in the base and abstract models are different ($\mathcal{M}[A] \neq \mathcal{M'}[X]$) forcing the abstraction to be different from an identity, and where the abstraction has error different from zero.

In [19]:
M0 = BN([('A','B'),('A','C'),('B','D'),('C','D')])

cpdA = cpd(variable='A',
          variable_card=3,
          values=[[.7],[.15],[.15]],
          evidence=None,
          evidence_card=None)
cpdB = cpd(variable='B',
          variable_card=2,
          values=[[.7,.2,.3],[.3,.8,.7]],
          evidence=['A'],
          evidence_card=[3])
cpdC = cpd(variable='C',
          variable_card=2,
          values=[[.9,.2,.25],[.1,.8,.75]],
          evidence=['A'],
          evidence_card=[3])
cpdD = cpd(variable='D',
          variable_card=2,
          values=[[.8,.2,.15,.8],[.2,.8,.85,.2]],
          evidence=['B','C'],
          evidence_card=[2,2])

M0.add_cpds(cpdA,cpdB,cpdC,cpdD)
M0.check_model()

True

In [20]:
M1 = BN([('X','Y'),('Y','Z')])

cpdX = cpd(variable='X',
          variable_card=2,
          values=[[.7],[.3]],
          evidence=None,
          evidence_card=None)
cpdY = cpd(variable='Y',
          variable_card=2,
          values=[[1,0],[0,1]],
          evidence=['X'],
          evidence_card=[2])
cpdZ = cpd(variable='Z',
          variable_card=2,
          values=[[0.4025,0.39],[0.5975,0.61]],
          evidence=['Y'],
          evidence_card=[2])

M1.add_cpds(cpdX,cpdY,cpdZ)
M1.check_model()

True

We define the abstraction between the two models

In [21]:
R = ['A','B','C','D']

a = {'A': 'X',
    'B': 'Y',
    'C': 'Y',
    'D': 'Z'}
alphas = {'X': np.array([[1,0,0],[0,1,1]]),
         'Y': np.array([[1,0,0,1],[0,1,1,0]]),
         'Z': np.eye(2),}

In [22]:
A = Abstraction(M0,M1,R,a,alphas)

We will then pay attention to the interventional distribution of interest $P(Z\vert do(X))$ and learn the following diagram:
$$
\begin{array}{ccc}
\mathcal{\mathcal{M}_{do}}\left[{A}\right] & \overset{\mathcal{\mathcal{M}_{do}}\left[\phi_{\tilde{D}}\right]}{\longrightarrow} & \mathcal{\mathcal{M}_{do}}\left[D\right]\\
\sideset{}{\alpha_{X}}\downarrow &  & \sideset{}{\alpha_{Z}}\downarrow\\
\mathcal{\mathcal{M}_{do}}\left[X\right] & \overset{\mathcal{\mathcal{M'}_{do}}\left[\phi_{Z}\right]}{\longrightarrow} & \mathcal{\mathcal{M'}_{do}}\left[Z\right]
\end{array}
$$

For reference we also computer the abstraction error.

In [23]:
err = A.evaluate_abstraction_error(verbose=True)

- Checking ['X'] -> ['Y']: True
- Checking ['X'] -> ['Z']: True
- Checking ['X'] -> ['Y', 'Z']: True
- Checking ['Y'] -> ['X']: False
---- Checking ['B', 'C'] -> ['A']: False
- Checking ['Y'] -> ['Z']: True
- Checking ['Y'] -> ['X', 'Z']: True
- Checking ['Z'] -> ['X']: False
---- Checking ['D'] -> ['A']: False
- Checking ['Z'] -> ['Y']: False
---- Checking ['D'] -> ['B', 'C']: False
- Checking ['Z'] -> ['X', 'Y']: False
---- Checking ['D'] -> ['A', 'B', 'C']: False
- Checking ['X', 'Y'] -> ['Z']: True
- Checking ['X', 'Z'] -> ['Y']: True
- Checking ['Y', 'Z'] -> ['X']: False
---- Checking ['B', 'C', 'D'] -> ['A']: False

 7 legitimate pairs of sets out of 49 possbile pairs of sets

M1: ['X'] -> ['Y']
M0: ['A'] -> ['B', 'C']
M1 mechanism shape: (2, 2)
M0 mechanism shape: (4, 3)
Alpha_s shape: (2, 3)
Alpha_t shape: (2, 4)
All JS distances: [0.36792454944836245, 0.5723641752371845, 0.523792390695269]

Abstraction error: 0.5723641752371845

M1: ['X'] -> ['Z']
M0: ['A'] -> ['D']
M1 mechani

Notice that the abstraction error for the intervention of interest is $E_\alpha(Z,X) \approx 0.15$

## Data collection

We generate data from the low-level model for learning. This will provide the first part $(\mathbf{a},\mathbf{d})$ of our datapoints.

In [24]:
M0do = M0.do('A')
lowlevel_data = M0do.simulate(n_samples=n_samples, show_progress=False)
print(lowlevel_data)

      A  B  C  D
0     0  0  0  0
1     0  0  0  0
2     2  0  1  0
3     0  0  1  1
4     0  1  0  1
...  .. .. .. ..
9995  0  0  0  0
9996  0  0  0  0
9997  0  1  1  1
9998  0  0  0  0
9999  0  1  0  1

[10000 rows x 4 columns]


Similarly, we generate data from the high-level model. This is meant to provide the second part $(\mathbf{x},\mathbf{z})$ of our datapoints. Notice that we need to align correctly our datapoints since we want $\mathbf{x} = \alpha_{X}(\mathbf{a})$; therefore we generate the datapoints one by one.

**TODO: this generation method is highly inefficient!!!**

In [25]:
alpha_X = np.array([[1,0,0],[0,1,1]])

In [26]:
M1do = M1.do('X')

high_level_samples = [M1do.simulate(n_samples=1, evidence={'X': np.where(alphas['X'][:,lowlevel_data.loc[i]['A']]==1)[0][0]}, show_progress=False) for i in range(lowlevel_data.shape[0])]
highlevel_data = pd.concat(high_level_samples)
    
print(highlevel_data)

    Z  X  Y
0   0  0  0
0   1  0  0
0   1  1  1
0   1  0  0
0   1  0  0
.. .. .. ..
0   1  0  0
0   0  0  0
0   0  0  0
0   0  0  0
0   0  0  0

[10000 rows x 3 columns]


## Learning model definition

We define a neural network with the modules and loss functions described above.

We start with some hyperparameters.

In [38]:
lr = 0.005
hiddensize = 4
num_epochs = 5000

In [39]:
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(1,hiddensize)
        self.sm1 = nn.Sigmoid()
        self.fc2 = nn.Linear(hiddensize,1)
        self.lowmech = nn.Sigmoid()
        
        self.fc3 = nn.Linear(1,hiddensize)
        self.sm2 = nn.Sigmoid()
        self.fc4 = nn.Linear(hiddensize,1)
        self.absS = nn.Sigmoid()
        
        self.fc5 = nn.Linear(1,hiddensize)
        self.sm3 = nn.Sigmoid()
        self.fc6 = nn.Linear(hiddensize,1)
        self.highmech = nn.Sigmoid()
        
        self.fc7 = nn.Linear(1,hiddensize)
        self.sm4 = nn.Sigmoid()
        self.fc8 = nn.Linear(hiddensize,1)
        self.absC = nn.Sigmoid()
        
    def forward(self, input_lowS, input_highS):
        out = self.fc1(input_lowS)
        out = self.sm1(out)
        out = self.fc2(out)
        mech_lowS = self.lowmech(out)
        
        out = self.fc3(input_lowS)
        out = self.sm2(out)
        out = self.fc4(out)
        abs_lowS = self.absS(out)
        
        out = self.fc5(input_highS)
        out = self.sm3(out)
        out = self.fc6(out)
        mech_highS = self.highmech(out)
        
        out = self.fc5(abs_lowS)
        out = self.sm3(out)
        out = self.fc6(out)
        mech_abs_lowS = self.highmech(out)
        
        out = self.fc7(mech_lowS)
        out = self.sm4(out)
        out = self.fc8(out)
        abs_mech_lowS = self.absS(out)
        
        return mech_lowS, abs_lowS, mech_highS, abs_mech_lowS, mech_abs_lowS 

We instantiate the model and its optimizer.

In [40]:
model = NeuralNet()

In [41]:
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

loss_lowmech = nn.MSELoss()
loss_highmech = nn.MSELoss()
loss_abstraction = nn.MSELoss()
loss_commutativity = nn.MSELoss()

## Training

We train the network.

In [42]:
for epoch in range(num_epochs):
    input_lowS = torch.from_numpy(np.expand_dims(lowlevel_data['A'].to_numpy(dtype=np.float32),axis=1))
    input_highS = torch.from_numpy(np.expand_dims(highlevel_data['X'].to_numpy(dtype=np.float32),axis=1))
    target_lowC = torch.from_numpy(np.expand_dims(lowlevel_data['D'].to_numpy(dtype=np.float32),axis=1))
    target_highC = torch.from_numpy(np.expand_dims(highlevel_data['Z'].to_numpy(dtype=np.float32),axis=1))
    
    mech_lowS, abs_lowS, mech_highS, abs_mech_lowS, mech_abs_lowS  = model(input_lowS, input_highS)
    
    l1 = loss_lowmech(mech_lowS,target_lowC)
    l2 = loss_highmech(mech_highS,target_highC)
    l3 = loss_abstraction(abs_lowS,input_highS)
    l4 = loss_commutativity(abs_mech_lowS,mech_abs_lowS)                                
    loss = l1+l2+l3
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 500 == 0:
        print ('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))

Epoch [500/5000], Loss: 0.4967
Epoch [1000/5000], Loss: 0.4868
Epoch [1500/5000], Loss: 0.4855
Epoch [2000/5000], Loss: 0.4851
Epoch [2500/5000], Loss: 0.4849
Epoch [3000/5000], Loss: 0.4848
Epoch [3500/5000], Loss: 0.4848
Epoch [4000/5000], Loss: 0.4848
Epoch [4500/5000], Loss: 0.4847
Epoch [5000/5000], Loss: 0.4847


## Testing

We define a new function to print out the result of testing.

In [43]:
def smalltest(model, test_lowA,test_highX):
    mech_lowA, abs_lowA, mech_highX, abs_mech_lowA, mech_abs_lowA = model(test_lowA,test_highX)
    
    print('-------------------------')
    print('Given I perform the intervention A={0}'.format(test_lowA.item()))
    print('The prediction of the low level mechanism P_M0 (D=1 | A={0}) outputs {1}'.format(test_lowA.item(),mech_lowA.item()))
    
    print('\nThe prediction of the abstraction to high-level alpha_X({0}) outputs {1} against the expected {2}'.format(test_lowA.item(),abs_lowA.item(),test_highX.item()))
    
    print('\nThe prediction of the high level mechanism P_M1 (Z=1 | X={0}) outputs {1} using the input {2}'.format(test_highX.item(),mech_highX.item(),test_highX.item()))
    print('The prediction of the high level mechanism P_M1 (Z=1 | X={0}) outputs {1} using the transformed input {2}'.format(abs_lowA.item(),mech_abs_lowA.item(),abs_lowA.item()))
    
    print('\nThe upper path produces {0}'.format(mech_abs_lowA.item()))
    print('The lower path produces {0}'.format(abs_mech_lowA.item()))
    print('-------------------------')

Let us check what our network produces with an input of $(\mathbf{a}=0,\mathbf{x}=0)$. As a reference we output the value of the mechanism $\mathcal{M}[\phi_{{\tilde{D}}}] = P(D|do(A))$ and $\mathcal{M'}[\phi_{Z}] = P(Z|do(X))$ which we are trying to learn.

In [44]:
inferM0 = VariableElimination(M0)
M0_joint_AD = inferM0.query(['A','D'],show_progress=False)
M0_joint_A = inferM0.query(['A'],show_progress=False)
M0_cond_DA = M0_joint_AD / M0_joint_A
M0_cond_DA.values

array([[0.5825 , 0.4175 ],
       [0.6    , 0.4    ],
       [0.55125, 0.44875]])

In [45]:
inferM1 = VariableElimination(M1)
M1_joint_XZ = inferM1.query(['X','Z'],show_progress=False)
M1_joint_X = inferM1.query(['X'],show_progress=False)
M1_cond_ZX = M1_joint_XZ / M1_joint_X
M1_cond_ZX.values

array([[0.4025, 0.5975],
       [0.39  , 0.61  ]])

In [46]:
test_lowA = torch.from_numpy(np.array([0.],dtype=np.float32))
test_highX = torch.from_numpy(np.array([0.],dtype=np.float32))

smalltest(model, test_lowA,test_highX)

-------------------------
Given I perform the intervention A=0.0
The prediction of the low level mechanism P_M0 (D=1 | A=0.0) outputs 0.41887593269348145

The prediction of the abstraction to high-level alpha_X(0.0) outputs 0.005152891390025616 against the expected 0.0

The prediction of the high level mechanism P_M1 (Z=1 | X=0.0) outputs 0.5912103652954102 using the input 0.0
The prediction of the high level mechanism P_M1 (Z=1 | X=0.005152891390025616) outputs 0.5912752151489258 using the transformed input 0.005152891390025616

The upper path produces 0.5912752151489258
The lower path produces 0.630756676197052
-------------------------


The network roughly learnt our diagram:
- The prediction $\hat{P}_\mathcal{M}(D=1 \vert do(A=0))$ is roughly around $0.4175$;
- The prediction $\alpha_X(0)$ is close to $0$;
- The prediction $\hat{P}_\mathcal{M'}(Z=1 \vert do(S=1))$ is roughly around $0.5975$ when using the input data $\mathbf{x}$; it is further away when using $\alpha_X(\mathbf{a})$ because of accumulation of approximations.
- The results along the upper and lower path are sort of close.

In [47]:
test_lowA = torch.from_numpy(np.array([1.],dtype=np.float32))
test_highX = torch.from_numpy(np.array([1.],dtype=np.float32))

smalltest(model, test_lowA,test_highX)

-------------------------
Given I perform the intervention A=1.0
The prediction of the low level mechanism P_M0 (D=1 | A=1.0) outputs 0.4125082492828369

The prediction of the abstraction to high-level alpha_X(1.0) outputs 0.9892100691795349 against the expected 1.0

The prediction of the high level mechanism P_M1 (Z=1 | X=1.0) outputs 0.6058823466300964 using the input 1.0
The prediction of the high level mechanism P_M1 (Z=1 | X=0.9892100691795349) outputs 0.6057053208351135 using the transformed input 0.9892100691795349

The upper path produces 0.6057053208351135
The lower path produces 0.6308521032333374
-------------------------


In [48]:
test_lowA = torch.from_numpy(np.array([2.],dtype=np.float32))
test_highX = torch.from_numpy(np.array([1.],dtype=np.float32))

smalltest(model, test_lowA,test_highX)

-------------------------
Given I perform the intervention A=2.0
The prediction of the low level mechanism P_M0 (D=1 | A=2.0) outputs 0.4495827555656433

The prediction of the abstraction to high-level alpha_X(2.0) outputs 0.9981436729431152 against the expected 1.0

The prediction of the high level mechanism P_M1 (Z=1 | X=1.0) outputs 0.6058823466300964 using the input 1.0
The prediction of the high level mechanism P_M1 (Z=1 | X=0.9981436729431152) outputs 0.6058518886566162 using the transformed input 0.9981436729431152

The upper path produces 0.6058518886566162
The lower path produces 0.6302937269210815
-------------------------


# TODO: (i) review architecture and loss of NN; (ii) consider other problems.