In [6]:
import numpy as np
from numpy.random import randn
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression

n_samples=10000

We will focus on the Structural Causal Model: 

$\begin{cases}
X_1 \leftarrow 3 \cdot X_2 + \epsilon_{X_1}\\
X_2 \leftarrow \epsilon_{X_2}\\
X_3 \leftarrow 5 \cdot X_1 + 4 \cdot X_2 +\epsilon_{X_3}\\
X_4 \leftarrow 6 \cdot X_3 + \epsilon_{X_4}\\
\end{cases}$

$\epsilon_{X_1},\epsilon_{X_2},\epsilon_{X_3}, \epsilon_{X_4} \sim \mathcal{N}(0,1) \ \mathrm{ and } \  \forall i\neq j: \epsilon_{X_i} \perp\!\!\perp \epsilon_{X_j}$

We will start by estimating the total average causal effect of X1 on X3, $E[X_3|do(X_1=1)] - E[X_3|do(X_1=0)] $, which in this case we compute analytically with the path coefficient method as $5 \times 6=30$. Instead we will estimate it numerically by simulating two related SCMs in which we set $X_1$ to 1 and 0, respectively:

In [7]:
mu, sigma = 0, 1 # mean and standard deviation

x2_1 = np.random.normal(mu, sigma, n_samples)
x1_1 =  1
x3_1 = 5 * x1_1 + 4 * x2_1 + np.random.normal(mu, sigma, n_samples)
x4_1 = 6 * x3_1 + np.random.normal(mu, sigma, n_samples)

x2_0 = np.random.normal(mu, sigma, n_samples)
x1_0 =  0
x3_0 = 5 * x1_0 + 4 * x2_0 + np.random.normal(mu, sigma, n_samples)
x4_0 = 6 * x3_0 + np.random.normal(mu, sigma, n_samples)
diff = np.mean(x4_1) - np.mean(x4_0)
print(diff)

29.805828417171806


We can also try to check if the effect predicted by the path method for the total average causal effect of X2 on X4 $E[X_4|do(X_2 = 1)] - E[X_4|do(X_2 = 0)]= 114$ is correct also in the numerical simulation:

In [8]:
x2_1 = 1
x1_1 =  3 * x2_1 + randn(n_samples) 
x3_1 = 5 * x1_1 + 4 * x2_1 + randn(n_samples)
x4_1 = 6 * x3_1 + randn(n_samples) 

x2_0 = 0
x1_0 = 3 * x2_0 + randn(n_samples) 
x3_0 = 5 * x1_0 + 4 * x2_0 + randn(n_samples)
x4_0 = 6 * x3_0 + randn(n_samples)
diff = np.mean(x4_1) - np.mean(x4_0)
print(diff)

113.45410315092538


We now show that we can also learn empirically the effect of X1 on X4 without simulating interventions. We simulate observational samples from the SCM from the example:

In [12]:
x2 = randn(n_samples) 
x1 = 3 * x2 + randn(n_samples) 
x3 = 5 * x1 + 4 * x2 + randn(n_samples)
x4 = 6 * x3 + randn(n_samples) 

df = pd.DataFrame({ "x2": x2, "x1": x1, "x3": x3,"x4": x4})
Y = df.iloc[:, 3].values.reshape(-1, 1)
X1 = df.iloc[:, 1].values.reshape(n_samples, 1)
X21 = df.iloc[:, 0:2].values.reshape(n_samples, 2)
X13 = df.iloc[:, 1:3].values.reshape(n_samples, 2)
X = df.iloc[:, 0:3].values.reshape(n_samples, 3)

            x2        x1         x3          x4
0     0.233087  0.556705   3.620552   21.469438
1     0.975011  4.067085  25.222893  151.832783
2     0.211004  0.684294   4.113545   26.270381
3     0.822919  1.822309  13.525527   79.689974
4     0.893344  2.356225  15.725063   92.442871
...        ...       ...        ...         ...
9995  2.203807  6.995353  45.071340  269.622416
9996 -0.357482 -0.054502  -1.949514  -12.014570
9997  0.748787  2.319182  13.416964   79.812848
9998  1.210459  5.581404  32.458151  194.945715
9999  0.703387  1.980747  12.584568   73.859804

[10000 rows x 4 columns]


We start by regressing X4 on X1 and check the linear coefficient:

In [19]:
linear_regressor = LinearRegression() 
linear_regressor.fit(X1, Y)
linear_regressor.coef_

array([[37.17373998]])

The linear coefficient is far from the prediction for X1 (which is 30 = 5*6)

We try something different and consider using both X1 and X3 as covariates in the regression for predicting X4:

In [18]:
linear_regressorX13 = LinearRegression()
linear_regressorX13.fit(X13, Y)
linear_regressorX13.coef_[:,0]

array([-0.00680472])

The result for X1 are even worse. 

We now consider X1 and X2 as covariates in the regression for predicting X4::

In [20]:
linear_regressorX12 = LinearRegression() 
linear_regressorX12.fit(X21, Y)
linear_regressorX12.coef_[:,1]

array([29.95265029])

The result is now similar to the coefficient we have computed analytically. 

Finally, we consider all the other variables X1, X2 and X3 in the regression for predicting X4:

In [21]:
linear_regressorX123 = LinearRegression() 
linear_regressorX123.fit(X, Y)
linear_regressorX123.coef_[:,1]

array([-0.00916755])

The only correct estimate of the causal effect of X1 on X4 is when we use X1 and X2 in the regression, which fits the adjustment formula we will see in the next slides.