## Theory
This notebook tests some of the basic concepts of causal relationships and statistical modelling, as laid out in Pearl & MacKenzie (2018). For example, controlling for a "collider" will introduce bias where there was none. Subsequent notebooks will test out how well different meta-learners work, and what are the practical implications of accounting for/ ignoring causal relationships in predictive models.

### Approach
I will be using simulated datasets to lay out "true" causal relationships. For simplicity, I will be using linear regressions. To aid understanding, we will be trying to discern the effect of number of years of education on wages, with various confounding causal relationships

### Contents
1) Basic confounding - treatment and effect share a common cause
2) Collider
3) Mediation
4) Back-door path with collider

In [157]:
import numpy as np
import pandas as pd 
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [158]:
# global parameters

education_mean = 12
education_std = 3

wage_mean = 36000
wage_std = 3000

education_effect_wage = 1000

## 1. Basic confounding

education + family wealth ----> wage

family_wealth ----> education

------------------------------------------

This is a standard omitted variable bias problem.

Intuition: 
- Suppose that family_wealth contributes positively to an individual's level of education, and positively to wages. 
- If we estimate the effect of education on wages and ignore family_wealth, then we will attribute some of the contribution stemming from family_wealth to education and over-estimate the effect of education.

In [159]:
def basic_confounding_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage:int = education_effect_wage
        , sample_family_wealth_effect_wage:int = 5000
        , sample_family_wealth_effect_education:int = 1
        , **kwargs
):
    
    # create randomised data    
    _family_wealth = np.random.standard_normal(sample_size)
    _education = sample_education_mean + (sample_family_wealth_effect_education * _family_wealth) + (np.sqrt(3) * np.random.standard_normal(sample_size))
    _wage = (
        sample_wage_mean + (sample_education_effect_wage * (_education - sample_education_mean)) + (sample_family_wealth_effect_wage * _family_wealth) + (3000 * np.random.standard_normal(sample_size))
    )

    _df = pd.DataFrame({'family_wealth': _family_wealth, 'education': _education, 'wage': _wage})
    
    return _df



In [160]:
# create data
df1 = basic_confounding_dataset()
df1.describe()

Unnamed: 0,family_wealth,education,wage
count,1000.0,1000.0,1000.0
mean,-0.02652,11.86289,35801.067557
std,0.964483,1.957089,6707.908043
min,-2.819431,5.257256,15195.50457
25%,-0.675305,10.460085,31067.87216
50%,-0.027038,11.955455,35717.144386
75%,0.610021,13.224268,40247.499072
max,3.087075,17.079326,54516.626349


In [161]:
# omitted variable bias
X = sm.add_constant(df1['education'])
y = df1['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.379
Model:                            OLS   Adj. R-squared:                  0.378
Method:                 Least Squares   F-statistic:                     608.8
Date:                Sat, 13 Jul 2024   Prob (F-statistic):          2.49e-105
Time:                        22:03:25   Log-Likelihood:                -9991.3
No. Observations:                1000   AIC:                         1.999e+04
Df Residuals:                     998   BIC:                         2.000e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.077e+04   1028.028     10.479      0.0

In [162]:
# fully specified

X = sm.add_constant(df1[['education', 'family_wealth']])
y = df1['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.798
Model:                            OLS   Adj. R-squared:                  0.798
Method:                 Least Squares   F-statistic:                     1968.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        22:03:25   Log-Likelihood:                -9429.9
No. Observations:                1000   AIC:                         1.887e+04
Df Residuals:                     997   BIC:                         1.888e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          2.471e+04    661.886     37.327

## 2. Collider

education -----> liberal_support <----- wage

------------------------------------------

In this example education does not affect wage. Support for liberal politics is not on the causal path from education to wage; instead it is affected by both. Controlling for liberal_support will spuriously introduce a relationship between education and wage.

Intuition: 
- Suppose that higher education causes more support for liberal politics, while higher wages reduces support. Further suppose that education does not affect wages.
- If we selected a group of people with medium liberal_support (i.e. controlling for liberal_support), we will find that those with a high education level (+ liberal_support) will also have a high wage (- liberal_support) to balance it out. Meanwhile those with low education (- liberal_support) will have low wages (+ liberal_support). So it will look like education is positively correlated with wages.

In [163]:
def collider_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_education_std:int = education_std
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage:int = 0
        , sample_education_effect_liberal_support:int = 1
        , sample_wage_effect_liberal_support:int = -1
        , **kwargs
):
    
    # create randomised data    
    _education = sample_education_mean + (sample_education_std * np.random.standard_normal(sample_size))
    _wage = (
        sample_wage_mean + (sample_education_effect_wage * (_education - sample_education_mean)) + (3000 * np.random.standard_normal(sample_size))
    )

    _education_normalised = (_education - np.mean(_education))/ np.std(_education)
    _wage_normalised = (_wage - np.mean(_wage))/ np.std(_wage)

    _liberal_support = (
        (sample_education_effect_liberal_support * _education_normalised) + (sample_wage_effect_liberal_support * _wage_normalised) + np.random.standard_normal(sample_size)
    )

    _df = pd.DataFrame({'liberal_support': _liberal_support, 'education': _education, 'wage': _wage})
    
    return _df

In [164]:
df2 = collider_dataset()
df2.describe()

Unnamed: 0,liberal_support,education,wage
count,1000.0,1000.0,1000.0
mean,-0.029008,11.986435,36112.585634
std,1.706554,2.999839,2965.730695
min,-6.12217,1.211014,26747.31422
25%,-1.1716,9.826096,34203.486736
50%,-0.0384,12.015935,36152.979113
75%,1.120477,14.188342,38003.706698
max,4.789026,20.30125,46343.922723


In [165]:
# "correct" regression - ignore collider

X = sm.add_constant(df2['education'])
y = df2['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.1118
Date:                Sat, 13 Jul 2024   Prob (F-statistic):              0.738
Time:                        22:03:25   Log-Likelihood:                -9413.3
No. Observations:                1000   AIC:                         1.883e+04
Df Residuals:                     998   BIC:                         1.884e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3.599e+04    386.646     93.075      0.0

In [166]:
# incorrect - control for collider

X = sm.add_constant(df2[['education', 'liberal_support']])
y = df2['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.519
Model:                            OLS   Adj. R-squared:                  0.518
Method:                 Least Squares   F-statistic:                     537.7
Date:                Sat, 13 Jul 2024   Prob (F-statistic):          3.95e-159
Time:                        22:03:26   Log-Likelihood:                -9047.5
No. Observations:                1000   AIC:                         1.810e+04
Df Residuals:                     997   BIC:                         1.812e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            3.035e+04    318.714     

## 3. Mediator

education ---> skills ---> wage

------------------------------------------

In this example education affects wage through skills. It is quite plainly nonsensical to control for skills when trying to elucidate the effect of education on wages and the below example proves it.

In [167]:
def mediator_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_education_std:int = education_std
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage: int = education_effect_wage
        , sample_skill_mean:int = 5
        , sample_education_effect_skill:int = 0.2
        , **kwargs
):
    
    # create randomised data    
    _education = sample_education_mean + (sample_education_std * np.random.standard_normal(sample_size))

    _skill = (
        sample_skill_mean + (sample_education_effect_skill * (_education - sample_education_mean)) + np.random.standard_normal(sample_size)
    )

    _implied_skill_effect_on_wage = sample_education_effect_wage / sample_education_effect_skill

    _wage = (
        sample_wage_mean + (_implied_skill_effect_on_wage * _skill) + (3000 * np.random.standard_normal(sample_size))
    )


    _df = pd.DataFrame({'skill': _skill, 'education': _education, 'wage': _wage})
    
    return _df

In [168]:
df3 = mediator_dataset()
df3.describe()

Unnamed: 0,skill,education,wage
count,1000.0,1000.0,1000.0
mean,5.010547,11.995044,61253.445497
std,1.159337,2.984807,6500.377421
min,1.30401,3.410228,39888.973131
25%,4.216676,9.974872,56807.910398
50%,5.002921,12.034082,61301.588186
75%,5.796471,13.875829,65682.065896
max,8.470522,21.145952,78835.614782


In [169]:
# correct - do not control for mediator collider

X = sm.add_constant(df3['education'])
y = df3['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.182
Model:                            OLS   Adj. R-squared:                  0.181
Method:                 Least Squares   F-statistic:                     222.2
Date:                Sat, 13 Jul 2024   Prob (F-statistic):           1.62e-45
Time:                        22:03:26   Log-Likelihood:                -10098.
No. Observations:                1000   AIC:                         2.020e+04
Df Residuals:                     998   BIC:                         2.021e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.011e+04    770.629     65.020      0.0

In [170]:
# incorrect - control for mediator

X = sm.add_constant(df3[['education', 'skill']])
y = df3['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.792
Method:                 Least Squares   F-statistic:                     1904.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        22:03:26   Log-Likelihood:                -9411.7
No. Observations:                1000   AIC:                         1.883e+04
Df Residuals:                     997   BIC:                         1.884e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3.662e+04    461.293     79.386      0.0

## 4. Backdoor/ "M bias"

This example illustrates an "M bias" laid out in the book. There is a backdoor path from education to wage which contains a collider

education -> wage

education <- parents_education -> books_read_per_year <- ambition -> wage

---------------------------------------------------------------------

Combining what we know of colliders, ignoring the collider books_read_per_year will automatically close the backdoor and prevent any bias. Controlling for the collider will instead open the backdoor.

It is also possible to avoid bias by also controlling for parents_education and ambition as well. In practice, it may not be possible to do this as ambition is difficult to measure, so the better option will be to keep the backdoor closed.

In [171]:
def m_bias_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_education_std:int = education_std
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage: int = education_effect_wage
        , sample_parents_effect_education: int = 0.5
        , sample_ambition_effect_wage: int = 5000
        , sample_books_read_mean: int = 5
        , parents_education_effect_books_read: int = 1
        , ambition_effect_books_read: int = 1
        , **kwargs
):
    
    # create randomised data    
    _parents_education_mean = sample_education_mean - 3
    _parents_education = _parents_education_mean + ((sample_education_std - 1) * np.random.standard_normal(sample_size))
    _ambition = np.random.standard_normal(sample_size)

    _education = (
        sample_education_mean 
        + (sample_parents_effect_education * (_parents_education - _parents_education_mean)) 
        + (sample_education_std * np.random.standard_normal(sample_size))
    )
    
    _wage = (
        sample_wage_mean 
        + (sample_education_effect_wage * (_education - sample_education_mean)) 
        + (sample_ambition_effect_wage * _ambition) + (3000 * np.random.standard_normal(sample_size))
    )

    _books_read = (
        sample_books_read_mean 
        + (parents_education_effect_books_read * (_parents_education - _parents_education_mean)) 
        + (ambition_effect_books_read * _ambition)
        + np.random.standard_normal(sample_size)
    )
    

    _df = pd.DataFrame({
        'education': _education, 'wage': _wage
        , 'parents_education': _parents_education
        , 'ambition': _ambition
        , 'books_read': _books_read
        })
    
    return _df

In [172]:
df4 = m_bias_dataset(**{'sample_education_effect_wage':1000})
df4.describe()

Unnamed: 0,education,wage,parents_education,ambition,books_read
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,12.061811,35834.689996,9.001196,-0.062688,4.91604
std,3.064842,6550.24756,2.029202,0.996868,2.505337
min,2.307747,16726.596429,1.966496,-2.824509,-2.616313
25%,10.073428,31759.129229,7.682993,-0.74808,3.15139
50%,11.94243,35607.988394,8.934267,-0.038399,4.92019
75%,14.147599,40388.576749,10.371557,0.604378,6.593122
max,21.948083,55677.011516,16.285957,3.620373,12.30985


In [173]:
# correct - ignore backdoor as already closed

X = sm.add_constant(df4['education'])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.225
Model:                            OLS   Adj. R-squared:                  0.224
Method:                 Least Squares   F-statistic:                     289.0
Date:                Sat, 13 Jul 2024   Prob (F-statistic):           4.11e-57
Time:                        22:03:26   Log-Likelihood:                -10079.
No. Observations:                1000   AIC:                         2.016e+04
Df Residuals:                     998   BIC:                         2.017e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.362e+04    741.392     31.858      0.0

In [174]:
# correct - control for parents' education and ambition, but do not control for collider (books read)

X = sm.add_constant(df4[['education', 'parents_education', 'ambition']])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.791
Method:                 Least Squares   F-statistic:                     1258.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        22:03:26   Log-Likelihood:                -9422.5
No. Observations:                1000   AIC:                         1.885e+04
Df Residuals:                     996   BIC:                         1.887e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              2.463e+04    507.81

In [175]:
# incorrect - control for collider (books read), thereby opening the backdoor

X = sm.add_constant(df4[['education','books_read']])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.327
Model:                            OLS   Adj. R-squared:                  0.325
Method:                 Least Squares   F-statistic:                     241.9
Date:                Sat, 13 Jul 2024   Prob (F-statistic):           2.20e-86
Time:                        22:03:26   Log-Likelihood:                -10008.
No. Observations:                1000   AIC:                         2.002e+04
Df Residuals:                     997   BIC:                         2.004e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.159e+04    710.601     30.379      0.0

In [176]:
# control - control for collider (books read) but also close the backdoor path by controlling for the other variables

X = sm.add_constant(df4[['education','books_read', 'ambition']])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.791
Method:                 Least Squares   F-statistic:                     1258.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        22:03:26   Log-Likelihood:                -9422.6
No. Observations:                1000   AIC:                         1.885e+04
Df Residuals:                     996   BIC:                         1.887e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.482e+04    401.914     61.761      0.0