<a href="https://colab.research.google.com/github/zd2011/causal201/blob/main/lab10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 10 - Recitation - Instrumental variables

author: Judith Abécassis and Élise Dumas with some inspiration from [Matheus Facure Alves](https://matheusfacure.github.io/python-causality-handbook/landing-page.html), [Card (1995)](https://www.nber.org/papers/w4483) and [Verbeek (2004)](https://thenigerianprofessionalaccountant.files.wordpress.com/2013/04/modern-econometrics.pdf).

In today's recitation, we will use an instrumental variable to estimate the Average Treatment Effect of schooling on earnings. We will use data from the US National Longitudinal Survey of Young Men. In this panel survey, a group of 3010 men is followed since 1966 (when they were aged 14–24) until 1976 (when all of them started working).

We will stay with a very simple model here, but more complex (and accurate) models have been used and published on this dataset by [Card (1995)](https://www.nber.org/papers/w4483) and [Verbeek (2004)](https://thenigerianprofessionalaccountant.files.wordpress.com/2013/04/modern-econometrics.pdf). You are welcome to read the publications if you are interested in the subject.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import scipy.stats as sps
import warnings

!pip install linearmodels
import linearmodels as lm
from linearmodels.iv import IV2SLS

warnings.filterwarnings(action='once')
rg = np.random.default_rng(2907)

sns.set_context('poster')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting linearmodels
  Downloading linearmodels-4.25-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 4.4 MB/s 
[?25hCollecting mypy-extensions>=0.4
  Downloading mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)
Collecting pyhdfe>=0.1
  Downloading pyhdfe-0.1.0-py3-none-any.whl (18 kB)
Collecting property-cached>=1.6.3
  Downloading property_cached-1.6.4-py2.py3-none-any.whl (7.8 kB)
Collecting formulaic
  Downloading formulaic-0.5.2-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 2.6 MB/s 
Collecting typing-extensions>=4.2.0
  Downloading typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting graphlib-backport>=1.0.0
  Downloading graphlib_backport-1.0.3-py3-none-any.whl (5.1 kB)
Collecting interface-meta>=1.2.0
  Downloading interface_meta-1.3.0-py3-none-any.whl (14 kB)
Installing c

# Exercise 1: dataset preparation

In [None]:
#Load dataset
df = pd.read_csv("data_card.csv")
df.describe()

Unnamed: 0,nearc2,wage,educ,south,black
count,3010.0,3010.0,3010.0,3010.0,3010.0
mean,0.440864,577.282392,13.263455,0.403654,0.233555
std,0.496573,262.958302,2.676913,0.490711,0.423162
min,0.0,100.0,1.0,0.0,0.0
25%,0.0,394.25,12.0,0.0,0.0
50%,0.0,537.5,13.0,0.0,0.0
75%,1.0,708.75,16.0,1.0,0.0
max,1.0,2404.0,18.0,1.0,1.0


 Here is the list of available variables : 
 
 `nearc2`          indicator for whether a subject grew up near a two-year college, 1:yes, 0:no
 
 `wage`              subject's wage in cents per hour in 1976
 
 `educ`       subject's years of education in 1976
 
 `south`        indicator for whether subject lived in the South
                      
 `black`         indicator for whether subject's race is black

### 1. What is the treatment? The outcome?

**Answer**

Treatment is number of years of education (educ); outcome is earning (wage).

### 2. Would you say that SUTVA holds? Strong ignorability? Conditional ignorability? Positivity?

**Answer**

SUTVA : it is reasonable to think that the "no interaction" part holds, (the fact that another individual decided to quit college or continue college is unlikely to affect your personal future earnings). The assumption that there is only one version of treatment is less likely to be true (different majors/minors, different grades, study years completed until the end or no..). (But we will assume it for the rest of the recitation).

Strong ignorability : no, it is not reasonable to think that strong ignorability holds. The experiement is not randomized and several confounding variables may exist, including "baseline ability", parent socioeconomic situation (if parents are low income, they are less likely to afford college, but also less likely to provide you tools to pass job interviews.).

Conditional ignorability : no, it is not reasonable to think that conditional ignorability holds because important confounders such as "baseline ability" or parent socioeconomic situation are not available in the dataset. This implies that we cannot use the estimators we used so far for observational data (stratification, matching, S-learner, T-learner, IPW,...).

Positivity : I would say that it is reasonable to assume positivity since there is no factor leading an individual to attend or not attend college with probability 1.

### 3. Among the other variables available in the data, can you think of a potential instrumental variable? Does it satisfy relevance? Exclusion restriction? Instrumental unconfoundedness?

**Answer**

Whether or not the individual grew up near a college can be used as an instrument.

It is probably linked with the number of years of schoooling since living closer to a college prevents to large costs due to renting an accomodation near college, or transportation; which ensures relevance.

We can argue that it is not directly altering future earnings, other than through schooling (there is no direct reason why living near a college provides you with a better or worse wage); which ensures exclusion restriction.

Instrumental uncondoundedness is harder to assume, since there may exist some confounding factors of the instrument and the outcome : for instance living in the south, where the density of college is lower, and the density of companies with high average rate is lower as well.

# Exercise 2: years of schooling as a binary variable

For this exercice, we will assume that relevance, exclusion restriction and instrumental unconfoundedness all hold. We will also assume that we are in a linear setting (see lectures). Finally, we will convert the treatment into a binary variable to simplify the estimators.

### 1. Data preprocessing : convert the treatment into a binary variable (1 if above median, 0 otherwise).

In [None]:
#Create educ_bin : 1 if educ is above median and 0 otherwise.
df_binary = df.assign(educ_bin = np.where(df.educ>= np.median(df.educ), 1,0))
df_binary.educ_bin.value_counts()

1    1521
0    1489
Name: educ_bin, dtype: int64

### 2. Compute the difference in means. What can you conclude?

In [None]:
diff_mean = np.mean(df_binary.wage[df_binary.educ_bin == 1])- np.mean(df_binary.wage[df_binary.educ_bin == 0])
print(f"The difference in means is {diff_mean.round(2)}.")

The difference in means is 110.06.


What can you conclude?

**Answer**

Longer schooling seems to be associated with a increase in future earning (about one dollar per hour). But this estimate is not causal (because strong ignorability does not hold).

### 3. Derive the Wald estimator for ATE using your instrumental variable. What can you conclude?

In [None]:
#Compute numerator of Wald estimator. 
numerator = np.mean(df_binary.wage[df_binary.nearc2 == 1]) - np.mean(df_binary.wage[df_binary.nearc2 == 0])
#Compute denominator of Wald estimator. 
denominator = np.mean(df_binary.educ_bin[df_binary.nearc2 == 1]) - np.mean(df_binary.educ_bin[df_binary.nearc2 == 0])
#Print result
wald_estimator = numerator/denominator
print(f"Wald estimator is {wald_estimator.round(2)}.")

Wald estimator is 1282.4.


**Answer**

Our estimate of ATE is 12 dollars per hour (which is much more than the one dollar we found by difference in means). It seems that the difference in means tended to underestimate the true effect due to confounding.

### 4. Derive the two-stage least squares estimator of ATE. What do you see?

In [None]:
##First stage : regress treatment on instrument.
reg1 = LinearRegression().fit(df_binary[["nearc2"]], df_binary["educ_bin"])

##Second stage : regress outcome on the proxy for treatment derived at first stage.

#Create a new column containing the proxy for treatment inferred from reg1
df_binary = df_binary.assign(t_hat = reg1.predict(df_binary[["nearc2"]]))

#Regress outcome on t_hat
reg2 = LinearRegression().fit(df_binary[["t_hat"]], df_binary["wage"])

#Print the coefficient in front of t_hat
print(f"SL2S estimator is {reg2.coef_[0].round(2)}.")

SL2S estimator is 1282.4.


**Answer**

We get exactly the same estimate for ATE!

### 5. Compare your results with Python function IV2SLS from linearmodels package. What do you conclude?

In [None]:
#We add a constant to the dataset (this is going to be the intercept term of the model)
df_binary = df_binary.assign(const = 1)
IV2SLS(dependent = df_binary.wage, #outcome
       endog = df_binary.educ_bin, #treatment
       exog = df_binary.const, #we just add a constant as covariate
       instruments = df_binary.nearc2 #Instrument
      ).fit(cov_type = "unadjusted")

0,1,2,3
Dep. Variable:,wage,R-squared:,-4.9264
Estimator:,IV-2SLS,Adj. R-squared:,-4.9284
No. Observations:,3010,F-statistic:,6.7980
Date:,"Fri, Nov 04 2022",P-value (F-stat),0.0091
Time:,15:48:40,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-70.736,248.81,-0.2843,0.7762,-558.40,416.93
educ_bin,1282.4,491.85,2.6073,0.0091,318.39,2246.4


**Answer**

We get exactly the same estimate! This way we also have access to a p-value and a confidence interval (deriving them by hand is very complicated). Our estimate for the ATE is significant (*p* = 0.0091 < 0.05).

# Exercise 3 : years of schooling as a continuous variable

For this exercice, we will assume that relevance, exclusion restriction and instrumental unconfoundedness all hold. We will also assume that we are in a linear setting (see lectures). We will use the continuous version of the treatment, as available in the original database (years of schooling).

### 1. Derive the Wald estimator for continuous variable. What do you conclude?

In [None]:
#Compute experimental covariance between Y and Z 
cov_YZ = np.cov(df.wage, df.nearc2)[1,0]
print(cov_YZ)
#Compute experimental covariance between T and Z
cov_TZ = np.cov(df.educ, df.nearc2)[1,0]
print(cov_TZ)
#Compute Wald estimator
wald_estimator = cov_YZ/cov_TZ
print(f"Wald estimator is {wald_estimator.round(2)}.")

15.106768288710828
0.06294284367274704
Wald estimator is 240.01.


**Answer**

Our ATE estimate for continous treatment is 240 cents; meaning that on average each new schooling years increase future earnings by 2.4 dollars/hour. This is consistent with our previous results.

### 2. Derive the two-stage least squares estimator (SL2S). What can you conclude?

In [None]:
##First stage : regress treatment on instrument.
reg1 = LinearRegression().fit(df_binary[["nearc2"]], df_binary["educ"])

##Second stage : regress outcome on the proxy for treatment derived at first stage.

#Create a new column containing the proxy for treatment inferred from reg1
df_binary = df_binary.assign(t_hat = reg1.predict(df_binary[["nearc2"]]))

#Regress outcome of t_hat
reg2 = LinearRegression().fit(df_binary[["t_hat"]], df_binary["wage"])

#Print the coefficient in front of t_hat
print(f"SL2S estimator is {reg2.coef_[0].round(2)}.")

SL2S estimator is 240.01.


**Answer**

We get exactly the same estimate!

### 3.  Compare your results with Python function IV2SLS from linearmodels package. What can you conclude?

In [None]:
#We add a constant to the dataset (this is going to be the intercept term of the model)
df = df.assign(const = 1)
IV2SLS(dependent = df.wage, #outcome
       endog = df.educ, #treatment
       exog = df.const, #we just add a constant as covariate
       instruments = df.nearc2 #Instrument
      ).fit(cov_type = "unadjusted")

0,1,2,3
Dep. Variable:,wage,R-squared:,-4.4944
Estimator:,IV-2SLS,Adj. R-squared:,-4.4962
No. Observations:,3010,F-statistic:,7.3325
Date:,"Fri, Nov 04 2022",P-value (F-stat),0.0068
Time:,15:56:04,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-2606.0,1175.6,-2.2167,0.0266,-4910.3,-301.83
educ,240.01,88.634,2.7079,0.0068,66.288,413.73


**Answer**

We get exactly the same estimate. Python method also provides a confidence interval and p-value, suggesting that the effect is significant.

# Exercise 4 : adding covariates to the model

For this exercice, we do not assume instrumental uncondoundedness anymore, but instrumental conditional uncondoundedness. We will encompass the role of living in the South (variable south) and of ethnicity (variable black) in our models. The variables should be added to both models (model for treatment and model for outcome).

### 1.  Adapt SL2S to account for the covariates. What do you conclude?

In [None]:
##First stage : regress treatment and covariables on instrument.
reg1 = LinearRegression().fit(df_binary[["nearc2","south","black"]], df_binary["educ"])

##Second stage : regress outcome on the proxy for treatment and the covariables derived at first stage.
#Create a new column containing the proxy for treatment inferred from reg1
df_binary = df_binary.assign(t_hat = reg1.predict(df_binary[["nearc2","south","black"]]))
#Regress outcome of t_hat
reg2 = LinearRegression().fit(df_binary[["south","black", "t_hat"]], df_binary["wage"])
#Print the coefficient in front of t_hat

print(f"SL2S estimator is {reg2.coef_.round(2)}.")

SL2S estimator is [ 89.6  270.69 278.48].


**Answer**

The estimate we get for ATE is slightly higher (30 cents/hour) when accounting for the covariates, suggesting the counfounding bias tended to lead to underestimation of the true effect.

### 2.  Compare your results with Python function IV2SLS from linearmodels package. What do you conclude?

In [None]:
df = df.assign(const = 1)
IV2SLS(dependent = df.wage, #Outcome
       endog = df.educ, #Treatment
       exog = df[["const","south","black"]], #Covariates (including constant)
       instruments = df.nearc2 #Instrument
      ).fit(cov_type = "unadjusted")

0,1,2,3
Dep. Variable:,wage,R-squared:,-6.0579
Estimator:,IV-2SLS,Adj. R-squared:,-6.0649
No. Observations:,3010,F-statistic:,47.869
Date:,"Fri, Nov 04 2022",P-value (F-stat),0.0000
Time:,15:58:19,Distribution:,chi2(3)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-3215.8,1988.6,-1.6171,0.1059,-7113.3,681.80
south,89.604,100.75,0.8894,0.3738,-107.87,287.07
black,270.69,208.41,1.2988,0.1940,-137.80,679.17
educ,278.48,143.35,1.9427,0.0521,-2.4759,559.44


**Answer**

We get exactly the same estimates, except that now the confidence interval contains zero; so that we do not have enough information to conclude at a significant effect.