## Welcome

This is material for the **Instrumental Variables** chapter in Scott Cunningham's book, [Causal Inference: The Mixtape.](https://mixtape.scunning.com/)

### Packages needed

The first thing you need to do is install a few packages to make sure everything runs:

In [1]:
import pandas as pd
import numpy as np
import plotnine as p

import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels import IV2SLS 

from stargazer.stargazer import Stargazer

In [2]:
def read_data(file):
    full_path = "https://raw.github.com/scunning1975/mixtape/master/" + file
    
    return pd.read_stata(full_path)

In [3]:
def lm_robust(formula, data, group_col):
    regression = sm.OLS.from_formula(formula, data = data)
    regression = regression.fit(cov_type="cluster",cov_kwds={"groups":data[group_col]})
    return regression

## Card

In [4]:
#card = read_data("card.dta")
card = read_data("card.dta")

#OLS
ols_reg = sm.OLS.from_formula("lwage ~ educ + exper + black + south + married + smsa", 
              data = card).fit()

ols_reg.summary()

0,1,2,3
Dep. Variable:,lwage,R-squared:,0.305
Model:,OLS,Adj. R-squared:,0.304
Method:,Least Squares,F-statistic:,219.2
Date:,"Sun, 07 Mar 2021",Prob (F-statistic):,1.97e-232
Time:,13:31:12,Log-Likelihood:,-1273.9
No. Observations:,3003,AIC:,2562.0
Df Residuals:,2996,BIC:,2604.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.0633,0.064,79.437,0.000,4.938,5.188
educ,0.0712,0.003,20.438,0.000,0.064,0.078
exper,0.0342,0.002,15.422,0.000,0.030,0.038
black,-0.1660,0.018,-9.426,0.000,-0.201,-0.131
south,-0.1316,0.015,-8.788,0.000,-0.161,-0.102
married,-0.0359,0.003,-10.547,0.000,-0.043,-0.029
smsa,0.1758,0.015,11.372,0.000,0.145,0.206

0,1,2,3
Omnibus:,53.196,Durbin-Watson:,1.858
Prob(Omnibus):,0.0,Jarque-Bera (JB):,69.43
Skew:,-0.231,Prob(JB):,8.38e-16
Kurtosis:,3.584,Cond. No.,154.0


In [5]:
#2SLS
iv_reg = IV2SLS.from_formula("lwage ~ 1 + exper + black + south + married + smsa + [educ ~ nearc4 ]", card).fit()
iv_reg.summary

Inputs contain missing values. Dropping rows with missing observations.


0,1,2,3
Dep. Variable:,lwage,R-squared:,0.2513
Estimator:,IV-2SLS,Adj. R-squared:,0.2498
No. Observations:,3003,F-statistic:,892.71
Date:,"Sun, Mar 07 2021",P-value (F-stat),0.0000
Time:,13:31:12,Distribution:,chi2(6)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,4.1625,0.8349,4.9857,0.0000,2.5262,5.7988
exper,0.0556,0.0199,2.7980,0.0051,0.0166,0.0945
black,-0.1157,0.0496,-2.3343,0.0196,-0.2128,-0.0186
south,-0.1132,0.0229,-4.9314,0.0000,-0.1581,-0.0682
married,-0.0320,0.0051,-6.3037,0.0000,-0.0419,-0.0220
smsa,0.1477,0.0303,4.8721,0.0000,0.0883,0.2071
educ,0.1242,0.0492,2.5258,0.0115,0.0278,0.2205


#### Questions
- Interpret the coefficient on education when we used OLS versus when used 2SLS. 
- How does the estimated effect of education change when instrumenting with being close to a 4-year college?  That is, does the coefficient get larger or smaller compared to OLS?
- If the only source of bias in our OLS regression was omitted heterogeneous ability, then will 2SLS be larger, smaller or the same as OLS estimate?  Why/why not?   
- Is the finding of the causal effect of educating when using 2SLS, when compared to the estimate using OLS, consistent with ability bias?  What else do you think may be going on and why?
- What sorts of individuals will go to college regardless of whether a college is near them?  What sorts of individuals will never go to a college even if one is near them?  And what sorts of people will go to a college if one is near them but won't go to college if it is not near them?

## JIVE 

In [6]:
judge = read_data("judge_fe.dta")
judge['bailDate'] = (judge['bailDate'] - pd.to_datetime('1970-01-01')).dt.days.values

# grouped variable names from the data set
judge_pre = "+".join(judge.columns[judge.columns.str.contains('^judge_pre_[1-7]')])
demo = "+".join(['black', 'age', 'male', 'white'])
off = "+".join(['fel', 'mis', 'sum', 'F1', 'F2', 'F3', 'M1', 'M2', 'M3', 'M'])
prior = "+".join(['priorCases', 'priorWI5', 'prior_felChar', 'prior_guilt', 'onePrior', 'threePriors'])
control2 = "+".join(['day', 'day2', 'bailDate', 't1', 't2', 't3', 't4', 't5'])

#formulas used in the OLS
min_formula = "guilt ~ jail3 + " + control2
max_formula = """guilt ~ jail3 + possess + robbery + DUI1st + drugSell + 
                aggAss + {demo} + {prior} + {off} + {control2}""".format(demo=demo,
                                                                        prior=prior,
                                                                        off=off,
                                                                        control2=control2)

#max variables and min variables
min_ols = sm.OLS.from_formula(min_formula, data = judge).fit()
max_ols = sm.OLS.from_formula(max_formula, data = judge).fit()
print("OLS")
Stargazer([min_ols, max_ols])

OLS


0,1,2
,,
,Dependent variable:guilt,Dependent variable:guilt
,,
,(1),(2)
,,
DUI1st,,0.047***
,,(0.004)
F1,,0.016***
,,(0.003)
F2,,0.052***


In [7]:
#--- Instrumental Variables Estimations
#-- 2sls main results
#- Min and Max Control formulas
min_formula = "guilt ~ {control2} + [jail3 ~ {judge_pre}]".format(control2=control2, judge_pre=judge_pre)
max_formula = """guilt ~ {demo} + possess + {prior} + robbery + {off} + DUI1st + {control2} + drugSell + aggAss +
                    [jail3 ~ {judge_pre}]""".format(demo=demo,
                                                    prior=prior,
                                                    off=off,
                                                    control2=control2,
                                                   judge_pre=judge_pre)
min_iv = IV2SLS.from_formula(min_formula, data = judge).fit()
max_iv = IV2SLS.from_formula(max_formula, data = judge).fit()


print("IV")
min_iv.summary

IV


0,1,2,3
Dep. Variable:,guilt,R-squared:,0.4849
Estimator:,IV-2SLS,Adj. R-squared:,0.4849
No. Observations:,331971,F-statistic:,3.199e+05
Date:,"Sun, Mar 07 2021",P-value (F-stat),0.0000
Time:,13:31:29,Distribution:,chi2(9)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
day,-7.952e-05,2.506e-05,-3.1730,0.0015,-0.0001,-3.04e-05
day2,1.4e-07,5.806e-08,2.4111,0.0159,2.619e-08,2.538e-07
bailDate,3.2e-05,1.902e-06,16.829,0.0000,2.828e-05,3.573e-05
t1,-0.0457,0.0028,-16.545,0.0000,-0.0511,-0.0403
t2,-0.0543,0.0031,-17.583,0.0000,-0.0604,-0.0483
t3,-0.0254,0.0045,-5.6924,0.0000,-0.0342,-0.0167
t4,-0.0022,0.0039,-0.5710,0.5680,-0.0098,0.0054
t5,-0.0095,0.0040,-2.3570,0.0184,-0.0173,-0.0016
jail3,0.1493,0.0652,2.2893,0.0221,0.0215,0.2771


In [8]:
max_iv.summary

0,1,2,3
Dep. Variable:,guilt,R-squared:,0.5428
Estimator:,IV-2SLS,Adj. R-squared:,0.5428
No. Observations:,331971,F-statistic:,4.057e+05
Date:,"Sun, Mar 07 2021",P-value (F-stat),0.0000
Time:,13:31:32,Distribution:,chi2(34)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
black,0.0592,0.0054,10.967,0.0000,0.0486,0.0697
age,0.0015,0.0001,13.463,0.0000,0.0013,0.0017
male,-0.0515,0.0070,-7.3976,0.0000,-0.0652,-0.0379
white,0.1016,0.0036,28.099,0.0000,0.0946,0.1087
possess,-0.0568,0.0039,-14.739,0.0000,-0.0643,-0.0492
priorCases,-0.0058,0.0003,-22.891,0.0000,-0.0063,-0.0053
priorWI5,0.0267,0.0054,4.9923,0.0000,0.0162,0.0372
prior_felChar,-0.0073,0.0008,-9.4857,0.0000,-0.0088,-0.0058
prior_guilt,0.0237,0.0010,22.757,0.0000,0.0217,0.0258


In [9]:
from rpy2 import robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
pandas2ri.activate()
SteinIV = importr('SteinIV')

In [10]:
#-- JIVE main results
#- minimum controls
y = judge['guilt']
X_min = judge[['jail3', 'day', 'day2', 't1', 't2', 't3', 't4', 't5', 'bailDate']]
X_min['intercept'] = 1

Z_min = judge[judge_pre.split('+') + ['day', 'day2', 't1', 't2', 't3', 't4', 't5', 'bailDate']]
Z_min['intercept'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [11]:
y = robjects.globalenv['y'] = y
X_min = robjects.globalenv['X_min'] = np.array(X_min)
Z_min = robjects.globalenv['Z_min'] = np.array(Z_min)

print("JIVE")
SteinIV.jive_est(y = y, X=X_min, Z=Z_min)

JIVE


0,1
est,[RTYPES.REALSXP]


In [12]:
X_max = judge[['jail3', 'white', 'age', 'male', 'black',
         'possess', 'robbery', 
         'prior_guilt', 'onePrior', 'priorWI5', 'prior_felChar', 'priorCases',
         'DUI1st', 'drugSell', 'aggAss', 'fel', 'mis', 'sum',
         'threePriors',
         'F1', 'F2', 'F3',
         'M', 'M1', 'M2', 'M3',
         'day', 'day2', 'bailDate', 
         't1', 't2', 't3', 't4', 't5']]
X_max['intercept'] = 1

Z_max = judge[judge_pre.split('+') + ['white', 'age', 'male', 'black',
         'possess', 'robbery', 
         'prior_guilt', 'onePrior', 'priorWI5', 'prior_felChar', 'priorCases',
         'DUI1st', 'drugSell', 'aggAss', 'fel', 'mis', 'sum',
         'threePriors',
         'F1', 'F2', 'F3',
         'M', 'M1', 'M2', 'M3',
         'day', 'day2', 'bailDate', 
         't1', 't2', 't3', 't4', 't5']]
Z_max['intercept'] = 1
X_max = robjects.globalenv['X_max'] = np.array(X_max)
Z_max = robjects.globalenv['Z_max'] = np.array(Z_max)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [13]:
SteinIV.jive_est(y = y, X = X_max, Z = Z_max)

0,1
est,[RTYPES.REALSXP]


#### QUESTION
- Interpret the coefficient on our two IV estimators?  How do they compare to our OLS estimate?
- What is your conclusion about the effect that cash bail has on adjudication?  Speculate about the channels by which cash bail has this effect. 
- Describe the four sub-populations (e.g., always takers, never takers, defiers and compliers) in the context of Stevenson's study.
- Discuss the plausibility of each of the 5 IV assumptions in Stevenson's case.  
- Draw a DAG that must be true for Stevenson's JIVE estimates to be consistent?  Which assumptions are contained in this DAG and which ones are not easily visualized? 
- Assume judge A is stricter than judge B.  Monotonicity requires that if judge B sets a lower bail amount for that individual, then judge A will always set a higher for that individual hypothetically than judge B.  Provide some examples where you think this may be violated.  


