# NETSI Special Topic: Causal Analysis
## Jupyter notebook on regression basics

In [1]:
## Set seed and parameters
library(lfe)
set.seed(5)
N <- 2000

Loading required package: Matrix


### Baseline regression

In [2]:
## Create empty dataframe
df <- data.frame("ID" = 1:N)

## Simulate household wealth
### Let's assume that in this hypothetical survey, we only observed whether HH wealth is > 1M or not.
df$hhw <- floor(runif(N, min=0, max=2))

## Simulate decision to attend private college (completely random for now)
df$private <- floor(runif(N, min=0, max=2))

## Simulate earning
treatment.effect <- 10000
df$salary <- 50000 + df$private*treatment.effect + df$hhw*10000 + rnorm(N, mean=0, sd=40000)

## Show examples
head(df)

ID,hhw,private,salary
<int>,<dbl>,<dbl>,<dbl>
1,0,0,112160.884
2,1,1,43530.321
3,1,1,74163.727
4,0,1,25743.575
5,0,0,-3212.614
6,1,1,6097.829


In [3]:
## Run regression without/with HHW
short <- felm(salary ~ private, data=df)
long <- felm(salary ~ private + hhw, data=df)

print(summary(short, robust=TRUE))
print(summary(long, robust=TRUE))
### robust=TRUE gives heteroskedasticity-robust standard errors

### Note that both give roughly correct causal estimates.
### The second regression has more "statistical power".
### "Statistical power" = prob (reject null | null is incorrect)
### If you run this simulation millions of times, the second regression will reject the null hypothesis
### more often than the first regression, meaning the second regression will be more likely to be correct
### Feel free to test this. Testing with adequate number of N will give you more definitive answer
### (not too small where null is never rejected, not too big where null is always rejected).


Call:
   felm(formula = salary ~ private, data = df) 

Residuals:
    Min      1Q  Median      3Q     Max 
-141966  -26438     570   27923  151181 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    56210       1267  44.361  < 2e-16 ***
private         6751       1813   3.723 0.000202 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40600 on 1998 degrees of freedom
Multiple R-squared(full model): 0.006864   Adjusted R-squared: 0.006367 
Multiple R-squared(proj model): 0.006864   Adjusted R-squared: 0.006367 
F-statistic(full model, *iid*):13.81 on 1 and 1998 DF, p-value: 0.0002078 
F-statistic(proj model): 13.86 on 1 and 1998 DF, p-value: 0.0002023 



Call:
   felm(formula = salary ~ private + hhw, data = df) 

Residuals:
    Min      1Q  Median      3Q     Max 
-147286  -26493    -160   27659  146071 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    51179       1539  33.

### Endogenous selection

In [4]:
## Create empty dataframe
df2 <- data.frame("ID" = 1:N)

## Simulate household wealth
### Let's assume that in this hypothetical survey, we only observed whether HH wealth is > 1M or not.
df2$hhw <- floor(runif(N, min=0, max=2))

## Simulate decision to attend private college 
### Instead of completely random, more likely to attend private college if from wealthy HH)
df2$private <- 1* (df2$hhw + runif(N, min=-0.8, max=0.8) > 0.5)

## roughly similar number of private college attendees
print(table(df$private))
print(table(df2$private)) 

## Simulate earning
treatment.effect <- 10000
df2$salary <- 50000 + df2$private*treatment.effect + df2$hhw*10000 + rnorm(N, mean=0, sd=40000)

## Some sample data
head(df2)


   0    1 
 967 1033 

   0    1 
1012  988 


ID,hhw,private,salary
<int>,<dbl>,<dbl>,<dbl>
1,1,0,24784.281
2,1,1,101131.612
3,0,0,80744.346
4,1,1,23834.099
5,1,0,2135.256
6,1,1,20831.842


In [5]:
## Run regression without/with HHW
short <- felm(salary ~ private, data=df2)
long <- felm(salary ~ private + hhw, data=df2)

print(summary(short, robust=TRUE))
print(summary(long, robust=TRUE))

### The short regression gives biased estimate of causal effect
### True coefficient, 10000, is not included in the 95% confidence interval.
### The long regression brings the estimate back to normal.
### The long regression automatically matches wealthy kids with each other and compare private vs. public,
### and matches poorer kids with each other and compare private vs. public,
### and then calculate weighted average.


Call:
   felm(formula = salary ~ private, data = df2) 

Residuals:
    Min      1Q  Median      3Q     Max 
-124278  -26752    -760   27812  138709 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    50891       1252  40.662   <2e-16 ***
private        15013       1810   8.296   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40450 on 1998 degrees of freedom
Multiple R-squared(full model): 0.03333   Adjusted R-squared: 0.03284 
Multiple R-squared(proj model): 0.03333   Adjusted R-squared: 0.03284 
F-statistic(full model, *iid*):68.88 on 1 and 1998 DF, p-value: < 2.2e-16 
F-statistic(proj model): 68.83 on 1 and 1998 DF, p-value: < 2.2e-16 



Call:
   felm(formula = salary ~ private + hhw, data = df2) 

Residuals:
    Min      1Q  Median      3Q     Max 
-128631  -26789    -719   27360  139860 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    49740       1330  37.39

In [6]:
### The OVB formula
fs <- felm(hhw ~ private, data=df2) ### fs = "first stage"
print(summary(fs, robust=TRUE))


Call:
   felm(formula = hhw ~ private, data = df2) 

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8360 -0.1897  0.1640  0.1640  0.8103 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)  0.18972    0.01233   15.39   <2e-16 ***
private      0.64631    0.01706   37.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3816 on 1998 degrees of freedom
Multiple R-squared(full model): 0.4178   Adjusted R-squared: 0.4175 
Multiple R-squared(proj model): 0.4178   Adjusted R-squared: 0.4175 
F-statistic(full model, *iid*): 1434 on 1 and 1998 DF, p-value: < 2.2e-16 
F-statistic(proj model):  1436 on 1 and 1998 DF, p-value: < 2.2e-16 




In [7]:
bias <- coef(summary(fs, robust=TRUE))["private",1] * coef(summary(long, robust=TRUE))["hhw",1]
long.estimate <- coef(summary(long, robust=TRUE))["private",1]
short.estimate <- coef(summary(short, robust=TRUE))["private",1]
print(bias)
print(bias + long.estimate)
print(short.estimate)

[1] 3922.767
[1] 15013.2
[1] 15013.2


### Exercise

Show a simulation exercise where regression does not go as intended. Explain why.