# NETSI Special Topic: Causal Analysis
## Jupyter notebook on regression basics

In [2]:
## Set seed and parameters
library(lfe)
set.seed(5)
N <- 3000

ERROR: Error in library(aer): there is no package called 'aer'


### Baseline regression

In [2]:
## Create empty dataframe
df <- data.frame("ID" = 1:N)

## Simulate household wealth
### Let's assume that in this hypothetical survey, we only observed whether HH wealth is > 1M or not.
df$hhw <- floor(runif(N, min=0, max=2))

## Simulate decision to attend private college (completely random for now)
df$private <- floor(runif(N, min=0, max=2))

## Simulate earning
treatment.effect <- 10000
df$salary <- 50000 + df$private*treatment.effect + df$hhw*10000 + rnorm(N, mean=0, sd=40000)

## Show examples
head(df)

ID,hhw,private,salary
<int>,<dbl>,<dbl>,<dbl>
1,0,1,78973.88
2,1,1,86067.18
3,1,1,37405.97
4,0,1,39449.91
5,0,1,54470.22
6,1,0,106778.78


In [3]:
## Run regression without/with HHW
short <- felm(salary ~ private, data=df)
long <- felm(salary ~ private + hhw, data=df)

print(summary(short, robust=TRUE))
print(summary(long, robust=TRUE))
### robust=TRUE gives heteroskedasticity-robust standard errors

### Note that both give roughly correct causal estimates.
### The second regression has more "statistical power".
### "Statistical power" = prob (reject null | null is incorrect)
### If you run this simulation millions of times, the second regression will reject the null hypothesis
### more often than the first regression, meaning the second regression will be more likely to be correct
### Feel free to test this. Testing with adequate number of N will give you more definitive answer
### (not too small where null is never rejected, not too big where null is always rejected).


Call:
   felm(formula = salary ~ private, data = df) 

Residuals:
    Min      1Q  Median      3Q     Max 
-124276  -26961    -122   27674  145148 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    54907       1034  53.112  < 2e-16 ***
private        10437       1484   7.032  2.5e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40640 on 2998 degrees of freedom
Multiple R-squared(full model): 0.01623   Adjusted R-squared: 0.0159 
Multiple R-squared(proj model): 0.01623   Adjusted R-squared: 0.0159 
F-statistic(full model, *iid*):49.46 on 1 and 2998 DF, p-value: 2.503e-12 
F-statistic(proj model): 49.46 on 1 and 2998 DF, p-value: 2.503e-12 



Call:
   felm(formula = salary ~ private + hhw, data = df) 

Residuals:
    Min      1Q  Median      3Q     Max 
-125368  -27140    -482   26912  140300 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    49998       1237  40.430  <

### Endogenous selection

In [4]:
## Create empty dataframe
df2 <- data.frame("ID" = 1:N)

## Simulate household wealth
### Let's assume that in this hypothetical survey, we only observed whether HH wealth is > 1M or not.
df2$hhw <- floor(runif(N, min=0, max=2))

## Simulate decision to attend private college 
### Instead of completely random, more likely to attend private college if from wealthy HH)
df2$private <- 1* (df2$hhw + runif(N, min=-0.8, max=0.8) > 0.5)

## roughly similar number of private college attendees
print(table(df$private))
print(table(df2$private)) 

## Simulate earning
treatment.effect <- 10000
df2$salary <- 50000 + df2$private*treatment.effect + df2$hhw*10000 + rnorm(N, mean=0, sd=40000)

## Some sample data
head(df2)


   0    1 
1500 1500 

   0    1 
1527 1473 


ID,hhw,private,salary
<int>,<dbl>,<dbl>,<dbl>
1,0,0,52453.27
2,1,1,108266.16
3,1,1,38468.71
4,1,1,85019.5
5,1,1,121545.12
6,0,0,133090.43


In [5]:
## Run regression without/with HHW
short <- felm(salary ~ private, data=df2)
long <- felm(salary ~ private + hhw, data=df2)

print(summary(short, robust=TRUE))
print(summary(long, robust=TRUE))

### The short regression gives biased estimate of causal effect
### True coefficient, 10000, is not included in the 95% confidence interval.
### The long regression brings the estimate back to normal.
### The long regression automatically matches wealthy kids with each other and compare private vs. public,
### and matches poorer kids with each other and compare private vs. public,
### and then calculate weighted average.


Call:
   felm(formula = salary ~ private, data = df2) 

Residuals:
    Min      1Q  Median      3Q     Max 
-134856  -27782     542   27345  132682 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    50866       1053   48.30   <2e-16 ***
private        16506       1486   11.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40710 on 2998 degrees of freedom
Multiple R-squared(full model): 0.03949   Adjusted R-squared: 0.03917 
Multiple R-squared(proj model): 0.03949   Adjusted R-squared: 0.03917 
F-statistic(full model, *iid*):123.3 on 1 and 2998 DF, p-value: < 2.2e-16 
F-statistic(proj model): 123.4 on 1 and 2998 DF, p-value: < 2.2e-16 



Call:
   felm(formula = salary ~ private + hhw, data = df2) 

Residuals:
    Min      1Q  Median      3Q     Max 
-136595  -28089     439   27073  130942 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept)    49218       1106  44.51

In [6]:
### The OVB formula
fs <- felm(hhw ~ private, data=df2) ### fs = "first stage"
print(summary(fs, robust=TRUE))


Call:
   felm(formula = hhw ~ private, data = df2) 

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8153 -0.1749 -0.1749  0.1847  0.8252 

Coefficients:
            Estimate Robust s.e t value Pr(>|t|)    
(Intercept) 0.174853   0.009724   17.98   <2e-16 ***
private     0.640490   0.014030   45.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.384 on 2998 degrees of freedom
Multiple R-squared(full model): 0.4103   Adjusted R-squared: 0.4101 
Multiple R-squared(proj model): 0.4103   Adjusted R-squared: 0.4101 
F-statistic(full model, *iid*): 2086 on 1 and 2998 DF, p-value: < 2.2e-16 
F-statistic(proj model):  2084 on 1 and 2998 DF, p-value: < 2.2e-16 




In [7]:
bias <- coef(summary(fs, robust=TRUE))["private",1] * coef(summary(long, robust=TRUE))["hhw",1]
long.estimate <- coef(summary(long, robust=TRUE))["private",1]
short.estimate <- coef(summary(short, robust=TRUE))["private",1]
print(bias)
print(bias + long.estimate)
print(short.estimate)

[1] 6034.682
[1] 16505.95
[1] 16505.95


### Exercise

Show a simulation exercise where regression does not go as intended. Explain why.