## Regression Gone Astry

##### By Matthew Simonson

#### In this notebook, we simulate Judea Pearl's somewhat contrived justification for why we need DAGs: the infamous M-bias!

In [1]:
set.seed(20015)

According to Pearl's setup, $x \to y$ as usual and $x$ and $y$ each have an additional cause: $a \to x$ and $b \to y$. But those proximate causes $a$ and $b$ also cause another variable, $m$: 

$ x \leftarrow a \to m \leftarrow b \to y$

In [2]:
a <- 10*runif(1000)
b <- 10*runif(1000)
x <- a + runif(1000)
y <- x + b + runif(1000)
m <- a + b + runif(1000)

Clearly one unit increase in $x$ should lead to about one unit increase in $y$, regardless of whether any other causes of $y$ are accounted for.  

In [3]:
reg_1 <- lm(y~x)
reg_2 <- lm(y~x+b)
reg_3 <- lm(y~x+m)


In [4]:
summary(reg_1)


Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3638 -2.4626 -0.0931  2.4398  5.4052 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.59960    0.19284   29.04   <2e-16 ***
x            0.97512    0.03186   30.60   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.905 on 998 degrees of freedom
Multiple R-squared:  0.4841,	Adjusted R-squared:  0.4836 
F-statistic: 936.5 on 1 and 998 DF,  p-value: < 2.2e-16


So far so good

In [5]:
summary(reg_2)


Call:
lm(formula = y ~ x + b)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50879 -0.24705 -0.01157  0.24652  0.50406 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.499110   0.024928   20.02   <2e-16 ***
x           1.001386   0.003153  317.60   <2e-16 ***
b           0.998804   0.003143  317.81   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2873 on 997 degrees of freedom
Multiple R-squared:  0.995,	Adjusted R-squared:  0.9949 
F-statistic: 9.836e+04 on 2 and 997 DF,  p-value: < 2.2e-16


Even better! The estimate for $x$'s coefficient remains close to 1 and R-squared values improve

In [27]:
summary(reg_3)


Call:
lm(formula = y ~ x + m)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.35346 -0.34449 -0.00483  0.33773  1.40623 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.532856   0.042282  12.602   <2e-16 ***
x           0.015489   0.007818   1.981   0.0478 *  
m           0.987817   0.005458 180.980   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4994 on 997 degrees of freedom
Multiple R-squared:  0.9856,	Adjusted R-squared:  0.9856 
F-statistic: 3.415e+04 on 2 and 997 DF,  p-value: < 2.2e-16


Uh oh! Looks like we got M-bias! x loses its predictive power while m soaks up all the variation, even though $m$ is not a cause of $y$.  By conditioning on a collider variable ($m$), we have inadvertantly generated an artificial correlation between $a$ and $b$ where none existed previously, because in order to hold $m$ constant, any increase is $a$ must be matched by a decrease in $b$.  Thus, controlling for $m$ opens up a backdoor path from $x$ to $y$ ($x \leftarrow a \leftrightarrow b \rightarrow y$) which siphons off much of the effect $x$ is supposed to be having on $y$ directly.

**Donghee's comment: fantastic example. Note and a and b are unobserved in Pearl's example, so y~x+b is a hypothetical regression.**