# Omitted Variable Bias

### Siumulate datset with positvely correlated predictors

$$y = \beta_0 + \beta_1 X_1 + \beta_2X_2+ \epsilon$$

where $X_1$ and $X_2$ are correlated

In [1]:
set.seed(1)
n <- 100 #number of observations per sample

b0 <- 1
b1 <- 1
b2 <- 1
eps <- 3*rnorm(n) #draw S samples

x1 <- rnorm(n)
x2 <- .5*x1 + .5*rnorm(n) # a second predictor that is negatively correlated with x1

y <- b0 + b1*x1 + b2*x2+ eps

### Run regression on both predictors

In [2]:
lm.fit <- lm(y~x1+x2)
summary(lm.fit)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
   Min     1Q Median     3Q    Max 
-6.904 -1.810  0.053  1.755  6.834 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3252     0.2725   4.863 4.45e-06 ***
x1            0.9519     0.3800   2.505   0.0139 *  
x2            1.0958     0.5296   2.069   0.0412 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.722 on 97 degrees of freedom
Multiple R-squared:  0.241,	Adjusted R-squared:  0.2253 
F-statistic:  15.4 on 2 and 97 DF,  p-value: 1.559e-06


Notice that both estimates are near their true value

### Run regression, but omit $X_2$

In [3]:
lm.fit <- lm(y~x1)
summary(lm.fit)


Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6888 -1.7516  0.1759  1.5202  7.3525 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3404     0.2769   4.840 4.83e-06 ***
x1            1.4705     0.2903   5.065 1.92e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.767 on 98 degrees of freedom
Multiple R-squared:  0.2075,	Adjusted R-squared:  0.1994 
F-statistic: 25.65 on 1 and 98 DF,  p-value: 1.92e-06


Now the estiamte of $\beta_1$ is inflated (too big) compared to its true value. That is because it is also "capturing" some of the effect of $X_2$.

This is because $X_1$ and $X_2$ are correlated.

### Siumulate datset with negatively correlated predictors

In [4]:
set.seed(1)
n <- 100 #number of observations per sample

b0 <- 1
b1 <- 1
b2 <- 1
eps <- 3*rnorm(n) #draw S samples

x1 <- rnorm(n)
x2 <- -.5*x1 + .5*rnorm(n) # generate a second predictor that is negatively correlated with x1

y <- b0 + b1*x1 + b2*x2+ eps

lm.fit <- lm(y~x1)
summary(lm.fit)


Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6888 -1.7516  0.1759  1.5202  7.3525 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3404     0.2769    4.84 4.83e-06 ***
x1            0.4705     0.2903    1.62    0.108    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.767 on 98 degrees of freedom
Multiple R-squared:  0.0261,	Adjusted R-squared:  0.01616 
F-statistic: 2.626 on 1 and 98 DF,  p-value: 0.1083


Now the estimate of $\beta_1$ is biased downward (too small) compared to its true value. 

This is because it is also "capturing" some of the effect of $X_2$ (which is negatively correlated with $X_1$).

In fact, it is so small that it is no longer signficantly different than 0. This means that we might erronesouy conclude that $X_1$ does not predict $Y$!