# The Effect of Adding Regressors (Bias vs. Variance)

## Independent Predictors, But One Not in True Model

In [2]:
set.seed(2)

e <- 5*rnorm(200,1)
x1 <- rnorm(200,1)
x2 <- rnorm(200,1)

y= 1*x1 + e

In [3]:
lm.fit= lm(y~x1) #estimate the true model
summary(lm.fit)


Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2442  -4.1644  -0.1621   3.9046  10.6096 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0502     0.6091   8.291 1.69e-14 ***
x1            0.9561     0.3870   2.470   0.0143 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.384 on 198 degrees of freedom
Multiple R-squared:  0.0299,	Adjusted R-squared:  0.025 
F-statistic: 6.102 on 1 and 198 DF,  p-value: 0.01435


In [4]:
lm.fit= lm(y~x1+x2) #estimate the model with the additional (useless) predictor
summary(lm.fit)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4881  -4.0121  -0.3348   4.0608  10.3054 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.7611     0.7146   6.663 2.62e-10 ***
x1            0.9408     0.3879   2.425   0.0162 *  
x2            0.2926     0.3770   0.776   0.4386    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.389 on 197 degrees of freedom
Multiple R-squared:  0.03285,	Adjusted R-squared:  0.02304 
F-statistic: 3.346 on 2 and 197 DF,  p-value: 0.03723


**standard errors on $\beta_1$ go up (slightly). This is because there are now more parameters but the same amount of data.**

**But because the regressor is independent it is not a big deal**

## Correlated Predictors, But One Not in True Model (Multicollinearity)

In [5]:
set.seed(2)

e <- 5*rnorm(200,1)
x1 <- rnorm(200,1)
x2 <- sqrt(.5)*x1 + sqrt(.5)*rnorm(200,1) #this just creates correlation between x1 and x21

y= x1 + e 

In [6]:
lm.fit= lm(y~x1)#estimate the true model
summary(lm.fit)


Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2442  -4.1644  -0.1621   3.9046  10.6096 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0502     0.6091   8.291 1.69e-14 ***
x1            0.9561     0.3870   2.470   0.0143 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.384 on 198 degrees of freedom
Multiple R-squared:  0.0299,	Adjusted R-squared:  0.025 
F-statistic: 6.102 on 1 and 198 DF,  p-value: 0.01435


**this is the same model/results as in the previous example**

In [7]:
lm.fit= lm(y~x1+x2) #estimate the model with the (useless and correlated) additional predictor
summary(lm.fit)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4881  -4.0121  -0.3348   4.0608  10.3054 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.7611     0.7146   6.663 2.62e-10 ***
x1            0.6482     0.5545   1.169    0.244    
x2            0.4138     0.5332   0.776    0.439    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.389 on 197 degrees of freedom
Multiple R-squared:  0.03285,	Adjusted R-squared:  0.02304 
F-statistic: 3.346 on 2 and 197 DF,  p-value: 0.03723


**adding a correlated regressor that is not in the true model increases the variance of the estimates (and adds no explanatory power). So even though RSS decreases a little bit ($R^2$ goes up), the additional predictor is increasing the variance of the estimates.**

This is **multicollinearity**

## Independent Predictors, Both in True Model

In [8]:
set.seed(2)

e <- 5*rnorm(200,1)
x1 <- rnorm(200,1)
x2 <- rnorm(200,1)

y= x1 + 2*x2 + e #both predictors are in the true model

In [9]:
lm.fit= lm(y~x1) #estimate the model which is missing an independent predictor
summary(lm.fit)


Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.2962  -4.6138   0.0115   3.8305  15.7034 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.0270     0.6628  10.602   <2e-16 ***
x1            1.0607     0.4212   2.519   0.0126 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.858 on 198 degrees of freedom
Multiple R-squared:  0.03104,	Adjusted R-squared:  0.02615 
F-statistic: 6.344 on 1 and 198 DF,  p-value: 0.01257


** this is still an unbiased estimate of $\beta_1$ because the missing predictor is independent **

In [10]:
lm.fit= lm(y~x1+x2) #estimate the true model which both predictor
summary(lm.fit)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4881  -4.0121  -0.3348   4.0608  10.3054 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.7611     0.7146   6.663 2.62e-10 ***
x1            0.9408     0.3879   2.425   0.0162 *  
x2            2.2926     0.3770   6.081 6.09e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.389 on 197 degrees of freedom
Multiple R-squared:  0.1842,	Adjusted R-squared:  0.1759 
F-statistic: 22.24 on 2 and 197 DF,  p-value: 1.959e-09


**RSS has decreased substantially (R^2 goes way up), so the parameters are estimated more precisely. Standard errors go down relative to the one predictor case**

## Correlated Predictors, Both in True Model

In [11]:
set.seed(2)

e <- 5*rnorm(200,1)
x1 <- rnorm(200,1)
x2 <- sqrt(.5)*x1 + sqrt(.5)*rnorm(200,1)

y= x1 + 2*x2 + e #both predictors are in the true model

In [12]:
lm.fit= lm(y~x1) #estimate the model which is missing a predictor (which is correlated with the existing predictor AND correlated with response... uhoh..)
summary(lm.fit)


Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.4895  -4.4094   0.1754   3.5581  14.1381 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.4480     0.6390   10.09  < 2e-16 ***
x1            2.4443     0.4061    6.02 8.34e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.648 on 198 degrees of freedom
Multiple R-squared:  0.1547,	Adjusted R-squared:  0.1504 
F-statistic: 36.24 on 1 and 198 DF,  p-value: 8.341e-09


**this estimate of $\beta_1$ is biased (it will be too big). This is because we are missing a predictor that is 1) correlated with $X_1$ and 2) correlated with the response**

this is an example of **omitted variable bias**

In [13]:
lm.fit= lm(y~x1+x2) #estimate the model with both correlated predictors
summary(lm.fit)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4881  -4.0121  -0.3348   4.0608  10.3054 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.7611     0.7146   6.663 2.62e-10 ***
x1            0.6482     0.5545   1.169    0.244    
x2            2.4138     0.5332   4.527 1.03e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.389 on 197 degrees of freedom
Multiple R-squared:  0.2344,	Adjusted R-squared:  0.2266 
F-statistic: 30.15 on 2 and 197 DF,  p-value: 3.772e-12


**The estimate of $\beta_1$ is now unbiased because we have included the omitted variable. However it still has high variance because of the multicollinearity problem**

**So we have reduced bias, but increased variance.**

## With Logistic Model

To make sure you understand the issues above, try repeating the exercise (adding and removing independent or correlated regressors) in the logistic model. Below is an example for independent predictors, with both in the true mdoel.

In [8]:
set.seed(1)
n=1000;

x1 <- rnorm(n,1)
x2 <- rnorm(n,1)

Xb=0 + 1*x1 + 2*x2;


p=1/(1+exp(-Xb))
y=rbinom(n,1,p)

df = data.frame(y=y,x1=x1,x2=x2)

glm.fit= glm(y~x1,family=binomial)
summary(glm.fit)


Call:
glm(formula = y ~ x1, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4861   0.3476   0.5073   0.6522   1.3473  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.98926    0.09949   9.943  < 2e-16 ***
x1           0.71152    0.08936   7.963 1.68e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 936.69  on 999  degrees of freedom
Residual deviance: 865.64  on 998  degrees of freedom
AIC: 869.64

Number of Fisher Scoring iterations: 5


In [9]:
glm.fit= glm(y~x1 + x2,family=binomial)
summary(glm.fit)


Call:
glm(formula = y ~ x1 + x2, family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.98837   0.07559   0.23114   0.47188   1.98579  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2884     0.1471  -1.960     0.05 *  
x1            1.0735     0.1207   8.893   <2e-16 ***
x2            1.8793     0.1488  12.627   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 936.69  on 999  degrees of freedom
Residual deviance: 583.59  on 997  degrees of freedom
AIC: 589.59

Number of Fisher Scoring iterations: 6
