# Lecture 4a

## Correlation and Omitted Variables (Sales Dataset)

This example will demonstrate how correlation in predictor variables can change estimates.

$sales=\beta_0 + \beta_1 TV +\beta_2 radio + \beta_3 newspaper + \epsilon$$

In [2]:
Advertising <- read.csv("~/rotman/RSM8512/Data/Advertising.csv") #load data
attach(Advertising) #attach variable names

### Each of the variables, by themselves, correlates with sales

In [3]:
lm.fit=lm(sales~TV) # estimate a linear model regressing sales on TV
summary(lm.fit) # display results


Call:
lm(formula = sales ~ TV)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,	Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16


In [4]:
lm.fit=lm(sales~radio) # estimate a linear model regressing sales on radio
summary(lm.fit) # display results


Call:
lm(formula = sales ~ radio)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7305  -2.1324   0.7707   2.7775   8.1810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.31164    0.56290  16.542   <2e-16 ***
radio        0.20250    0.02041   9.921   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared:  0.332,	Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16


In [5]:
lm.fit=lm(sales~newspaper) # estimate a linear model regressing sales on newspaper
summary(lm.fit) # display results


Call:
lm(formula = sales ~ newspaper)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2272  -3.3873  -0.8392   3.5059  12.7751 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
newspaper    0.05469    0.01658    3.30  0.00115 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212,	Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148


### When we include all three variables, however, newspaper is no longer positive and significant

In [6]:
lm.fit=lm(sales~TV+radio+newspaper) # estimate a linear model regressing sales on TV
summary(lm.fit) # display results


Call:
lm(formula = sales ~ TV + radio + newspaper)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16


### We can verify that newspaper adds little explanatory variation, by dropping it from the model

Notice that the coefficients on TV and radio change little

In [7]:
lm.fit=lm(sales~TV+radio) # estimate a linear model regressing sales on TV
summary(lm.fit) # display results


Call:
lm(formula = sales ~ TV + radio)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7977 -0.8752  0.2422  1.1708  2.8328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.92110    0.29449   9.919   <2e-16 ***
TV           0.04575    0.00139  32.909   <2e-16 ***
radio        0.18799    0.00804  23.382   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.681 on 197 degrees of freedom
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8962 
F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16


### However, if we dropped radio instead, then we might erroneously conclude that newspaper spending caused sales!

This is an example of **omitted variable bias**, where an omitted variable (radio) is:

1) correlatd with the response variable (sales)

2) correlated with other predictors (newspaper)

In [8]:
lm.fit=lm(sales~TV+newspaper) # estimate a linear model regressing sales on TV and newspaper only
summary(lm.fit) # display results


Call:
lm(formula = sales ~ TV + newspaper)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6231 -1.7346 -0.0948  1.8926  8.4512 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.774948   0.525338  10.993  < 2e-16 ***
TV          0.046901   0.002581  18.173  < 2e-16 ***
newspaper   0.044219   0.010174   4.346 2.22e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.121 on 197 degrees of freedom
Multiple R-squared:  0.6458,	Adjusted R-squared:  0.6422 
F-statistic: 179.6 on 2 and 197 DF,  p-value: < 2.2e-16


In [9]:
cor(Advertising)

Unnamed: 0,X,TV,radio,newspaper,sales
X,1.0,0.01771469,-0.11068044,-0.15494414,-0.05161625
TV,0.01771469,1.0,0.05480866,0.05664787,0.78222442
radio,-0.11068044,0.05480866,1.0,0.35410375,0.57622257
newspaper,-0.15494414,0.05664787,0.35410375,1.0,0.22829903
sales,-0.05161625,0.78222442,0.57622257,0.22829903,1.0


Notice that newspaper and radio are highly correlated

## Generating Predicted Values

In [7]:
predict(lm.fit,data.frame(TV=100,radio=20),interval="prediction") #predicted values at TV=$100,000 and radio=$20,000

Unnamed: 0,fit,lwr,upr
1,11.25647,7.929616,14.58332


In [6]:
predict(lm.fit,data.frame(TV=100,radio=20),interval="confidence")

Unnamed: 0,fit,lwr,upr
1,11.25647,10.98525,11.52768


In [11]:
predict(lm.fit,interval="confidence") #fitted values at each datapoint

Unnamed: 0,fit,lwr,upr
1,20.555465,20.162781,20.948148
2,12.345362,11.890932,12.799792
3,12.337018,11.767337,12.906699
4,17.617116,17.247635,17.986597
5,13.223908,12.900491,13.547326
6,12.512084,11.894853,13.129316
7,11.718212,11.341145,12.095280
8,12.105516,11.853931,12.357100
9,3.709379,3.163756,4.255002
10,12.551697,12.117602,12.985792
