# Lecture 8: Validation Set and Cross-Validation

We will use the "Auto" detaset

## Validation Set

In [1]:
library(ISLR)
attach(Auto)

In [148]:
lm.fit=lm(mpg~horsepower, data=Auto)
summary(lm.fit)


Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,	Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16


### Split the sample and estimate on training data

In [160]:
set.seed(1)
train=sample(392,196)

lm.fit=lm(mpg~horsepower, data=Auto,subset=train) #estimate on training dataset only
summary(lm.fit)


Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.698  -3.085  -0.216   2.680  16.770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.340377   1.002269   40.25   <2e-16 ***
horsepower  -0.161701   0.008809  -18.36   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.692 on 194 degrees of freedom
Multiple R-squared:  0.6346,	Adjusted R-squared:  0.6327 
F-statistic: 336.9 on 1 and 194 DF,  p-value: < 2.2e-16


### Caluclate $MSE_{test}$

In [161]:
mpg.pred <-predict(lm.fit,Auto)
test.err <-(mpg-mpg.pred)[-train]

In [162]:
mean(test.err^2) #MSE_test

You can also do this for the models with high-order polynomials

In [163]:
lm.fit2=lm(mpg~poly(horsepower,2), data=Auto,subset=train) #estimate on training dataset only
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
lm.fit3=lm(mpg~poly(horsepower,3), data=Auto,subset=train) #estimate on training dataset only
mean((mpg-predict(lm.fit3,Auto))[-train]^2)

If we split the sample a different way, we will get different estimates of $MSE_{test}$

In [164]:
train=sample(392,196)

lm.fit=lm(mpg~horsepower, data=Auto,subset=train) #estimate on training dataset only
mean((mpg-predict(lm.fit,Auto))[-train]^2)

But the story is consistent, the quadratic model performs better than the linear model, and the cubic offers little improvement

## Leave-One-Out Cross Validation

Cross validation can be done using the glm() and cv.glm() functions.

These functions can be used for both linear regression and logistic regression.

In [167]:
glm.fit=glm(mpg~horsepower,data=Auto) #the "default" use of glm(), when the "family" argument is not passed to it, is just a linear regression
coef(glm.fit)

In [169]:
library(boot) #this library has the cv.glm() function

Compute $CV_{(n)}=\frac{1}{n}\sum_{i=1}^n\text{MSE}_i$

In [172]:
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta # computes two estimates of LOOCV

In [174]:
cv.error=rep(0,5) #intialize vector to store errors
for (i in 1:5){
    glm.fit=glm(mpg~poly(horsepower,i),data=Auto) #run regression on predictors with plynomial up to degree i
    cv.error[i]=cv.glm(Auto,glm.fit)$delta[1] #store the LOOCV
              }
cv.error

again, a sharp drop moving from linear to quadratic, but no improvemeny moving to higher degrees

## K-Fold Cross Validation

In [175]:
set.seed(17)
cv.error.10=rep(0,10)
for (i in 1:10){
    glm.fit=glm(mpg~poly(horsepower,i),data=Auto) #run regression on predictors with plynomial up to degree i
    cv.error.10[i]=cv.glm(Auto,glm.fit,K=10)$delta[1] #set K=10. Store the 10-fold cross validation
              }
cv.error.10