#### Introduction to Statistical Learning, Lab 5.1

# The Validation Set Approach

We will use the validation set approach to evaluate the test error rates from various linear models on the `Auto` data set.

We will use the linear models and tools from the `sklearn` library in this lab.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as skl_lm
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

We first load the data set.

In [None]:
auto = datasets.Auto()
auto.info()

In [None]:
auto.shape


Next we randomly split the data set in training and test samples.

In [None]:
train = auto.sample(auto.shape[0] // 2, random_state = 1)
test = auto.drop(train.index)

x_train = train[['horsepower']]
y_train = train['mpg']
x_test = test[['horsepower']]
y_test = test['mpg']

We now use `sklearn`'s `LinearRegression()` to fit a linear model with `mpg` as the response and `horsepower` as the predictor. The intercept is automatically added.

In [None]:
lm = skl_lm.LinearRegression().fit(x_train, y_train)

We can now use the `predict()` function on the test data and use `sklearn` to calculate the MSE.

In [None]:
from sklearn.metrics import mean_squared_error

pred = lm.predict(x_test)
MSE = mean_squared_error(y_test, pred)
    
print(MSE)

To create quadratic and cubic features we use `PolynomialFeatures()` from `sklearn`'s `preprocessing` library.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

Quadratic:

In [None]:
poly = PolynomialFeatures(degree=2)
x_train2 = poly.fit_transform(x_train)
x_test2 = poly.fit_transform(x_test)
lm = skl_lm.LinearRegression().fit(x_train2, y_train)

print(mean_squared_error(y_test, lm.predict(x_test2)))

Cubic:

In [None]:
poly = PolynomialFeatures(degree=3)
x_train3 = poly.fit_transform(x_train)
x_test3 = poly.fit_transform(x_test)
lm = skl_lm.LinearRegression().fit(x_train3, y_train)

print(mean_squared_error(y_test, lm.predict(x_test3)))

The exact number depend on the train/test split of the data set. We can show this by splitting again with a different random seed.

In [None]:
train = auto.sample(auto.shape[0] // 2, random_state = 3)
test = auto.drop(train.index)

x_train = train[['horsepower']]
y_train = train['mpg']
x_test = test[['horsepower']]
y_test = test['mpg']

Linear:

In [None]:
lm = skl_lm.LinearRegression().fit(x_train, y_train)

print(mean_squared_error(y_test, lm.predict(x_test)))

Quadratic:

In [None]:
poly = PolynomialFeatures(degree=2)
x_train2 = poly.fit_transform(x_train)
x_test2 = poly.fit_transform(x_test)
lm = skl_lm.LinearRegression().fit(x_train2, y_train)

print(mean_squared_error(y_test, lm.predict(x_test2)))

Cubic:

In [None]:
poly = PolynomialFeatures(degree=3)
x_train3 = poly.fit_transform(x_train)
x_test3 = poly.fit_transform(x_test)
lm = skl_lm.LinearRegression().fit(x_train3, y_train)

print(mean_squared_error(y_test, lm.predict(x_test3)))