## Model Validation and Cross-Validation



In this lab, we explore some techniques for model evaluation. Some of the commands in this lab may take a while to run on your computer.
This file is drawn from labs that are part of the book that goes with the ISLP package.

[<https://github.com/intro-stat-learning/ISLP_labs/blob/stable/Ch05-resample-lab.ipynb>]

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

There are several new imports needed for this lab.

In [None]:
from functools import partial
from sklearn.model_selection import \
     (cross_validate,
      KFold,
      ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm


## The Test Set Approach
We explore the use of the test or validation set approach in order to estimate
the test error rates that result from fitting various linear models on
the  `Auto`  data set.

We use the function `train_test_split()` to split
the data into training and validation sets. As there are 392 observations,
we split into two equal sets of size 196 using the
argument `test_size=196`. It is generally a good idea to set a random seed
when performing operations like this that contain an
element of randomness, so that the results obtained can be reproduced
precisely at a later time. We set the random seed of the splitter
with the argument `random_state=0`. 

In [None]:
Auto = load_data('Auto')
print(Auto.info())
Auto_train, Auto_test = train_test_split(Auto,
                                         test_size=196,
                                         random_state=0)


Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`.

In [None]:
hp_mm = MS(['horsepower'])
X_train = hp_mm.fit_transform(Auto_train)
y_train = Auto_train['mpg']
model = sm.OLS(y_train, X_train)
results = model.fit()


We now use the `predict()` method of `results` evaluated on the model matrix for this model
created using the test data set. We also calculate the test MSE of our model.

In [None]:
X_valid = hp_mm.transform(Auto_test)
y_valid = Auto_test['mpg']
valid_pred = results.predict(X_valid)
np.mean((y_valid - valid_pred)**2)


Hence our estimate for the test MSE of  the linear regression
fit is $23.62$.

We can also estimate the test error for
higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well
as a training and test set and returns the MSE on the test set.

In [None]:
# define a function call evalMSE
def evalMSE(terms,
            response,
            train,
            test):
   # create the matrix needed, mm, based upon the terms in the model
   mm = MS(terms)
   # make training data
   X_train = mm.fit_transform(train)
   y_train = train[response]

   # make test data
   X_test = mm.transform(test)
   y_test = test[response]

   # fit the regression model 
   results = sm.OLS(y_train, X_train).fit()
   # get the predicted values from the model fit above on the test data
   test_pred = results.predict(X_test)
   # return the RMSE
   return np.mean((y_test - test_pred)**2)


Let’s use this function to estimate the test MSE
using linear, quadratic, cubic and quartic fits. We use the `enumerate()`  function
here, which gives both the values and indices of objects as one iterates
over a _for loop_.

In [None]:
# make a blank array of all zeroes of length 3
MSE = np.zeros(4)
# create a for loop over the values 
for idx, degree in enumerate(range(1, 5)):
    # fit different models to 
    MSE[idx] = evalMSE([poly('horsepower', degree)],
                       'mpg',
                       Auto_train,
                       Auto_test)
MSE


These error rates are $23.62, 18.76$, $18.80$, and $18.78$ respectively. If we
choose a different training/validation split instead, then we
can expect somewhat different errors on the validation set.

In [None]:
Auto_train, Auto_test = train_test_split(Auto,
                                          test_size=196,
                                          random_state=3)
MSE = np.zeros(4)
for idx, degree in enumerate(range(1, 5)):
    MSE[idx] = evalMSE([poly('horsepower', degree)],
                       'mpg',
                       Auto_train,
                       Auto_test)
MSE

Using this split of the observations into a training set and a validation set,
we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$,$16.97$, and $16.90$ respectively.

Seems like there is not much advantage to using the cubic or quartic models over the
quadratic model.  

## Cross-Validation
In theory, the cross-validation estimate can be computed for any generalized
linear model.  {}
In practice, however, the simplest way to cross-validate in
Python is to use `sklearn`, which has a different interface or API
than `statsmodels`, the code we have been using to fit models.

This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ don’t naturally speak to each other, this
requires the use of a *wrapper*.
In the `ISLP` package,
we provide 
a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with
models fit by `statsmodels`.

The class `sklearn_sm()` 
has  as its first argument
a model from `statsmodels`. It can take two additional
optional arguments: `model_str` which can be
used to specify a formula, and `model_args` which should
be a dictionary of additional arguments used when fitting
the model. For example, to fit a logistic regression model
we have to specify a `family` argument. This
is passed as `model_args={'family':sm.families.Binomial()}`.

Here is our wrapper in action:

In [None]:
hp_model = sklearn_sm(sm.OLS,
                      MS(['horsepower']))
X, Y = Auto.drop(columns=['mpg']), Auto['mpg']
cv_results = cross_validate(hp_model,
                            X,
                            Y,
                            cv=Auto.shape[0])
cv_err = np.mean(cv_results['test_score'])
cv_err


The arguments to `cross_validate()` are as follows: an
object with the appropriate `fit()`, `predict()`,
and `score()` methods,  an
array of features `X` and a response `Y`. 
We also included an additional argument `cv` to `cross_validate()`; specifying an integer
$K$ results in $K$-fold cross-validation. We have provided a value 
corresponding to the total number of observations, which results in
leave-one-out cross-validation (LOOCV). The `cross_validate()`  function produces a dictionary with several components;
we simply want the cross-validated test score here (MSE), which is estimated to be 24.23.

We can repeat this procedure for increasingly complex polynomial fits.
To automate the process, we again
use a for loop which iteratively fits polynomial
regressions of degree 1 to 5, computes the
associated cross-validation error, and stores it in the $i^{th}$ element
of the vector `cv_error`. The variable `d` in the _for loop_
corresponds to the degree of the polynomial. We begin by initializing the
vector. This command may take a couple of seconds to run.

In [None]:
cv_error = np.zeros(5)
H = np.array(Auto['horsepower'])
M = sklearn_sm(sm.OLS)
for i, d in enumerate(range(1,6)):
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,
                          X,
                          Y,
                          cv=Auto.shape[0])
    cv_error[i] = np.mean(M_CV['test_score'])
cv_error


We see a sharp drop in the estimated test MSE between the linear and
quadratic fits, but then no clear improvement from using higher-degree polynomials.

Above we introduced the `outer()`  method of the `np.power()`
function.  The `outer()` method is applied to an operation
that has two arguments, such as `add()`, `min()`, or
`power()`.
It has two arrays as arguments, and then forms a larger
array where the operation is applied to each pair of elements of the
two arrays. 

In [None]:
A = np.array([3, 5, 9])
B = np.array([2, 4])
np.add.outer(A, B)


In the CV example above, we used $K=n$, but of course we can also use $K<n$. The code is very similar
to the above (and is significantly faster). Here we use `KFold()` to partition the data into $K=10$ random groups. We use `random_state` to set a random seed and initialize a vector `cv_error` in which we will store the CV errors corresponding to the
polynomial fits of degrees one to five.

In [None]:
cv_error = np.zeros(5)
cv = KFold(n_splits=10,
           shuffle=True,
           random_state=0) # use same splits for each degree
for i, d in enumerate(range(1,6)):
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,
                          X,
                          Y,
                          cv=cv)
    cv_error[i] = np.mean(M_CV['test_score'])
cv_error


Notice that the computation time is much shorter than that of LOOCV.
(In principle, the computation time for LOOCV for a least squares
linear model should be faster than for $K$-fold CV)  
We still see little evidence that using cubic
or higher-degree polynomial terms leads to a lower test error than simply
using a quadratic fit.

The `cross_validate()` function is flexible and can take
different splitting mechanisms as an argument. For instance, one can use the `ShuffleSplit()` funtion to implement
the test/validation set approach just as easily as K-fold cross-validation.

In [None]:
validation = ShuffleSplit(n_splits=1,
                          test_size=196,
                          random_state=0)
results = cross_validate(hp_model,
                         Auto.drop(['mpg'], axis=1),
                         Auto['mpg'],
                         cv=validation);
results['test_score']


One can estimate the variability in the test error by running the following:

In [None]:
validation = ShuffleSplit(n_splits=10,
                          test_size=196,
                          random_state=0)
results = cross_validate(hp_model,
                         Auto.drop(['mpg'], axis=1),
                         Auto['mpg'],
                         cv=validation)
results['test_score'].mean(), results['test_score'].std()


Note that this standard deviation is not a valid estimate of the sampling variability of the mean test score or the individual scores, since the randomly-selected training samples overlap and hence introduce correlations. But it does give an idea of the Monte Carlo variation incurred by picking different random folds.

In [None]:
cv_error = np.zeros(5)
cv = KFold(n_splits=10,
           shuffle=True,
           random_state=0) # use same splits for each degree
for i, d in enumerate(range(1,6)):
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,
                          X,
                          Y,
                          cv=cv)
    cv_error[i] = np.mean(M_CV['test_score'])
cv_error

In [None]:
cv = KFold(n_splits=10,
           shuffle=True,
           random_state=0) # use same splits for each degree
results = cross_validate(hp_model,
                         Auto.drop(['mpg'], axis=1),
                         Auto['mpg'],
                         cv=cv);
results['test_score']

In [None]:
cv = ShuffleSplit(n_splits=10,
                          test_size=196,
                          random_state=0)
results = cross_validate(hp_model,
                         Auto.drop(['mpg'], axis=1),
                         Auto['mpg'],
                         cv=cv)
results['test_score']

Now we will do some cross validation on regression with the penguins data. 


In [None]:
penguins = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/penguins.csv", na_values=['NA'])
# remove rows with missing data
penguins.dropna(inplace=True)
penguins.head()
print(penguins.info())

In [None]:
pens_model = sklearn_sm(sm.OLS)
X = penguins.drop(['species','island',
                   'body_mass_g','sex','year'],axis=1)

Y = penguins['body_mass_g'] 


For comparison we will build a linear regression with all of the data and compare the performance from using all of the data to what we would get on test data.

In [None]:
X = penguins[['bill_depth_mm', 'flipper_length_mm', 'bill_length_mm']]  
y = penguins['body_mass_g']  


# Create a linear regression model
bluejay_model3 = LinearRegression()

# Fit the model on the  data
bluejay_model3.fit(X, y)

# Make predictions on the  data
y_hat = bluejay_model3.predict(X)

# Evaluate the model
rmse = root_mean_squared_error(y, y_hat)
print('Root Mean Squared Error:', rmse)

In [None]:
# cross validate does not give RMSE as a possible output
# but gives negative MSE, so we adjust that output
cv = KFold(n_splits=6,
           shuffle=True,
           random_state=42)
results = cross_validate(pens_model,
                         X,
                         Y,
                         cv=cv,
                         scoring=('neg_mean_squared_error'));
np.sqrt(-1*results['test_score'])


The RMSE for the full data regression is $390.6$ while the average for the KFold is about $450$.  So we see a drop in performance but the idea is that the latter is more likely what we would observe in application of the model to new data.    

### Tasks

1. Fit a multiple linear regression to *all* of the data that has flipper length and bill length as predictors.  Find the RMSE.  

2. Now do a K-fold cross validation of the model in Task 1 with 6 folds and find the RMSE.  How do your answers in this task compare to the answers in Task 1?  

2. Change the number of folds, *n_splits*, in task 1 from 6 to 10.  How does that change your RMSE results?

3. Change the random seed, *random_state*, in task 1 and 2 from 42 to 20250217.  How does that change your RMSE results?

4. Why might we want to change the number of splits in our code?  What are the advantages of a large number of folds and what are the advantages of a small number of folds?