# Cross Validation

In the previous notebook, we see that regularization is crucial to training a good model. The strength of regularization is controlled by a **hyperparameter** ```alpha```. How should we pick such hyperparameters?

In [16]:
import pandas as pd # organize data
from sklearn.linear_model import LinearRegression, Ridge, Lasso # regressions
import numpy as np # calculate mean and standard deviation

In [17]:
# Load data. 
auto = pd.read_csv("../Data/auto.csv")

# Check data
auto.head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic


### A. A simple out-of-sample test
Let's start with a simple out-of-sample test: we will divide our data into two parts, one for training the model and the other for testing the model's out-of-sample performance. The former is commonly called **training set** while the latter **test set** or **holdout set**.

In [18]:
# Pick variables
y = auto["price"]
x = auto[["mpg","weight"]]

# Use about 3/4 of data for training: 60 samples
train_num = 60

# In-sample data for training model
y_in = y[:train_num]
x_in = x[:train_num]

# Out-of-sample data for testing model
y_out = y[train_num:]
x_out = x[train_num:]

# Train Ridge model and show R-Squared values
ridge = Ridge(alpha=5000000)
ridge.fit(x_in,y_in)
print("in-sample R-squared:",ridge.score(x_in,y_in))
print("Out-of-sample R-squared:",ridge.score(x_out,y_out))

in-sample R-squared: 0.3239678139252299
Out-of-sample R-squared: 0.04190498919752106


As before, let's loop through different values of alpha and see how it affects the model's performance.

In [19]:
# Alphas to go through
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000,500000000]

for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), #left-justified, width=10
          str(round(ridge.score(x_in,y_in),2)).ljust(8), #left justified, width=5
          str(round(ridge.score(x_out,y_out),2)).rjust(5)) #right justified, width=8

1          0.33     -0.13
5          0.33     -0.13
50         0.33     -0.12
500        0.33     -0.09
5000       0.33     -0.06
50000      0.33     -0.05
500000     0.33     -0.04
5000000    0.32      0.04
50000000   0.21      0.16
500000000  0.04      0.04


While regularization helps us get more consistent performance, our model simply isn't really good. What could be the problem?

### B. Shuffling data

If the data is sorted, splitting the data sequentially would give us unrepresentative sets of data. To deal with that, we can shuffle our data before splitting it up.

In [20]:
# Import function for shuffling
from sklearn.utils import shuffle 

# Shuffle observations
y,x = shuffle(auto["price"],auto[["mpg","weight"]],random_state=1234)

# Use about 3/4 of data for training: 60 samples
train_num = 60

# In-sample data for training model
y_in = y[:train_num]
x_in = x[:train_num]

# Out-of-sample data for testing model
y_out = y[train_num:]
x_out = x[train_num:]

# Train Ridge model with different alphas and show R-Squared values
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), # left-justified, width=10
          str(round(ridge.score(x_in,y_in),2)).ljust(8), # left justified, width=5
          str(round(ridge.score(x_out,y_out),1)).rjust(5)) # right justified, width=8

1          0.27       0.4
5          0.27       0.4
50         0.27       0.4
500        0.27       0.4
5000       0.27       0.4
50000      0.27       0.4
500000     0.27       0.4
5000000    0.27       0.4
50000000   0.18       0.2
500000000  0.03      -0.0


Should you shuffle your data before splitting it? It depends. For cross-section data it is probably a good idea, but for time-series data it would be a bad idea, since you are introducing *hindsight bias* if you can train with data that is generated after some of your test data.

### C. train_test_split

In practice, you will probably use sckit-learn's ```train_test_split``` method to split the data. ```train_test_split``` shuffles the data by default, so there is no need to call ```shuffle``` separately. The default is a 75/25 split, which you can change by providing a different ```train_size``` or ```test_size```.

In [21]:
from sklearn.model_selection import train_test_split

# Splitting the data
y_in,y_out,x_in,x_out = train_test_split(auto["price"],
                                         auto[["mpg","weight"]],
                                         train_size=0.8)

# Train Ridge model with different alphas and show R-Squared values
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), # left-justified, width=10
          str(round(ridge.score(x_in,y_in),2)).ljust(8), # left justified, width=5
          str(round(ridge.score(x_out,y_out),1)).rjust(5)) # right justified, width=8

1          0.28       0.3
5          0.28       0.3
50         0.28       0.3
500        0.28       0.3
5000       0.28       0.3
50000      0.28       0.3
500000     0.28       0.3
5000000    0.27       0.3
50000000   0.18       0.2
500000000  0.04       0.0


### D. Validation

So we try out different values of ```alpha``` and pick the one that give us the highest out-of-sample score. Do so is actually problematic: since ```alpha``` is a parameter of our model, we are effectively training our model with the supposingly out-of-sample data, which means the test set no longer gives us truly out-of-sample results. In particular, there is a real chance of overfitting our model to the test set via ```alpha```.


The correct approach is to split the data into three parts: besides the train set and test set, we have an additional **validation set** for picking the model's hyperparameters. It is common to use around 60% of the data for training and 20% each for validation and test.

In [22]:
# 64% for training, 16% for validation and 20% for out-of-sample test
y_in,y_out,x_in,x_out = train_test_split(auto["price"],
                                         auto[["mpg","weight"]],
                                         test_size=0.2)
y_train,y_valid,x_train,x_valid = train_test_split(y_in,
                                                   x_in,
                                         train_size=0.8)

# Train Ridge model with different alphas and show R-Squared values
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10),
          str(round(ridge.score(x_in,y_in),2)).ljust(8), 
          str(round(ridge.score(x_out,y_out),1)).rjust(5)) 

1          0.3        0.2
5          0.3        0.2
50         0.3        0.2
500        0.3        0.2
5000       0.29       0.3
50000      0.29       0.3
500000     0.29       0.3
5000000    0.29       0.3
50000000   0.19       0.2
500000000  0.03       0.0


After picking the best alpha based on validation data, the final step is to test the model's out-of-sample performance with the test set.

In [23]:
a = 500000
ridge = Ridge(alpha=a)
ridge.fit(x_train,y_train)
print(str(a).ljust(10), 
      str(round(ridge.score(x_in,y_in),2)).ljust(8), 
      str(round(ridge.score(x_out,y_out),2)).rjust(5)) 

500000     0.28      0.18


We can automate the process of picking alpha:

In [24]:
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000]

# Loop through alphas and update the best model if needed
best_model = None
best_score = -99

for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_train,y_train)
    
    training_score = ridge.score(x_in,y_in)
    valid_score = ridge.score(x_valid,y_valid)
    print(str(a).ljust(10), 
          str(round(training_score,2)).ljust(8), 
          str(round(valid_score,2)).rjust(5)) 
    
    if valid_score > best_score:
        best_score = valid_score
        best_model = ridge

# Check model performance with test data
print("Best alpha value:",best_model.alpha)
print("Test R-Squared:",round(best_model.score(x_out,y_out),2))

1          0.29     -0.35
5          0.29     -0.35
50         0.29     -0.34
500        0.29     -0.31
5000       0.29     -0.28
50000      0.28     -0.27
500000     0.28     -0.26
5000000    0.28     -0.18
50000000   0.18     -0.05
Best alpha value: 50000000
Test R-Squared: 0.19


### E. K-Fold Cross Validation

A problem with dividing the data into three parts is that we are using a lot less data for training. **K-Fold Cross Validation** is a method to overcome that problem: instead of having a separate validation set, we divide our training set into $K$ equal parts. We use $K-1$ parts for training and validate with the remaining part. This process can be repeated for $K$ times, each time using a different part for validation. We then take the average score from these $K$ runs to pick our hyperparameters.

<img src="../Images/cross_validation.png" width="80%">
Source: <a href="https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6">
Towards Data Science</a>

In [25]:
from sklearn.model_selection import cross_val_score

ridge = Ridge(alpha=5000)
scores = cross_val_score(ridge,x_in,y_in,cv=5)

scores 

array([ 0.27499178, -5.69854808,  0.38183644, -0.88662471,  0.16251918])

As before, we can loop through different alphas and pick the one that works best.

In [26]:
train_num = 60
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000,500000000]

score_list = [] #List for saving scores

# Splitting the data
y_in,y_out,x_in,x_out = train_test_split(auto["price"],
                                         auto[["mpg","weight"]],
                                         train_size=0.8)

# Loop through different alphas
best_alpha = None
best_score = -99

for a in alphas:
    ridge = Ridge(alpha=a)
    scores = cross_val_score(ridge,x,y,cv=5)
    avg_score = np.mean(scores)
    print(str(a).ljust(10),
          str(round(avg_score,4)).rjust(5))
    
    if avg_score > best_score:
        best_score = avg_score
        best_alpha = a

# Check model performance with test data
best_model = Ridge(alpha=best_alpha)
best_model.fit(x_in,y_in)
print("Best alpha value:",best_alpha)
print("Test R-Squared:",round(best_model.score(x_out,y_out),2))

1          0.1482
5          0.1483
50         0.1501
500        0.1579
5000       0.1639
50000      0.1649
500000     0.1656
5000000    0.1664
50000000   0.0683
500000000  -0.1123
Best alpha value: 5000000
Test R-Squared: -0.07


K-Fold cross-validation trades data with training time. Having a high number of folds might be worthwhile when data is limited and the model is relatively simple. For models such neural networks that are time-consuming to train, the number of folds will have to be low, perhaps to the point that only the simple train-validation-test split is feasible.

### F. GridSearchCV

In practice, you should use scikit-learn's ```GridSearchCV``` instead of writing your own loop. This is particularly true if the model has multiple hyperparameters to tune.

In [27]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV

# Use a dictionary to specify the parameters we need to go through
parameters = {'alpha':[1,5,50,500,5000,50000,500000,5000000,50000000]}
ridge = Ridge()
gscv = GridSearchCV(ridge,parameters,cv=5)
gscv.fit(x_in, y_in)

GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [1, 5, 50, 500, 5000, 50000, 500000, 5000000,
                                   50000000]})

The best-performing hyperparameter(s) and the best score are recorded in ```best_params_``` and ```best_score_``` respectively:

In [28]:
# Best parameter(s)
gscv.best_params_

{'alpha': 500000}

In [29]:
# Best score
gscv.best_score_

0.17953714455658307

The best_estimator is saved in ```best_estimator_```. We can use that for out-of-sample test or making predictions:

In [30]:
best_model = gscv.best_estimator_
best_model.score(x_out,y_out)

-0.11636418877511123