# Cross-Validation

In the previous notebook, we see that regularization is crucial to training a good model. The strength of regularization is controlled by a **hyperparameter** ```alpha```. How should we pick such hyperparameters?

In [2]:
import pandas as pd #organize data
from sklearn.linear_model import LinearRegression, Ridge, Lasso #regressions
import numpy as np #calculate mean and standard deviation

In [3]:
#Load data. 
auto = pd.read_csv("auto.csv")

#Check data
auto.head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic


### A. A simple out-of-sample test
Let's start with a simple out-of-sample test: we will divide our data into two parts, one for training the model and the other for testing the model's out-of-sample performance. The former is commonly called **training set** while the latter **test set** or **holdout set**.

In [4]:
#Pick variables
y = auto["price"]
x = auto[["mpg","weight"]]

#Use about 3/4 of data for training: 60 samples
train_num = 60

#In-sample data for training model
y_in = y[:train_num]
x_in = x[:train_num]

#Out-of-sample data for testing model
y_out = y[train_num:]
x_out = x[train_num:]

#Train OLS model and show R-Squared values
ols = LinearRegression()
ols.fit(x_in,y_in)
print("in-sample R-squared:",ols.score(x_in,y_in))
print("Out-of-sample R-squared:",ols.score(x_out,y_out))

in-sample R-squared: 0.332711250555
Out-of-sample R-squared: -0.13200781595


Now let's try training a Ridge regression.

In [5]:
#Train Ridge model and show R-Squared values
ridge = Ridge(alpha=5000000)
ridge.fit(x_in,y_in)
print("in-sample R-squared:",ridge.score(x_in,y_in))
print("Out-of-sample R-squared:",ridge.score(x_out,y_out))

in-sample R-squared: 0.323967813925
Out-of-sample R-squared: 0.0419049891975


As before, let's loop through different values of alpha and see how it affects the model's performance.

In [6]:
#Alphas to go through
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000,500000000]

for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), #left-justified, width=10
          str(round(ridge.score(x_in,y_in),2)).ljust(8), #left justified, width=5
          str(round(ridge.score(x_out,y_out),2)).rjust(5)) #right justified, width=8

1          0.33     -0.13
5          0.33     -0.13
50         0.33     -0.12
500        0.33     -0.09
5000       0.33     -0.06
50000      0.33     -0.05
500000     0.33     -0.04
5000000    0.32      0.04
50000000   0.21      0.16
500000000  0.04      0.04


While regularization helps us get more consistent performance, our model simply isn't really good. What could be the problem?

### B. Shuffling data

If the data is sorted, splitting the data sequentially would give us unrepresentative sets of data. To deal with that, we can shuffle our data before splitting it up.

In [7]:
#Import function for shuffling
from sklearn.utils import shuffle 

#Shuffle observations
y,x = shuffle(auto["price"],auto[["mpg","weight"]],random_state=1234)

#Use about 3/4 of data for training: 60 samples
train_num = 60

#In-sample data for training model
y_in = y[:train_num]
x_in = x[:train_num]

#Out-of-sample data for testing model
y_out = y[train_num:]
x_out = x[train_num:]

#Train OLS model and show R-Squared values
ols = LinearRegression()
ols.fit(x_in,y_in)
print("in-sample R-squared:",ols.score(x_in,y_in))
print("Out-of-sample R-squared:",ols.score(x_out,y_out))

in-sample R-squared: 0.273990201463
Out-of-sample R-squared: 0.392093024031


How would the Ridge regression fare in this case?

In [8]:
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), 
          str(round(ridge.score(x_in,y_in),2)).ljust(8), 
          str(round(ridge.score(x_out,y_out),2)).rjust(5)) 

1          0.27      0.39
5          0.27      0.39
50         0.27      0.39
500        0.27      0.39
5000       0.27      0.39
50000      0.27      0.39
500000     0.27      0.39
5000000    0.27      0.39
50000000   0.18      0.25
500000000  0.03      -0.0


Should you shuffle your data before splitting it? It depends. For cross-section data it is probably a good idea, but for time-series data it would be a bad idea, since you are introducing *hindsight bias* if you can train with data that is generated after some of your test data.

### C. Validation

So we try out different values of ```alpha``` and pick the one that give us the highest out-of-sample score. Do so is actually problematic: since ```alpha``` is a parameter of our model, we are effectively training our model with the supposingly out-of-sample data, which means the test set no longer gives us truly out-of-sample results. In particular, there is a real chance of overfitting our model to the test set via ```alpha```.


The correct approach is to split the data into three parts: besides the train set and test set, we have an additional **validation set** for picking the model's hyperparameters. It is common to use around 60% of the data for training and 20% each for validation and test.

In [9]:
#Cutoffs
train_num = 45 #Number of samples used in training
valid_num = 15 #Number of samples used in picking alpha

#Data for training model
y_in = y[:train_num]
x_in = x[:train_num]

#Data for picking alpha
y_valid = y[train_num:train_num+valid_num]
x_valid = x[train_num:train_num+valid_num]

#Data for testing model
y_test = y[train_num+valid_num:]
x_test = x[train_num+valid_num:]

#Try different alphas
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), 
          str(round(ridge.score(x_in,y_in),2)).ljust(8), 
          str(round(ridge.score(x_valid,y_valid),2)).rjust(5)) 

1          0.35      0.01
5          0.35      0.01
50         0.35      0.01
500        0.34      0.05
5000       0.33      0.08
50000      0.33      0.08
500000     0.33      0.09
5000000    0.32      0.08
50000000   0.19      0.02
500000000  0.03     -0.06


After picking the best alpha based on validation data, the final step is to test the model's out-of-sample performance with the test set.

In [10]:
a = 500000
ridge = Ridge(alpha=a)
ridge.fit(x_in,y_in)
print(str(a).ljust(10), 
      str(round(ridge.score(x_in,y_in),2)).ljust(8), 
      str(round(ridge.score(x_test,y_test),2)).rjust(5)) 

500000     0.33      0.39


We can automate the process of picking alpha:

In [11]:
train_num = 45
valid_num = 15
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000]

models = [] #List for saving models

#Data for training model
y_in = y[:train_num]
x_in = x[:train_num]

#Data for picking alpha
y_valid = y[train_num:train_num+valid_num]
x_valid = x[train_num:train_num+valid_num]

#Data for testing model
y_test = y[train_num+valid_num:]
x_test = x[train_num+valid_num:]

#Try different alphas
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)    
    models.append(ridge) #Save model to list

#Go through saved models and pick the best one based on validation score
best_model = None
best_score = -99

for m in models:
    a = m.alpha
    training_score = m.score(x_in,y_in)
    valid_score = m.score(x_valid,y_valid)
    print(str(a).ljust(10), 
          str(round(training_score,2)).ljust(8), 
          str(round(valid_score,2)).rjust(5)) 
    
    if valid_score > best_score:
        best_score = valid_score
        best_model = m

#Check model performance with test data
print("Best alpha value:",best_model.alpha)
print("Test R-Squared:",round(best_model.score(x_test,y_test),2))

1          0.35      0.01
5          0.35      0.01
50         0.35      0.01
500        0.34      0.05
5000       0.33      0.08
50000      0.33      0.08
500000     0.33      0.09
5000000    0.32      0.08
50000000   0.19      0.02
Best alpha value: 500000
Test R-Squared: 0.39


### C. K-Fold Cross Validation

A problem with dividing the data into three parts is that we are using a lot less data for training. **K-Fold Cross Validation** is a method to overcome that problem: instead of having a separate validation set, we divide our training set into $K$ equal parts. We use $K-1$ parts for training and validate with the remaining part. This process can be repeated for $K$ times, each time using a different part for validation. We then take the average score from these $K$ runs to pick our hyperparameters.

<img src="https://cdn-images-1.medium.com/max/1600/1*J2B_bcbd1-s1kpWOu_FZrg.png" width="80%">
Source: <a href="https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6">
Towards Data Science</a>

In [12]:
from sklearn.model_selection import cross_val_score

ridge = Ridge(alpha=5000)
scores = cross_val_score(ridge,x,y,cv=5)

scores 

array([ 0.2378138 , -0.25374265,  0.36447818,  0.08180748,  0.3891237 ])

As before, we can loop through different alphas and pick the one that works best.

In [13]:
train_num = 60
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000,500000000]

score_list = [] #List for saving scores

#Data for training model
y_in = y[:train_num]
x_in = x[:train_num]

#Data for testing model
y_test = y[train_num:]
x_test = x[train_num:]

#Try different alphas
for a in alphas:
    ridge = Ridge(alpha=a)
    scores = cross_val_score(ridge,x,y,cv=5)
    
    score_list.append([a,np.mean(scores)]) #Save scores to list

#Go through saved models and pick the best one based on validation score
best_alpha = None
best_score = -99

for s in score_list:
    a = s[0]
    avg_score = s[1]
    print(str(a).ljust(10),
          str(round(avg_score,4)).rjust(5))
    
    if avg_score > best_score:
        best_score = avg_score
        best_alpha = a

#Check model performance with test data
print("Best alpha value:",best_alpha)
print("Test R-Squared:",round(best_model.score(x_test,y_test),2))

1          0.1482
5          0.1483
50         0.1501
500        0.1579
5000       0.1639
50000      0.1649
500000     0.1656
5000000    0.1664
50000000   0.0683
500000000  -0.1123
Best alpha value: 5000000


**K-Fold cross-validation** trades a data with training time. This might be a worthwhile tradeoff when data is limited and the model is relatively simple. For models such neural networks that are time-consuming to train, the simple train-validation-test split is often the only feasible way.