# Cross-Validation

In the previous notebook, we see that regularization is crucial to training a good model. The strength of regularization is controlled by a **hyperparameter** ```alpha```. How should we pick such hyperparameters?

In [None]:
import pandas as pd #organize data
from sklearn.linear_model import LinearRegression, Ridge, Lasso #regressions
import numpy as np #calculate mean and standard deviation

In [None]:
#Load data. 
auto = pd.read_csv("auto.csv")

#Check data
auto.head()

### A. A simple out-of-sample test
Let's start with a simple out-of-sample test: we will divide our data into two parts, one for training the model and the other for testing the model's out-of-sample performance. The former is commonly called **training set** while the latter **test set** or **holdout set**.

In [None]:
#Pick variables


#Use about 3/4 of data for training: 60 samples


#In-sample data for training model


#Out-of-sample data for testing model


#Train OLS model and show R-Squared values


Now let's try training a Ridge regression.

In [None]:
#Train Ridge model and show R-Squared values


As before, let's loop through different values of alpha and see how it affects the model's performance.

In [None]:
#Alphas to go through
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000,500000000]

#Loop

While regularization helps us get more consistent performance, our model simply isn't really good. What could be the problem?

### B. Shuffling data

If the data is sorted, splitting the data sequentially would give us unrepresentative sets of data. To deal with that, we can shuffle our data before splitting it up.

In [None]:
#Import function for shuffling
from sklearn.utils import shuffle 

#Shuffle observations


#Use about 3/4 of data for training: 60 samples


#In-sample data for training model


#Out-of-sample data for testing model


#Train OLS model and show R-Squared values
ols = LinearRegression()
ols.fit(x_in,y_in)
print("in-sample R-squared:",ols.score(x_in,y_in))
print("Out-of-sample R-squared:",ols.score(x_out,y_out))

How would the Ridge regression fare in this case?

In [None]:
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(x_in,y_in)
    print(str(a).ljust(10), #left-justified, width=10
          str(round(ridge.score(x_in,y_in),2)).ljust(8), #left justified, width=5
          str(round(ridge.score(x_out,y_out),1)).rjust(5)) #right justified, width=8

Should you shuffle your data before splitting it? It depends. For cross-section data it is probably a good idea, but for time-series data it would be a bad idea, since you are introducing *hindsight bias* if you can train with data that is generated after some of your test data.

### C. Validation

So we try out different values of ```alpha``` and pick the one that give us the highest out-of-sample score. Do so is actually problematic: since ```alpha``` is a parameter of our model, we are effectively training our model with the supposingly out-of-sample data, which means the test set no longer gives us truly out-of-sample results. In particular, there is a real chance of overfitting our model to the test set via ```alpha```.


The correct approach is to split the data into three parts: besides the train set and test set, we have an additional **validation set** for picking the model's hyperparameters. It is common to use around 60% of the data for training and 20% each for validation and test.

In [None]:
#Cutoffs


#Data for training model


#Data for picking alpha


#Data for testing model


#Try different alphas


After picking the best alpha based on validation data, the final step is to test the model's out-of-sample performance with the test set.

In [None]:
#Best alpha here
a = 

#Test set
ridge = Ridge(alpha=a)
ridge.fit(x_in,y_in)
print(str(a).ljust(10), #left-justified, width=10
      str(round(ridge.score(x_in,y_in),2)).ljust(8), #left justified, width=5
      str(round(ridge.score(x_out,y_out),2)).rjust(5)) #right justified, width=8

We can automate the process of picking alpha:

In [None]:
#Loop through alphas and save each model


#Go through saved models and pick the best one based on validation score


#Check model performance with test data
print("Best alpha value:",best_model.alpha)
print("Test R-Squared:",round(best_model.score(x_out,y_out),2))

### C. K-Fold Cross Validation

A problem with dividing the data into three parts is that we are using a lot less data for training. **K-Fold Cross Validation** is a method to overcome that problem: instead of having a separate validation set, we divide our training set into $K$ equal parts. We use $K-1$ parts for training and validate with the remaining part. This process can be repeated for $K$ times, each time using a different part for validation. We then take the average score from these $K$ runs to pick our hyperparameters.

<img src="https://cdn-images-1.medium.com/max/1600/1*J2B_bcbd1-s1kpWOu_FZrg.png">
Source: <a href="https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6">
Towards Data Science</a>

In [None]:
from sklearn.model_selection import cross_val_score

#Cross validation



As before, we can loop through different alphas and pick the one that works best.

In [None]:
train_num = 60
alphas = [1,5,50,500,5000,50000,500000,5000000,50000000,500000000]

score_list = [] #List for saving scores

#Data for training model
y_in = y[:train_num]
x_in = x[:train_num]

#Data for testing model
y_test = y[train_num:]
x_test = x[train_num:]

#Try different alphas
for a in alphas:
    ridge = Ridge(alpha=a)
    scores = cross_val_score(ridge,x,y,cv=5)
    
    score_list.append([a,np.mean(scores)]) #Save scores to list

#Go through saved models and pick the best one based on validation score
best_alpha = None
best_score = -99

for s in score_list:
    a = s[0]
    avg_score = s[1]
    print(str(a).ljust(10), #left-justified, width=10
          str(round(avg_score,4)).rjust(5)) #right justified, width=8
    
    if avg_score > best_score:
        best_score = avg_score
        best_alpha = a

#Check model performance with test data
print("Best alpha value:",best_alpha)


**K-Fold cross-validation** trades a data with training time. This might be a worthwhile tradeoff when data is limited and the model is relatively simple. For models such neural networks that are time-consuming to train, the simple train-validation-test split is often the only feasible way.