# Cross Validation

Version: 2022-10-12

In the previous notebook, we see that regularization is crucial to training a good model. The strength of regularization is controlled by a **hyperparameter** ```alpha```. How should we pick such hyperparameters?

In [1]:
import pandas as pd # organize data
from sklearn.linear_model import LinearRegression, Ridge, Lasso # regressions
from sklearn.preprocessing import StandardScaler # Standardize data
from sklearn.pipeline import Pipeline # Pipeline
import numpy as np # calculate mean and standard deviation

In [2]:
# Load data. 
auto = pd.read_csv("../Data/auto.csv")

# Check data
auto.head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic


### A. A simple out-of-sample test
Let's start with a simple out-of-sample test: we will divide our data into two parts, one for training the model and the other for testing the model's out-of-sample performance. The former is commonly called **training set** while the latter **test set** or **holdout set**.

In [3]:
# Pick variables
y = auto["price"]
x = auto[["mpg","weight","headroom","displacement"]]

# Use about 3/4 of data for training: 60 samples
train_num = 60

# In-sample data for training model
y_in = y[:train_num]
x_in = x[:train_num]

# Out-of-sample data for testing model
y_out = y[train_num:]
x_out = x[train_num:]

# Train Ridge model and show R-Squared values
scaler = StandardScaler()
lasso = Lasso(alpha=50)
model = Pipeline(steps=[("scaler", scaler),
                       ("lasso", lasso)])
model.fit(x_in,y_in)
print("in-sample R-squared:",model.score(x_in,y_in))
print("Out-of-sample R-squared:",model.score(x_out,y_out))

in-sample R-squared: 0.39042888813645216
Out-of-sample R-squared: -0.24575377887631067


If we enclose the model creation and training process in a loop, 
we can easily try different alpha values:

In [11]:
# Alphas to go through
alphas = [1,5,10,50,100,500,1000,5000]

# Loop through alphas
for a in alphas:
    scaler = StandardScaler()
    lasso = Lasso(alpha=a)
    model = Pipeline(steps=[("scaler", scaler),
                           ("lasso", lasso)])
    model.fit(x_in,y_in)
    
    in_score = model.score(x_in,y_in)
    out_score = model.score(x_out,y_out)
    
    print(a)
    print("in-sample R-squared:",in_score)
    print("Out-of-sample R-squared:",out_score)

1
in-sample R-squared: 0.3203323500131189
Out-of-sample R-squared: 0.26184014397537436
5
in-sample R-squared: 0.3203204009322512
Out-of-sample R-squared: 0.26134629617267846
10
in-sample R-squared: 0.32028314641925504
Out-of-sample R-squared: 0.2606944031523071
50
in-sample R-squared: 0.3190920642530575
Out-of-sample R-squared: 0.25502820863505204
100
in-sample R-squared: 0.31537073839155483
Out-of-sample R-squared: 0.24681120867245976
500
in-sample R-squared: 0.2546073403473279
Out-of-sample R-squared: 0.17824118318444848
1000
in-sample R-squared: 0.14968383869313473
Out-of-sample R-squared: 0.0808903589680322
5000
in-sample R-squared: 0.0
Out-of-sample R-squared: -0.02765171701126068


Since we will be fitting models repeatedly, let us place the code above in a helper function:

In [6]:
# Function for fitting models and printing results
def fit_models(data,alphas=[1]):
    
    y_in,y_out,x_in,x_out = data
    
    for a in alphas:
        scaler = StandardScaler()
        lasso = Lasso(alpha=a)
        model = Pipeline(steps=[("scaler", scaler),
                               ("lasso", lasso)])
        model.fit(x_in,y_in)
        
        in_score = model.score(x_in,y_in)
        out_score = model.score(x_out,y_out)
        
        print(str(a).ljust(10), #left-justified, width=10
              str(round(in_score,2)).ljust(8), #left justified, width=5
              str(round(out_score,2)).rjust(5)) #right justified, width=8

This is how the function works:

In [7]:
# Alphas to go through
alphas = [1,5,10,50,100,500,1000,5000]

fit_models([y_in,y_out,x_in,x_out],alphas)

1          0.39      -0.3
5          0.39     -0.29
10         0.39     -0.29
50         0.39     -0.25
100        0.39      -0.2
500        0.32      0.02
1000       0.23      0.11
5000       0.0       -0.0


While regularization helps us get more consistent performance, our model simply isn't really good. What could be the problem?

### B. Shuffling data

If the data is sorted, splitting the data sequentially would give us unrepresentative sets of data. To deal with that, we can shuffle our data before splitting it up.

In [8]:
# Import function for shuffling
from sklearn.utils import shuffle 

# Shuffle observations
y,x = shuffle(auto["price"],auto[["mpg","weight","headroom","displacement"]],random_state=1234)

# Use about 3/4 of data for training: 60 samples
train_num = 60

# In-sample data for training model
y_in = y[:train_num]
x_in = x[:train_num]

# Out-of-sample data for testing model
y_out = y[train_num:]
x_out = x[train_num:]

# Train Ridge model with different alphas and show R-Squared values
fit_models([y_in,y_out,x_in,x_out],alphas)

1          0.3       0.46
5          0.3       0.46
10         0.3       0.46
50         0.3       0.46
100        0.29      0.46
500        0.24      0.36
1000       0.16      0.22
5000       0.0      -0.07


Should you shuffle your data before splitting it? It depends. For cross-section data it is probably a good idea, but for time-series data it would be a bad idea, since you are introducing *hindsight bias* if you can train with data that is generated after some of your test data.

### C. train_test_split

In practice, you will probably use sckit-learn's ```train_test_split``` method to split the data. ```train_test_split``` shuffles the data by default, so there is no need to call ```shuffle``` separately. The default is a 75/25 split, which you can change by providing a different ```train_size``` or ```test_size```.

In [9]:
from sklearn.model_selection import train_test_split

# Splitting the data
y_in,y_out,x_in,x_out = train_test_split(auto["price"],
                                         auto[["mpg","weight","headroom","displacement"]],
                                         train_size=0.8)

# Train Ridge model with different alphas and show R-Squared values
fit_models([y_in,y_out,x_in,x_out],alphas)

1          0.37      0.06
5          0.37      0.06
10         0.37      0.06
50         0.37      0.08
100        0.37       0.1
500        0.28      0.13
1000       0.18      0.07
5000       0.0      -0.04


### D. Validation

So we try out different values of ```alpha``` and pick the one that give us the highest out-of-sample score. Do so is actually problematic: since ```alpha``` is a parameter of our model, we are effectively training our model with the supposingly out-of-sample data, which means the test set no longer gives us truly out-of-sample results. In particular, there is a real chance of overfitting our model to the test set via ```alpha```.


The correct approach is to split the data into three parts: besides the train set and test set, we have an additional **validation set** for picking the model's hyperparameters. It is common to use around 60% of the data for training and 20% each for validation and test.

In [12]:
# 64% for training, 16% for validation and 20% for out-of-sample test
y_in,y_out,x_in,x_out = train_test_split(auto["price"],
                                         auto[["mpg","weight","headroom","displacement"]],
                                         test_size=0.2)
data_in = train_test_split(y_in,
                           x_in,
                           train_size=0.8)

# Train Ridge model with different alphas and show R-Squared values
fit_models(data_in,alphas)

1          0.31      0.02
5          0.31      0.03
10         0.31      0.04
50         0.3       0.14
100        0.28      0.22
500        0.18      0.21
1000       0.08      0.09
5000       0.0      -0.03


After picking the best alpha based on validation data, the final step is to test the model's out-of-sample performance with the test set.

In [13]:
a = 100
scaler = StandardScaler()
lasso = Lasso(alpha=a)
model = Pipeline(steps=[("scaler", scaler),
                       ("lasso", lasso)])
model.fit(x_in,y_in)
print(str(a).ljust(10), 
      str(round(model.score(x_in,y_in),2)).ljust(8), 
      str(round(model.score(x_out,y_out),2)).rjust(5)) 

100        0.27      0.35


We can automate the process of picking alpha. 
To do so, we need to modify the loop to keep track of the best model.

In [14]:
def find_best_models(data,alphas=[1]):
    
    y_train,y_valid,x_train,x_valid = data

    # Loop through alphas and update the best model if needed
    best_model = None
    best_score = -99

    for a in alphas:
        scaler = StandardScaler()
        lasso = Lasso(alpha=a)
        model = Pipeline(steps=[("scaler", scaler),
                               ("lasso", lasso)])
        model.fit(x_train,y_train)

        training_score = model.score(x_train,y_train)
        valid_score = model.score(x_valid,y_valid)
        print(str(a).ljust(10), 
              str(round(training_score,2)).ljust(8), 
              str(round(valid_score,2)).rjust(5)) 

        if valid_score > best_score:
            best_score = valid_score
            best_model = model

    # Check model performance with test data
    print("Best alpha value:",best_model["lasso"].alpha)
    print("Test R-Squared:",round(best_model.score(x_out,y_out),2))

In [15]:
# Try out the model
find_best_models(data_in,alphas=alphas)

1          0.31      0.02
5          0.31      0.03
10         0.31      0.04
50         0.3       0.14
100        0.28      0.22
500        0.18      0.21
1000       0.08      0.09
5000       0.0      -0.03
Best alpha value: 100
Test R-Squared: 0.36


### E. K-Fold Cross Validation

A problem with dividing the data into three parts is that we are using a lot less data for training. **K-Fold Cross Validation** is a method to overcome that problem: instead of having a separate validation set, we divide our training set into $K$ equal parts. We use $K-1$ parts for training and validate with the remaining part. This process can be repeated for $K$ times, each time using a different part for validation. We then take the average score from these $K$ runs to pick our hyperparameters.

<img src="../Images/cross_validation.png" width="80%">
Source: <a href="https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6">
Towards Data Science</a>

In [16]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model,x_in,y_in,cv=5)

scores 

array([ 0.01982245,  0.03936871,  0.17304933,  0.07723184, -0.25782771])

As before, we can loop through different alphas and pick the one that works best.

In [17]:
train_num = 60
alphas = [1,5,10,50,100,500,1000,5000]

score_list = [] #List for saving scores

# Splitting the data
y_in,y_out,x_in,x_out = train_test_split(auto["price"],
                                         auto[["mpg","weight","headroom","displacement"]],
                                         train_size=0.8)

# Loop through different alphas
best_alpha = None
best_score = -99

for a in alphas:
    scaler = StandardScaler()
    lasso = Lasso(alpha=a)
    model = Pipeline(steps=[("scaler", scaler),
                           ("lasso", lasso)])
    scores = cross_val_score(model,x,y,cv=5)
    avg_score = np.mean(scores)
    print(str(a).ljust(10),
          str(round(avg_score,4)).rjust(5))
    
    if avg_score > best_score:
        best_score = avg_score
        best_alpha = a

# Check model performance with test data
scaler = StandardScaler()
lasso = Lasso(alpha=best_alpha)
best_model = Pipeline(steps=[("scaler", scaler),
                       ("lasso", lasso)])
best_model.fit(x_in,y_in)
print("Best alpha value:",best_alpha)
print("Test R-Squared:",round(best_model.score(x_out,y_out),2))

1          0.0703
5          0.073
10         0.0764
50         0.0939
100        0.1027
500        0.103
1000       0.025
5000       -0.1568
Best alpha value: 500
Test R-Squared: 0.23


K-Fold cross-validation trades data with training time. Having a high number of folds might be worthwhile when data is limited and the model is relatively simple. For models such neural networks that are time-consuming to train, the number of folds will have to be low, perhaps to the point that only the simple train-validation-test split is feasible.

### F. GridSearchCV

In practice, you should use either scikit-learn's `GridSearchCV` or `RandomizedSearchCV` instead of writing your own loop. This is particularly true if the model has multiple hyperparameters to tune.

In [18]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV

# Use a dictionary to specify the parameters we need to go through
parameters = {'lasso__alpha':[1,5,10,50,100,500,1000,5000]}
scaler = StandardScaler()
lasso = Lasso()
model = Pipeline(steps=[("scaler", scaler),
                        ("lasso", lasso)])
gscv = GridSearchCV(model,parameters,cv=5)
gscv.fit(x_in, y_in)

The best-performing hyperparameter(s) and the best score are recorded in ```best_params_``` and ```best_score_``` respectively:

In [19]:
# Best parameter(s)
gscv.best_params_

{'lasso__alpha': 1000}

In [20]:
# Best score
gscv.best_score_

-0.41717993122383346

The `GridSearchCV` object can be used just like any other `scikit-learn` models.
It will use the best model it has found:

In [21]:
gscv.score(x_out,y_out)

0.1546671956360689

You can get the best model directly with ```best_estimator_```:

In [22]:
best_model = gscv.best_estimator_
best_model.score(x_out,y_out)

0.1546671956360689

On powerful computers with many CPU cores, you can speed up the
search by setting `n_jobs` to a number bigger than one. 
This will parallelize the search by the number you specify.
Because parameter search is perfectly parallel, you will see
speedup proportional to `n_jobs`, 
as long as you are not maxing out all available CPU cores.

In [None]:
# Parallel search
gscv = GridSearchCV(model,parameters,cv=5,n_jobs=4)

### Hyperopt
Automatically tries many different models and parameters.

In [24]:
from hpsklearn import HyperoptEstimator,any_preprocessing,any_regressor
from hyperopt import tpe

model = HyperoptEstimator(regressor=any_regressor("my_rego"),
                          preprocessing=any_preprocessing("my_pre"),
                          algo=tpe.suggest,
                          max_evals=20)
model.fit(x_in, y_in)
print(model.score(x_out, y_out))
print(model.best_model())

100%|██████████| 1/1 [00:00<00:00, 12.98trial/s, best loss: 0.5159810118802733]
100%|██████████| 2/2 [00:00<00:00, 15.07trial/s, best loss: 0.4510744726606646]
100%|██████████| 3/3 [00:00<00:00, 10.74trial/s, best loss: 0.4510744726606646]
100%|██████████| 4/4 [00:00<00:00, 19.17trial/s, best loss: 0.4510744726606646]
100%|██████████| 5/5 [00:00<00:00,  2.74trial/s, best loss: 0.4510744726606646]
100%|██████████| 6/6 [00:00<00:00, 12.78trial/s, best loss: 0.4510744726606646]
100%|██████████| 7/7 [00:00<00:00, 18.52trial/s, best loss: 0.4510744726606646]
100%|██████████| 8/8 [00:00<00:00,  1.95trial/s, best loss: 0.32721780663228883]
100%|██████████| 9/9 [00:01<00:00,  1.79s/trial, best loss: 0.32721780663228883]
100%|██████████| 10/10 [00:00<00:00,  5.97trial/s, best loss: 0.32721780663228883]
100%|██████████| 11/11 [00:00<00:00, 17.89trial/s, best loss: 0.32721780663228883]
100%|██████████| 12/12 [00:00<00:00,  2.00trial/s, best loss: 0.32721780663228883]
100%|██████████| 13/13 [00:00




100%|██████████| 20/20 [00:00<00:00, 15.06trial/s, best loss: 0.32721780663228883]
0.28454360626094366
{'learner': AdaBoostRegressor(learning_rate=0.02926615872179483, loss='exponential',
                  n_estimators=497, random_state=4), 'preprocs': (Normalizer(),), 'ex_preprocs': ()}


