## Using Validation/Cross-Validation For Model Selection

This notebook demonstrates two typical workflows for using validation data to select models. It also demonstrates the use of some utility methods like generating **polynomial features**, converting **categorical features to "dummy variable"** binary columns, and **scaling features** when applying regularization.

**Notebook Contents**

> 1. Simple preprocessing and dummy variables
> 2. **Basic validation** method: Train/validation/test
> 3. **Rigorous validation** method: Cross-validation/test
> 4. **Making CV less manual** via scikit-learn

## 1. Preprocessing and Dummy Variables

In [3]:
#Data loading: cars data set (using car characteristics to predict the price)
import pandas as pd
import numpy as np

df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data',
               header=None)

columns= ['symboling','normalized-losses','make','fuel-type',
          'aspiration','num-of-doors','body-style','drive-wheels',
          'engine-location','wheel-base','length','width','height',
          'curb-weight','engine-type','num-of-cylinders','engine-size',
          'fuel-system','bore','stroke','compression-ratio','horsepower',
          'peak-rpm','city-mpg','highway-mpg','price']
df.columns=columns

In [4]:
df.head(3)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500


We're going to simplify things a bit by focusing on the numeric columns and a single categorical column, make.

In [5]:
#simple cleaning

# replace '?' with NaN
df = df.replace('?', np.NaN).dropna().reset_index() # creates a column with an index

df['price'] = df['price'].astype(float)  # change all the columns which are supposed to be float to float

cars = df.select_dtypes(exclude=['object']).copy() # .copy() makes a new DF without changing your current one
# .selecct_dtypes is a VERY USEFUL code to exclude columns with datatypes you don't want

cars['make'] = df['make']
cars.head(3)

Unnamed: 0,index,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg,price,make
0,3,2,99.8,176.6,66.2,54.3,2337,109,10.0,24,30,13950.0,audi
1,4,2,99.4,176.6,66.4,54.3,2824,136,8.0,18,22,17450.0,audi
2,6,1,105.8,192.7,71.4,55.7,2844,136,8.5,19,25,17710.0,audi


A machine learning model has no idea how to interpret something like 'audi', it only understands numbers! A very common trick for addressing this is the use of **"dummy variables"**. 

Dummy variables are binary feature columns corresponding to each category. The value is 1 if the observation is in that category, and 0 if not. We can then just use these columns as features in our regression model.

Pandas makes it very easy for us to make this conversion. Notice that the new dummies df has 18 columns, 1 for each category.

In [6]:
cars['make'].nunique()

18

In [7]:
pd.get_dummies(cars['make']).head(5)  # Another way to make dummy variables instead of using patsy

Unnamed: 0,audi,bmw,chevrolet,dodge,honda,jaguar,mazda,mercedes-benz,mitsubishi,nissan,peugot,plymouth,porsche,saab,subaru,toyota,volkswagen,volvo
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [8]:
cars = pd.get_dummies(cars) #can just apply it to the whole df
cars.head(3)

Unnamed: 0,index,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,...,make_mitsubishi,make_nissan,make_peugot,make_plymouth,make_porsche,make_saab,make_subaru,make_toyota,make_volkswagen,make_volvo
0,3,2,99.8,176.6,66.2,54.3,2337,109,10.0,24,...,0,0,0,0,0,0,0,0,0,0
1,4,2,99.4,176.6,66.4,54.3,2824,136,8.0,18,...,0,0,0,0,0,0,0,0,0,0
2,6,1,105.8,192.7,71.4,55.7,2844,136,8.5,19,...,0,0,0,0,0,0,0,0,0,0


Now we're ready to start modeling! We're going to try out the validation process to choose between 3 models: simple linear regression, linear regression with ridge regularization, and linear regression with 2nd degree polynomial features.

## 2. Simple Validation Method: Train / Validation / Test

Here we will break the data into 3 portions: 60% for training, 20% for validation (used to select the model), 20% for final testing evaluation.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge #ordinary linear regression + w/ ridge regularization
from sklearn.preprocessing import StandardScaler, PolynomialFeatures  # StandardScaler = scales the data


# Feature selection (filtering out which filters are not useful) can be done based on Linear Regression 
# - checking p score, t score


X, y = cars.drop('price',axis=1), cars['price']  
# This part we do the feature dropping for training and testing which model

# hold out 20% of the data for final testing
X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=10) # X, X_test, y, y_test

# You want to seed random_state because you want to save your results

---
**Exercise**: using `train_test_split` and random state 3, further partition X, y into datasets X_train, y_train (60% of original) and X_val, y_val (20%).

Hint: you will need to adjust the `test_size` parameter.

---

In [10]:
# YOUR SOLUTION HERE

# X is already data for training (as defined in the code above)
# y is already data for training 
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.25, random_state=3)

# Therefore, this test_size is further splitting it from the already split 80% data for training above.
# So .25 test_size will give you 80 * 1/4 = 20% of the whole data set.

Now we need some model setup:  
  
**when using regularization, we must standardize**  
  
the data so that all features are on the same scale (we subtract the mean of each column and divide by the standard deviation, giving us features with mean 0 and std 1). Since this scaling is part of our model, we need to scale using the training set feature distributions and apply the same scaling to validation and test without refitting the scaler. 

Also, we need to get **polynomial features** for the poly model

In [11]:
#set up the 3 models we're choosing from:

# 1: Linear Regression Model
# Setting up Linear Regression Model
# We don't do any scaling on our data because LR does not require scaling
lm = LinearRegression()


# 2: Ridge Model
# Feature scaling for train, val, and test data so that we can run our RIDGE model on each
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train.values)  # Only fit transform the first one
X_val_scaled = scaler.transform(X_val.values)  # The rest you just transform
X_test_scaled = scaler.transform(X_test.values)

lm_reg = Ridge(alpha=1)  # Setting up Ridge Model

# 3: Poly Model
#Feature transforms for train, val, and test so that we can run our poly model on each
poly = PolynomialFeatures(degree=2) 

X_train_poly = poly.fit_transform(X_train.values)
X_val_poly = poly.transform(X_val.values)
X_test_poly = poly.transform(X_test.values)

lm_poly = LinearRegression()  # Setting up Poly Model

Now we can train, validate, and test.

In [12]:
X_train.shape

(95, 29)

In [13]:
X_train_scaled.shape

(95, 29)

In [14]:
X_train_poly.shape

(95, 465)

In [15]:
# train model and validate 

lm.fit(X_train, y_train)  # for normal LR, you can choose to scale or not to scale your x.
print(f'Linear Regression val R^2: {lm.score(X_val, y_val):.3f}')

lm_reg.fit(X_train_scaled, y_train)  # for your ridge regression, your practice is to train on SCALED DATA
print(f'Ridge Regression val R^2: {lm_reg.score(X_val_scaled, y_val):.3f}')

lm_poly.fit(X_train_poly, y_train)  # you have to put in your polynomial features.
print(f'Degree 2 polynomial regression val R^2: {lm_poly.score(X_val_poly, y_val):.3f}')


# R^2 can be negative.

Linear Regression val R^2: 0.835
Ridge Regression val R^2: 0.737
Degree 2 polynomial regression val R^2: -5.136


Check out that negative R^2, some severe overfitting! 

So having run this validation step, we see that the **evidence points to simple linear regression being the best model.** So our validation process lets us **select** that **choice of model**, and as our final step we retrain it on the entire chunk of train/val data and see how it does on test data:  

In [16]:
lm.fit(X,y)
print(f'Linear Regression test R^2: {lm.score(X_test, y_test):.3f}')

Linear Regression test R^2: 0.906


Not terrible!

This is a pretty solid selection method, but we can make it even more rigorous using **cross-validation**.

---
**Exercise**: Return to the beginning of this workflow (train-test split), and **try changing the random state (e.g. to 11)** before stepping through all the code blocks.

What happened to the evaluation results, and how would you explain it? Is the evidence about which model we should select the same, or different? 
       
---

## 3. Rigorous Validation Method: Cross-Validation / Test

Here we will break the data into 2 portions: 80% for a cross-validated training process, and 20% for final testing evaluation. 

Remember that the idea of CV is to make efficient use of the data available to us (using 80% instead of 60% above), while also performing multiple validation checks. For k-fold CV, we come up with k train/validation splits of the whole chunk of data, in such a way that **each observation is in the validation set exactly 1 time**. Here's a helpful diagram:

![](images/cross_validation_diagram.png)

For simplicity we'll focus on linear regression and ridge regression (we also can feel pretty comfortable throwing out the full degree 2 polynomial regression based on the poor results above!) As we loop through our CV folds, we will train and validate both models and collect the results to compare at the end. Note that we scale the training features within the CV loop.

In [17]:
from sklearn.model_selection import KFold  #Kfold will allow you to do cross validation

X, y = cars.drop('price',axis=1), cars['price']

X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=10) #hold out 20% of the data for final testing

#this helps with the way kf will generate indices below
X, y = np.array(X), np.array(y)

In [18]:
#run the Cross-Validation

kf = KFold(n_splits=5, shuffle=True, random_state = 71) # splits your data into 5 kfolds, and shuffles them randomly 
cv_lm_r2s, cv_lm_reg_r2s = [], [] # to collect the validation results for both models

for train_ind, val_ind in kf.split(X,y): # len(kf.split(X.y)) should be 5 since you set n_split=5
    # looping through each subset of kfold and running the whole training code
    
    X_train, y_train = X[train_ind], y[train_ind]
    X_val, y_val = X[val_ind], y[val_ind] 
    
    #simple linear regression
    lm = LinearRegression()
    lm_reg = Ridge(alpha=1)

    lm.fit(X_train, y_train)
    cv_lm_r2s.append(lm.score(X_val, y_val))
    
    #ridge with feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    
    
    lm_reg.fit(X_train_scaled, y_train)
    cv_lm_reg_r2s.append(lm_reg.score(X_val_scaled, y_val)) # Appending the score of each model through each kfold loop

print('Simple regression scores: ', cv_lm_r2s)
print('Ridge scores: ', cv_lm_reg_r2s, '\n')

print(f'Simple mean cv r^2: {np.mean(cv_lm_r2s):.3f} +- {np.std(cv_lm_r2s):.3f}')
print(f'Ridge mean cv r^2: {np.mean(cv_lm_reg_r2s):.3f} +- {np.std(cv_lm_reg_r2s):.3f}')

Simple regression scores:  [0.9474039095843965, 0.8724453019893137, 0.41023920794917035, 0.9327397319071149, 0.8358438145266573]
Ridge scores:  [0.9330767176657704, 0.857781878071569, 0.8316030184988591, 0.9084829083232389, 0.8458302814404041] 

Simple mean cv r^2: 0.800 +- 0.199
Ridge mean cv r^2: 0.875 +- 0.039


In [31]:
# We can see that through doing cross_validation, ridge is better. 

The plot thickens! Our simple validation method above pointed to simple linear regression being better than ridge, but k-fold shows the opposite. The ridge model appears to be both better on average and has less varying results.

**Since k-fold is more reliable than a single validation set, we select the ridge regression model**. This shows the dangers of relying on simple validation methods, especially when our sample sizes are small.

Finally, see that we do better on the same test set!

In [19]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

lm_reg = Ridge(alpha=1)
lm_reg.fit(X_scaled,y)
print(f'Ridge Regression test R^2: {lm_reg.score(X_test_scaled, y_test):.3f}')

Ridge Regression test R^2: 0.918


---
**Exercise (time permitting)**: Modify the Cross-validation loop above so that we also fit degree 2 polynomial regression models and track their validation scores.

---

## 4. K-fold, in a Less Manual Way with Scikit-learn

The k-fold loop we created above required a chunk of code, but we usually expect sklearn to let us make everything much simpler. It turns out we can in this case too, at the expense of fine-grained control over exactly what happens within the k-fold loop. When we want to scale each training set within the k-fold loop or apply other transformations, the fine-grained control is nice, but often we can keep things simple and use the below.

In [34]:
from sklearn.model_selection import cross_val_score
lm = LinearRegression()

cross_val_score(lm, X, y, # estimator, features, target
                cv=5, # number of folds 
                scoring='r2') # scoring metric

# Returns you the validation score for each of the 5 kfolds

array([ 0.6325312 ,  0.82652407, -2.03345873,  0.85532528, -1.24183066])

We could also recreate the exact partitioning we used for the manual version by passing a KFold object to `cross_val_score` and using the same random state -- note the results below are identical to the manual output above.

In [35]:
kf = KFold(n_splits=5, shuffle=True, random_state = 71)
cross_val_score(lm, X, y, cv=kf, scoring='r2')

array([0.9490019 , 0.8671545 , 0.21940447, 0.93619777, 0.83549919])

And we can quickly add more evidence that the regularized model will generalize better - choosing a new KFold partioning, we can compare the model scores again and see that the ridge model does better, even without scaling.

In [36]:
kf = KFold(n_splits=5, shuffle=True, random_state = 1000)

print(np.mean(cross_val_score(lm, X, y, cv=kf, scoring='r2')))
print(np.mean(cross_val_score(lm_reg, X, y, cv=kf, scoring='r2')))

0.8092934100669888
0.8841813446192417


---
**Exercise**: Using the `KFold` objects and the `cross_val_score` method above, loop through 5 different random states of your choice and collect mean CV scores for both `lm` and `lm_reg`. Compare the 5 pairs of scores. 

Are you convinced yet that regularization is helpful with this feature set?  

---

In [39]:
lm_kfold_result, lm_reg_kfold_result = [], []
for i in [123, 5, 1000, 5, 77]:
    kf = KFold(n_splits=5, shuffle=True,random_state= i)
    
    lm_kfold_result.append(np.mean(cross_val_score(lm, X, y, cv=kf, scoring='r2')))
    lm_reg_kfold_result.append(np.mean(cross_val_score(lm_reg, X,y, cv=kf, scoring='r2')))

In [40]:
lm_kfold_result   # Getting the r2 score for each validation data set for lm

[0.6649112257538974,
 0.5242685527520594,
 0.8092934100669888,
 0.5242685527520594,
 0.6434025516917388]

In [45]:
lm_reg_kfold_result  # Getting the r2 score for each validation data set for lm_Reg

[0.8608219776804956,
 0.8378459328655963,
 0.8841813446192417,
 0.8378459328655963,
 0.8492606154032766]

In [43]:
np.mean(lm_kfold_result)

0.6332288586033487

In [44]:
np.mean(lm_reg_kfold_result)

0.8539911606868413