# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model to determine optimal model performance
- Compare training and testing errors to determine if model is over or underfitting

## Let's get started

We included the code to pre-process below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

### Train-test split

Perform a train-test split with a test set of 20%.

In [2]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [3]:
# Split the data into training and test sets (assign 20% to test set)
linreg = LinearRegression()
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.2)

In [5]:
# A brief preview of train-test split
print(len(Xtrain), len(Xtest), len(ytrain), len(ytest))

1168 292 1168 292


### Fit the model

Fit a linear regression model and apply the model to make predictions on test set

In [7]:
linreg.fit(Xtrain, ytrain)
yhat_train = linreg.predict(Xtrain)
yhat_test = linreg.predict(Xtest)

### Residuals and MSE

Calculate the residuals and the mean squared error on the test set

In [8]:
from sklearn.metrics import mean_squared_error

mean_squared_error(yhat_test, ytest)

0.21523725366122143

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function `kfolds()` that splits a dataset into k evenly sized pieces. If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [25]:
import numpy as np

def kfolds(data, k):
    data = pd.DataFrame(data)
    # add 1 to fold size to account for leftovers  
    folds = []
    for i in range(k):
        range_start = int(np.ceil(i*len(data)/k))
        range_end = int(np.ceil((i+1)*len(data)/k))
        folds.append(data.iloc[range_start:range_end, :])
    return folds

### Apply it to the Ames Housing data

In [26]:
# Make sure to concatenate the data again
ames_data = pd.concat([X, y], axis=1)

In [59]:
# Apply kfolds() to ames_data with 5 folds
folds = kfolds(ames_data, 5)

### Perform a linear regression for each fold and calculate the training and test error

Perform linear regression on each and calculate the training and test error: 

In [79]:
ames_data.drop(folds[1].index)

Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,...,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,SalePrice_log
0,-0.133185,-0.803295,0.529078,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0.559876
1,0.113403,0.418442,-0.381715,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0.212692
2,0.419917,-0.576363,0.659449,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0.733795
3,0.103311,-0.439137,0.541326,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,-0.437232
4,0.878108,0.112229,1.281751,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1.014303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,-0.259100,-0.465447,0.416538,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.121392
1456,0.725171,1.980456,1.106213,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.577822
1457,-0.002324,0.228260,1.469438,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1.174306
1458,0.136814,-0.077546,-0.854179,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,-0.399519


In [82]:
from sklearn.metrics import mean_squared_error

test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    test = folds[n]
    train = ames_data.drop(test.index)
    
    # Fit a linear regression model
    ytest = test[['SalePrice_log']]
    Xtest = test.drop('SalePrice_log', axis=1)
    ytrain = train[['SalePrice_log']]
    Xtrain = train.drop('SalePrice_log', axis=1)
    linreg.fit(Xtrain, ytrain)
    yhat_test = linreg.predict(Xtest)
    yhat_train = linreg.predict(Xtrain)
    
    # Evaluate Train and Test errors
    test_errs.append(mean_squared_error(yhat_test, ytest))
    train_errs.append(mean_squared_error(yhat_train, ytrain))

print(train_errs)
print(test_errs)

[0.1717050965146466, 0.15507935685930538, 0.156599463262233, 0.16134557666308721, 0.15165048553131677]
[0.12431546148437427, 0.19350064631313132, 0.1891053043131118, 0.17079325250026914, 0.20742704588916955]


In [86]:
np.mean(test_errs)

0.17702834210001123

## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [84]:
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score

mse = make_scorer(mean_squared_error)
cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring = mse)

Next, calculate the mean of the MSE over the 5 cross-validation and compare and contrast with the result from the train-test split case.

In [85]:
cv_5_results.mean()

0.17702834210001123

##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are robust and always use it whenever possible! 