# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

boston_features.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.0,-1.27526
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.0,-0.263711
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.0,-1.162114


In [2]:
X = boston_features
y = pd.DataFrame(boston.target, columns=['MEDV'])

## Train test split

Perform a train-test-split with a test set of 0.20.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print(len(X_train), len(X_test), len(y_train), len(y_test))

404 102 404 102


Fit the model and apply the model to the make test set predictions

In [4]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [5]:
train_residuals = y_hat_train - y_train
test_residuals = y_hat_test - y_test

from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print('Train Mean Squarred Error:', train_mse)
print('Test Mean Squarred Error:', test_mse)

delta = abs(train_mse - test_mse)
print("delta: {}".format(delta))

Train Mean Squarred Error: 16.41786509350068
Test Mean Squarred Error: 17.744121372848614
delta: 1.326256279347934


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [6]:
def kfolds(data, cols, k):
    kf = []
    size = len(data)
    d = int(size/k)
    part_size = d
    r = size % k
    if r > 0:
        part_size += 1

    # Force data as pandas dataframe
    df = pd.DataFrame(data, columns=cols)

    # partition df
    rbound = 0
    for part in range(k):
        lbound = rbound
        tb = rbound + part_size
        rbound = tb if tb <= size else size
        df_part = df.iloc[lbound:rbound]
        kf.append(df_part)    

    return kf

### Apply it to the Boston Housing Data

In [7]:
b2 = load_boston()
kf = kfolds(b2.data, b2.feature_names, 5)
len(kf)

5

In [8]:
# Make sure to concatenate the data again
y2 = pd.DataFrame(b2.target, columns=['MEDV'])
for part in range(len(kf)):
    _b = kf[part]["B"]
    _logdis = np.log(kf[part]["DIS"])
    _loglstat = np.log(kf[part]["LSTAT"])
    # minmax scaling
    kf[part]["B"] = (_b-min(_b))/(max(_b)-min(_b))
    kf[part]["DIS"] = (_logdis-min(_logdis))/(max(_logdis)-min(_logdis))
    #standardization
    kf[part]["LSTAT"] = (_loglstat-np.mean(_loglstat))/np.sqrt(np.var(_loglstat))
    kf[part] = pd.concat([kf[part], y2], axis=1, join='inner')
    print("{}\n".format(kf[part].head()))

CRIM    ZN  INDUS  CHAS    NOX     RM   AGE       DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  0.322397  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  0.484302  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  0.484302  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  0.650328  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  0.650328  3.0  222.0   

   PTRATIO         B     LSTAT  MEDV  
0     15.3  1.000000 -1.246661  24.0  
1     17.8  1.000000 -0.062010  21.6  
2     17.8  0.975228 -1.659597  34.7  
3     18.7  0.986184 -2.274829  33.4  
4     18.7  1.000000 -1.114152  36.2  

        CRIM   ZN  INDUS  CHAS   NOX     RM   AGE       DIS  RAD    TAX  \
102  0.22876  0.0   8.56   0.0  0.52  6.405  85.4  0.409859  5.0  384.0   
103  0.21161  0.0   8.56   0.0  0.52  6.137  87.4  0.409859  5.0  384.0   
104  0.13960  0.0   8.56   0.0  0.52  6.167  90.0  0.344665  5.0  384.0   
105  0.13262  0.0   

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [9]:
test_errs = []
train_errs = []

for part in range(len(kf)):
    X_part = kf[part].drop('MEDV', axis=1)
    y_part = kf[part]['MEDV']
    # Split in train and test for the fold
    X_part_train, X_part_test, y_part_train, y_part_test = train_test_split(X, y, test_size=0.3)
    print(len(X_part_train), len(X_part_test), len(y_part_train), len(y_part_test))
    # Fit a linear regression model
    linreg.fit(X_part_train, y_part_train)
    y_part_hat_train = linreg.predict(X_part_train)
    y_part_hat_test = linreg.predict(X_part_test)
    train_residuals = y_part_hat_train - y_part_train
    test_residuals = y_part_hat_test - y_part_test
    train_part_mse = mean_squared_error(y_part_train, y_part_hat_train)
    test_part_mse = mean_squared_error(y_part_test, y_part_hat_test)    
    print('Train Mean Squarred Error:', train_part_mse)
    print('Test Mean Squarred Error:', test_part_mse)
    delta = abs(train_part_mse - test_part_mse)
    print("delta: {}".format(delta))
    train_errs.append(train_part_mse)
    test_errs.append(test_part_mse)

print("\n{}".format(train_errs))
print(test_errs)

354 152 354 152
Train Mean Squarred Error: 16.963463823671816
Test Mean Squarred Error: 16.4772303953525
delta: 0.4862334283193164
354 152 354 152
Train Mean Squarred Error: 18.41079308855939
Test Mean Squarred Error: 12.754837474683486
delta: 5.655955613875905
354 152 354 152
Train Mean Squarred Error: 16.94553497573843
Test Mean Squarred Error: 16.40526571851425
delta: 0.5402692572241783
354 152 354 152
Train Mean Squarred Error: 16.445346561802243
Test Mean Squarred Error: 17.331336978613255
delta: 0.885990416811012
354 152 354 152
Train Mean Squarred Error: 16.835463172480136
Test Mean Squarred Error: 16.296163148827787
delta: 0.5393000236523484

[16.963463823671816, 18.41079308855939, 16.94553497573843, 16.445346561802243, 16.835463172480136]
[16.4772303953525, 12.754837474683486, 16.40526571851425, 17.331336978613255, 16.296163148827787]


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [10]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv_5_results  = np.mean(cross_val_score(linreg, X, y, cv=5, scoring='neg_mean_squared_error'))
cv_5_results

-25.344451926139975

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [11]:
mean_train_mse = np.array(train_errs).mean()
mean_test_mse = np.array(test_errs).mean()

print("abs(train_mse - mean_train_mse)=abs({}, {})={}".format(train_mse, mean_train_mse, abs(train_mse - mean_train_mse)))
print("abs(test_mse - mean_test_mse)=abs({}, {})={}".format(test_mse, mean_test_mse, abs(test_mse - mean_test_mse)))

abs(train_mse - mean_train_mse)=abs(16.41786509350068, 17.1201203244504)=0.7022552309497208
abs(test_mse - mean_test_mse)=abs(17.744121372848614, 15.852966743198255)=1.8911546296503587


##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!