# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [51]:
X = boston_features[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']]
y = pd.DataFrame(boston.target)

## Train test split

Perform a train-test-split with a test set of 0.20.

In [52]:
from sklearn.model_selection import train_test_split

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20)

Fit the model and apply the model to the make test set predictions

In [54]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")

reg = LinearRegression()
reg.fit(X_train, y_train)

y_hat_train = reg.predict(X_train)
y_hat_test = reg.predict(X_test)

Calculate the residuals and the mean squared error

In [55]:
import numpy as np

X_train_residuals = y_train - y_hat_train
X_test_residuals = y_test - y_hat_test
X_train_MSE = mean_squared_error(y_train, y_hat_train)
X_test_MSE = mean_squared_error(y_test, y_hat_test)

In [56]:
X_test_MSE

24.731064107766134

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [57]:
def kfolds(data, k):
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers
    df = pd.DataFrame(data)
    d_lst = []
    samp = 0
    count = 0
    k  = k+1
    last = 0
    if len(data) % k == 0:
        while count <= k+1:
            d_lst.append(df.iloc[lambda x: (x.index >= samp) & (x.index <= (samp+int(len(data)/k)-1))])
            samp += int(len(data)/k)
            count +=1
    else:
        while count <= (k - (len(data) % k) + 1):
            d_lst.append(df.iloc[lambda x: (x.index >= samp) & (x.index <= samp+int(len(data)/k)-1)])
            samp += int(len(data)//k)
            count +=1
            last = samp+int(len(data)/k)-1
        d_lst.append(df.iloc[lambda x: (x.index >  last) & (x.index <= int(len(data))-1)])
    return d_lst #folds should be a list of subsets of data


### Apply it to the Boston Housing Data

In [60]:
# Make sure to concatenate the data again
data = pd.merge(X, y, on=X.index)
data["target"] = data[0]
data = data.drop('key_0', axis=1)
folds = kfolds(data, k=5)
print(folds[5].tail())
folds[4].head()

     CHAS     RM       DIS         B     LSTAT     0  target
499   0.0  5.569  0.317486  0.997151  0.572599  17.5    17.5
500   0.0  6.027  0.334399  1.000000  0.485410  16.8    16.8
501   0.0  6.593  0.331081  0.987619 -0.169811  22.4    22.4
502   0.0  6.120  0.297277  1.000000 -0.274682  20.6    20.6
503   0.0  6.976  0.274575  1.000000 -1.067939  23.9    23.9


Unnamed: 0,CHAS,RM,DIS,B,LSTAT,0,target
336,0.0,5.869,0.645772,1.0,-0.147565,19.5,19.5
337,0.0,5.895,0.675609,0.99473,-0.023142,18.5,18.5
338,0.0,6.059,0.610606,0.998084,-0.382682,20.6,20.6
339,0.0,5.985,0.610606,1.0,-0.157795,19.0,19.0
340,0.0,5.968,0.610606,1.0,-0.236594,18.7,18.7


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [61]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([fold for i, fold in enumerate(folds) if i!=n])
    test = folds[n]
    # Fit a linear regression model
    reg = LinearRegression().fit(train[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']], 
                                 train["target"])
    #Evaluate Train and Test Errors
    y_hat_train = reg.predict(train[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']])
    y_hat_test = reg.predict(test[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']])
    train_residuals = y_hat_train - train["target"]
    test_residuals = y_hat_test - test["target"]
    train_errs.append(np.mean(train_residuals**2))
    test_errs.append(np.mean(test_residuals**2))

print(train_errs)
print(test_errs)

[23.021437184446516, 23.346908726358503, 20.230582873154006, 20.652325176385006, 12.721515124341654]
[14.01931833988997, 11.01679923502021, 28.989972983030672, 28.357156773453404, 87.18398323109275]


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [62]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(reg, X, y, cv=5, scoring="neg_mean_squared_error")
cv_5_results

array([-13.40514492, -17.4440168 , -37.03271139, -58.27954385,
       -26.09798876])

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!