# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
X = boston_features[['CHAS','RM','B', 'DIS', 'LSTAT']]
y = pd.DataFrame(boston.target, columns=['price'])

## Train test split

Perform a train-test-split with a test set of 0.20.

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.20)

In [5]:
print(len(X_train), len(X_test), len(y_train), len(y_test))

404 102 404 102


Fit the model and apply the model to the make test set predictions

In [6]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [7]:
from sklearn.metrics import mean_squared_error
test_residuals = y_hat_test - y_test

test_mse = mean_squared_error(y_test, y_hat_test)
test_mse

19.504547448857608

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [8]:
def kfolds(data, k):
    # Force data as pandas dataframe
    data = pd.DataFrame(data)
    obs = len(data)
    fold_size = obs // k
    leftovers = obs%k
    folds = []
    start_obs = 0
    # add 1 to fold size to account for leftovers 
    for i in range(1, k+1):
        if i <= leftovers:
            fold = data.iloc[start_obs : start_obs+fold_size+1]
            folds.append(fold)
            start_obs += fold_size+1
        else:
            fold = data.iloc[start_obs : start_obs+fold_size]
            folds.append(fold)
            start_obs += fold_size
    return folds

### Apply it to the Boston Housing Data

In [11]:
# Make sure to concatenate the data again
boston_data = pd.concat([X, y], axis=1)
boston_data.head()

Unnamed: 0,CHAS,RM,B,DIS,LSTAT,price
0,0.0,6.575,1.0,0.542096,-1.27526,24.0
1,0.0,6.421,1.0,0.623954,-0.263711,21.6
2,0.0,7.185,0.989737,0.623954,-1.627858,34.7
3,0.0,6.998,0.994276,0.707895,-2.153192,33.4
4,0.0,7.147,1.0,0.707895,-1.162114,36.2


In [12]:
boston_folds = kfolds(boston_data, 5)
boston_folds

[     CHAS     RM         B       DIS     LSTAT  price
 0     0.0  6.575  1.000000  0.542096 -1.275260   24.0
 1     0.0  6.421  1.000000  0.623954 -0.263711   21.6
 2     0.0  7.185  0.989737  0.623954 -1.627858   34.7
 3     0.0  6.998  0.994276  0.707895 -2.153192   33.4
 4     0.0  7.147  1.000000  0.707895 -1.162114   36.2
 5     0.0  6.430  0.992990  0.707895 -1.200048   28.7
 6     0.0  6.012  0.996722  0.671500  0.248456   22.9
 7     0.0  6.172  1.000000  0.700059  0.968416   27.1
 8     0.0  5.631  0.974104  0.709276  1.712312   16.5
 9     0.0  6.004  0.974305  0.743201  0.779802   18.9
 10    0.0  6.377  0.988956  0.727217  1.077829   15.0
 11    0.0  6.009  1.000000  0.719175  0.357391   18.9
 12    0.0  5.889  0.983862  0.663113  0.638571   21.7
 13    0.0  5.949  1.000000  0.601338 -0.432353   20.4
 14    0.0  6.096  0.957436  0.578763 -0.071152   18.2
 15    0.0  5.834  0.996772  0.582214 -0.390531   19.9
 16    0.0  5.935  0.974658  0.582214 -0.811149   23.1
 17    0.0

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [13]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([fold for i, fold in enumerate(boston_folds) if i!=n])
    test = boston_folds[n]
    
    # OR
#     train = []
#     for index, x in enumerate(boston_folds):
#         if index != n:
#             train.append(boston_folds[index])
#     test = boston_folds[n]
#     train = pd.concat(train)
    # Fit a linear regression model
    linreg.fit(train[X.columns], train[y.columns])
    
    # OR
#     linreg.fit(train[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']], train['MEDV'])

    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(train[X.columns])
    y_hat_test = linreg.predict(test[X.columns])
    
    train_residuals = y_hat_train - train[y.columns]
    test_residuals = y_hat_test - test[y.columns]
    
    train_errs.append(np.mean(train_residuals.astype(float)**2))
    test_errs.append(np.mean(test_residuals.astype(float)**2))
print(train_errs)
print(test_errs)

[price    24.195577
dtype: float64, price    23.032087
dtype: float64, price    19.745073
dtype: float64, price    15.317101
dtype: float64, price    22.329973
dtype: float64]
[price    13.405145
dtype: float64, price    17.444017
dtype: float64, price    37.032711
dtype: float64, price    58.279544
dtype: float64, price    26.097989
dtype: float64]


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [14]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring='neg_mean_squared_error')

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [15]:
cv_5_results

array([-13.40514492, -17.4440168 , -37.03271139, -58.27954385,
       -26.09798876])

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!