# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
logb = np.log(boston_features["B"])
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])
rm = boston_features['RM']

# minmax scaling
boston_features["B"] = (logb-min(logb))/(max(logb)-min(logb))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))
boston_features['RM'] = (rm-np.mean(rm))/np.sqrt(np.var(rm))

In [19]:
X = boston_features[['B', 'DIS', 'LSTAT', 'RM', 'CHAS']]
y = boston.target

## Train test split

Perform a train-test-split with a test set of 0.20.

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

Fit the model and apply the model to the make test set predictions

In [22]:
from sklearn.linear_model import LinearRegression

In [23]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [24]:
y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [25]:
from sklearn.metrics import mean_squared_error

In [26]:
mse_train = mean_squared_error(y_train, y_hat_train)
mse_test = mean_squared_error(y_test, y_hat_test)
print(mse_train, mse_test)

22.620931335053108 18.00798227973473


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [15]:
667//9

74

In [16]:
667%9

1

In [None]:
83, 330
83, 247
83, 

75,    
    

In [45]:
def kfolds(data, k):
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers     
    
    data = pd.DataFrame(data)
    
    fold_size = len(data)//k
    leftover = len(data)%k
    
    times_fold = k-leftover
    times_fold_plus_one = leftover
    
    print(times_fold, times_fold_plus_one)
    
    folds = []

    for i in range(times_fold):
        print(i*fold_size, (i+1)*fold_size)
        folds.append(data[i*fold_size:(i+1)*fold_size])
    
    if times_fold_plus_one > 0:
        
        for i in range(times_fold_plus_one):
            print('in plus 1 for loop')
            start = times_fold*fold_size
            print(start + (i*(fold_size+1)), start+((i+1)*(fold_size+1)))
            folds.append(data[start + (i*(fold_size+1)):start+((i+1)*(fold_size+1))])
        
    return folds    

### Apply it to the Boston Housing Data

In [64]:
# Make sure to concatenate the data again
boston_five_best = pd.concat([X,pd.Series(y, name='MEDV')], axis=1)
boston_five_best.head()

Unnamed: 0,B,DIS,LSTAT,RM,CHAS,MEDV
0,1.0,0.542096,-1.27526,0.413672,0.0,24.0
1,1.0,0.623954,-0.263711,0.194274,0.0,21.6
2,0.998553,0.623954,-1.627858,1.282714,0.0,34.7
3,0.999195,0.707895,-2.153192,1.016303,0.0,33.4
4,1.0,0.707895,-1.162114,1.228577,0.0,36.2


In [77]:
folds = kfolds(boston_five_best, 5)

4 1
0 101
101 202
202 303
303 404
in plus 1 for loop
404 506


In [80]:

test_popped = folds[:1]+folds[2:]
test = pd.concat(test_popped)
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 405 entries, 0 to 505
Data columns (total 6 columns):
B        405 non-null float64
DIS      405 non-null float64
LSTAT    405 non-null float64
RM       405 non-null float64
CHAS     405 non-null float64
MEDV     405 non-null float64
dtypes: float64(6)
memory usage: 22.1 KB


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [87]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    train = folds[:n] + folds[n+1:]
    train_df = pd.concat(train)
    train_X = train_df[['B', 'DIS', 'LSTAT', 'RM', 'CHAS']]
    train_y = train_df["MEDV"]
    
    test = folds[n]
    test_df = pd.DataFrame(test)
    test_X = test[['B', 'DIS', 'LSTAT', 'RM', 'CHAS']]
    test_y = test["MEDV"]
    
    # Fit a linear regression model
    linreg = LinearRegression()
    linreg.fit(train_X, train_y)
    
    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(train_X)
    y_hat_test = linreg.predict(test_X)
    
    mse_train = mean_squared_error(train_y, y_hat_train)
    mse_test = mean_squared_error(test_y, y_hat_test)
    
    train_errs.append(mse_train)
    test_errs.append(mse_test)

print(train_errs)
print(test_errs)
print(np.mean(test_errs))

[24.124483087969228, 22.99697588523474, 19.614344728130472, 15.758258008483033, 22.319482832653986]
[13.413115726910467, 17.372864299175315, 37.65686446904524, 54.91912185933499, 21.363986528700725]
28.945190576633347


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [83]:
X = boston_features[['B', 'DIS', 'LSTAT', 'RM', 'CHAS']]
y = boston.target

In [84]:
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring='neg_mean_squared_error' )
cv_5_results

array([-13.34501376, -17.66139356, -37.13346086, -55.08461959,
       -21.22956376])

The wild fluctuation across results means these data folds vary wildly

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [86]:
abs(np.mean(cv_5_results))

28.890810306821628

the test_mse for the train-test-split case was 18.  that's a huge difference.  I guess that shows if data has huge ranges within itself, it can lead to unreliable linear regression accuracy statistics unless care is taken with the training and test splits (cross-validation)

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!