# Regression Model Validation - Lab

## Introduction

In this lab, you'll be able to validate your model using train-test-split.


## Objectives

You will be able to:

- Calculate the mean squared error (MSE) as a measure of predictive performance
- Validate the model using the test data


## Let's use our Boston Housing Data again!

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
X = boston_features[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']]
y = pd.DataFrame(boston.target, columns=['MEDV'])
y.head()

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


## Perform a train-test-split

In [3]:
# based on suggestion, do a 70/30 split, w.r.t. train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(len(X_train), len(X_test), len(y_train), len(y_test))

354 152 354 152


## Apply your model to the train set

#### Importing and initializing the model class

In [4]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

#### Fitting the model to the train data

In [5]:
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### Calculating predictions on the train set, and on the test set

In [6]:
y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

#### Calculating your residuals

In [7]:
train_residuals = y_hat_train - y_train
test_residuals = y_hat_test - y_test

#### Calculating the Mean Squared Error
A good way to compare overall performance is to compare the mean squarred error for the predicted values on the train and test sets.

In [8]:
# directly
#mse_train = np.sum((y_train-y_hat_train)**2)/len(y_train)
#mse_test = np.sum((y_test-y_hat_test)**2)/len(y_test)
#print('Train Mean Squarred Error:', mse_train)
#print('Test Mean Squarred Error:', mse_test)

from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print('Train Mean Squarred Error:', train_mse)
print('Test Mean Squarred Error:', test_mse)

delta = abs(train_mse - test_mse)
print("delta: {}".format(delta))

Train Mean Squarred Error: 21.790399879500264
Test Mean Squarred Error: 21.637653154687097
delta: 0.15274672481316642


If your test error is substantially worse then our train error, this is a sign that our model doesn't generalize well to future cases.

One simple way to demonstrate overfitting and underfitting is to alter the size of our train test split. By default, scikit learn's built in method allocates 25% of the data to the test set and 75% to the training set. Fitting a model on only 10% of the data is apt to lead to underfitting, while training a model on 99% of the data is apt to lead to overfitting.

# Evaluating the effect of train-test split size

Iterate over a range of train-test split sizes from .5 to .95. For each of these, generate a new train/test split sample. Fit a model to the training sample and calculate both the training error and the test error (mse) for each of these splits. Plot these two curves (train error vs. training size and test error vs. training size) on a graph.

In [9]:
def find_best_train_test_split(X, y):
    delta_min = -1
    best_split = 0

    for test_ratio in range(5, 100, 5):
        tr = test_ratio/100
        print("test-size ration: {}".format(tr))
        print("================================")
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=tr, random_state=42)
        print(len(X_train), len(X_test), len(y_train), len(y_test))
        linreg.fit(X_train, y_train)
        y_hat_train = linreg.predict(X_train)
        y_hat_test = linreg.predict(X_test)
        train_residuals = y_hat_train - y_train
        test_residuals = y_hat_test - y_test
        train_mse = mean_squared_error(y_train, y_hat_train)
        test_mse = mean_squared_error(y_test, y_hat_test)
        print('Train Mean Squarred Error:', train_mse)
        print('Test Mean Squarred Error:', test_mse)
        delta = abs(train_mse - test_mse)
        print("delta: {}".format(delta))
        if delta_min == -1 or delta < delta_min:
            delta_min = delta
            best_split = tr
        print("================================\n")

    return (delta_min, best_split)

delta_min, best_split = find_best_train_test_split(X, y)
print("\nRESULT: delta_min={}, best split = {} train ratio, {} test ratio".format(delta_min, 1-best_split, best_split))

test-size ration: 0.05
480 26 480 26
Train Mean Squarred Error: 22.176211267599918
Test Mean Squarred Error: 12.967956505177982
delta: 9.208254762421936

test-size ration: 0.1
455 51 455 51
Train Mean Squarred Error: 22.46122514195907
Test Mean Squarred Error: 15.046884226368066
delta: 7.414340915591003

test-size ration: 0.15
430 76 430 76
Train Mean Squarred Error: 22.932119154437025
Test Mean Squarred Error: 15.131445796411382
delta: 7.800673358025643

test-size ration: 0.2
404 102 404 102
Train Mean Squarred Error: 21.371163624945574
Test Mean Squarred Error: 23.476104314750756
delta: 2.104940689805183

test-size ration: 0.25
379 127 379 127
Train Mean Squarred Error: 21.536815607555614
Test Mean Squarred Error: 22.472035278906905
delta: 0.935219671351291

test-size ration: 0.3
354 152 354 152
Train Mean Squarred Error: 21.790399879500264
Test Mean Squarred Error: 21.637653154687097
delta: 0.15274672481316642

test-size ration: 0.35
328 178 328 178
Train Mean Squarred Error: 21.521

# Evaluating the effect of train-test split size: extension

Repeat the previous example, but for each train-test split size, generate 100 iterations of models/errors and save the average train/test error. This will help account for any particularly good/bad models that might have resulted from poor/good splits in the data. 

In [10]:
def find_best_train_test_split_over_n_trials(X, y, n):
    delta_min = -1
    best_split = 0

    for test_ratio in range(5, 100, 5):
        tr = test_ratio/100
        print("test-size ration: {}".format(tr))
        print("================================")
        mean_train_mse = 0
        mean_test_mse = 0

        for i in range(0, n):
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=tr, random_state=42)           
            linreg.fit(X_train, y_train)
            y_hat_train = linreg.predict(X_train)
            y_hat_test = linreg.predict(X_test)
            train_residuals = y_hat_train - y_train
            test_residuals = y_hat_test - y_test
            train_mse = mean_squared_error(y_train, y_hat_train)
            mean_train_mse += train_mse
            test_mse = mean_squared_error(y_test, y_hat_test)
            mean_test_mse += test_mse
           
        mean_train_mse = mean_train_mse/n
        mean_test_mse = mean_test_mse/n
        print(len(X_train), len(X_test), len(y_train), len(y_test))
        print('(Mean over {} trials) Train Mean Squarred Error: {}'.format(n, mean_train_mse))
        print('(Mean over {} trials) Test Mean Squarred Error: {}'.format(n, mean_test_mse))
        delta = abs(mean_train_mse - mean_test_mse)
        print("delta: {}".format(delta))
        if delta_min == -1 or delta < delta_min:
                delta_min = delta
                best_split = tr
        print("================================\n")

    return (delta_min, best_split)

prev_delta_min, prev_best_split = delta_min, best_split
delta_min, best_split = find_best_train_test_split_over_n_trials(X, y, 100)
print("\nRESULT: delta_min={}, best split = {} train ratio, {} test ratio".format(delta_min, 1-best_split, best_split))

test-size ration: 0.05
480 26 480 26
(Mean over 100 trials) Train Mean Squarred Error: 22.17621126759993
(Mean over 100 trials) Test Mean Squarred Error: 12.96795650517797
delta: 9.208254762421959

test-size ration: 0.1
455 51 455 51
(Mean over 100 trials) Train Mean Squarred Error: 22.46122514195911
(Mean over 100 trials) Test Mean Squarred Error: 15.046884226368087
delta: 7.414340915591021

test-size ration: 0.15
430 76 430 76
(Mean over 100 trials) Train Mean Squarred Error: 22.932119154437004
(Mean over 100 trials) Test Mean Squarred Error: 15.131445796411361
delta: 7.800673358025643

test-size ration: 0.2
404 102 404 102
(Mean over 100 trials) Train Mean Squarred Error: 21.371163624945606
(Mean over 100 trials) Test Mean Squarred Error: 23.47610431475076
delta: 2.1049406898051544

test-size ration: 0.25
379 127 379 127
(Mean over 100 trials) Train Mean Squarred Error: 21.536815607555596
(Mean over 100 trials) Test Mean Squarred Error: 22.472035278906947
delta: 0.9352196713513514



What's happening here? evaluate your result!

In [11]:
print("delta_min={}, best_split={}".format(delta_min, best_split))
print("prev_delta_min={}, prev_best_split={}".format(prev_delta_min, prev_best_split))
print("abs(delta_min - prev_delta_min)={}, abs(best_split - prev_best_split)={}".format(abs(delta_min - prev_delta_min), abs(best_split - prev_best_split)))

delta_min=0.15274672481316642, best_split=0.3
prev_delta_min=0.15274672481316642, prev_best_split=0.3
abs(delta_min - prev_delta_min)=0.0, abs(best_split - prev_best_split)=0.0


##  Summary 

Congratulations! You now practiced your knowledge on MSE and on using train-test-split.