# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Train-Test Split
Week 3 | Lesson 2.3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Explain the connection between the bias-variance tradeoff and the train-test split
- Perform a split of data into testing and training sets

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn as skl

### The Bias-Variance Tradeoff

We are seeking a model that generalizes. We will build the model using data that we have, but its value comes in predicting outcomes for data we have not yet seen. 

Let's briefly consider the source of possible error in our model:

$$E\left[y_0-\hat f(x_0)\right]^2 = \text{Var}\left(\hat f(x_0)\right) + \left[\text{Bias}\left(\hat f(x_0)\right)\right]^2 + \text{Var}(\epsilon)$$

What do these represent:

- $\text{Var}\left(\hat f(x_0)\right)$ : the variance in your model; the extent to which your model adjusts to perfectly match your data 
- $\left[\text{Bias}\left(\hat f(x_0)\right)\right]^2$ : the bias in your model; the extent to which your model is not capable of matching the data 
- $\text{Var}(\epsilon)$ : the variance in the inherent error

#### THESE ARE ALL POSITIVE

We can only hope to minimize them. We have no control over the variance of the inherent error

## Minimize the Bias 

Minimizing the bias is easy. This is actually what we are doing when we doing a least squares regression (OLS).

![](assets/ols.png)

In fact, according to the Gauss-Markov Theorem, 

> in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator.

https://en.wikipedia.org/wiki/Gauss–Markov_theorem

In [None]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

def make_data(N=30, err=0.8, rseed=1):
    # randomly sample the data
    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 10 - 1. / (X.ravel() + 0.1)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

In [None]:
fig = plt.figure(figsize=(20,6))

X, y = make_data()
xfit = np.linspace(-0.1, 1.0, 1000)[:, None]

models = []
for i in range(7):
    fig.add_subplot(171+i)
    model = PolynomialRegression(4*i+1).fit(X, y)
    models.append(model)
    plt.scatter(X, y)
    plt.plot(xfit, model.predict(xfit))
    plt.ylim(-1, 12)


### Measure the Bias

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
error = [mean_absolute_error(model.predict(X), y) for model in models]
plt.plot(error)

### What if we pass new data?

In [None]:
X_new, y_new = make_data(rseed=2)

error_new = [mean_absolute_error(model.predict(X_new), y_new) for model in models]
plt.plot(error)
plt.plot(error_new)
plt.ylim(-1,25)

## Minimize the sum of the Bias and Variance

This is a much more challenging problem. In essence, we seek a model that is simultaneously lacking in complexity (low variance) and able to fit our known data well (low bias). To do this, we split our data into two sets:

- a training set
- a test set

In [None]:
data_file_location = '../../../data/boston.csv'
boston_housing_df = pd.read_csv(data_file_location, 
                                index_col=None,
                                header=None,
                                delim_whitespace=True)

boston_housing_df.columns = ["CRIM", "ZN", "INDUS", "CHAS", 
                             "NOX", "RM", "AGE", "DIS", 
                             "RAD", "TAX", "PTRATIO", "B", 
                             "LSTAT", "MEDV"]

In [None]:
boston_housing_df.describe()

#### Sort Parameters by their Correlation with `MEDV`

In [None]:
boston_abs_correlations = abs(boston_housing_df.corr()['MEDV'])
boston_abs_correlations.sort_values(inplace=True, ascending=False)
boston_abs_correlations

#### Just get the Names

In [None]:
features_names = list(boston_abs_correlations.index)

#### Don't need `MEDV`!

In [None]:
features_names.pop(0)
print(features_names)

---

# Best Practices in Developing Predictive Models

1. Clearly state the problem you wish to solve
1. Clearly state the model you will develop to solve the problem
1. Clearly state a metric you will use to assess your performance
1. Clearly define a benchmark against which you will measure the performance of your model using the metric you selected

## Modeling Median Home Value in Boston

### Problem Statement

### Solution Statement

### Metric Selection

<img src="assets/regression_metrics.png" width="600px">

In [None]:
from sklearn import metrics

In [None]:
def metric(y_true, y_pred):
    return metrics.mean_absolute_error(y_true, y_pred)

In [None]:
metric((1,1,1),(4,1,1))

In [None]:
metric((1,1,1),(10,1,1))

### Benchmark 

---

# the Train-Test Split

The process looks as follows:

1. Split the data into two (not necessarily equally sized) sets, the training set and the test set
1. Set the test set aside
1. Fit the model to the best of our abilities using the training set
1. Evaluate the model separately using both the training set and the test set
   - the evaluation of the model using the training set can be taken to signify bias
   - the evaluation of the model using the test set can be taken to signify variance
1. Repeat steps 3 and 4 until an optimal sum of bias and variance is reached

### Prepare the Data

Pull the target vector off of the dataframe.

Drop the target vector to prepare the feature matrix.

In [None]:
boston_housing_target = boston_housing_df['MEDV']
boston_housing_feature = boston_housing_df.drop('MEDV', axis=1)

### Step 1: Split the data into a Training Set and a Test Set

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
feature_matrix_train, \
feature_matrix_test, \
target_vector_train, \
target_vector_test = train_test_split(boston_housing_feature, 
                                      boston_housing_target, 
                                      test_size=0.1,
                                      random_state=11)

### Forward Selection

We will use forward selection to develop our models.

In [None]:
features_names

In [None]:
from sklearn.linear_model import LinearRegression

#### Let's store the errors

In [None]:
errors_training_set = []
errors_test_set = []

### Step 2: Fit the Model

Here we fit a linear model using a single feature, `LSTAT`.

#### Prepare the data for fitting

In [None]:
print(features_names[:1])
ftr_mtx_01_p_trn = pd.DataFrame(feature_matrix_train[features_names[:1]])
ftr_mtx_01_p_tst = pd.DataFrame(feature_matrix_test[features_names[:1]])

#### Build Model with One Feature

In [None]:
LINEAR_REGRESSOR = LinearRegression()
LINEAR_REGRESSOR.fit(ftr_mtx_01_p_trn, target_vector_train)

### Step 3: Evaluate the Model

In [None]:
predict_train = LINEAR_REGRESSOR.predict(ftr_mtx_01_p_trn)
predict_test = LINEAR_REGRESSOR.predict(ftr_mtx_01_p_tst)
error_training_set = metric(predict_train, target_vector_train)
error_test_set = metric(predict_test, target_vector_test)
errors_training_set.append(error_training_set)
errors_test_set.append(error_test_set)
error_training_set, error_test_set

### Step 2: Fit the Model

Here we fit a linear model using two features, `LSTAT` and `RM`.

#### Prepare the data for fitting

In [None]:
print(features_names[:2])
ftr_mtx_02_p_trn = pd.DataFrame(feature_matrix_train[features_names[:2]])
ftr_mtx_02_p_tst = pd.DataFrame(feature_matrix_test[features_names[:2]])

#### Build Model with Two Features

In [None]:
LINEAR_REGRESSOR = LinearRegression()
LINEAR_REGRESSOR.fit(ftr_mtx_02_p_trn, target_vector_train)

### Step 3: Evaluate the Model

In [None]:
predict_train = LINEAR_REGRESSOR.predict(ftr_mtx_02_p_trn)
predict_test = LINEAR_REGRESSOR.predict(ftr_mtx_02_p_tst)
error_training_set = metric(predict_train, target_vector_train)
error_test_set = metric(predict_test, target_vector_test)
errors_training_set.append(error_training_set)
errors_test_set.append(error_test_set)
error_training_set, error_test_set

### Step 2: Fit the Model

Here we fit a linear model using three features, `LSTAT`, `RM`, and `PTRATIO`.

#### Prepare the data for fitting

In [None]:
print(features_names[:3])
ftr_mtx_03_p_trn = pd.DataFrame(feature_matrix_train[features_names[:3]])
ftr_mtx_03_p_tst = pd.DataFrame(feature_matrix_test[features_names[:3]])

#### Build Model with Three Features

In [None]:
LINEAR_REGRESSOR = LinearRegression()
LINEAR_REGRESSOR.fit(ftr_mtx_03_p_trn, target_vector_train)

### Step 3: Evaluate the Model

In [None]:
predict_train = LINEAR_REGRESSOR.predict(ftr_mtx_03_p_trn)
predict_test = LINEAR_REGRESSOR.predict(ftr_mtx_03_p_tst)
error_training_set = metric(predict_train, target_vector_train)
error_test_set = metric(predict_test, target_vector_test)
errors_training_set.append(error_training_set)
errors_test_set.append(error_test_set)
error_training_set, error_test_set

## Let Python Do The Work

In [None]:
training_feature_matrics = [
    pd.DataFrame(feature_matrix_train[features_names[:4]]),
    pd.DataFrame(feature_matrix_train[features_names[:5]]),
    pd.DataFrame(feature_matrix_train[features_names[:6]]),
    pd.DataFrame(feature_matrix_train[features_names[:7]]),
    pd.DataFrame(feature_matrix_train[features_names[:8]]),
    pd.DataFrame(feature_matrix_train[features_names[:9]])
]

test_feature_matrics = [
    pd.DataFrame(feature_matrix_test[features_names[:4]]),
    pd.DataFrame(feature_matrix_test[features_names[:5]]),
    pd.DataFrame(feature_matrix_test[features_names[:6]]),
    pd.DataFrame(feature_matrix_test[features_names[:7]]),
    pd.DataFrame(feature_matrix_test[features_names[:8]]),
    pd.DataFrame(feature_matrix_test[features_names[:9]])
]

In [None]:
for training_matrix, test_matrix in zip(
    training_feature_matrics,
    test_feature_matrics):
    
    LINEAR_REGRESSOR = LinearRegression()
    LINEAR_REGRESSOR.fit(training_matrix, target_vector_train)
    predict_train = LINEAR_REGRESSOR.predict(training_matrix)
    predict_test = LINEAR_REGRESSOR.predict(test_matrix)
    error_training_set = metric(predict_train, target_vector_train)
    error_test_set = metric(predict_test, target_vector_test)
    errors_training_set.append(error_training_set)
    errors_test_set.append(error_test_set)

In [None]:
plt.plot(errors_training_set, label='training set')
plt.plot(errors_test_set, label='test_set')
plt.legend()