In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
from math import sqrt
%matplotlib inline
np.set_printoptions(precision=3)
fig_width = 6.9
golden_mean = (sqrt(5)-1.0)/2.0    # Aesthetic ratio
fig_height = fig_width*golden_mean # height in inches

params = {
   'axes.labelsize': 8,
   'text.latex.preamble': ['\\usepackage{gensymb}'],
   'font.size': 10,
    'axes.labelsize': 10, # fontsize for x and y labels (was 10)
    'axes.titlesize': 12,
   'legend.fontsize': 8,
   'xtick.labelsize': 10,
   'ytick.labelsize': 10,
   'text.usetex': True,
   'figure.figsize': [fig_width,fig_height],
    'font.family': 'serif'
   }
rcParams.update(params)

# Model Validation



## Model validation the wrong way

Consider the linear regression problem.


In [21]:
# Player classification.
data = pd.read_csv('Data/Players.csv')
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']
X = data[features]
y = data.position

Then we train the model, and use it to predict labels for data we already know:

In [22]:
from sklearn.neighbors import KNeighborsClassifier        # 1. choose model class
model = KNeighborsClassifier(n_neighbors=1)               # 2. instantiate model
model.fit(X, y)                                           # 3. fit model to data
y_pred = model.predict(X)                                 # 4. predict on new data

Finally, we compute the fraction of correctly labeled points:

In [23]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_pred)

0.99327731092436977

We see an accuracy score of 0.99, which indicates that 99% of points were correctly labeled by our model. This approach contains a fundamental flaw: it trains and evaluates the model on the same data. 

## Model validation the right way: Holdout sets

A better sense of a model's performance can be found using what's known as a holdout set: *that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance*. This splitting can be done using the train_test_split utility in Scikit-Learn:

In [24]:
from sklearn.model_selection import train_test_split
# split the data with 25% in each set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [25]:
# fit the model on one set of data
model.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [26]:
# evaluate the model on the second set of data
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.46979865771812079

In this turn, te nearest-neighbor classifier is about 46% accurate on this hold-out set. The hold-out set is similar to unknown data, because the model has not "seen" it before.

**Limitation:** loss  of portion of  data to the model training this may cause problem if the dataset is small.

## Model validation via cross-validation

In cross-validation, the data is instead split repeatedly and multiple models are trained. 

The most commonly used version of cross-validation is **k-fold cross-validation**, where k is a user-specified number, usually 5 or 10.

An example of five folds is shown in figure below.

**Cross-Validation in scikit-learn**

Cross-validation is implemented in scikit-learn using the **cross_val_score func‐
tion** from the model_selection module.