# Model Validation

- model evaluation/comparison
- the tuning of hyperparameters

In [33]:
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Use seaborn for plotting defaults
import seaborn as sns; sns.set()

In [34]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

In [35]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [36]:
y_pred = clf.predict(X)

print("{0} / {1} correct".format(np.sum(y == y_pred), len(y)))

1797 / 1797 correct


In [8]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [9]:
y_pred = clf.predict(X)
print("{0} / {1} correct".format(np.sum(y == y_pred), len(y)))

1797 / 1797 correct


It seems we have a perfect classifier!

**Question: what's wrong with this?**

We made the mistake of testing our data on the same set of data that was used for training. This is not generally a good idea. If we optimize our estimator this way, we will tend to over-fit the data: that is, we learn the noise.
A better way to test a model is to use a hold-out set which doesn't enter the training. We've seen this before using scikit-learn's train/test split utility:

In [37]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.66)
X_train.shape, X_test.shape

((1186, 64), (611, 64))

In [47]:
# clf = DecisionTreeClassifier()
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("{0} / {1} correct".format(np.sum(y_test == y_pred), len(y_test)))

602 / 611 correct


The metric we're using here, comparing the number of matches to the total number of samples, is known as the accuracy score

In [48]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.98527004909983629

In [49]:
# alternatively directly within the model
clf.score(X_test, y_test)

0.98527004909983629

In [50]:
for n_neighbors in [1, 5, 10, 20, 30]:
    knn = KNeighborsClassifier(n_neighbors)
    knn.fit(X_train, y_train)
    print(n_neighbors, knn.score(X_test, y_test))

1 0.9852700491
5 0.981996726678
10 0.975450081833
20 0.959083469722
30 0.952536824877


In [32]:
for depth in [None, 2, 3, 5, 8, 10, 20]:
    clf = DecisionTreeClassifier(max_depth=depth)
    clf.fit(X_train, y_train)
    print(depth, tree.score(X_test, y_test))

None 0.846153846154
2 0.281505728314
3 0.427168576105
5 0.73977086743
8 0.847790507365
10 0.834697217676
20 0.849427168576


### Cross-Validation

One problem with validation sets is that you "lose" some of the data. Above, we've only used 3/4 of the data for the training, and used 1/4 for the validation. Another option is to use 2-fold cross-validation, where we split the sample in half and perform the validation twice:

In [51]:
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.5, random_state=0)
X1.shape, X2.shape

((898, 64), (899, 64))

In [57]:
score1 = KNeighborsClassifier(1).fit(X2, y2).score(X1, y1)
score2 = KNeighborsClassifier(1).fit(X1, y1).score(X2, y2)
print(score1)
print(score2)
print((score1+score2)/2)

0.983296213808
0.982202447164
0.982749330486


In [60]:
from sklearn.cross_validation import cross_val_score
cv = cross_val_score(clf, X, y, cv=2)
cv.mean()

0.96048785080069765

In [61]:
cross_val_score(KNeighborsClassifier(1), X, y, cv=5)

array([ 0.96153846,  0.95303867,  0.96657382,  0.98879552,  0.95492958])

## Bias-Variance Trade-off, Overfitting, Underfitting and Model Selection

Selecting the optimal model for your data is what makes the difference between a good and a bad data scientist.

If our estimator is underperforming, how should we move forward?

- Use simpler or more complicated model?
- Add more features to each observed data point?
- Add more training samples?

Sometimes using a more complicated model will give worse results because we fit the noise.