# The need for a third dataset

This notebook illustrates why we need to split our original data into three sets.

We use a somewhat peculiar "hyper-parameter" to illustrate the point. While you would never try and find "the best" random seed in a real world problem, it is a great "hyper-parameter" to optimise for us because we know it should not change anything!

In [1]:
%config InlineBackend.figure_format='retina'
%matplotlib inline

import numpy as np
np.random.seed(123)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14

# Train, validation and test

Using the breast cancer dataset once again.

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X, y = data.data, data.target

# features are on very different scales so let's
# scale them all.
X = scale(X)

In [3]:
# Split your data into two datasets. "trainval" and "test"
# The test dataset should be put on a USB drive, locked in
# a safe and deleted from your laptop. Only unlock it once
# you have frozen *every* parameter and made all choices.
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=2)

# The training and validation dataset is what we will use
# day to day to tune our model.
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, random_state=2)

In [4]:
knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)

print("Train: {:.3f}".format(knn.score(X_train, y_train)))
print("Validation: {:.3f}".format(knn.score(X_val, y_val)))
print("Test: {:.3f}".format(knn.score(X_test, y_test)))

Train: 0.978
Validation: 0.981
Test: 0.986


Just to make sure, we get roughly the same performance on all three splits. There is nothing special about them.

Now we have split the data into three groups, but why? We have a **train**, **validation** and **test** data set.

So far we always fit our classifier on the **train** set, and measured performance on the **test** set. Now we split the **train** data into two smaller sets again and call them **train** and **validation**.

---

For a moment let's pretend there is no **test** set anymore. If you like we could pretend that our train and validation set could be renamed train and test. Keeping the namign straight and
communicating it to others takes discipline so I recommend not to rename them.

We will fit our model on the **train** dataset, trying different techniques for increasing the accuracy. We know we need to use a different dataset to measure our performance, so we will use the **validation** set to do that. Once we know the technique which has the highest score we will use that for future data.

*Note:* This is an example for you to learn from, so we want to use different "techniques" that actually all have the same performance. This way we know that any difference we see must be due to random fluctuations.

Let's simulate 1000 different attempts to fit a model and select the one with the best validation score. Just like we learnt.

In [5]:
val = []
test = []
for i in range(1000):
    rng = np.random.RandomState(i)
    # think of this as tuning a hyper-parameter
    # a weird parameter but let's just roll with it
    noise = rng.normal(scale=.1, size=X_train.shape)

    knn = KNeighborsClassifier(n_neighbors=3)
    # add noise to our dataset and fit the classifier
    knn.fit(X_train + noise, y_train)

    val.append(knn.score(X_val, y_val))
    test.append(knn.score(X_test, y_test))

print("Validation: {:.3f}".format(np.max(val)))

Validation: 1.000


Wow! We managed to get a perfect score on data not used in the training! This "adding random noise" trick must be a magical algorithm afterall!

Not so fast ...

You see that we can overfit the validation set by doing this. Luckily we have the third dataset (the test set) to get an unbiased estimate of the models performance. It checks out as the model fitted on a noisy dataset is no better than any other.


In [6]:
print("Test: {:.3f}".format(test[np.argmax(val)]))

Test: 0.986


The take away is that if you try enough (random) settings you will find one that performs very well on your test set. However because you look at the test set score while choosing the parameters the score is no longer an unbiased estimate of your algorithms performance on unseen data. However this is the number you and everyone else wants to know!