*This notebook has been **significantly** modified from the original notebook available online, as detailed next. This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*<br>
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT).*

# Model validation

### The basic process for applying a supervised machine learning model:

1. Choose a class of model (e.g. polynomial regression)
2. Choose the model hyperparameters (e.g. polynomial degree)
3. Fit the model to the training data
4. Validate the model<br>
   4.1 if the model is not ok, go back to either 1. or 2.<br> 
5. **Use the validated model to predict labels for new data**

In order to make an informed choice, we need a way to **validate** that our **model** and our **hyperparameters** are a good fit to the data.

> **There are some pitfalls that you must avoid to do this effectively**.

## Thinking about Model Validation

In principle, **model validation** is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the test data and comparing the prediction to the known value.

The following sections first show a naive approach to model validation and why it
fails, before exploring the use of holdout sets and cross-validation for more robust
model evaluation.

### Model validation the wrong way

Let's demonstrate the naive approach to validation using the Iris data.

>**We are going to use the same data for training and testing: WRONG!!**<br>
>*We are going to get what appears to be an accurate model, which is in fact wrong!*

In [1]:
# We will start by loading the data:
    
from sklearn.datasets import load_iris
iris = load_iris()

# features
X = iris.data

# variable to be predicted
y = iris.target


Next we choose a model and its hyperparameters.<br>
**Here we'll use a *k*-neighbors classifier** with ``n_neighbors=1``. Let's not look at the specifics of this model for the time being. Intuitively, are are **setting the label of an unknown point to be the same as the label of its closest training point:**

In [2]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)

**Then we train the model, and use it to predict labels for data we already know:**

In [3]:
model.fit(X, y)
y_model = model.predict(X)

**Finally, we compute the fraction of correctly labeled points:**

In [4]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)

1.0

We see an accuracy score of 1.0, which indicates that **100% of points were correctly labeled by our model!**

But is this truly measuring the expected accuracy? Have we really come upon a model that we expect to be correct 100% of the time?
As you may have gathered, the answer is no.

>In fact, this approach contains a **fundamental flaw**:
**it trains and evaluates the model on the same data.**

### Model validation the right way: Holdout sets

So what can be done?
A better sense of a model's performance can be found using what's known as a *holdout set*: that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance.
This splitting can be done using the ``train_test_split`` utility in ``Scikit-Learn``:

In [5]:
# Watch out!! The sklearn.cross_validation module is no longer in use
#from sklearn.cross_validation import train_test_split

from sklearn.model_selection import train_test_split

# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)

# fit the model on one training set 
model.fit(X1, y1)

# evaluate the model on the testing set
y2_model = model.predict(X2)
accuracy_score(y2, y2_model)

0.9066666666666666

> **We see here a more reasonable result**: the nearest-neighbor classifier is about 90% accurate on this hold-out set.
The hold-out set is similar to unknown data, because the model has not "seen" it before.


### Accuracy classification score
In multilabel classification, the `accuracy_score` function computes either the number or the fraction of correctly classified samples. 
> `sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)`

**parameters**
normalize : bool, optional (default=True)
If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.

### Limitations of holdout set 
>One **disadvantage of using a holdout set** for model validation is that **we have lost a portion of our data to the model training.**
In the above case, half the dataset does not contribute to the training of the model!
This is not optimal, and can cause problems – especially if the initial set of training data is small.

### Model validation via cross-validation

One way to address this is to use *cross-validation*; that is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set.
Visually, it might look something like this:

![](figures/05.03-2-fold-CV.png)

Here we do two validation trials, alternately using each half of the data as a holdout set.
Using the split data from before, we could implement it like this:

In [6]:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)

(0.96, 0.9066666666666666)

What comes out are two accuracy scores, which we could combine (by, say, taking the mean) to get a better measure of the global model performance.
This particular form of cross-validation is a **two-fold cross-validation** — that is, one in which we have split the data into two sets and used each in turn as a validation set.

We could expand on this idea to use even more trials, and more folds in the data—for example, here is a visual depiction of **five-fold cross-validation**:

![](figures/05.03-5-fold-CV.png)

Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data.
This would be rather tedious to do by hand, and so we can use Scikit-Learn's ``cross_val_score`` convenience routine to do it succinctly:

In [7]:
#from sklearn.cross_validation import cross_val_score

# five-fold cross-validation
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])

>**Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.**

Scikit-Learn implements a number of useful cross-validation schemes that are useful in particular situations; these are implemented via iterators in the ``cross_validation`` module.
For example, we might wish to go to the extreme case in which our number of folds is equal to the number of data points: that is, we train on all points but one in each trial.
>**This type of cross-validation is known as *leave-one-out* cross validation, and can be used as follows:**

In [8]:
#from sklearn.cross_validation import LeaveOneOut
from sklearn.model_selection import LeaveOneOut

# outdated format
#scores = cross_val_score(model, X, y, cv=LeaveOneOut(len(X)))

scores = cross_val_score(model, X, y, cv=LeaveOneOut())
scores

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction.
> **Taking the mean of these gives an estimate of the error rate:**

In [9]:
scores.mean()

0.96

> **Other cross-validation schemes can be used similarly.** Take a look at Scikit-Learn's online [cross-validation documentation](http://scikit-learn.org/stable/modules/cross_validation.html).