In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Validation


* K-fold cross-validation
* `cross_val_score`

## The problem of overfitting

![](overfitting.png)

## Populations

![](pop1.png)

![](pop2.png)

![](pop3.png)

## Your model versus the population

A sample is a **subset** of a population.

You will likely **never** have data that covers the entire population.

That means that you will likely **never** be able to represent the entire population!

Your model will lie!

## Carowners and voters

In 1963 *millions* of mock ballots was mailed to carowners across the USA, to learn who would win the presidential election.

The Republicans was a *clear* winner in the mock ballots, but the Democrats won the election.

What was wrong?

## The problem of generalisation

If X% of sample has Y it does **not** mean that X% of population has Y!

**Always** ask yourself: is your data representative?

## Back to overfitting 

In ML you learn from data that is **not** the entire population.

![](overfitting.png)

So, how can we get our model to work on the entire population?

Answer: By hiding some data from our model, and saving it for future tests.

## Cross-validation

Cross-validation just mean to train your model on *parts* of the data, and *test* it on the rest.

In [2]:
import pandas as pd

df = pd.read_csv('science.csv')

In [3]:
spending = df['US science spending']
suicides = df['Suicides']

In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [5]:
from sklearn.model_selection import train_test_split

x_train, _, y_train, _ = train_test_split(spending, suicides)

In [6]:
model.fit(x_train.values.reshape(-1, 1), y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [7]:
model.score(x_train.values.reshape(-1, 1), y_train)

0.982474866330483

Note: The model *has already seen* the data. 

In [8]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(spending, suicides)

In [9]:
model.fit(x_train.values.reshape(-1, 1), y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
model.score(x_test.values.reshape(-1, 1), y_test)

0.9214063625371502

## Training and testing data

We now have a split between 
* **Training data**: the data that the model sees
* **Testing data**: the data that the model is tested against

Note: the model should **never** train on the testing data

## The overfitting curve

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Overfitting_svg.svg/1280px-Overfitting_svg.svg.png" style="width:40%"/>

## K-fold training/testing data

What if the data we choose to use is not representative?

Why not just do this many times?

Like, $K$ times?

In [11]:
from sklearn.model_selection import KFold

KFold?

In [12]:
from sklearn.model_selection import KFold

folder = KFold(n_splits=10)

In [13]:
for training_indices, testing_indices in folder.split(spending):
    print(training_indices, testing_indices)

[ 2  3  4  5  6  7  8  9 10] [0 1]
[ 0  1  3  4  5  6  7  8  9 10] [2]
[ 0  1  2  4  5  6  7  8  9 10] [3]
[ 0  1  2  3  5  6  7  8  9 10] [4]
[ 0  1  2  3  4  6  7  8  9 10] [5]
[ 0  1  2  3  4  5  7  8  9 10] [6]
[ 0  1  2  3  4  5  6  8  9 10] [7]
[ 0  1  2  3  4  5  6  7  9 10] [8]
[ 0  1  2  3  4  5  6  7  8 10] [9]
[0 1 2 3 4 5 6 7 8 9] [10]


In [14]:
for training_indices, testing_indices in folder.split(spending):
    x_train = spending[training_indices]
    y_train = spending[training_indices]
    x_test = spending[testing_indices]
    y_test = spending[testing_indices]

In [15]:
for training_indices, testing_indices in folder.split(spending):
    x_train = spending[training_indices]
    y_train = spending[training_indices]
    x_test = spending[testing_indices]
    y_test = spending[testing_indices]
    
    model = LinearRegression()
    model.fit(x_train.values.reshape(-1, 1), y_train)
    print('Score:', model.score(x_test.values.reshape(-1, 1), y_test))

Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0
Score: 1.0


## Exercise

* Create 1000 random input numbers with `np.random.random_sample((1000, 1))`
* Create 1000 random output numbers with `np.random.random_sample((1000, 1))`
* Create a 10-fold cross-validation object `KFold`
* Loop through each fold and extract the `x_train`, `y_train`, `x_test` and `y_test` variables
* Train your model with the *training* data, and print the score of your model using the *testing* data

## Sklearn `cross_val_score`

This can by the way be done much simpler:

In [16]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

model = LinearRegression()

In [17]:
cross_val_score(model, spending.values.reshape(-1, 1), suicides, cv=8)

array([-4.43825938,  0.92060946,  0.72126823,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ])

In [18]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted'])

In [19]:
cross_val_score(model, spending.values.reshape(-1, 1), suicides, cv=5, scoring='r2')

array([ 0.19980696, -0.85358847, -8.53764768,  0.55620809, -3.2176773 ])