In [30]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## The problem of overfitting

![](images/overfitting.png)

## Populations

![](images/pop1.png)

![](images/pop2.png)

![](images/pop3.png)

## Your model versus the population

A sample is a **subset** of a population.

You will likely **never** have data that covers the entire population.

That means that you will likely **never** be able to represent the entire population!

Your model will lie!

## Carowners and voters

In 1963 *millions* of mock ballots was mailed to carowners across the USA, to learn who would win the presidential election.

The Republicans was a *clear* winner in the mock ballots, but the Democrats won the election.

What was wrong?

## The problem of generalisation

If X% of sample has Y it does **not** mean that X% of population has Y!

**Always** ask yourself: is your data representative?

## Back to overfitting 

In ML you learn from data that is **not** the entire population.

![](images/overfitting.png)

So, how can we get our model to work on the entire population?

Answer: By hiding some data from our model, and saving it for future tests.

## Training and testing data

We now have a split between 
* **Training data**: the data that the model sees
* **Testing data**: the data that the model is tested against

Note: the model should **never** train on the testing data

## Sklearn `train_test_split`

Splitting the data into testing and training makes it more likely that your model generalises.

But it **does not guarantee it**!

In [1]:
from sklearn.datasets import load_iris

In [2]:
X = load_iris().data
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [3]:
y = load_iris().target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [4]:
load_iris().target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
from sklearn.model_selection import train_test_split
train_test_split(X, y)

[array([[6.3, 2.5, 5. , 1.9],
        [6.3, 2.9, 5.6, 1.8],
        [5.7, 3. , 4.2, 1.2],
        [5.1, 3.8, 1.5, 0.3],
        [7.3, 2.9, 6.3, 1.8],
        [5.8, 2.7, 5.1, 1.9],
        [6. , 2.2, 5. , 1.5],
        [5.8, 4. , 1.2, 0.2],
        [6.6, 3. , 4.4, 1.4],
        [5.6, 3. , 4.1, 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [5. , 3. , 1.6, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [7.7, 2.6, 6.9, 2.3],
        [5.2, 2.7, 3.9, 1.4],
        [4.3, 3. , 1.1, 0.1],
        [5.5, 2.4, 3.7, 1. ],
        [6.4, 2.8, 5.6, 2.1],
        [5.1, 2.5, 3. , 1.1],
        [5. , 3.4, 1.6, 0.4],
        [4.9, 2.5, 4.5, 1.7],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 3.6, 1.4, 0.1],
        [6.1, 2.8, 4. , 1.3],
        [5.7, 2.5, 5. , 2. ],
        [6.9, 3.1, 5.1, 2.3],
        [4.9, 3. , 1.4, 0.2],
        [6. , 3. , 4.8, 1.8],
        [4.8, 3. , 1.4, 0.1],
        [6.3, 2.3, 4.4, 1.3],
        [7.7, 2.8, 6.7, 2. ],
        [5.7, 2.9, 4.2, 1.3],
        [6. , 2.9, 4.5, 1.5],
        [5

In [8]:
# split the data into training data (2/3) for x and for y and test data (1/3) for x and for y
# training data is for the model to learn, test data to see if the model learned correctly
x_train, x_test, y_train, y_test = train_test_split(X, y)
print(y_train)

[0 1 0 2 1 0 0 1 2 1 1 2 0 1 1 0 1 0 0 2 2 0 1 1 2 0 2 1 0 0 2 2 2 2 1 1 2
 0 2 1 1 1 0 0 0 0 2 2 2 0 2 1 0 0 0 0 0 1 0 1 0 2 0 2 0 2 1 2 0 1 1 1 0 2
 2 2 1 0 2 1 1 1 1 1 1 2 1 2 0 0 1 0 0 1 2 0 1 1 2 0 0 0 1 0 1 1 1 2 0 2 2
 1]


In [9]:
# use the linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
model.predict(x_train)

array([-0.01699789,  1.05870907, -0.00934693,  1.56238627,  1.28277086,
        0.12475026, -0.06670707,  1.30599488,  1.57598146,  1.27835643,
        1.18864536,  1.98679138, -0.02404528,  1.0189839 ,  1.3739602 ,
       -0.07964716,  1.19161032, -0.09670362, -0.129608  ,  1.76207635,
        1.8700417 ,  0.09679758,  1.36101825,  1.42302128,  1.64449836,
       -0.16138788,  1.90207138,  1.19953241,  0.02859678,  0.18326697,
        1.48830538,  1.73654435,  1.80857406,  1.9438798 ,  1.0242749 ,
        0.93634529,  2.04094191, -0.01436306,  1.78823139,  1.15483936,
        0.8657702 ,  0.97250753,  0.0426953 , -0.05170879, -0.10462571,
       -0.08905452,  1.58862163,  2.14030314,  1.99002605, -0.0820034 ,
        1.57538162,  1.18511078, -0.07904732, -0.06081624, -0.01905421,
       -0.04699258, -0.02730315,  1.5689322 , -0.09022914,  1.28008964,
       -0.17315149,  1.44155228,  0.10092827,  1.72269749, -0.03407755,
        1.51064583,  1.21510547,  1.76626417, -0.02963806,  1.18

In [None]:
model.score(x_train, y_train)

In [None]:
model.score(x_test, y_test)

## Evaluating a model

* Models are supposed to be as accurate as possible
  * `model.score`
  * Read the documentation

* But not *too* accurate
  * Overfitting

## The overfitting curve

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Overfitting_svg.svg/1280px-Overfitting_svg.svg.png" style="width:40%"/>

## Exercise

* Import `science.csv` to a pandas DataFrame
* Split the input (X) and target (y) using `train_test_split`
* Train the model on the training data
* Score the model based on the testing data

In [20]:
# Purpose to predict suicides (y) dependent, from Science spending (X) the independent
from sklearn.linear_model import LinearRegression
import pandas as pd

df = pd.read_csv('data/science.csv')
print(df['US science spending'])
x = df['US science spending'].values.reshape(-1,1) # reshape(-1,1) same (in this case) as reshape(11,1) meaning reshape all values to array of single value arrays
print(x)
y = df['Suicides']
x_train, x_test, y_train, y_test = train_test_split(x, y)
print('4 sets of data: \n',x_train, x_test, y_train, y_test)

0     18079
1     18594
2     19753
3     20734
4     20831
5     23029
6     23597
7     23584
8     25525
9     27731
10    29449
Name: US science spending, dtype: int64
[[18079]
 [18594]
 [19753]
 [20734]
 [20831]
 [23029]
 [23597]
 [23584]
 [25525]
 [27731]
 [29449]]
4 sets of data: 
 [[18594]
 [25525]
 [23029]
 [20831]
 [18079]
 [29449]
 [23597]
 [19753]] [[20734]
 [23584]
 [27731]] 1     5688
8     8161
5     7336
4     6635
0     5427
10    9000
6     7248
2     6198
Name: Suicides, dtype: int64 3    6462
7    7491
9    8578
Name: Suicides, dtype: int64


In [21]:
model = LinearRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)
# model.score produces a value between 0 to 1. Close to 1 show strong dependency and close to 0 no dependency

0.9882538413933657

## Other sklearn metrics

The model uses *default* metrics. But there are numerous others.

https://sklearn.org/modules/classes.html#module-sklearn.metrics

Metrics usually depends on the type of your model (classification, regression, etc.)

* **READ THE DOCUMENTATION**

In [22]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])