## How to protect yourself from overlearning ##

Under-learning situations occur rarely in practice, or they are often due to a lack of relevant data or other problems that cannot be solved directly by data sciences. The real enemy of data scientists is overlearning, because it gives the illusion of performance, which is actually a trap!

A simple and effective way to avoid this trap is to practice k-fold cross validation. It is a process that consists in choosing an integer k (often 10 is chosen by default), and randomly distributing the observations into k groups of equal size. Then the following method is repeated k times:



* We isolate one group i among 10 groups, which we call **test base**, and we gather the 9 others, which we call **learning base**.
* Estimate the chosen model using the learning base.
* We calculate the error committed by model i on the test basis (group i) and compare it to the error committed on the learning basis after optimization.

The comparison of the learning error and the test error allows us to understand the real explanatory power of a model, because it quantifies the performance of the model on unknown data compared to its performance on known data.


![over_learning](https://drive.google.com/uc?export=view&id=1Dx7vSLocMMcjiYmowVS2GzKXbvxgkN5l)


The figure above illustrates the principle of k-fold cross-validation. Each iteration produces results in terms of test error and learning error that are used to evaluate the model. These errors are usually calculated based on the cost function chosen to optimize the model, or simply on the average of the errors squared. In general, the testing and validation error is expected to be of the same order of magnitude, and it is hoped that the overall error will be reduced relative to the values taken by the target variable.

### Cross Validation in Python

In [7]:
import numpy as np 
# Create features 
X = np.random.randn(100, 2)
y = np.random.randint(0,2, 100)

# Split 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Create model 
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X, y)

# Create predictions
y_pred = classifier.predict(X)

In [8]:
''' k fold cross validation'''
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

# the code bellow allows to generate 10 splits train/test for the 10-folds cross-validation
for train_index, test_index in kf.split(X):
      print("Train:", train_index, "Validation:",test_index)
      X_train, X_test = X[train_index], X[test_index]
      y_train, y_test = y[train_index], y[test_index]

# this method allows to obtain fastly precisions's scores
# for e = all differents models estimated by cross-validation
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

scores = cross_val_score(classifier, X, y, cv=10)
predictions = cross_val_predict(classifier, X, y, cv=10)

Train: [10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99] Validation: [0 1 2 3 4 5 6 7 8 9]
Train: [ 0  1  2  3  4  5  6  7  8  9 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99] Validation: [10 11 12 13 14 15 16 17 18 19]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99] Validation: [20 21 22 23 24 25 26 27 28 29]
Train: [ 0  1  2  3  4  5  6  7 

In [10]:
print("Scores for each cross_validation:\n {}".format(scores))
print("Predictions for each cross_validation:\n {}".format(predictions))

Scores for each cross_validation:
 [0.6 0.6 0.4 0.4 0.4 0.3 0.4 0.4 0.7 0.5]
Predictions for each cross_validation:
 [0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 1
 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
 0 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1]
