# Combat overfitting and prevent biasing your model

In this notebook and in the next ones, you'll see different ways to combat overfitting.

In machine learning, what we want our model to do is predict the class of data that the model has never seen, we want it to **generalize** to unseen data. We do not want it to memorize training data.

Thus, overfitting occurs when the model does just that, memorize data, and when it can not correctly classify data that it has never seen.



## Train-test split

Like you've seen in the previous notebook, you can simulate *unseen* data by simply using a train-test split and verifying the accuracy on the test set.
This is usually done when you have a lot of data and you can easily spare a bit of data to constitute a testing set.

In [1]:
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split

dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

In [None]:
# Splitting dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=41, test_size=0.3)

In [None]:
# Creating, fitting the data to the classifier (training)
classifier = DecisionTreeClassifier(random_state=1)
classifier.fit(X_train, y_train)

# Evaluation classifier on training set and testing set
score = classifier.score(X_train, y_train)
print("Evaluating the model on the training set yields an accuracy of {}%".format(score*100))
score=classifier.score(X_test, y_test)
print("Evaluating the model on the testing set yields an accuracy of {:.2f}%".format(score*100))

Moreover, you can see that when changing the random_state, which changes how the different samples are distributed between the training and the testing set, the testing accuracy changes.
Indeed, the elements in the training set and in the testing set are different, so the model learns different things and it gets evaluated on different samples.

In [17]:
# Repeat the actions above for different random states
for random_state in range(4):
    X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=random_state, test_size=0.3)
    classifier = DecisionTreeClassifier(random_state=1)
    classifier.fit(X_train, y_train)
    score=classifier.score(X_test, y_test)
    print("Evaluating the model on the testing set yields an accuracy of {:.2f}% with random state {}".format(score*100, random_state))

Evaluating the model on the testing set yields an accuracy of 93.57% with random state 0
Evaluating the model on the testing set yields an accuracy of 92.98% with random state 1
Evaluating the model on the testing set yields an accuracy of 92.40% with random state 2
Evaluating the model on the testing set yields an accuracy of 94.74% with random state 3


A difference of 2% is not negligible, especially when you are trying to create state of the art (SOTA) models, which are models that are the best in their category. A popular competition where many research teams around the world try to create the best model is, for example, [ILSVRC](http://image-net.org/challenges/LSVRC/), which is based around the ImageNet dataset.

## Cross-validation
When you don't have a lot of data, but you still want to test a model, you can use cross-validation.
A way to get a single value that can accurately represent the performance of the model is to take the average over many different train-test-splits.
This is called *k-fold cross-validation*, which uses `k` different train-test splits.
Indeed, this splits the data set into k differents groups, called splits.
Then, it will train on (k-1) splits and test on 1 of the splits. This operation gets repeated k times so that each split is tested once, while being trained on the other splits.

<img src="assets/crossvalidation.png" />

[[Image source]](https://www.researchgate.net/publication/331209203_Tectonic_discrimination_of_olivine_in_basalt_using_data_mining_techniques_based_on_major_elements_a_comparative_study_from_multiple_perspectives)

In [18]:
from sklearn.model_selection import cross_val_score
classifier = DecisionTreeClassifier(random_state=1)
scores = cross_val_score(classifier, X, y, cv=5) # cv is the number of folds (k)
print(scores)

# It is always a good practice to show the mean AND the standard deviation of the model accuracy
print("Accuracy: {:.2f}% (+/- {:.2f})".format(scores.mean() * 100, scores.std() * 100))

[0.90350877 0.90350877 0.92105263 0.94736842 0.91150442]
Accuracy: 91.74% (+/- 1.63)


You can find many different cross validation strategies in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html).

## References and additional reading material
<a id='references'></a>

[Difference between test and validation set - Machine Learning Mastery](https://machinelearningmastery.com/difference-test-validation-datasets/)