# Overfitting

In this notebook, we are going to tackle a big problem in the field of machine learning : overfitting.

*The possibility of over-fitting exists because the criterion used for selecting the model is not the same as the criterion used to judge the suitability of a model. For example, a model might be selected by maximizing its performance on some set of training data, and yet its suitability might be determined by its ability to perform well on unseen data; then* **over-fitting occurs when a model begins to "memorize" training data rather than "learning" to generalize from a trend.**



For this, we're going to use the popular Iris dataset, which is a dataset of flowers, of which `sklearn` provides easy access to.

In [106]:
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
import sklearn
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [107]:
# Loading the data set
dataset = datasets.load_iris()
X, y = dataset.data, dataset.target

# Here you see that there are 3 different classes
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [108]:
# To show a big case of overfitting, we're going to take all the samples or class 0 and 1, and only
# a single sample from class 2

limit = 101
X_train, X_test = X[:limit], X[limit:]
y_train, y_test = y[:limit], y[limit:]
y_train

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2])

In [109]:
print("X_train shape : ", X_train.shape)
print("X_test shape : ", X_test.shape)

X_train shape :  (101, 4)
X_test shape :  (49, 4)


In [110]:
# In this example, we're going to use a simple DecisionTree to classify the 3 classes.
# Here, we initialize the model and we fit the training data onto it
classifier = DecisionTreeClassifier(random_state=1)
classifier.fit(X_train, y_train)


DecisionTreeClassifier(random_state=1)

In [111]:
score = classifier.score(X_train, y_train)
print("Evaluating the model on the training set yields an accuracy of {}%".format(score*100))
score=classifier.score(X_test, y_test)
print("Evaluating the model on the testing set yields an accuracy of {:.2f}%".format(score*100))

Evaluating the model on the training set yields an accuracy of 100.0%
Evaluating the model on the testing set yields an accuracy of 48.98%


What do we see here ?

We have trained the model on the training set, which, as a reminder, only contains classes 0 and 1 and a single example of class 2.

Testing the model on the training set yields perfect accuracy. Indeed, the model has already seen the samples and the whole model was build to increase the training accuracy. So it is indeed expected that the accuracy on the training set is pretty high.

However, when we're looking at the testing set, which contains only samples of class 2, the accuracy is not that great (at all). How is that ?
Well, the problem is that the model has only seen a single example of class 2, and thus it relied too much on that single sample to decide whether the test samples were of class 2 or not. It **overfitted** the training set.

<img src="assets/overfitting.png" />

[[Image source]](https://www.educative.io/edpresso/overfitting-and-underfitting)

To combat this, we can shuffle the training and testing sets so that they have a random number of samples of each class

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.2)

In [113]:
classifier.fit(X_train, y_train)
score = classifier.score(X_train, y_train)
print("Evaluating the model on the training set yields an accuracy of {}%".format(score*100))
score=classifier.score(X_test, y_test)
print("Evaluating the model on the testing set yields an accuracy of {:.2f}%".format(score*100))

Evaluating the model on the training set yields an accuracy of 100.0%
Evaluating the model on the testing set yields an accuracy of 96.67%


See how the accuracy on the testing set rose to be closer to the training set accuracy ?

In the next chapter, you'll see other ways how overfitting can appear, and where you'll need to be careful when manipulating your data.
Moreover, you'll see many ways to prevent your model from overfitting.

## References and additional material

[Overfitting in machine learning - Elite Data Science](https://elitedatascience.com/overfitting-in-machine-learning)

[Overfitting and underfitting - Machine Learning Mastery](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/)