# Early stopping to prevent overfitting.

These are a commented and extended (adding examples, comments or testing the code) notes from a course in deep neural network which took place at Washington University in St. Louis. The instructor of this course was Jeff Heaton.


__Overfitting__ occurs when a neural network is trained to the point that it begins to memorize rather than generalize.

<div>
<img src="attachment:overfitting_curve.png" width="500"/>
</div>

__Credit__: Jeff Heaton.


It is important to segment the original data set into several datasets:

- Training set
- Validation set
- Holdout set

There several different ways to construct these subsets. The first method is a training and validation set.
The training data set are used until the validation set no longer improves. Looking at the previous curve, we would like to stop near to the optimal training point.

This method will only give accurate "out of sample" predictions for the validation set, this is usually 20% or so of the data. The predictions for the training data will be overly optimistic, as these were the data that the neural network was trained on.

![validation.png](attachment:validation.png)




# Implementation: early stopping with classification.

In [2]:
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

# This is our previous example - Iris neural network

# loading the data set from Heaton's website

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/iris.csv", 
    na_values=['NA', '?'])


x = df[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
dummies = pd.get_dummies(df['species']) # Classification
species = dummies.columns
y = dummies.values

# Here we find some differences -> there are x_test and y_test
# We are using the command train_tes_split, to split the data set.
# This is a command from the library scikilearn.

'''test_size. This parameter specifies the size of the testing dataset. 
   The default state suits the training size. It will be set to 0.25 if 
   the training size is set to default.
   
   train_size. This parameter sets the size of the training dataset. 
   There are three options: None, which is the default, Int, which 
   requires the exact number of samples, and float, which ranges from 
   0.1 to 1.0.
   
   random_state. The default mode performs a random split using np.random. 
   Alternatively, you can add an integer using an exact number.'''

x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(25, activation='relu')) # Hidden 2
model.add(Dense(y.shape[1],activation='softmax')) # Output
model.compile(loss='categorical_crossentropy', optimizer='adam') # here we use the adam algorithm - there is a paper from 2015 about this algorithm.


# This is also a new piece of code.


monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, 
        verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
        callbacks=[monitor],verbose=2,epochs=1000)

Epoch 1/1000
4/4 - 0s - loss: 2.1972 - val_loss: 1.7421
Epoch 2/1000
4/4 - 0s - loss: 1.7584 - val_loss: 1.4154
Epoch 3/1000
4/4 - 0s - loss: 1.4062 - val_loss: 1.2009
Epoch 4/1000
4/4 - 0s - loss: 1.1932 - val_loss: 1.0958
Epoch 5/1000
4/4 - 0s - loss: 1.0754 - val_loss: 1.0411
Epoch 6/1000
4/4 - 0s - loss: 1.0142 - val_loss: 1.0053
Epoch 7/1000
4/4 - 0s - loss: 0.9650 - val_loss: 0.9661
Epoch 8/1000
4/4 - 0s - loss: 0.9249 - val_loss: 0.9224
Epoch 9/1000
4/4 - 0s - loss: 0.8835 - val_loss: 0.8672
Epoch 10/1000
4/4 - 0s - loss: 0.8377 - val_loss: 0.8072
Epoch 11/1000
4/4 - 0s - loss: 0.7894 - val_loss: 0.7541
Epoch 12/1000
4/4 - 0s - loss: 0.7505 - val_loss: 0.7092
Epoch 13/1000
4/4 - 0s - loss: 0.7186 - val_loss: 0.6677
Epoch 14/1000
4/4 - 0s - loss: 0.6857 - val_loss: 0.6293
Epoch 15/1000
4/4 - 0s - loss: 0.6528 - val_loss: 0.5956
Epoch 16/1000
4/4 - 0s - loss: 0.6221 - val_loss: 0.5643
Epoch 17/1000
4/4 - 0s - loss: 0.5928 - val_loss: 0.5365
Epoch 18/1000
4/4 - 0s - loss: 0.5669 - 

<tensorflow.python.keras.callbacks.History at 0x7faa78681be0>

There are a number of parameters that are specified to the **EarlyStopping** object. 

* **min_delta** This value should be kept small. It simply means the minimum change in error to be registered as an improvement.  Setting it even smaller will not likely have a great deal of impact.
* **patience** How long should the training wait for the validation error to improve?  
* **verbose** How much progress information do you want?
* **mode** In general, always set this to "auto".  This allows you to specify if the error should be minimized or maximized.  Consider accuracy, where higher numbers are desired vs log-loss/RMSE where lower numbers are desired.
* **restore_best_weights** This should always be set to true.  This restores the weights to the values they were at when the validation set is the highest.  Unless you are manually tracking the weights yourself (we do not use this technique in this course), you should have Keras perform this step for you.

As you can see from above, the entire number of requested epochs were not used.  The neural network training stopped once the validation set no longer improved.

In [4]:
# Measuring accuracy

from sklearn.metrics import accuracy_score

pred = model.predict(x_test)
predict_classes = np.argmax(pred,axis=1)
expected_classes = np.argmax(y_test,axis=1)
correct = accuracy_score(expected_classes,predict_classes)
print(f"Accuracy: {correct}")

Accuracy: 0.9736842105263158


### Early Stopping with Regression

We can apply early stopping to a regression problem.  

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])

cars = df['name']

# Handle missing value
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

# Pandas to Numpy
x = df[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values # predictors
y = df['mpg'].values # dependent variable

# Split into validation and training sets
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

# Build the neural network
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
        patience=5, verbose=1, mode='auto',
        restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
        callbacks=[monitor], verbose=2,epochs=1000)

Epoch 1/1000
10/10 - 0s - loss: 13604.1963 - val_loss: 1222.7080
Epoch 2/1000
10/10 - 0s - loss: 2614.7219 - val_loss: 1724.5574
Epoch 3/1000
10/10 - 0s - loss: 784.5842 - val_loss: 719.7219
Epoch 4/1000
10/10 - 0s - loss: 745.6928 - val_loss: 501.9683
Epoch 5/1000
10/10 - 0s - loss: 419.2739 - val_loss: 487.8710
Epoch 6/1000
10/10 - 0s - loss: 422.4509 - val_loss: 400.9056
Epoch 7/1000
10/10 - 0s - loss: 396.7486 - val_loss: 412.2203
Epoch 8/1000
10/10 - 0s - loss: 381.8182 - val_loss: 391.0242
Epoch 9/1000
10/10 - 0s - loss: 363.2636 - val_loss: 379.7307
Epoch 10/1000
10/10 - 0s - loss: 354.1170 - val_loss: 367.5441
Epoch 11/1000
10/10 - 0s - loss: 347.3123 - val_loss: 358.7933
Epoch 12/1000
10/10 - 0s - loss: 336.9625 - val_loss: 351.7666
Epoch 13/1000
10/10 - 0s - loss: 329.8938 - val_loss: 341.0068
Epoch 14/1000
10/10 - 0s - loss: 321.7982 - val_loss: 331.9789
Epoch 15/1000
10/10 - 0s - loss: 316.4487 - val_loss: 323.0038
Epoch 16/1000
10/10 - 0s - loss: 307.4712 - val_loss: 315.4

10/10 - 0s - loss: 30.3017 - val_loss: 24.7165
Epoch 134/1000
10/10 - 0s - loss: 30.9427 - val_loss: 24.7239
Epoch 135/1000
10/10 - 0s - loss: 29.9350 - val_loss: 23.7567
Epoch 136/1000
10/10 - 0s - loss: 29.1545 - val_loss: 23.6417
Epoch 137/1000
10/10 - 0s - loss: 29.2118 - val_loss: 26.1155
Epoch 138/1000
10/10 - 0s - loss: 30.4494 - val_loss: 23.9983
Epoch 139/1000
10/10 - 0s - loss: 28.4418 - val_loss: 22.8048
Epoch 140/1000
10/10 - 0s - loss: 28.3176 - val_loss: 23.6078
Epoch 141/1000
10/10 - 0s - loss: 28.0501 - val_loss: 24.5623
Epoch 142/1000
10/10 - 0s - loss: 27.6374 - val_loss: 22.3467
Epoch 143/1000
10/10 - 0s - loss: 27.3906 - val_loss: 22.0489
Epoch 144/1000
10/10 - 0s - loss: 26.8760 - val_loss: 24.3167
Epoch 145/1000
10/10 - 0s - loss: 27.2628 - val_loss: 21.8277
Epoch 146/1000
10/10 - 0s - loss: 26.5952 - val_loss: 21.8351
Epoch 147/1000
10/10 - 0s - loss: 26.3284 - val_loss: 21.2987
Epoch 148/1000
10/10 - 0s - loss: 26.4215 - val_loss: 21.5485
Epoch 149/1000
10/10 - 

<tensorflow.python.keras.callbacks.History at 0x7faa7871c250>

In [21]:
# Some comments about the last piece of code.

# This is a standard regression problem, our array y is (dependent variable)

y 

# Our matrix X is

x

# We don't classify, we just apply regression in this particular case. Notice that

pred = model.predict(x_test)
pred # as you can see the results are not from a classification algorithm.

array([[29.643208],
       [29.650608],
       [18.687214],
       [16.808949],
       [16.295704],
       [29.087452],
       [27.949024],
       [11.021015],
       [20.901356],
       [22.6375  ],
       [11.050739],
       [30.733326],
       [28.99187 ],
       [16.032887],
       [25.377308],
       [11.704334],
       [29.19701 ],
       [25.93015 ],
       [13.088245],
       [32.0602  ],
       [26.996616],
       [20.925968],
       [23.155643],
       [28.640415],
       [14.4547  ],
       [33.634342],
       [28.10088 ],
       [25.646137],
       [20.714924],
       [11.362629],
       [27.33936 ],
       [29.80991 ],
       [17.01247 ],
       [27.95406 ],
       [30.943592],
       [13.421223],
       [24.134022],
       [18.288013],
       [16.648304],
       [28.601871],
       [24.622532],
       [28.268345],
       [26.58025 ],
       [15.763569],
       [26.993671],
       [29.835285],
       [26.057087],
       [27.351322],
       [28.03638 ],
       [28.73145 ],


In [22]:
# Root Mean Square Error

pred = model.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print(f"Final score (RMSE): {score}")

Final score (RMSE): 3.968743882514706
