# Deep Learning - Day 2 - How to properly prevent overfitting

### Exercise objectives:
- Give a validation set to the model
- Use the stopping criterion to prevent the Neural network from overfitting

<hr>
<hr>

Yesterday, not everything was done properly, so let's get back to that.

# Data 

First, let's generate some data thanks to the [`make_blob`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) function that we used yesterday.

❓ **Question** ❓ Generate 2000 samples, with 10 features each. There should be 8 classes of blobs (`centers` argument), wich `cluster_std` equal to 7. Plot some dimensions to check your data.

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
%matplotlib inline

### YOUR CODE HERE

❓ **Question** ❓ Use the `to_categorical` function from `tensorflow` to convert `y` to `y_cat` which is the categorical representation of `y` with one-hot encoding columns.

In [None]:
from tensorflow.keras.utils import to_categorical 

y_cat = ### YOUR CODE HERE


# Part I : Valid cross-validation

Yesterday, we splitted the dataset into a train and a test set at the beginning of the notebook. And then, we started to build different models which were trained on the train set but evaluated on the test set.

So, at the end of the day, we used the test set as many times as we evaluated our models and different hyperparameters. We therefore _used_ the test set to select our best model. Which is a sort of overfitting, as we were not able to properly state whether our best model was best on any unseen data or only on the test set which was used to select our model. 

A first good practice is to avoid using `random_state` or any deterministic separation between your train and test set. In that case, your test set will change everytime you re-run your notebook. But this is far from being sufficient.

To properly compare models, you have to run a proper cross-validation, a 10-fold split for instance. Let's see how to do it properly.

❓ **Question** ❓ First, write a function that outputs a neural network with 3 layers
- a layer with 25 neurons, the `relu` activation function and the appropriate `input_dim`
- a layer with 10 neurons and the `relu` activation function.
- a last layer which is suited to the problem at hand (multiclass classification)

The function should include its compilation, with the `categorical_crossentropy` loss, the `adam` optimizer and the `accuracy` metrics.

In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers


def initialize_model():
    ### YOUR CODE HERE

Here, we will do a proper cross validation.

❓ **Question** ❓ Write a loop thanks to the [K-Fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function of Scikit-Learn (select 10 splits) to fit your model on the train data, and evaluate it on the test data. Store the result of the evaluation in the `results` variable.

Do not forget to standardize your train data before fitting the neural network.
Also, 150 epochs shoul be sufficient in a first approximation

In [None]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler


kf = KFold(n_splits=10)
kf.get_n_splits(X)

results = []

for train_index, test_index in kf.split(X):
    # Split the data into train and test
    ### YOUR CODE HERE
    
    # Use the standard scaler
    ### YOUR CODE HERE
    
    # Initialize the model
    ### YOUR CODE HERE
    
    # Fit the model on the train data
    ### YOUR CODE HERE
    
    # Evaluate the model on the test data and append the result in the `results` variable
    ### YOUR CODE HERE
    

❓ **Question** ❓ Print the mean accuracy, and its standard deviation

In [None]:
### YOUR CODE HERE

❗ **Remark** ❗ You probably encountered one of the drawback of using a proper cross-validation for a neural network: **it takes a lot of time**. Therefore, for the rest of week, we will do **only one split**. But remember that this is not entirely correct and, for real-life applications and problems, you are encouraged to use a proper cross-validation technique.

❗ **Remark** ❗ In general, what practitionners do, is that they split only once, as you did. And once they get to the end of their optimization, they launch a real cross-validation at 6pm, go home and get the final results on the next day.

❓ **Question** ❓ For the rest of the exercise (and you will do the same for the rest of the week), split the dataset into train and test with a 70/30% training to test data ratio.



In [None]:
### YOUR CODE HERE

# Part II : Stop the learning before overfitting

Let's first show that if we train the model for too long, it will overfit the training data and will not be good on the test data.

❓ **Question** ❓ To do that, train the same neural network (do not forget to re-initialize it) with `validation_data=(X_test, y_test)` and 500 epochs. Store the history in the `history` variable.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Evaluate the model on the test set and print the accuracy

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Plot the history of the model with the following function : 

In [None]:
def plot_loss_accuracy(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='best')
    plt.show()
    
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model Accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='best')
    plt.show()
    
    
### YOUR CODE HERE

We clearly see that the number of epochs we choose has a great influence on the final results: 
- If not enough epochs, then the algorithm is not optimal as it has not converged yet. 
- On the other hand, if too many epochs, we overfit the training data and the algorithm does not generalize well on test data.

What we want is basically to stop the algorithm when the test loss is minimal (or the test accuracy is maximal).

Let's introduce the early stopping criterion which is a way to stop the epochs of the algorithm at a interesting epoch. It basically use part of the data to see if the test loss stops from improving. You cannot use the test data to check that, otherwise, it is some sort of data leakage. On the contrary, it uses a subset of the initial training data, called the **validation set**

It basically looks like the following : 

<img src="validation_set.png" alt="Validation set" style="height:350px;"/>

To split this data, we use, in the `fit` function, the `validation_split` keywork which sets the percentage of data from the initial training set used in the validation set. On top of that, we use the `callbacks` keyword to call the early stopping criterion at the end of each epoch. You can check additional information in the [documentation](https://www.tensorflow.org/guide/keras/train_and_evaluate)


❓ **Question** ❓ Launch the following code, plot the history and evaluate it on the test set

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping()

model = initialize_model()

# Fit the model on the train data
history = model.fit(X_train, y_train,
                    validation_split=0.3,
                    epochs=500,
                    batch_size=16, 
                    verbose=0, 
                    callbacks=[es])

def plot_loss_accuracy(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='best')
    plt.show()
    
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model Accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='best')
    plt.show()
    

### YOUR CODE HERE

❗ **Remark** ❗ The problem, with this type of approach, is that as soon as the loss of the validation set increases, the model stops. However, as neural network convergence is stochastic, it happens that the loss increases before decreasing again. For that reason, the Early Stopping criterion has the `patience` keyword that defines how many epochs without loss decrease you allow.

❓ **Question** ❓ Use the early stopping criterion with a patience of 30 epochs, plot the results and print the accuracy on the test set

In [None]:
### YOUR CODE HERE

❗ **Remark** ❗ You now see that the model continue to converge even though it has some loss increase and descrease. The number of patience epochs to select is highly related to the task at hand and there does not exist any general rule. 

❗ **Remark** ❗ In case you select a high patience, you might face the problem that the loss on the test set decrease a lot from the best position. To that end, the early stopping criterion allows you to stop the convergence _and_ restore the weights of the neural network when it had the best score on the validation set, thanks to the `restore_best_weights` that is set to `False` by default.

❓ **Question** ❓ Run the model with a early stopping criterion that enables to restore the best weights of the parameters, plot the loss and accuracy and print the accuracy on the test set

In [None]:
### YOUR CODE HERE

❗ **Remark 1** ❗ You can look at the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) to play with other parameters

❗ **Remark 2** ❗ No need to take a look at the epochs as long as it hit the stopping criterion. So, in the future, you should have a large number of epochs and the early stopping criterion has to stop the epochs. 

❓ **Question** ❓ If you look closely at the different plots, you might see that sometimes, between two epochs, the loss is different but the accuracy is the same. How can that happend?

Hint : look at the following class and two different predictions. What would the accuracy and loss be in the two cases?

In [None]:
# True label
y_true = 1

# Prediction 1
y_pred_1 = 0.55

# Prediction 2
y_pred_2 = 0.99

# Part III : Batch-size & Epochs

❓ **Question** ❓ Run the previous model with different batch sizes (with the early stopping criterion)  and plot the results.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Look at the oscillations of the accuracy and loss with respect to the batch size number. Is this coherent with what we saw with the Tensorflow Playground? 

❓ **Question** ❓ How many optimizations of the weight are they within one epoch, with respect to the number of data and the batch size? Therefore, is one epoch longer with a large or a small bacth size?