# Modern Neural Networks - Notebook 4

This notebook will continue on from Notebook 3 of Modern Neural Networks and dive into the OOP-style development of CNN (LeNet-5) with TensorFlow 2 along with an experiment on different Regularisation methods.

### Import the required libraries:

In [1]:
%matplotlib inline

import tensorflow as tf
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import timeit

In [2]:
import os
from IPython.display import display, Image
import matplotlib.pyplot as plt

# %matplotlib inline

# Set up the working directory for the images:
image_folderName = 'Description Images'
image_path = os.path.abspath(image_folderName) + '/'

## 5 - Adding Regularisation to the Model:

## 5.1 - What is Regularisation?

Previously covered sections mainly trains the models to minimise the loss function to update the network weights in order to obtain better accuracies over time. Regularly, the model may need further improvements to be applied to prevent the model from overfitting the data. Overfitting is undesired because when the trained model recevies new unknown data to classify, it won't be able to perform with the same accuracy. The idea here, is to employ techniques to prevent overfitting so that the model can generalise well. 

Methods without regularisation can be:
1) Training the model on a Rich dataset, this provides enough variability from the data to improve the model's performance during testing scenarios. \
2) Change the model architecture with experiments, ensuring that the model is not too shallow to avoid underfitting or too deep to prevent overfitting. 

Regularistion techniques are:
1) Early Stopping. \
2) L1 and L2 Regularisation. \
3) Dropout. \
4) Batch Normalisation.

The following section will discuss more about these techniques.

## 5.2 - Early Stopping:

This straightforward technique essentially stops the model during training at a certain point (traininng epoch), this is to prevent overfitting as the model iterates over the dataset too many times, where ususally more common when the dataset has less training samples. The stopping point should be low enough to stop overfitting and large enough to ensure that the model can learn all that is needed. 

Cross-validation is the key to deciding on the early stopping point. This is done with providing a validation dataset to test the model on and through this validation, the network is able to measure if the training process should be continued or not.

Note: This can be automatically implemented with Keras Callbakcs, "tf.keras.callbacks.EarlyStopping".

## 5.3 - L1 and L2 Regularisation:

Generally, using regularisation in Machine Learning penalises the coefficients of the fitting function while in Deep Learning, the weight matrices of the nodes are the ones being penalised.

Mathematically, the regularisation term $R(P)$ is added to the loss function before training. This can be represened as the following:

$$ L(y, y^{true}) + \lambda R(P) $$ with $$ y = f(x, P) $$

Where, 
- $\lambda$ is the controlling factor for the strength of the regularisation term. 
- $y$ is the output of the function $f$ that is parameterised by $P$ for the input data $x$.

Next, the L1 and L2 regularisation terms can be defined as:

__For L1 Regularisation (a.k.a LASSO)__:

$$ R_{L1}(P) = \left\lvert \left\lvert P \right\rvert \right\rvert _{1} = \sum_{k} \left\lvert P_{k} \right\rvert$$

In more detail, the L1 Regulariser (LASSO, Least Absolute Shrinkage and Selection Operator) makes the network minimise the sum of its absolute parameter values. The larger weights are not penalised by the squaring factor, where instead it shrinks the parameters that are linked to the less important feature towards zero. Essentially, the network will ignore the less meaningful features, adopting sparse parameters. This technique is also useful when being applied to models that needs to run on mobile applications.

__For L2 Regularisaiton (a.k.a RIDGE)__:

$$ R_{L2}(P) = \frac{1}{2} \left\lvert \left\lvert P \right\rvert \right\rvert _{1}^{2} = \sum_{k} \frac{1}{2}  \left\lvert P_{k}^{2} \right\rvert$$

In more detail, the L2 Regulariser (RIDGE) makes the network minimise the sum of its squared parameter values. This technique will decay all of the parameter values but it does so more strongly on large parameters (because of the squared term). Essentially, the network will keep its parameter values low and therefore it will be more homogeneously distributed. This technique prevents the network from developing a small set of parameters that has large values which influences the predictions.

The Code implementations for these will be shown in the sections below.

## 5.4 - Dropout:

Larger neural networks with greater number of parameters inherently have the problem of overfitting and with the increase of architecture complexity, the model will be slower to compute. Dropout can be implemented to address these issues. The main concept is to randomly drop units from the neural network by temporarily removing it and its incoming and outgoing connections in the training phase. Applying this technique on the network essentially mean it amounts to sampling a thinned-out network, where it consists of only the units that has survived dropout. 

This method takes in a hyperparameter ratio of $\rho$ which is the probability of that neuron being switched off at each of the training step. This value is typically set betweem 0.1 to 0.5.

Note: In TF code implementation, "tf.nn.dropout()" and in Keras API implementation, "tf.kears.layers.Dropout()".

#### The diagram below briefly describes the dropout being applied to the network:

In [None]:
# Dropout: 
display(Image(image_path + 'Dropout.png', width=600, unconfined=True))
print('Image ref -> https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf')

## 5.5 - Batch Normalisation:

Batch normalization is a technique used in deep learning to tackle the problem known as internal covariate shift. For each layers output becomes the input for the next layer in the hidden layers. As the model updates after each training iteration using gradient descent, the distribution of the activations changes as well, hence this slows down the training process because each layer has to adapt to these new changes.

The batch normalisation operation normalises the output results the from the previous layer and normalises it by subtracting the batch mean and then divides it by the batch standard deviation.
As the batches in Stochastic Gradient descent are randomly sampled, this also means that the data won't be normalised the same way twice, therefore, the network learns to deal with these fluctuations making it more robust and generalise better.

Note: In TF code implementation, "tf.nn.batch_normalization()" and in Keras API implementation, "tf.kears.layers.BatchNormalization()".

## 5.6 - TF and Keras implementation: Adding Regularisation to the model:

### 5.6.1 - Load in the Data:

In [7]:
# Define the number of classes:
nb_classes = 10

# Define the image dimensions:
img_rows, img_cols, img_chnls = 28, 28, 1

# Define the input shape:
input_shape = (img_rows, img_cols, img_chnls)

# Load in the dataset:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

### 5.6.2 - Data Preprocessing:

NOTE: the "*input_shape" where the single " * " notation in the code means a to create a tuple from this argument. where two " * " like " ** " is to create a dictionary.

In [8]:
# Normalise the data:
x_train, x_test = x_train / 255.0, x_test / 255.0

# Inspect:
x_train.shape, x_test.shape

((60000, 28, 28), (10000, 28, 28))

In [9]:
# Reshape the inputs:
x_train = x_train.reshape(x_train.shape[0], *input_shape)
x_test = x_test.reshape(x_test.shape[0], *input_shape)

# Inspect:
x_train.shape, x_test.shape

((60000, 28, 28, 1), (10000, 28, 28, 1))

### 5.6.3 - 




## Summary:

