# Training Deep Neural Networks

We will explore the vanishing and exploding gradients problems, tackle complex tasks when you have little labeled data, various optimizers to speed up training models and a few popular regularization techniques.

## Setup

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

%load_ext tensorboard

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images")
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

![Vanishing-Gradients-in-DNN.png](images/Vanishing-Gradients-in-DNN.png)


### Weight Initialization : Glorot and He Initialization

In their paper, Glorot and Bengio proposed that for the signal to flow properly, variance of the outputs of each layer must be equal to variance of the input; and we need gradience to have equal variance before and after flowing through a layer in the reverse direction. 

Glorot initialization (when using logistic activation function) :

Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$

Or a uniform distribution between -r and +r, with r = $\sqrt(\frac{3}{fan_{avg}})$

Here, fan$_{avg}$ = (fan$_{in}$ + fan$_{out}$) / 2

This initialization strategy is called **Xavier initialization** or **Glorot initialization**.

Yann LeCun proposed replacing fan$_{avg}$ with fan$_{in}$ in above equation. This strategy is called **LeCun initialization**.

The initialization strategy for the *ReLU* activation functions (including variants) is sometimes called the **He initialization**.

*Table: Initialization parameters for each type of activation function:*

| Initialization | Activation functions | $\sigma^2(Normal)$ |
|-|-|-|
| Glorot | None, tanh, logistic, softmax | $\frac{1}{fan_{avg}}$ |
|He|ReLU and variants| $\frac{2}{fan_{in}}$ |
|LeCun|SELU| $\frac{1}{fan_{in}}$ |

In [2]:
[name for name in dir(keras.initializers) if not name.startswith("_")]

['Constant',
 'GlorotNormal',
 'GlorotUniform',
 'HeNormal',
 'HeUniform',
 'Identity',
 'Initializer',
 'LecunNormal',
 'LecunUniform',
 'Ones',
 'Orthogonal',
 'RandomNormal',
 'RandomUniform',
 'TruncatedNormal',
 'VarianceScaling',
 'Zeros',
 'constant',
 'deserialize',
 'get',
 'glorot_normal',
 'glorot_uniform',
 'he_normal',
 'he_uniform',
 'identity',
 'lecun_normal',
 'lecun_uniform',
 'ones',
 'orthogonal',
 'random_normal',
 'random_uniform',
 'serialize',
 'truncated_normal',
 'variance_scaling',
 'zeros']

By default, Keras uses Glorot initialization with a uniform distribution.

In [3]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x7fe375b2e130>

If you want He initialization with a uniform distribution but based on fan$_{avg}$ rather than fan$_{in}$, you can use VarianceScaling initializer like this:

In [4]:
init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="relu", kernel_initializer=init)

<tensorflow.python.keras.layers.core.Dense at 0x7fe37d907400>

### Nonsaturating Activation functions

**ReLU** is considered as the much better activation function compared to others. Unfortunately, it is not perfect. It suffers from a problem known as *dying ReLUs*: during training, some neurons stop outputting anything other than 0, especially if you use a large learning rate.

To solve this problem you may use a variant of ReLU, such as the **Leaky ReLU**. This function is defined as: LeakyReLU$_\alpha$(z) = max($\alpha$z, z). $\alpha$ is a slope of the function for z < 0 and is typically set to 0.01. This small slope ensures that the ReLU will never "die"; they can go into a long coma, but eventually wake up.

In the *randomized leaky ReLU* (RReLU), the $\alpha$ is picked randomly in a given range during training and is fixed to an averarage during testing. It acts as a regularizer (reduces the risk of overfitting).

In the *parametric leaky ReLU* (PReLU), the $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified during backpropagation). PReLU is reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting.

![Leaky ReLU](images/leaky_relu.png)
<center>Leaky ReLU; like ReLU, but with small slope for -ve values.</center>

In [5]:
[m for m in dir(keras.activations) if not m.startswith("_")]

['deserialize',
 'elu',
 'exponential',
 'get',
 'hard_sigmoid',
 'linear',
 'relu',
 'selu',
 'serialize',
 'sigmoid',
 'softmax',
 'softplus',
 'softsign',
 'swish',
 'tanh']

In [6]:
[m for m in dir(keras.layers) if "relu" in m.lower()]

['LeakyReLU', 'PReLU', 'ReLU', 'ThresholdedReLU']

Let's train a neural network on Fashion MNIST using the Leaky ReLU:

In [7]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_val, X_train = X_train_full[:5000], X_train_full[5000:]
y_val, y_train = y_train_full[:5000], y_train_full[5000:]

In [8]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [9]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [10]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now let's try PReLU:

In [11]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [12]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [13]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10



**Exponential linear unit (ELU)** outperforms all ReLU variants: training time is reduced and the network performed better on the test set.

ELU activation function: 

 $$ ELU_\alpha(z) =   \left\{
\begin{array}{ll}
      \alpha(exp(z) -1) & if z < 0 \\
      z & if z >= 0 \\
\end{array} 
\right.  $$

![ELU](images/ELU.png)


The main drawback of ELU is it is slower to compute than the ReLU functions (due to exponential function).

Implementing ELU in TensorFlow is trivial, just specify the activation function when building each layer:

In [14]:
keras.layers.Dense(10, activation="elu")

<tensorflow.python.keras.layers.core.Dense at 0x7fe35c2ce9d0>

Scaled ELU (**SELU**) activation function:

During training, a neural network composed exclusively of a stack of dense layers using the SELU activation function and LeCun initialization will self-normalize: the output of each layer will tend to preserve the same mean and variance during training, which solves the vanishing/exploding gradients problem.

As a result, this activation function outperforms the other activation functions very significantly for such neural nets. Unfortunately, the self-normalizing property of the SELU activation function is easily broken: you cannot use ℓ1 or ℓ2 regularization, regular dropout, max-norm, skip connections or other non-sequential topologies (so recurrent neural networks won't self-normalize). 

However, in practice it works quite well with sequential CNNs. If you break self-normalization, SELU will not necessarily outperform other activation functions.

![SELU](images/selu.png)

Let's create a neural net for Fashion MNIST with 100 hidden layers, using the SELU activation function:

In [15]:
np.random.seed(42)
tf.random.set_seed(42)

In [16]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="selu",
                             kernel_initializer="lecun_normal"))
for layer in range(99):
    model.add(keras.layers.Dense(100, activation="selu",
                                 kernel_initializer="lecun_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [17]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

**Note**: Do not forget to scale the inputs to mean 0 and standard deviation 1:

In [18]:
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_val_scaled = (X_val - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

In [19]:
history = model.fit(X_train_scaled, y_train, epochs=5,
                    validation_data=(X_val_scaled, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [20]:
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"))
for layer in range(99):
    model.add(keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

history = model.fit(X_train_scaled, y_train, epochs=5,
                    validation_data=(X_val_scaled, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Not great at all, we suffered from the vanishing/exploding gradients problem.

#### So which activation function should you use?

* In general, SELU > ELU > leaky ReLU (all variants) > ReLU > tanh > logistic.
* If the network's architecture prevents it from self-normalizing, ELU may perform better than SELU.
* For runtime latency, prefer Leaky ReLU.
* If you have spare time and computing power, you can use cross-validation to evaluate RReLU(if overfitting) or PReLU (for huge training set).
* If speed is priority, ReLU is the best choice.

### Batch Normalization

Although weight initialization reduces the vanishing/exploding gradient problems at the beginning of training, it doesn't guarantee it won't come back during training.
As neural network learns, it updates the weights over each epoch during training. So what if one of the weights become drastically large than other weights. Well, this large weight will lead to larger output which will cascade causing instability. This is where BN process comes in.

BN is applied to the layers that we choose to apply on within the network.

The BN technique does the following steps:
1. The operation standardizes and normalizes the input values. 
2. The input values are then transformed through scaling (multiplying with $\gamma$) and shifting (adding $\beta$) operations.

Batch Normalization equations:

1. mean: $ \mu_B = \frac{1}{m_B}\sum_{i=1}^{m_B} x^{(i)} $

2. variance: $\sigma_B^2 = \frac{1}{m_B}\sum_{i=1}^{m_B}(x^{(i)} - \mu_B)^2 $

3. normalize (i.e., step 1) : $ \hat x^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt(\sigma_B^2 + \epsilon)} $

4. scale and shift (i.e., step 2) $ z^{(i)} = \gamma \otimes \hat x^{(i)} + \beta  $

where,
* $ \mu_B $ is the vector of input **means**, evaluated over the whole mini-batch $B$,
* $\sigma_B $ is the vector of input **standard deviations**,
* $ m_B $ is the number of instances,
* $ \hat x^{(i)} $ is the vector of zero-centered and normalized inputs for instance $i$,
* $ \gamma $ is the **output scale** parameter vector for the layer,
* $\otimes $ represents element-wise multiplication,
* $ \beta $ is the **output shift** parameter vector for the layer,
* $\epsilon $ is tiny number that avoids division by zero (typically $10^{-5}$). This is called smoothing term.
* $ z^{(i)} $ is the output of the BN operation. It is the rescaled and shifted version of the inputs.

In [21]:
# Implementation of batch normalization

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [22]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_211 (Dense)            (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_212 (Dense)            (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_213 (Dense)            (None, 10)               

In [23]:
bn1 = model.layers[1]
[(var.name, var.trainable) for var in bn1.variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [24]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [25]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The authors of the BN paper argued in favor of adding the BN layers before the activation functions, rather than after.

In [26]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    keras.layers.Dense(100, use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    keras.layers.Dense(10, activation="softmax")
])

In [27]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [28]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Gradient Clipping
In this technique, it clips the gradient during backpropagation so that they never exceed some threshold. It is most often used in RNN. For other types of networks, BN is usually sufficient.

All Keras optimizers accept clipnorm or clipvalue arguments:

In [29]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)

In [30]:
optimizer = keras.optimizers.SGD(clipnorm=1.0)

## Reusing Pretrained Layers

You should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then reuse the lower layers of this network. This technique is called **transfer learning**.

![Reusing pretrained layers](images/reusing_pretrained_layers.jpeg)

Try freezing all the reused layers first (i.e., make their weights non-trainable) then train your model and see how it performs.
Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves.
If you still cannot get food performance, and you have little training data, try dropping the top hidden layer(s) and freezing all the remaining hidden layers to reuse.

### Transfer Learning in Keras

Let's split the fashion MNIST training set in two:

* X_train_A: all images of all items except for sandals and shirts (classes 5 and 6).<br>
* X_train_B: a much smaller training set of just the first 200 images of sandals or shirts.

We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). 

In [31]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_val_A, y_val_A), (X_val_B, y_val_B) = split_dataset(X_val, y_val)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

X_train_A.shape, X_val_A.shape, X_test_A.shape, X_train_B.shape

((43986, 28, 28), (4014, 28, 28), (8000, 28, 28), (200, 28, 28))

In [32]:
tf.random.set_seed(42)
np.random.seed(42)

In [33]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

In [34]:
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

In [35]:
history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_val_A, y_val_A))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [36]:
model_A.save("my_model_A.h5")

In [37]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

In [38]:
model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

In [39]:
history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_val_B, y_val_B))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [40]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 784)               3136      
_________________________________________________________________
dense_214 (Dense)            (None, 300)               235200    
_________________________________________________________________
batch_normalization_4 (Batch (None, 300)               1200      
_________________________________________________________________
activation (Activation)      (None, 300)               0         
_________________________________________________________________
dense_215 (Dense)            (None, 100)               30000     
_________________________________________________________________
batch_normalization_5 (Batch (None, 100)              

In [41]:
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

In [42]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [43]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])

In [44]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_val_B, y_val_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_val_B, y_val_B))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [45]:
model_B.evaluate(X_test_B, y_test_B)



[0.1408407837152481, 0.9704999923706055]

In [46]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.06834527105093002, 0.9934999942779541]

In [48]:
(100 - 97.05) / (100 - 99.35)

4.538461538461503

Great! We got quite a bit of transfer: the error rate dropped by a factor of 4!

Transfer learning does not work well with small dense networks. It works best with CNN.

## Faster Optimizers
The regular gradient descent updates the weights $\theta$ by directly subtracting the gradient of the cost function J($\theta$) multiplied by the learning rate $\eta$. The equation is:
$$ \theta \leftarrow \theta - \eta \nabla_{\theta}J(\theta) $$
It does not care what the earlier gradients were: if the local gradient is tiny, it goes very slowly.

### Momentum Optimization

Equation:
1. $ m\leftarrow \beta m - \eta \nabla_{\theta}J(\theta) $
2. $ \theta \leftarrow \theta + m $ 

where,
m is the momentum vector and $\beta$ set between 0 to 1. A typical value is **0.9**.

In [50]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

### Nesterov Accelerated Gradient

It measures the gradient of cost function not at the local position $\theta$ but slightly ahead in the direction of the momentum, $ \theta + \beta m $.

Equation:
1. $ m\leftarrow \beta m - \eta \nabla_{\theta}J(\theta + \beta m) $
2. $ \theta \leftarrow \theta + m $

In [51]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### AdaGrad
Consider the elongated bowl problem: the Gradient Descent starts by quickly going down the steepest slope, which does not point straight towards the global optimum, then it very slowly goes down to the bottom of the valley.
The Adagrad algorithm corrects its direction earlier to point a bit more toward the global minimum.

Equations:
1. $ s\leftarrow s + \nabla_{\theta}J(\theta) \otimes \nabla_{\theta}J(\theta) $
2. $ \theta \leftarrow \theta - \eta \nabla_{\theta}J(\theta) \oslash \sqrt{s + \epsilon}  $

The first step accumates the square of the gradients into the vector s. This vectorized form is equivalent to computing $ s_i  \leftarrow s_i + (\partial J(\theta) / \partial \theta_i)^2 $.<br>
The second step is identical to Gradient Descent, but the gradient vector is scaled down by a factor of $\sqrt{s + \epsilon} $ ($ \oslash $ represents element-wise division).

The alogirthm works faster for steeper dimensions than for dimensions with gentler slopes.

AdaGrad frequently performs well for simple quadratic problems, but it often stops too early when training neural networks.

In [52]:
optimizer = keras.optimizers.Adagrad(lr=0.001)

### RMSProp

RMSProp fixes the problem of AdaGrad of slowing down to fast and never converging to global minimum. It fixes this by accumulating the gradients form most recent iterations  (as opposed to the gradients since the beginning of training). It does so by using exponential decay in the first step.

Equations:
1. $ s\leftarrow \beta s + (1 - \beta)\nabla_{\theta}J(\theta) \otimes \nabla_{\theta}J(\theta) $
2. $ \theta \leftarrow \theta - \eta \nabla_{\theta}J(\theta) \oslash \sqrt{s + \epsilon}  $

The decay rate $\beta$ is typically set to 0.9.

Except on very simple problems, it almost always performs much better than AdaGrad.

In [53]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### Adam and Nadam Optimization

#### Adam
Adam combines the ideas of momentum and RMSProp: just like momentum optimization it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

Equations:
1. $ m\leftarrow \beta_1 m - (1 - \beta_1)\nabla_{\theta}J(\theta) $
2. $ s\leftarrow \beta_2 s + (1 - \beta_2)\nabla_{\theta}J(\theta) \otimes \nabla_{\theta}J(\theta) $
3. $ \hat m \leftarrow \frac{m}{1-\beta_1^T} $
4. $ \hat s \leftarrow \frac{s}{1-\beta_2^T} $
2. $ \theta \leftarrow \theta + \eta \hat m \oslash \sqrt{\hat s + \epsilon}  $

Here, $\beta_1$ is typically initialized to 0.9, while $\beta_2$ is often initialized to 0.999. The $\epsilon$ is initialized to a tiny number such as $10^{-7}$.


In [54]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

#### AdaMax

In [55]:
optimizer = keras.optimizers.Adamax(lr=0.001, beta_1=0.9, beta_2=0.999)

#### Nadam

In [57]:
optimizer = keras.optimizers.Nadam(lr=0.001, beta_1=0.9, beta_2=0.999)

### Learning Rate Scheduling

## Avoiding Overfitting Through Regularization

### $\ell1$ and $\ell2$ Regularization

In [58]:
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))
# or l1(0.1) for ℓ1 regularization with a factor or 0.1
# or l1_l2(0.1, 0.01) for both ℓ1 and ℓ2 regularization, with factors 0.1 and 0.01 respectively

In [59]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(100, activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(10, activation="softmax",
                       kernel_regularizer=keras.regularizers.l2(0.01))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_val_scaled, y_val))

Epoch 1/2
Epoch 2/2


Since you will want to apply the same regularizer to all layers (as well as same activation function and same initialization strategy in all hidden layers), you may find yourself repeating the same arguments. To avoid this, either use loops or Python's functools.partial() function.

In [60]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_val_scaled, y_val))


Epoch 1/2
Epoch 2/2


### Dropout
With dropout regularization, at each training iteration a random subset of all neurons in one or more layers - except the outpt layer - are "dropped out"; these neurons output 0 at this iteration.

In [61]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_val_scaled, y_val))

Epoch 1/2
Epoch 2/2


If you observe the model is overfitting, you can increase the dropout rate. Dropout significantly slows down the convergence, but it usuallt results in a much better model when tuned properly.

#### Alpha Dropout

Use Alpha dropout if you want to regularize a self-normalizing network based on SELU activation function.

In [62]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

history = model.fit(X_train_scaled, y_train, epochs=20, validation_data=(X_val_scaled, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [63]:
model.evaluate(X_test_scaled, y_test)



[0.4453473687171936, 0.8664000034332275]

In [64]:
model.evaluate(X_train_scaled, y_train)



[0.3312146067619324, 0.88919997215271]

### Monte-Carlo (MC) Dropout
MC Dropout can boost the performance of any trained dropout model without having to retrain it or even modify it at all.

In [71]:
tf.random.set_seed(42)
np.random.seed(42)

In [72]:
y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)

In [82]:
np.round(model.predict(X_test_scaled[:1]), 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.02, 0.  , 0.87]],
      dtype=float32)

In [84]:
np.round(y_probas[:, :1], 2)

array([[[0.  , 0.  , 0.  , 0.  , 0.  , 0.45, 0.  , 0.44, 0.  , 0.11]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.47, 0.  , 0.51]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.73, 0.  , 0.01, 0.  , 0.26]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.06, 0.  , 0.24, 0.  , 0.7 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.23, 0.  , 0.5 , 0.  , 0.27]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.35, 0.  , 0.63]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.29, 0.  , 0.17, 0.  , 0.54]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.56, 0.  , 0.41]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.45, 0.  , 0.12, 0.  , 0.43]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.09, 0.  , 0.07, 0.  , 0.84]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.05, 0.  , 0.29, 0.  , 0.66]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.88, 0.  , 0.04, 0.  , 0.08]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.69, 0.  , 0.22, 0.  , 0.09]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.45, 0.  , 0.36, 0.  , 0

This tells that the when we activate dropout, the model is not sure anymore. It still prefers class 9 but sometimes it prefers class 5 and 7 as well

In [86]:
np.round(y_proba[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.31, 0.  , 0.2 , 0.  , 0.48]],
      dtype=float32)

In [87]:
y_std = y_probas.std(axis=0)
np.round(y_std[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.28, 0.  , 0.17, 0.  , 0.27]],
      dtype=float32)

Apparently, there's quite a lot of variance in the probability estimates.

In [88]:
y_pred = np.argmax(y_proba, axis=1)

In [89]:
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

0.8647

If your model contains other layers (like BN), you should replace the dropout layers as below.

In [90]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

class MCAlphaDropout(keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

We override the call() method to force training argument to True.

In [91]:
tf.random.set_seed(42)
np.random.seed(42)

mc_model = keras.models.Sequential([
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer
    for layer in model.layers
])

In [92]:
mc_model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_11 (Flatten)         (None, 784)               0         
_________________________________________________________________
mc_alpha_dropout (MCAlphaDro (None, 784)               0         
_________________________________________________________________
dense_240 (Dense)            (None, 300)               235500    
_________________________________________________________________
mc_alpha_dropout_1 (MCAlphaD (None, 300)               0         
_________________________________________________________________
dense_241 (Dense)            (None, 100)               30100     
_________________________________________________________________
mc_alpha_dropout_2 (MCAlphaD (None, 100)               0         
_________________________________________________________________
dense_242 (Dense)            (None, 10)              

In [93]:
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
mc_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [94]:
mc_model.set_weights(model.get_weights())

Now we can use the model with MC Dropout:

In [95]:
np.round(np.mean([mc_model.predict(X_test_scaled[:1]) for sample in range(100)], axis=0), 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.3 , 0.  , 0.25, 0.  , 0.45]],
      dtype=float32)

### Max-Norm Regularization
For each neuron, it constraints the weights w of the incoming connections such that $||w||_2 <= r$, where r is the max-norm hyperparameter and ||.|| is $\ell_2$ norm.

It does not add a regularization loss term to overall loss function. Instead it is typically implemented by computing $||w||_2 $ after each traiing step ans rescaling w if needed (w $\leftarrow w \frac{r}{||w||_2}$) 

In [98]:
layer = keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal",
                           kernel_constraint=keras.constraints.max_norm(1.))

In [97]:
MaxNormDense = partial(keras.layers.Dense,
                       activation="selu", kernel_initializer="lecun_normal",
                       kernel_constraint=keras.constraints.max_norm(1.))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    MaxNormDense(300),
    MaxNormDense(100),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

history = model.fit(X_train_scaled, y_train, epochs=2, validation_data=(X_val_scaled, y_val))

Epoch 1/2
Epoch 2/2
