 # Lecture 13: Training deep neural networks

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1ftihrW-_2cIzCkA3TYScFgoOe1bQwTrT)

In [None]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

In [None]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
def reset_state(seed=42):
    tf.keras.backend.clear_session()
    tf.random.set_seed(seed)
    np.random.seed(seed)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## Vanishing and exploding gradients

Training typically relies on gradients.

*Vanishing gradients problem*: For deep networks, gradients in lower layers can become very small.  Hence, corresponding weights are not updated during training.

*Exploding gradients problem*: In some situations (typically recurrent neural networks) gradients can become very large.  Hence, weight updates are very large and the training algorithm may not converge.

In general deep neural networks can suffer from *unstable gradients*.

### Problematic activation functions

One common cause of vanishing gradients in the past was the use of the sigmoid activation function (and unit Gaussian initialisation).

In [None]:
def logit(z):
    return 1 / (1 + np.exp(-z))

In [None]:
z = np.linspace(-5, 5, 200)
 
plt.figure(figsize=(8,4))
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [1, 1], 'k--')
plt.plot([0, 0], [-0.2, 1.2], 'k-')
plt.plot([-5, 5], [-3/4, 7/4], 'g--')
plt.plot(z, logit(z), "b-", linewidth=2)
props = dict(facecolor='black', shrink=0.1)
plt.annotate('Saturating', xytext=(3.5, 0.7), xy=(5, 1), arrowprops=props, fontsize=14, ha="center")
plt.annotate('Saturating', xytext=(-3.5, 0.3), xy=(-5, 0), arrowprops=props, fontsize=14, ha="center")
plt.annotate('Linear', xytext=(2, 0.2), xy=(0, 0.5), arrowprops=props, fontsize=14, ha="center")
plt.grid(True)
plt.title("Sigmoid activation function", fontsize=14)
plt.axis([-5, 5, -0.2, 1.2]);

Variance of outputs grows at each layer.  Final layers essentially saturate.  Gradients on final layers then very small and when propagate gradients back with back-propagation then get vanishing gradients.

### Weight initialisation

To avoid this problem need signals and gradents to *not* decay as propagating through network.

Avoid decaying signals/gradients by promoting equal variance at outputs and inputs of layer.

Can be promoted by random initialisation of weights to follow Gaussian with standard deviation:

\begin{eqnarray}
\text{Sigmoid activation:} \quad\quad & \sigma = \sqrt{\frac{2}{n_{\rm inputs}+n_{\rm outputs}}} \\
\text{Hyperbolic tangent activation:} \quad\quad & \sigma = 4\sqrt{\frac{2}{n_{\rm inputs}+n_{\rm outputs}}} \\
\text{ReLU activation:} \quad\quad & \sigma = \sqrt{2}\sqrt{\frac{2}{n_{\rm inputs}+n_{\rm outputs}}} \\
\end{eqnarray}

where $n_{\rm inputs}$ and $n_{\rm outputs}$ are the number of input and output nodes, respectively, for the layer.

There are a lot of different weight initialisation strategies.

#### Weight initialisation in TensorFlow

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
[name for name in dir(keras.initializers) if not name.startswith("_")]

Can often simply set initialiser when defining layer.

In [None]:
reset_state()

keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

Or can set up a `VarianceScaling` object directly.

In [None]:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

### Non-saturating activation functions

ReLU activation behaves much better than the sigmoid in deep networks since it does not saturate for positive values (and it is fast to compute).

However, the ReLU does suffer from the *dying neuron* problem.

In this senario neurons effectively die and only output zero.  The neuron is unlikely to come back to life since the gradient of the ReLU activation function is zero for negative inputs.

#### Leaky ReLU

The *leaky ReLU* avoids this problem and is defined by

$$
\text{LeakyReLU}_\alpha(z) = \max(\alpha z, z),
$$

where the hyperparameter $\alpha$ defines how much the leaky ReLU leaks (typically $\alpha=0.01$).

Let's plot the Leaky ReLU activation function for $\alpha=0.05$.

In [None]:
def leaky_relu(z, alpha=0.01):
    return np.maximum(alpha*z, z)

In [None]:
z = np.linspace(-5, 5, 200)

plt.figure(figsize=(8,4))
plt.plot(z, leaky_relu(z, 0.05), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([0, 0], [-0.5, 4.2], 'k-')
plt.grid(True)
props = dict(facecolor='black', shrink=0.1)
plt.annotate('Leak', xytext=(-3.5, 0.5), xy=(-5, -0.2), arrowprops=props, fontsize=14, ha="center")
plt.title("Leaky ReLU activation function", fontsize=14)
plt.axis([-5, 5, -0.5, 4.2]);

#### ELU

Another alternative is the *exponental linear unit* (ELU).

In [None]:
def elu(z, alpha=1):
    return np.where(z < 0, alpha * (np.exp(z) - 1), z)

Let's plot the ELU activation function for $\alpha=1$.

In [None]:
plt.figure(figsize=(8,4))
plt.plot(z, elu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1, -1], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"ELU activation function ($\alpha=1$)", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2]);

Properties:
- Non-zero gradient for $z<0$ to avoid dying neuron issue.
- Smooth so gradients well defined.
- But is slower to compute.

#### Activations functions in TensorFlow

TensorFlow supports a lot of activation functions.

In [None]:
[m for m in dir(keras.activations) if not m.startswith("_")]

Can again simply set when definiting layer or can construct directly.

In [None]:
reset_state()
keras.layers.Dense(10, activation="elu", name="hidden1")

In [None]:
reset_state()
keras.layers.Dense(10, activation=keras.layers.Activation("elu"), name="hidden1")

## Batch normalisation

While weight normalisation can reduce gradient problems at the beginning of training, it does not guarantee that these problems won't resurface during training.

*Batch normalisation* adds normalisation during training to address these issues.

Consists of zero-centering and normalising inputs just before the activation function, followed by shifting and scaling the result.  The shift and scale are considered additional parameters that are learnt during training.

This approach allows training to select the appropriate scale and shift (mean) for each layer.

The mean and standard deviation of the unnormalised inputs are computed for each mini-batch, hence the name *batch normalisation*.

When the trained network is applied to the test set there are no batches, so instead a running mean and standard deviation computed on the *training* set are used.

### Batch normalisation in TensorFlow

In [None]:
reset_state()

import tensorflow as tf

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

model = keras.models.Sequential([
keras.layers.Input(shape=[28, 28]),
keras.layers.Flatten(), keras.layers.BatchNormalization(),
keras.layers.Dense(n_hidden1, activation="elu", kernel_initializer="he_normal"), keras.layers.BatchNormalization(),
keras.layers.Dense(n_hidden2, activation="elu", kernel_initializer="he_normal"), keras.layers.BatchNormalization(),
keras.layers.Dense(n_outputs, activation="softmax")
])

model.summary()

In [None]:
[(var.name, var.trainable) for var in model.layers[1].variables]

**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*

## Pretraining and transfer learning

A deep network trained for one task can often be adapted for a similar task.

Reuse lower layers of network trained for another task.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture13_Images/transfer_learning.png" width="700px" style="display:block; margin:auto"/>

[Credit: Geron]

For transfer learning to be successful the data must have similar low-level features.

### Reusing a Keras model

Let's work through a transfer learning example.

Split the fashion MNIST training set into two:
* `X_train_A`: all images of all items, except sandals and shirts (classes 5 and 6).
* `X_train_B`: first 200 images of sandals or shirts.

The validation set and the test set are split similarly, but without restricting the number of images.

Dataset B corresponds to a simple problem (binary classification) but we only have a small number of training instances. 

Dataset A corresponds to a more difficult problem (classification between 8 classes) but we have much more data.

We will attempt to transfer knowledge from setting A to B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). 

Aside: Note that only patterns that occur in the same location can be reused since we are using `Dense` layers (CNNs will be much more effective in tranferring information detected anywhere in the image due to their translational equivariance properties, as we'll see in the CNN lecture).

#### Set up data

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [None]:
X_train_A.shape, X_train_B.shape

In [None]:
y_train_A[:30]

In [None]:
y_train_B[:30]

#### Define, compile, fit and save model on dataset A

In [None]:
reset_state()
model_A = keras.models.Sequential()
model_A.add(keras.layers.Input(shape=[28, 28]))
model_A.add(keras.layers.Flatten())
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

In [None]:
model_A.summary()

In [None]:
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])

In [None]:
history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

We achieve an accuracy ~92%, which is reasonable.

In [None]:
model_A.save("my_model_A.keras")

#### Repeat on dataset B

In [None]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Input(shape=[28, 28]))
model_B.add(keras.layers.Flatten())
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

In [None]:
model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])

In [None]:
history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_valid_B, y_valid_B))

We achieve an accuracy ~97% since this is an easier problem (binary classification).

However, we could do better by transferring information from setting A.

### Freezing lower layers

The lower layers of the first network have already learnt low-level features for the first task, so they can be reused as they are. 

That is, we freeze their weights so that they are not altered during subsequent training of the new network.

We will take all layers from model A and then add a final output layer for our binary classification problem.

In [None]:
model_A = keras.models.load_model("my_model_A.keras")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1]) # Reuse all layers except output.
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Note that `model_B_on_A` and `model_A` now share layers.  When you train on `model_B_on_A` that will also impact `model_A`.

To avoid this you can clone a model.

Let's freeze all layers except the final dense output layer.

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                     metrics=["accuracy"])

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

Even with just one trained layer and a few epochs, our model is starting to learn the new problem.

Now let's unfreeze the lower layers and train the full model to fine-tune it.

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                     metrics=["accuracy"])

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

In [None]:
model_B.evaluate(X_test_B, y_test_B)

In [None]:
model_B_on_A.evaluate(X_test_B, y_test_B)

### Model gardens

Many trained Tensor Flow models are available at 
[https://github.com/tensorflow/models](https://github.com/tensorflow/models).

## Improved optimizers

Although standard (stochastic) gradient descent is very effective it can still be slow for deep networks.

There are a number of more advanced optimizers that provide improvements, e.g.:
- Momentum optimization
- Nesterov accelerated gradient
- AdaGrad
- RMSProp
- Adam optimization
- ...

Recall gradient descent, with cost function $J(\theta)$ and gradients $\nabla_\theta J(\theta)$, proceeds simply by updating the weights $\theta$ by taking a step $\eta$ (learning rate) in the direction of the gradient:

$$\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)$$

### Momentum optimization

Momentum optimization uses the gradients to modify a momentum vector and uses the momentum to update the weights:

1. $m \leftarrow \beta m + \eta \nabla_\theta J(\theta)$
2. $\theta \leftarrow \theta - m$

Gradient is used as an acceleration rather than speed.  Can help to traverse plateaus and to avoid local minima.

The additional hyperparameter $\beta$ is introduced as a friction term to avoid the momentum growing too large (typically $\beta \sim 0.9$).

### Nesterov accelerated gradient

Nesterov accelerated gradient is a variant of momentum optimization where the gradient is computed further ahead in the direction of the momentum:

1. $m \leftarrow \beta m + \eta \nabla_\theta J(\theta + \beta m)$
2. $\theta \leftarrow \theta - m$

In general the momentum will be pointing toward the optimum and so Nesterov modification typically provides an improvement over standard momentum optimization.

### AdaGrad

AdaGrad scales down the gradient vector along the steepest direction by incorporating a gradient squared term:

1. $s \leftarrow s + \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)$
2. $\theta \leftarrow \theta - \eta \nabla_\theta J(\theta) \oslash \sqrt{s+\epsilon}$

Note that $\otimes$ and $\oslash$ are elementwise multiplication and division, respectively.

The parameter $\epsilon$ is introduced for numerical stability (typically $\epsilon\sim 10^{-10}$).



Basically, AdaGrad correspondings to an *adaptive learning rate* where the learning rate is decayed faster for steep directions.

Consequently, it requires much less tuning of the learning rate $\eta$.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture13_Images/ada_grad.png" width="750px" style="display:block; margin:auto"/>

[Credit: Geron]

### RMSProp

RMSProp extends AdaGrad by introducing an exponential decay in the accumulated squared gradient:

1. $s \leftarrow \beta s + (1-\beta) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)$
2. $\theta \leftarrow \theta - \eta \nabla_\theta J(\theta) \oslash \sqrt{s+\epsilon}$

(Typically $\beta\sim 0.9$.)

Avoids the problem where AdaGrad slows down too fast and so doesn't converge to the global optimum.

### Adam optimization

Adam optimization combines momentum and RMSProp:

1. $m \leftarrow \beta_1 m + (1-\beta_1) \nabla_\theta J(\theta)$
2. $s \leftarrow \beta_2 s + (1-\beta_2) \nabla_\theta J(\theta)\otimes\nabla_\theta J(\theta)$
3. $m \leftarrow \frac{m}{1-\beta_1^{t}}$, where $t$ is the iteration number 
4. $s \leftarrow \frac{s}{1-\beta_2^{t}}$, where $t$ is the iteration number
5. $\theta \leftarrow \theta - \eta m \oslash \sqrt{s+\epsilon}$

Steps 3 and 4 are introduced to boost $m$ and $s$ at the beginnning of training (since they are initialised to 0 they can otherwise be low at the beginning).

(Typically $\beta_1 \sim 0.9$, $\beta_2 \sim 0.999$.)

## Regularization

Deep networks have many parameters (sometimes millions) and so are prone to overfitting.

Regularization therefore becomes increasingly important.

### Early stopping

A simple regularization strategy is to end training early, e.g. when performance on validation set starts to degrade.

Although early stopping works well, other regularisation techniques can lead to better performance.

### $\ell_2$ and $\ell_1$ regularization

*Tikhonov* regularization adopts $\ell_2$ regularising term (also called *Ridge regression*):


$$ R(\theta) = \frac{1}{2} \sum_{j=1}^n \theta_j^2 = \frac{1}{2}  \theta^{\rm T}\theta.$$


*Lasso* regularization adopts $\ell_1$ regularising term:

$$ R(\theta) =\sum_{j=1}^n \left\vert \theta_j \right\vert .$$

*Elastic net* regularization provides a mix of Tikhonov and Lasso regularization, controlled by mix ratio $r$:

$$ R(\theta) =  r\sum_{j=1}^n \left\vert \theta_j \right\vert + \frac{1-r}{2} \sum_{j=1}^n \theta_j^2.$$

- For $r=0$, corresponds to Tikhonov regularization.
- For $r=1$, corresponds to Lasso regularization.

### Dropout

Dropout is a very popular and effective regularlisation technique developed by [Geoff Hinton in 2012](http://www.jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf).

Dropout involves simply dropping each neuron for a given training set with probability $p$.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture13_Images/dropout.png" width="750px" style="display:block; margin:auto"/>

[Credit: Geron]

Dropout encourages each neuron to be as effective as possible individually and not to rely heavily on a few nearby neurons but to consider all input neurons carefully.

The probability $p$ is called the *dropout rate* (typically $p \sim 0.5$).

After training the neurons don't get dropped.

The number of inputs of active neurons is lower when dropout is applied during training, than when the network is applied during testing.  

For example, if $p=0.5$, on average there are half as many input neurons during training than when testing.  During testing each neuron will get an input signal (approximately) twice as large as during training.

It is important to account for this difference.

To compensate, after training each neurons input weights are multiplied by the keep probability $1-p$ before applying the network to test data.

### Data augmentation

Data augmentation can be applied both as a regularization technique and to increase the volume of the training set.

Essentially, new training instances are created from the original training set.

For example, for images, data augmentation can be performed by rotating, shifting, scaling, flipping, changing the contrast, ..., of the original images in the training data-set.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture13_Images/data_augmentation.png" width="750px" style="display:block; margin:auto"/>

[Credit: Geron]

Appropriate data augmentation strategies depend on the type of data under consideration.

Typically training instances are generated on the fly to avoid additional storage requirements.  

Tensor Flow has built in functionality for many transformations for image data, making data augmentation for image data straightforward.