In [1]:
import numpy as np

# Vanishing/Exploding Gradients

Suppose we have a L layer neural network. We know that to get our prediction $\hat{y}$, we perform the following computation:

$$\hat{y} = W^{[L]}W^{[L-1]}W^{[L-2]} \dots W^{[2]} W^{[1]} X$$

Now suppose that each of our $W$ has the same values and shape, the operation is simplified to:

$$\hat{y} = W^{L} X$$

Why is this a problem? Let's illustrate this with an example.

Suppose we have a simple weight matrix with $1.5$ at its diagonals.

In [12]:
# Suppose we have this simple weight matrix, W
W = np.array([[1.5, 0], [0, 1.5]])
print(W)

[[ 1.5  0. ]
 [ 0.   1.5]]


Now suppose that we have a very deep neural network, with 100 layers. Observe that when we perform the computation for $\hat{y}$, the weight values shoot up. This is known as **exploding gradient**.

In [13]:
L = 100
y = np.power(W, L)
print(y)

[[  4.06561178e+17   0.00000000e+00]
 [  0.00000000e+00   4.06561178e+17]]


Conversely, now suppose that our weight matrix are $0.5$ at its diagonals. Observe that the values are now very small. This is known as **vanishing gradient**.

In [15]:
W = np.array([[0.5, 0], [0, 0.5]])
y = np.power(W, L)
print(y)

[[  7.88860905e-31   0.00000000e+00]
 [  0.00000000e+00   7.88860905e-31]]


## How to avoid this?

To avoid this, we can use some clever intuition about the values of the weights to initialize. For a neuron with large number of inputs, the smaller we want the weight values to be.

### Variance Scaling

We can set the weight initialization to scale by setting the variance of our initial weight distribution to be 
$$\frac{1}{n^{[L-1]}}$$
or if we are using ReLU,
$$\frac{2}{n^{[L-1]}}$$

In [17]:
inputs = 100
shape = (5, 3)
W = np.random.randn(shape[0], shape[1]) * np.sqrt(1 / inputs)
print(W)

[[-0.01114848  0.1224484   0.10872784]
 [-0.04284195  0.01937481 -0.02973807]
 [-0.04492172 -0.01220667  0.134523  ]
 [-0.03760996  0.03397368 -0.11037234]
 [ 0.06605986  0.09200254  0.11398231]]


### Xavier Initialization

We set variance to be:

$$\sqrt{\frac{1}{n^{[L-1]}}}$$

This is useful if we are using the `tanh` activation.

There are bunch of other weight initialization strategies out there to be tried. On a personal note, Keras has a bunch of these `Initializer`s in their high level API for us to try out. Glad that they have those!