## Vanishing/ Exploding Gradients

This note will answer the following question:
- Cause of gradient vanishing/exploding problem
> - Bad initial weights $W$
> - Bad choice of activation function
- What is overflow and underflow


### 1. Bad Initial Weights

Consider a deep neural network $l$ hidden layers:

<center><img src='https://drive.google.com/uc?id=1AvMQI_iYsfr4tqSsxGJRYw4lKe7CToQq' width=800></img></center>

with linear activation function: $g(z) = z$ parameters $b^{[i]} = 0$ for all layers. Then the output of this NN will be: 

$$\hat{y}=W^{[l]}W^{[l-1]}W^{[l-2]}...W^{[3]}W^{[2]}W^{[1]}x$$

This is because:

$$a^{[1]}=g(z^{[1]})=g(W^{[1]}x+b^{[1]})=W^{[1]}x$$

$$a^{[2]}=g(z^{[2]})=g(W^{[2]}a^{[1]}+b^{[2]})=W^{[2]}a^{[1]}=W^{[2]}W^{[1]}x$$

$$a^{[3]}=g(z^{[3]})=g(W^{[3]}a^{[2]}+b^{[3]})=W^{[3]}a^{[2]}=W^{[3]}W^{[2]}W^{[1]}x$$

$$\vdots$$

Since each intermediate hidden layer has 2 hidden units, and input $x=[x_1, x_2]$ also have two dimension, $W^{[i]}$ will be 2\*2 matrix (**Note**: except the last layer $W^{[l]}$).

Now imagin all the $W^{[i]}$ have the same value:

$$ W^{[i]} = \begin{bmatrix}
1.5 & 0 \\
0 & 1.5
\end{bmatrix}$$

Then the output value:

$$\hat{y} = W^{[l]}\begin{bmatrix}
1.5 & 0 \\
0 & 1.5
\end{bmatrix}^{l-1} x =  W^{[l]}\begin{bmatrix}
1.5^{l-1} & 0 \\
0 & 1.5^{l-1}
\end{bmatrix} x$$

When the deep neural is very deep (with very large $l$), the output $\hat{y}$ will be super large: $1.5^{l-1} \rightarrow \infty$ as $l \rightarrow \infty$. 

This can result in very large gradients during back propagation. The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0. 

In the most extreme case, the values of parameters can become so large as to overflow and result in NaN values.

> **Note**: A rounding error occurs when numbers with very large magnitude being approximated as $\pm \infty$, further arithmetic will change these values into NaN on a computer, this is called **overflow**. When
> - The model is unstable, resulting in large changes in loss through iterations
> - The model loss goes to NaN during training
>
> these are signs of **gradient exploding problem**

Similarly, if all the $W^{[i]}$ have value:

$$ W^{[i]} = \begin{bmatrix}
0.5 & 0 \\
0 & 0.5
\end{bmatrix}$$<br>

Then the output value:

$$\hat{y} = W^{[l]}\begin{bmatrix}
0.5 & 0 \\
0 & 0.5
\end{bmatrix}^{l-1} x =  W^{[l]}\begin{bmatrix}
0.5^{l-1} & 0 \\
0 & 0.5^{l-1}
\end{bmatrix} x$$
<br>

When the neural network is very deep (with very large $l$), the output $\hat{y}$ will be super small: $0.5^{l-1} \rightarrow 0$ as $l \rightarrow \infty$. This is the case for all $W^{[i]} < I \text{  (Identity matrix)} $

> **Note**: A rounding error occurs when numbers near zero are rounded to zero during calculation on a computer, this is called **underflow**. When
>
> - The training loss remain large and nearly unchanged.
> It's a sign of **gradient vanishing problem**

### 2. Bad choice of activation function

Recall that is the back propagation, we need to take the derivative of the activation function w.r.t all the $a^{[i]}$, if sigmoid function is being used as the activation function:

<center><img src='https://drive.google.com/uc?id=1v035gDUHLW-dck7ji82JChki7m30zhbj'></img></center>

<br>
The gradient is almost 0 when the value has a very large magnitude. Thus, during the backpropagation process, there is almost no updates of the parameters ($W^{[i]}$s and $b^{[i]}$s).

This is a typical **gradient vanishing problem**. To fix it, use `relu` or other activation functions instead of sigmoid.


#### Another example: **softmax function**

$$\text{softmax}(x_i) = \frac{\text{exp}(x_i)}{\sum_{j=1}^{n}\text{exp}(x_j)}$$

Consider a simple case where all the $x_i = c$, we should expect to see all the outputs equal to $\frac{1}{n}$. However,

- When c is negative with large magnitude: then $exp(c) \rightarrow 0$ will underflow, meaning the denominator of the softmax will become 0, so the final result will be undefined
- When c is positive with large magnitude: then $exp(c) \rightarrow +\infty$ will overflow, the whole expression will become undefined

This problem can be solved by evaluating $\text{softmax}(z_i)$ instead, where $z_i = x_i-\underset{0 \leq j \leq n}{\max}(x_j)$. 

By doing so, the max value of $\text{exp}(z_i)$ equals to  $\text{exp}(0) = 1$ thus rules out the probability of overflow.
At the same time, at least one term in the denominatore hae value 1, while rules out the possibility of underflow.

### Useful link:

[How to Fix the Vanishing Gradients Problem Using the ReLU](https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function/)

[A Gentle Introduction to Exploding Gradients in Neural Networks](https://machinelearningmastery.com/exploding-gradients-in-neural-networks/)