# Vanishing and Exploding Gradient 

## S.Alireza Mousavizade

***

# Introduction

Artificial Neural Networks as we know were invented in 1943 to mimic our biological Nervous system to help machines learn as humans do.
But it was not until 1975 that we were able to actually make machines learn and recognize patterns in a data, with the famous Back-Propagation Algorithm came a new hope of training of multi-layered networks.
It allowed researchers to train supervised deep artificial neural networks from scratch, although with a little success. The Problem for this low accuracy of training the ANN using Back Propagation was later identified by Sepp Hochreiter’s in 1991.

# The Problem
1. The Vanishing Gradients
2. The Exploding Gradients

## The Vanishing Gradients

In Deep Neural Networks **adding more and more hidden layers** makes our network to learn **more Complex arbitrary functions** and features and therefore have higher accuracy while predicting the outcomes or identifying a pattern/feature in a complex data such as Image and Speech.

But, **adding a layer comes at a cost** which we refer as the **Vanishing Gradient**.
The Error that is back propagated using the Back Propagation Algorithm might become so small by the time it reaches the input of the model that it may have very little effect. This phenomena is called the Vanishing Gradient Problem
This make it difficult to know which direction the parameters/weights should move to improve the cost function therefore causes premature conversion to a poor solution.

below is an example of how mathematically back propagation works for a 4 hidden layer network.

<center>
<img src="vae_gradients_images/vanishing_and_exploding_gradients.jpg" alt="Drawing" style="width: 85%;"/>
</center>

Some time it might so happen that ∂J/∂b1 becomes equal to zero, and hence may not contribute towards updation of weights, thus causing a premature end to the learning of the model.

### The Exploding Gradients

Let’s now talk about another scenario that is very common with Deep Neural Nets that leads to failure of model training.

Sometimes it might so happen that while updating the weights error gradients can accumulate and result in Large gradients, this is in turn result in large update of weights and therefore make a network unstable, worst case scenario being that the value of weights become NaN.

# Solutions to the Vanishing Gradient Problem

## 1. Activation Function

The Simplest solution is to use activation functions like relu (leaky relu instead of sigmoid, tanh. 
[The Vanishing Gradient Problem Of Sigmoid](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)

### Sigmoid:

<center>
<img src="vae_gradients_images/sigmoid_vae_gradients.png" alt="Drawing" style="width: 85%;"/>
</center>


### Tanh:

<center>
<img src="vae_gradients_images/tanh_vae_gradients.png" alt="Drawing" style="width: 85%;"/>
</center>



## 2. Weight Initialization

- How to choose the	starting point for the iterative process of optimization?

- The aim of weight initialization is prevent layer activation outputs	from exploding or vanishing	during the course of a forward pass.

<center>
<img src="vae_gradients_images/weight_initialization.png" alt="Drawing" style="width: 85%;"/>
</center>

## Ideas:

### Small Random Numbers


<center>
<img src="vae_gradients_images/wi1.png" alt="Drawing" style="width: 85%;"/>
</center>

### Xavier Initialization

#### Steps to Derieve:

#### Step 1:

<center>
<img src="vae_gradients_images/wi2.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 2:

<center>
<img src="vae_gradients_images/wi3.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 3:

<center>
<img src="vae_gradients_images/wi4.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 4:

<center>
<img src="vae_gradients_images/wi5.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 5:

<center>
<img src="vae_gradients_images/wi6.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 6:

<center>
<img src="vae_gradients_images/wi7.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 7:

<center>
<img src="vae_gradients_images/wi8.png" alt="Drawing" style="width: 85%;"/>
</center>



## 3. Residual Connections

the residual connection directly adds the value present at the beginning of the block, x, to the end of the block (F(x)+x) thus residual connection doesn’t have to go through the  activation functions that “squashes” the derivatives, resulting in a higher overall derivative of the block.

### ResNet Architecture

<center>
<img src="vae_gradients_images/resnet_arch.jpeg" alt="Drawing" style="width: 95%;"/>
</center>

> **Bottleneck Design:** The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the ResNet architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.

| Residual Block Connection | Bottleneck Design | 
| -- | -- | 
| ![](vae_gradients_images/residual_block_connection.png) | ![Residual Network](vae_gradients_images/resnet_bottleneck.png) |

### ResNet Performance

Winner results of the ImageNet large scale visual recognition challenge (LSVRC) of the past years on the top-5 classiﬁcation task.

<center>
<img src="vae_gradients_images/image_net_results.jpg" alt="Drawing" style="width: 75%;"/>
</center>

### [Current State-of-the-art Models](https://paperswithcode.com/sota/image-classification-on-imagenet)


## 4. GRU and LSTM For RNN Cases

These two topics will be explained later.

## 5. Batch Normalization Layers

Batch Normalization (BN) does not prevent the vanishing or exploding gradient problem in a sense that these are impossible. Rather it reduces the probability for these to occur. Accordingly, the original paper states: ( It was proposed by Sergey Ioffe and Christian Szegedy in 2015.)

> In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues. By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities.

***

# Solution to the Vanishing Gradient Problem

1. **Gradient Clipping**: When gradients explode, the gradients could become NaN because of the numerical overflow or we might see irregular oscillations in training cost when we plot the learning curve. A solution to fix this is to apply Gradient Clipping; which places a predefined threshold on the gradients to prevent it from getting too large, and by doing this it doesn’t change the direction of the gradients it only change its length.

<center>
<img src="vae_gradients_images/gradient-clipping.png" alt="Drawing" style="width: 50%;"/>
</center>


2. **Network Re-designing**: Using a smaller batch size and fewer layers while training might show some improvement in tackling the Exploding Gradients.

3. **LSTM Networks**: Using LSTMs and perhaps related gated-type neuron structures are the new best practices to avoid exploding gradients in networks.

4. **Weight Regularization**: if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.

In [3]:

!jupyter nbconvert --to html vae_gradients.ipynb

[NbConvertApp] Converting notebook vae_gradients.ipynb to html
[NbConvertApp] Writing 585373 bytes to vae_gradients.html
