# Vanishing/Exploding (Unstable) Gradient Problem


**How to Solve?**

- Changing how we initialize weights
- Using non-saturating activation functions
- Batch-normalization
- Gradient clipping


## Changing how we initialize weights

Initialize the weights with a certain variance and mean value


### Glorot (Xavier) Initialization:
68.2% of the values are between -1x(standard deviation) and +1x(standard deviation)

![image.png](attachment:3faed1ad-77d0-49b0-b396-afad555c9c7f.png)


### He Initialization

![image.png](attachment:3ee520f6-c5ca-4787-b510-77aee41b0652.png)

Using **HE Initialization** with **ELU** or any other variant of ReLU can
significantly reduce **the danger of unstable gradients** at **the beginning of training**.

But it might still come back later... (which **Batch-Normalization is your friend there**)

### LeCun Initialization

![image.png](attachment:6340b80f-43b9-4c6c-aa5d-f82b50b122bc.png)


---

![image.png](attachment:f4945418-e587-4969-a426-963c4f411a75.png)

## Using non-saturating activation functions

Staturation on extremes causes the unstable gradient


Staturating activation functions:

- Sigmoid
- ReLU (for negative numbers only)

NonSaturating activation functions:

- Leaky ReLU (with parameter alpha)
- Randomized Leaky ReLU (RReLU) -> alpha is random
- Parametric Leaky ReLU (PReLU) -> alpha is figured out during training
- Exponential Linear Unit (ELU) (with parameter alpha)
- Scaled ELU -> alpha is trainable
- Gaussian Error Linear Units (GELU)

![image.png](attachment:e434f90b-eba2-4f40-b777-67f0d1c291ad.png)
![image.png](attachment:d28cd3fe-cf44-4856-bfcc-393136e98874.png)
![image.png](attachment:6459833f-1f07-4dc7-a82f-49d93ba1b202.png)
![image.png](attachment:950265a6-0fe5-4918-b5b0-30d52d24e286.png)

## Batch-normalization (BN)

Center around **zero** and normalize the inputs.

- Achives same accuracy faster
- Can lead to better performance
- No need to have standardization layer
- Reduces the need for regularization
- Epochs take longer due to the amount of computations but convergence will be faster

**Normalization** means:

- set **mean** to **0**, and
- set **standard deviation** to **1**

![image.png](attachment:03fe9134-ed3c-46ec-a930-46db990b4afd.png)

![image.png](attachment:d9ea0296-ff08-4220-8016-858bcd96d2c2.png)

![image.png](attachment:91317e74-b215-4582-be80-321ecdc4ac05.png)

Variant = Square of Standard Deviation

![image.png](attachment:802949aa-dc07-44b5-8e98-05dde813bee7.png)

![image.png](attachment:17589f8d-bbac-4901-9c13-75f002a2d6fa.png)

![image.png](attachment:0c5c7384-d5bc-4819-9e98-4261080b5e64.png)

The first 2 parameters are **trained**, but the third and forth parameters are **learned from the dataset**



## Gradient clipping (for exploding gradients in RNNs)

- Clipping
- Clipping by norm

![image.png](attachment:59d38cf1-2cbe-4767-b0fc-36e2b332146a.png)

![image.png](attachment:79ba2baf-fc12-4e96-bc72-9fc321e00329.png)
