______________________
# Vanishing/Exploding gradient problem in rnn
__________________________
Lets analyse the gradients of a simple rnn to understand the problem we encounter during training.

<img src='assets/vanish_grad1.png'>

RNNs are trained using 'Backpropagation through time (BPTT)', where the objective is to learn the parameters $W_{xh}, W_{hh}, W_{hy}$ by obtaining the error gradients with respect to those parameters through gradient desent.

Lets compute the gradient $\large\frac{\partial E} {\partial W}$ for the above RNN which has three stages. The weights are obtained by summing up the gradients obtained at each time step: $\sum\limits_{t}\large\frac{\partial E_{t}} {\partial W_{hh}}$. But lets only calculate $\large\frac{\partial E_{t+1}} {\partial W}$ for this exercise.

Lets do the 'computational graph gradient flow approach' to track the gradients as it flows back in various computational nodes. Go through [Gradient flow in neural network lesson]('http://localhost:8888/notebooks/intro/Gradient%20Flow.ipynb') for the details.

The equations involved:   
$$\begin{align}
z_{t} &= U x_{t} + W s_{t-1} \\    
s_{t} &= tanh(z_{t}) \\    
q_{t} &= V s_{t} \\    
o_{t} &= softmax(q_{t}) \\   
\end{align} $$


#### Backpropagating from the last output node:
* Since we need to calculate $E_{t+1}$ with respect to $W$, we will start with output node: $o_{t+1}$. Since the weights W is common for all time steps, we need to individualy calculate their gradient for each time step and add the results for the final gradient. Note all the notations below are in matrix form.
* ##### W gradient at time (t+1):
    + The error gradient $\large\frac{\partial E_{t+1}} {\partial o_{t+1}}$ is propagated back.
    + Next is the activation gate, with output $o_{t+1}$ and input $q_{t+1}$ , so the local gradient $\large\frac{\partial o_{t+1}} {\partial q_{t+1}}$ gets multiplied with the incoming gradient. Thus we get
    
    $$\frac{\partial E_{t+1}} {\partial o_{t+1}} . \frac{\partial o_{t+1}} {\partial q_{t+1}} $$
        
    + Now we encounter the multiplication gate with weights V and $s_{t+1}$, so the gradient towards $s_{t+1}$ is the product of the incoming gradient multiplied by V. The gradient at this point is 
    
    $$\frac{\partial E_{t+1}} {\partial o_{t+1}} . \frac{\partial o_{t+1}} {\partial q_{t+1}} . \small V $$
    
    + After that comes the activation unit whose output is $s_{t+1}$ and input is $z_{t+1}$. So the local gradient gets multipled with the incoming gradient. And we get,
    
    $$\frac{\partial E_{t+1}} {\partial o_{t+1}} . \frac{\partial o_{t+1}} {\partial q_{t+1}} . \small V . \frac{\partial s_{t+1}} {\partial z_{t+1}} $$
    
    + Next comes the multiplication gate where the gradient at W is obtained by multiplying the incoming gradient with the 'other input' $s_{t}$, while the gradient that flows through $s_{t}$ gets multiplied with W. So the gradient at W is:   
    
    $$\frac{\partial E_{t+1}} {\partial W} = 
    \frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    S_{t}$$ 
    
    and the other gradient that flows through $s_{t+1}$ is   
    $$\frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W$$

* ##### W gradient at time (t):
    + Here the next node is the activation unit, so the local gradient gets multiplied with the incoming gradient from the previous step. We get, 
    
    $$\frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W . 
    \frac{\partial s_{t}} {\partial z_{t}}$$
    
    + Next comes the multiplication gate where the gradient at W is obtained by multiplying the incoming gradient with the 'other input' $s_{t-1}$, while the gradient that flows through $s_{t-1}$ gets multiplied with W. So the gradient at W is:
    
     $$\frac{\partial E_{t+1}} {\partial W} =
     \frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W . 
    \frac{\partial s_{t}} {\partial z_{t}}. S_{t-1}$$ 
     and the other gradient that flows through $s_{t-1}$ is 
     
     $$\frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W . 
    \frac{\partial s_{t}} {\partial z_{t}}. W$$ 

* ##### W gradient at time (t-1):
    + Repeating the same as above, we get the gradient at W as:
    
     $$\frac{\partial E_{t+1}} {\partial W} =
     \frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W . 
    \frac{\partial s_{t}} {\partial z_{t}}. W. \frac{\partial s_{t-1}} {\partial z_{t-1}}. S_{t-2}$$ 
    
So the final gradient is 

$$\begin{align}\frac{\partial E_{t+1}} {\partial W} &= 
    \frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    S_{t}  \\ &+ 
    \frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W . 
    \frac{\partial s_{t}} {\partial z_{t}}. S_{t-1} 
    \\ &+ 
    \frac{\partial E_{t+1}} {\partial o_{t+1}} . 
    \frac{\partial o_{t+1}} {\partial q_{t+1}} . 
    \small V . 
    \frac{\partial s_{t+1}} {\partial z_{t+1}} . 
    \small W . 
    \frac{\partial s_{t}} {\partial z_{t}}. W. \frac{\partial s_{t-1}} {\partial z_{t-1}}. S_{t-2} \end{align}$$

Notice the terms $\frac{\partial s_{t-n}} {\partial z_{t-n}}$, they are multiplied a lot and they are derivatives of the activation function. If the activation function is sigmoid, then its derivative will always be less than 0.25, and multiplying a lot yields a really small number, which is causes vanishing gradient problem!.

## How does LSTM prevent the vanishing gradient problem?

We have seen above that the problem is primarily because of the deriviative of the activation function. In the case of LSTM, the activation is the identity function with a derivative of 1.0. So the backpropagated gradient neither vanishes or explodes when passing through. The details will be published in the next section.

-- EOF --