Gated recurrent neural networks achieve state-of-the-art performance on difficult sequence learning tasks.
Due to their more sophisticated architecture, they are able to learn longer term dependencies from data.

Consider a simple tanh RNN with the following recursive forward equations:
\begin{align}
a_t &= W_h h_{t-1} + W_i x_t \\
h_t &= g(a_t)
\end{align}
where the basecase of the recursion could be $h_0 = \epsilon \mathbb{1}$ for some small $\epsilon \geq 0$.
For simplicity, we use an output layer geared towards next-step prediction:
\begin{align}
z_t &= W_o h_t \\
l_t &= \log( \mathrm{softmax}(z_t, x_{t+1}) = z_{t,x_{t+1}} - \log(\sum \exp(z_t))
\end{align}
We can then use the negative log-likelihood of a sequence $x = [x_1,\ldots,x_T]$ as a function of the parameters $\theta=\{W_i, W_h, W_o\}$:
$$
L(\theta) = -\sum_{t=0}^{T-1} l_t
$$


Crucial for back-propagation, we need gradients $\nabla_{a_t} L$ for all time steps.
We use three observations to compute these quantities. 
1. for $t>s$, $l_s$ does not depend on $a_t$ 
2. $l_t$ depends directly on $a_t$
3. for $s>t$, $l_s$ depends on $a_t$ only through $a_{t+1}$. 

Thus, we can compute $\nabla_{a_t} L$ using the rule of total derivatives like so:
$$
\nabla_{a_t} L = \nabla_{a_t} l_t + \sum_d \frac{\partial L}{\partial a_{t+1,d}} \nabla_{a_t} a_{t+1,d}
$$



[Standford Lecture Notes](https://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf)