The basic RNN equation is

$s_t = tanh(Ws_{t-1} + Ux_t)$

### Problem statement:

Let's take RNN for character predicition at each time step.

For easier understand, we will skip activation for RNN, so equations will be

$s_t = Ws_{t-1} + Ux_t$

$z_t = Vs_t$

$\hat{y} = softmax(z_t)$

RNN will look like

![image.png](attachment:image.png)

***Cross Entropy loss:***

$$E = -\sum_{k}y_klog(p_k)$$

- Let's derive, $\large{\frac{\partial E_4}{\partial V}} $

 $\large{\frac{\partial E_4}{\partial V} = \frac{\partial E_4}{\partial \hat{y_4}}.
 \frac{\partial \hat{y_4}}{\partial z_4}.\frac{\partial z_4}{\partial V}}$
 
 $= (\hat{y_4} - y_4).s_4$

**Note:**

Derivative of cross entropy loss and softmax is $p_i - y_i$, where
- $p_i$ - predicted value
- $y_i$ - actual value

- Let's derive, $\large{\frac{\partial E_4}{\partial W}} $

![image.png](attachment:image.png)

- $E_4$ depends on $s_4$
- $s_4$ depends on $s_3$ and $W$
- $s_3$ depends on $s_2$ and $W$
- $s_2$ depends on $s_1$ and $W$
- $s_1$ depends on $s_0$ and $W$
- $s_0$ is a starting state

By chain rule,
$\large{\frac{\partial E_4}{\partial W}= \\
 \frac{\partial E_4}{\partial s_4}.\frac{\partial s_4}{\partial W} + \\
 \frac{\partial E_4}{\partial s_4}.\frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial W} +\\
 \frac{\partial E_4}{\partial s_4}.\frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial s_2}.\frac{\partial s_2}{\partial W} +\\
 \frac{\partial E_4}{\partial s_4}.\frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial s_2}.\frac{\partial s_2}{\partial s_1}.\frac{\partial s_1}{\partial W}
} $

Simplified notation:

$$\large{\frac{\partial E_4}{\partial W}= \sum_{k=1}^4 \frac{\partial E_4}{\partial s_4}.(\prod_{j=k+1}^4\frac{\partial s_j}{\partial s_{j-1}})\frac{\partial s_k}{\partial W}
}$$

Solving equation:
    
$\large{\frac{\partial E_4}{\partial W}= 
 \frac{\partial E_4}{\partial s_4}
 (\frac{\partial s_4}{\partial W} + 
 \frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial W} +
 \frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial s_2}.\frac{\partial s_2}{\partial W} +
 \frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial s_2}.\frac{\partial s_2}{\partial s_1}.\frac{\partial s_1}{\partial W})
} $

$\large{\frac{\partial E_4}{\partial W}= 
\frac{\partial E_4}{\partial s_4}
(s_3 + W.s_2 + W.W.s_1 + W.W.W.s_0)
}$

Now calculate $\large{\frac{\partial E_4}{\partial s_4}}$,

$\large{\frac{\partial E_4}{\partial s_4} = 
\frac{\partial E_4}{\partial \hat{y_4}}.\frac{\partial \hat{y_4}}{\partial z_4}.\frac{\partial z_4}{\partial s_4}
}$

$\large{\frac{\partial E_4}{\partial s_4} = 
(\hat{y_4} - y_4).V
}$

$\large{\frac{\partial E_4}{\partial W}= 
(\hat{y_4} - y_4).V.
(s_3 + W.s_2 + W.W.s_1 + W.W.W.s_0)
}$

Simplified,
$$\large{\frac{\partial E_4}{\partial W}= 
(\hat{y_4} - y_4).V.
\sum_{j=1}^4 W^{4-j-1}.s_{j-1}
}$$

- Let's derive, $\large{\frac{\partial E_4}{\partial U}} $

![image.png](attachment:image.png)

- $E_4$ depends on $s_4$
- $s_4$ depends on $s_3$ and $U$
- $s_3$ depends on $s_2$ and $U$
- $s_2$ depends on $s_1$ and $U$
- $s_1$ depends on $s_0$ and $U$
- $s_0$ is a starting state

By chain rule,

Solving equation:
    
$\large{\frac{\partial E_4}{\partial U}= 
 \frac{\partial E_4}{\partial s_4}
 (\frac{\partial s_4}{\partial U} + 
 \frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial U} +
 \frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial s_2}.\frac{\partial s_2}{\partial U} +
 \frac{\partial s_4}{\partial s_3}.\frac{\partial s_3}{\partial s_2}.\frac{\partial s_2}{\partial s_1}.\frac{\partial s_1}{\partial U})
} $

$\large{\frac{\partial E_4}{\partial U}= 
\frac{\partial E_4}{\partial s_4}
(x_4 + W.x_3 + W.W.x_2 + W.W.W.x_1)
}$

Simplified,
$$\large{\frac{\partial E_4}{\partial U}= 
(\hat{y_4} - y_4).V.
\sum_{j=1}^4 W^{4-j-1}.x_{j}
}$$

## In-depth explanation explanation:

$s_1 = s_0W + x_1U$

$s_2 = s_1W + x_2U$

$s_3 = s_2W + x_3U$

$s_4 = s_3W + x_4U$

By product rule,
![image.png](attachment:image.png)

- Let's derive, $\large{\frac{\partial s_4}{\partial W}}$

$\large{\frac{\partial s_4}{\partial W} = s_3 + W.\frac{\partial s_3}{\partial W}}$

$\large{\frac{\partial s_3}{\partial W} = s_2 + W.\frac{\partial s_2}{\partial W}}$

$\large{\frac{\partial s_2}{\partial W} = s_1 + W.\frac{\partial s_1}{\partial W}}$

$\large{\frac{\partial s_1}{\partial W} = s_0}$

On substituting,

$\large{\frac{\partial s_1}{\partial W} = s_0}$

$\large{\frac{\partial s_2}{\partial W} = s_1 + W.s_0}$

$\large{\frac{\partial s_3}{\partial W} = s_2 + W.(s_1 + W.s_0) = s_2 + W.s_1 + W.W.s_0}$

$\large{\frac{\partial s_4}{\partial W} = s_3 + W.(s_2 + W.s_1 + W.W.s_0) = s_3 + W.s_2 + W.W.s_1 + W.W.W.s_0}$



Simplified,
$$\large{\frac{\partial E_4}{\partial W}= 
(\hat{y_4} - y_4).V.
\sum_{j=1}^4 W^{4-j-1}.s_{j-1}
}$$

- Let's derive $\large{\frac{\partial s_4}{\partial U}}$

$\large{\frac{\partial s_4}{\partial U} = x_4 + W.\frac{\partial s_3}{\partial U}}$

$\large{\frac{\partial s_3}{\partial U} = x_3 + W.\frac{\partial s_2}{\partial U}}$

$\large{\frac{\partial s_2}{\partial U} = x_2 + W.\frac{\partial s_1}{\partial U}}$

$\large{\frac{\partial s_1}{\partial U} = x_1}$

On substituting,

$\large{\frac{\partial s_1}{\partial U} = x_1}$

$\large{\frac{\partial s_2}{\partial U} = x_2 + W.x_1}$

$\large{\frac{\partial s_3}{\partial U} = x_3 + W.(x_2 + W.x_1) = x_3 + W.x_2 + W.W.x_1}$

$\large{\frac{\partial s_4}{\partial U} = x_4 + W.(x_3 + W.x_2 + W.W.x_1) = x_4 + W.x_3 + W.W.x_2 + W.W.W.x_1}$



Simplified,
$$\large{\frac{\partial E_4}{\partial U}= 
(\hat{y_4} - y_4).V.
\sum_{j=1}^4 W^{4-j-1}.x_{j}
}$$