# Backpropagation -- the Workhourse of ML

### On the front
I took this note when I was studying the backpropagation algorithm from [ad-n-nn.pdf](./references/ad-n-nn.pdf). I was trapped by being too focus on the which layout should I use for the chain rule in backprop, which blocked me away from see the general result by the autometic differentiation. The note provided a new perspective and wiped out my confusion.    
This note is also a derivative of another note of mine [backprop_in_rnn](./backprop_in_rnn.ipynb) and tries to further justify the calculation of gradiant in Siraj's implement of RNN.

### As usual, let's take a concrete example:  
Consider `dWhy += np.dot(dy, hs[t].T)`. `dWhy` is the gradiant w.r.t `Why`, or $\frac{\partial l_t}{\partial W_{hy}}$.  

For a concrete example, let:  
`Why` = $W_{hy}$ = $\begin{bmatrix}
       w_{11} & w_{12} & w_{13}           \\[0.3em]
       w_{21} & w_{22} & w_{23}
     \end{bmatrix}$
  
 `hs[t]` = $H_t$ = $\begin{bmatrix}
       h_{1}\\[0.3em]
       h_{2}\\[0.3em]
       h_{3}
     \end{bmatrix}$
  
  `by` = $\begin{bmatrix}
       b_{1}\\[0.3em]
       b_{2}
     \end{bmatrix}$

Now, we have `ys[t] = np.dot(Why, hs[t]) + by`:  
`ys[t]` = $Y_t$ = $\begin{bmatrix}
      w_{11}h_1 + w_{12}h_2 + w_{13}h_3 + b_1\\[0.3em]
      w_{21}h_1 + w_{22}h_2 + w_{23}h_3 + b_2
     \end{bmatrix}
        = \begin{bmatrix}
       y_{1}\\[0.3em]
       y_{2}
     \end{bmatrix}$
     
So, `ps[t]` = $\begin{bmatrix}
       \frac{e^{y_1}}{e^{y_1} + e^{y_2}} \\[0.3em]
       \frac{e^{y_2}}{e^{y_1} + e^{y_2}}
     \end{bmatrix}
      = \begin{bmatrix}
       p_{1}\\[0.3em]
       p_{2}
     \end{bmatrix}$

**Loss function:**  
In the code, the loss at the $t^{th}$ iteration is defined by `-np.log(ps[t][targets[t],0])`, the negative log-likelihood if the model gives the right answer.


WLOG, let's just assume that at this iteration, the second letter is the target (or the ground truth). So, set `loss` = $l_t$ = $-log(p_2)$.

Inspired by *ad-n-nn.pdf*, I draw the following expression graph:
![](./images/exp-graph.png)

*Sidenote: this [article](https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/) elaborates the connection between backprop and the method of Lagrange multipliers.*

The backprop algorithm is as followed: 
![screenshot from the note](./images/bp.png)
Writting the forwardpass in the expression graph fashion shows the dependencies among inputs, intermedia variables, and output. This will help us to observe backprop closer:  
$$ 
\begin{align*}
\frac{\partial L}{\partial p_1} = 0 \quad &\frac{\partial L}{\partial p_2} = -\frac{1}{p_2} \\
\\
\frac{\partial L}{\partial y_1} = \frac{\partial L}{\partial p_1}\frac{\partial p_1}{\partial y_1} + \frac{\partial L}{\partial p_2}\frac{\partial p_2}{\partial y_1} = p_1 \quad &\frac{\partial L}{\partial y_2} = \frac{\partial L}{\partial p_1}\frac{\partial p_1}{\partial y_2} + \frac{\partial L}{\partial p_2}\frac{\partial p_2}{\partial y_2} = p_2 - 1 \\
\\
\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial z_1} = p_1 \quad &\frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial z_2} = p_2 - 1\\
\\
\frac{\partial L}{\partial \textbf{w}_1} = \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial \textbf{w}_1}=p_1[\begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix}] \quad &\frac{\partial L}{\partial \textbf{w}_2} = \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial \textbf{w}_2} = (p_2 - 1)[\begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix}]
\end{align*}
$$

----------------------------------------------------------------------
*Sidenote: Given that* $ z_1 = \textbf{w}_1 \textbf{h} $, *show* $\frac{\partial z_1}{\partial \textbf{w}_1} = [\begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix}]$:  
$\begin{align*}
& \text{Let } \textbf{v} = \textbf{w}_1 \bigotimes \textbf{h} = \bigg[ \begin{smallmatrix} w_{11}h_1 \\ w_{12}h_2 \\ w_{13}h_3 \end{smallmatrix} \bigg], \text{ then } z_1 = [\begin{smallmatrix} 1 & 1 & 1 \end{smallmatrix}] \textbf{v} \\
\\
& \text{Now, } \frac{\partial z_1}{\partial \textbf{v}} = [\begin{smallmatrix} 1 & 1 & 1 \end{smallmatrix}] \text{, and } \frac{\partial \textbf{v}}{\partial \textbf{w}_1} \bigg[ \begin{smallmatrix} h_1 & 0 & 0 \\ 0 & h_2 & 0 \\ 0 & 0 & h_3 \end{smallmatrix} \bigg] \\
\\
&\text{By chain rule, }\frac{\partial z_1}{\partial \textbf{w}_1} = \frac{\partial z_1}{\partial \textbf{v}}\frac{\partial \textbf{v}}{\partial \textbf{w}_1} = [\begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix}]
\end{align*}
$

---

Reorder what we got: $\frac{\partial L}{\partial W_{hy}} =  \bigg[ \begin{smallmatrix} \frac{\partial L}{\partial \textbf{w}_1} \\ \frac{\partial L}{\partial \textbf{w}_2}\end{smallmatrix} \bigg] = \big[ \begin{smallmatrix} p_1[\begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix}]
\\(p_2 - 1)[\begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix}]\end{smallmatrix} \big] = \big[ \begin{smallmatrix} p_1 \\ p_2 - 1\end{smallmatrix} \big]$ $[ \begin{smallmatrix} h_1 & h_2 & h_3 \end{smallmatrix} ]$.

Again, we have the same result as the code `Why += np.dot(np.dot(dy, hs[t].T))` from different angle to see it, though.

### Punchline
To calculate $\frac{\partial L}{\partial W_{hy}}$, backprop algorithm only needs to look forward one step to get the gradiant of $W_{hy}$'s child, which is $\big[ \begin{smallmatrix} z_1 \\ z_2\end{smallmatrix} \big]$. The great thing is that by backprop, the gradiant of $\big[ \begin{smallmatrix} z_1 \\ z_2\end{smallmatrix} \big]$ is what we already know if we are trying to calculate $\frac{\partial L}{\partial W_{hy}}$. Therefore, we only need to calculate $\frac{\partial z_1}{\partial \textbf{w}_1}$ and $\frac{\partial z_2}{\partial \textbf{w}_2}$. What's even better is that $\frac{\partial z_1}{\partial \textbf{w}_1}$ and $\frac{\partial z_2}{\partial \textbf{w}_2}$ share the same result (this is generally true for dot product) and indeed, we only need one of them to compute $\frac{\partial L}{\partial W_{hy}}$. **Backprop has the same time complexity as the forward propagation.**

To see the generality of the result on calculating any the weights of a fully connected layers, note that besides `dWhy += np.dot(np.dot(dy, hs[t].T))`, we also have `dWxh += np.dot(dhraw, xs[t].T)` and `dWhh += np.dot(dhraw, hs[t-1].T)`, where `dhraw` is the gradiant of the child of both `Wxh` and `Whh`.

Seems like we miss `dhnext = np.dot(Whh.T, dhraw)` and `dh = np.dot(Why.T, dy) + dhnext`. However, they are the same under the hood. Take this example to see why they look different:

$Z_t = W_{hy}H_t$ = $\begin{bmatrix}
       w_{11} & w_{12} & w_{13}           \\[0.3em]
       w_{21} & w_{22} & w_{23}
     \end{bmatrix}$ $\begin{bmatrix}
       h_{1}\\[0.3em]
       h_{2}\\[0.3em]
       h_{3}
     \end{bmatrix}$. As you may notice, one is on the left and the other is on the right.