# Backpropogation

## Problem Premise

Given the following loss term:

$$L(f(x, W), y)$$

We want to compute the gradient with chain rule:

$$\frac{dL}{dW} = \frac{dL}{df}\frac{df}{dW} \ \ \text{ or } \ \ \nabla_W L = \frac{dL}{df}\nabla_W{f}$$

if *W* is a vector of parameters.

**So... what's the problem?**

First of all: **Computing gradient is very expensive in a neural network**.

Let $W$ be the input:

$$W \rightarrow A \rightarrow B \rightarrow C \rightarrow \cdots K \rightarrow L$$

In this case, loss function with respect to $W$ is

$$L(W) = L(K \cdots C (B(A(W))))$$

Per Chain Rule, Jacobian with respect to $L$ is

$$J_{L}(W) = J_{L}(K)J_{C}(B)J_{B}(A)J_{A}(W)$$



So, **Backpropagation** is to evaluate the jacobian product fright output of neural network to towards its input.


### Why do we use Backpropogation?

#### Efficiency
![](backprop1.png)

Note every multiplication will result in a row vector.
![](backprop2.png)

![](backprop3.png)

![](backprop4.png)

So, the computational cost of multplying a row vector with matrix is $\mathcal{O}(n^2)$ while cost of multiplying two matrices is $\mathcal{O}(n^3)$


#### Common Subexpressions

![](backprop5.png)

Note values such as $J_C(B)$ is also required in order to perform gradient update.  Computing it before $J_B(A)$ and $J_A(W)$ and save the value will reduce both memory and computation time.

### Common Backpropogation Recipies

#### Cross Entropy Loss

Cross Entropy Loss:

$$L = -\log s_y \text{ for } y \in \{1, \cdots, n\} $$  

Then

$$J_{L}(s)_i = \nabla_s L_i^T =\begin{cases}
    -\frac{1}{s_i},& \text{for } i = y\\
    0,              & \text{otherwise}
\end{cases} $$

$$J_{L}(s)_i = \nabla_s L_i^T = -\frac{y_i}{s_i} $$

#### Multiclass SVM Loss

Multiclass SVM Loss:

$$L = \sum_{i=1}^n \max(0, 1-s_y + s_i)$$

Then

$$J_{L}(s)_i = \nabla_s L_i^T =\begin{cases}
    1 & \text{if } s_y - s_i < 1 \text{ and } i \neq y\\
    0,              & \text{otherwise}
\end{cases} $$

#### Multiclass Logistic Function

For logistic function (softmax), 

$$y_i = \frac{\exp s_i}{\sum_{j=1}^n \exp s_j} = \frac{f}{g}$$

Then

$$\frac{\partial y_i}{\partial s_j} = y_i\delta_{ij} - y_iy_j$$

$$\delta_{ij} = \begin{cases}
    1 & \text{if } i = j\\
    0,              & \text{otherwise}
\end{cases}$$ 

#### Matrix Multiply (FC) Layer

For FC Layer, the output is compute as follows:

$$y_i = \sum_{j=1}^n W_{ij}s_j$$

Then

$$\frac{\partial y_i}{\partial s_j} = W_{ij}$$

$$J_L(s) = J_L(y)W$$

$$\frac{\partial L}{\partial W_{ik}} = J_L(y)^Ts^T$$