#Backprop guide: backpropagation calculus and algebra

These are meant as a guide to doing backpropagation calculus. I'll include copious comments to explain what I'm doing

# Further reading

* [Peter's Notes](http://peterroelants.github.io/posts/neural_network_implementation_part01/) goes over the derivatives

# Backpropagation Calculus

## Simple stuff

### Softmax and sigmoid-NLL

Scroll down to the bottom for these.

The softmax and the sigmoid are most popular output activation layers for classification. The softmax selects one category out of many, by returning class probabilities, and the sigmoid tests each category separately, returning 0 to 1 values for each.

Both of these simplify to $a_{out} - y$ but the algebra is more complicated. We'll start with a sigmoid using the quadratic cost function (a.k.a. mean squared error).

### Output layer with MSE

#### Multi-purpose tool: the chain rule

Doing the full backprop requires repetitive application of the chain rule of differentiation, to the point that

$$\frac{\partial J(\theta)}{\partial W_1} = \frac{\partial J(\theta)}{\partial a_{out}} \frac{\partial a_{out}}{\partial z_{out}} \frac{\partial z_{out}}{\partial W_{out}}$$

is the full gradient for $W_{out}$. What you're seeing above is called the chain rule of derivation. Do you see how things repeat in the numerators and denominators? The chain rule allows us to break apart a problem into more manageable pieces.

The chain rule is

$$ \frac{d}{dx} f(g(x)) = f^\prime g(x) g^\prime(x)$$

meaning that you use it to drill down past multiple layers of function to get down to your $x$.

#### MSE cost function

Like the Andrew Ng course, we'll identify the cost function by $J(\theta)$

$$J(\theta) = \frac{1}{2} (a_{out} - y)^2$$

The cost function above is the quadratic cost function, which is easier to work with but unfortunately less useful in practice.

To start the backprop, we need to differentiate cost with respect to $a_{out}$, the model output. Using the power rule of differentiation, this is easy:

$$\frac{\partial J(\theta)}{\partial a_{out}} = \frac{\partial}{\partial a_{out}} \frac{1}{2}(a_{out} - y)^2 = a_{out} - y$$

This is the first step of the backprop, and so everything following will be multiplied by it.

#### Sigmoid derivative

Next is the derivative of the sigmoid function's output with respect to its input, $z_{out}$.

$$\frac{\partial a_{out}}{\partial z_{out}} = \frac{\partial \sigma(z_{out})}{\partial z_{out}}$$

We first start by applying the reciprocal rule for functions of the form $\frac{1}{f(x)}$

$$\frac{d \mathit{f(x)}}{dx} = - \frac{\mathit{f^\prime(x)}}{\mathit{f(x)^2}}$$

so that

$$\frac{\partial \sigma(z_{out})}{\partial z_{out}} = \frac{e^{-z}}{(1+e^{-z})^2}$$

We then use a trick where we add 1 to the numerator while also subtracting 1 from it. This has no effect on the value of the fraction but manages to help us in our quest.

$$\frac{1+e^{-z}-1}{(1+e^{-z})^2} = \frac{1+e^{-z}}{(1+e^{-z})^2} + \frac{-1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}} - \left(\frac{1}{1+e^{-z}}\right)^2$$

We've managed to turn the derivative into the sigmoid function minus its squared value. With this we can factor out $\frac{1}{1+e{-z}}$ and obtain the sigmoid derivative.

$$\frac{1}{1+e^{-z}}\left(1 - \frac{1}{1+e^{-z}}\right) = \frac{\partial \sigma(z_{out})}{\partial z_{out}} = \sigma(z_{out})(1-\sigma(z_{out}))$$

#### Weights and biases

The next step is differentiating $z_out$ with respect to $W_{out}$ and $B_{out}$. For the weights $W_{out}$

$$\frac{\partial z_{out}}{\partial W_{out}} = \frac{\partial}{\partial W_{out}} a_2 W_{out} + B_{out}$$

and the bias terms $B_{out}$

$$\frac{\partial z_{out}}{\partial B_{out}} = \frac{\partial}{\partial B_{out}} a_2 W_{out} + B_{out}$$

These become

$$\frac{\partial z_{out}}{\partial W_{out}} = a_2^T$$

and

$$\frac{\partial z_{out}}{\partial B_{out}} = 1$$

where $1$ here a row vector of $1 \times batch_{size}$. Likewise $a_2^T$ will be of size $Layer2_{size} \times batch_{size}$.

Why? The "easy" reason for these is that the matrix algebra demands it. Matrix multiplication requires that matrices being multiple be of dimensions $n \times m$ and $m \times p$. You have to transpose these to get the right dimensions for the gradients.

In the case of the bias with the row vector of ones, this row vector has the effect of summing the bias gradients across batches, returning a gradient the exact same size as $B_{out}$. In other words, 

$$1 \times batch_{size} \cdot batch_{size} \times OutputLayer_{size} = 1 \times OutputLayer_{size}$$

and in the case of the weights, 

$$Layer2_{size} \times batch_{size} \cdot batch_{size} \times OutputLayer_{size} = Layer2_{size} \times OutputLayer_{size}$$

Unfortunately, I do not know the mathematical reason for doing these transposes. I have searched online and in books but cannot find the reason. For now it will have to remain an unsolved mystery.

### Sigmoid output with negative log likelihood loss

We're told that the NNL cost function combined with a sigmoid output activation function simplify to $a_{out} - y$ during backpropagation. How is this? Here's the demonstration.

Andrew Ng tells us that the NLL cost function is

$$NLL(a_{out})= -\left(y\log(a_{out}) + (1-y)\log(1-a_{out})\right)$$

and differentiating it is

$$\frac{\partial J(\theta)}{\partial z_{out}} = \frac{\partial J(\theta)}{\partial a_{out}} \frac{\partial a_{out}}{\partial z_{out}}$$

where $\frac{\partial a_{out}}{\partial z_{out}}$ is the sigmoid derivative.

Let's approach this in pieces. Here is the derivative of the NLL portion. We apply the derivation rule of $\frac{d}{dx} \log(x) = \frac{1}{x}$.

$$\frac{\partial}{\partial a_{out}} -[y\log(a_{out}) + (1-y)\log(1-a_{out})] = \frac{1-y}{1-a_{out}} \cdot \frac{y}{a_{out}}$$

We then multiply this with the sigmoid derivative:

$$\frac{1-y}{1-a_{out}} \cdot \frac{y}{a_{out}} \cdot a_{out} \cdot(1-a_{out}) = (1-y)a_{out} - y(1-a_{out}) = a_{out} - ya_{out} - y + ya_{out} = a_{out} - y$$

You can see that the two $y \cdot a_{out}$ are cancelled. We then get what we were looking for, $a_{out} - y$.

### Softmax output with cross entropy loss

#### Softmax derivative

The softmax function is defined as 

$$softmax = s(z_i) = \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}$$

Looking at this formula, we can imagine that the sum in the denominator is going to cause us trouble; however, it is crucial to the softmax's behaviour.

The softmax function is important because it does multiple classes. It does this by either returning 1 for the true class and 0 otherwise; or probabilities of each class being the true one (as in the case of a neural network making a prediction). It behaves this way because of the denominator constraining its outputs between 0 and 1 and constraining the sum of all outputs at exactly 1. Basically, it's exactly what you want for picking between multiple options.

All outputs of the softmax depend on one another: if the model gets evidence that one class is likelier to be true, the softmax function will reduce the probability of all other classes accordingly. In other words, it's a zero sum game.

All that to say that this means we have to calculate a gradient that takes into account $z_i$'s effect on $s(z_i)$ but also $z_j$'s effect on $s(z_i)$ too!

Let's get started.

Going forward, I'm just going to write $\sum_{k=1}^K e^{z_k}$ as $\sum e^z$. So that what we have to do is

$$\frac{\partial s(z_i)}{\partial z_i} = \frac{\partial}{\partial z_i} \frac{e^{z_i}}{\sum e^z}$$

where

$$\sum e^z = e^{z_1} + e^{z_2} + e^{z_3} + \dots + e^{z_k}$$

is the sum of a all the $z_k$s.

To start, we're going to apply the quotient rule:

$$f^\prime(x) = \left(\frac{g(x)}{h(x)}\right)^\prime = \frac{g^\prime(x)h(x) - g(x)h^\prime(x)}{h(x)^2}$$

Exponentials ($e^x$) are easy to differentiate: basically, nothing happens to them. So the $g(x)$ numerator is going to differentiate to $e^{z_i}$, and the denominator will differentiate to $e^{z_i}$. Putting all of that together, we get

$$\frac{\partial}{\partial z_i} \frac{e^{z_i}}{\sum e^z} = \frac{e^{z_i} \cdot \sum e^z - e^{z_i} \cdot e^{z_i}}{(\sum e^z)^2}$$

If we do some factoring of the above, we get

$$\frac{e^{z_i} \cdot \sum e^z - e^{z_i} \cdot e^{z_i}}{(\sum e^z)^2} = \frac{e^{z_i}}{\sum e^z} \left(\frac{\sum e^z - e^{z_i}}{\sum e^z}\right) = \frac{e^{z_i}}{\sum e^z} \left(1 - \frac{e^{z_i}}{\sum e^z} \right) = s(z^i)(1-s(z^i))$$

which is very reminiscent of the sigmoid derivative. In fact, the same. However, we now have to deal with $\frac{\partial s(z_i)}{\partial z_j}$, the derivative of $s(z_i)$ with respects to $z_j$.

Let's start by again applying the quotient rule, except this time $g(x)$ derives to 1. The derivative of $h(x)$ (denominator) this time is $e^{z_j}$. We quickly arrive at what we need.

$$\frac{\partial}{\partial z_j} \frac{e^{z_i}}{\sum e^z} = -\frac{e^{z_i} e^{z_j}}{(\sum e^z)^2} = -s(z^i)s(z^j)$$

#### Cross entropy derivative

The cross entropy cost function is what's used with the softmax function. By looking at it, you can see that it is related to the NLL cost function used with the sigmoid. The cross entropy function is

$$J(\theta) = E_{CE} = -y \log(a_{out}) = -y \log(s(z_{out}))$$

We're interested in finding

$$\frac{\partial J(\theta)}{\partial a_{out}} = \frac{\partial}{\partial s(z_{out})} -y \log(s(z_{out}))$$

We very quickly arrive at what we need since

$$\frac{\partial}{\partial s(z_{out})} -y \log(s(z_{out})) = \frac{-y}{s(z_{out})}$$

#### Combining for output error gradient

What remains is to combine the two steps above so that we find the derivative of $J(\theta)$ with respects to $z_i$. We must remember though that $z_i$ has an effect on all other $s(z_j)$, so we have to combine these pieces together to get the derivative on $J(\theta)$.

$$\frac{\partial J(\theta)}{z_i} = -\sum_{k=1}^{K} \frac{\partial y_k log(s(z_{k}))}{\partial z_i}$$

We can apply the chain rule to clean up the log from the fraction

$$-\sum_{k=1}^{K} \frac{\partial y_k \log(s(z_{k}))}{\partial z_i} = -\sum_{k=1}^{K} \frac{\partial y_k \log(s(z_{k}))}{\partial s(z_{k})} \frac{\partial s(z_{k})}{\partial z_i} = -\sum_{k=1}^{K} \frac{y_k}{s(z_{k})} \frac{\partial s(z_{k})}{\partial z_i}$$

We can now substitute in the softmax derivatives from before

$$-\sum_{k=1}^{K} \frac{y_k}{s(z_{k})} \frac{\partial s(z_{k})}{\partial z_i} = -\frac{y_i}{s(z_{i})}s(z_i)(1-s(z_i)) + \sum_{k=1,i \neq j}^{K} \frac{y_k}{s(z_{k})} s(z_i)s(z_k)$$

and do a few simplifications

$$-\frac{y_i}{s(z_{i})}s(z_i)(1-s(z_i)) + \sum_{k=1,i \neq j}^{K} \frac{y_k}{s(z_{k})} s(z_i)s(z^k) = -y_i +(y_i s(z_i)) + \sum_{k=1,i \neq j}^{K} y_k s(z_i)$$

We have now reached the point where we can merge that summation!

$$-y_i + \sum_{k=1}^{K} y_k s(z_i) = -y_i + s(z_i) \sum_{k=1}^{K} y_k$$

At this point we're done. Due to the properties of the softmax, the summation is equal to 1. Once we get to $s(z_i) - y_i$, we can generalize it over observations as an elementwise subtraction.

$$-y_i + s(z_i) \sum_{k=1}^{K} y_k = s(z_i) - y_i = a_{out} - y$$