# Neural Networks: Back-propagation

Backprop is how the network evaluates its performance during feed forward. Combined with gradient descent, back-propagation helps train the network.


# Further reading

* [Peter's Notes](http://peterroelants.github.io/posts/neural_network_implementation_part01/) are a bit mathy and specific, but I've found them helpful when confused
* [Deep Learning Basics](http://alexminnaar.com/deep-learning-basics-neural-networks-backpropagation-and-stochastic-gradient-descent.html), a guide that covers about the same ground as this one
* [A Step by Step Backpropagation Example](https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/) using actual numbers
* [3Blue1Brown's calculus videos.](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) If you want to go deeper into calculus, these are good to get you motivated.
* [Again, 3Blue2Brown's neural network videos might be useful.](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

## Intuition

If the feed-forward is data pushed all the way forward to the outputs, then back-propagation is the trickling back down of errors from the outputs all the way back to the earliest neurons.

Back-propagation is necessary if you want to use gradient descent on neural networks of multiple layers. Single-layer models can calculate their errors in one step, but multi-layer networks require a multi-step process to get it done.

![Visual aid](Images/intro_backprop_notation.png "Visual aid")

The back-propagation journey all starts at the output. Here there is a clear link between the choice of parameters (weights and biases) and the output error. The approach here is the same as simple gradient descent.

At the layer preceeding the output, we'll call it $l_2$, there is an extra step. What is the link between $l_2$ weights and biases and the output error? It has multiple steps: $l_2$ has a direct effect on the output layer's data, and the output layer's data has a direct effect on what the model decides to output. It takes two steps to get back to the end.

In other words, the output layer is the boss and it is directly responsible for the model's error. If the output layer changes its behaviour, it can directly improve its accuracy. It's the easiest to train.

The hidden layers are not directly responsible for the model's error; however, they are responsible for providing the output layer accurate analyses of the model's input data. Knowing their boss, they have an idea of how to change their computations so that the big cheese makes more informed decisions. Their gradient formulas in fact depend on the output layer's weights (the boss's personality, you might say).

## Detour: gradient checking

Backprop takes snapshots of errors everywhere in the NN and uses these to adjust parameters. Normally this is done with calculus and repeated applications of the chain rule of differentiation.

Backprop can also be done by more primitive methods, albeit much more slowly. Numerical differentiation is used to teach students calculus, so it makes sense to show it here first before breaking out the chain rule.

The idea behind numerical differentiation is this: 

1. Take your NN as is
2. Adjust a parameter slightly and see the effect on output error
3. You now know the effect of that parameter on error

![A precise nudge to the weight](Images/intro_gradient_nudge.png "A precise nudge to the weight")

Given $J(\theta_{i,j})$ your cost function, $\theta$ any parameter anywhere in the neural network, and $\epsilon$ a small value as a "nudge", 

$$\frac{\partial J(\theta)}{\partial \theta_{i,j}} \approx \frac{J(\theta_{i,j} + \epsilon) - J(\theta_{i,j} - \epsilon)}{2\epsilon}$$

Backprop does this for every parameter in the NN. If it sees that the output error increases when a parameter is increased, it will decrease the parameter. If output error increases when the parameter is decreased, backprop increases the parameter instead.

(This is the gradient descent algorithm: it sees error and rolls down the slope in the opposite direction.)

Ultimately, this makes a neural network more complicated than any collection of corporate committees. Except in rare prophetic instances, an office worker will not know how many dollars their actions win/lose their company. With neural networks, a single neuron will know how much error it is causing its network. And yet, it's never guaranteed that the neuron can do something useful with this information!

It's a good idea to use gradient checking. It's a good backup.

## Back propagation II

I hope the above has made back propagation make some sense. It's now time for some light mathematics. Don't worry I'll just paste the answers and skip the algebraic Tetris.

Here are the gradients with respect to error for parameters in the NN model.

Glossary:
* An error gradient: the "slope" of the error. All the model needs to know is which direction this is.
* The Jacobian: a matrix full of gradients. Since all of our parameters are stored in matrices, it makes sense that we'd store all of our gradients in matrices too.

**Note 1:** With these equations, the most important part is whether they're positive or negative, so you can look at them to see what affects their sign. Gradient decsent will generally work alright as long as it's heading in the right direction (has the right sign).

**Note 2:** When a gradient is positive (error is increasing with parameter) you want to decrease the parameter. When the gradient is negative, you want to increase the parameter.

The gradients for the output layer weights and biases are

$$\frac{\partial J(\theta)}{\partial B_{out}} = 1 \cdot (a_{out} - y), \frac{\partial J(\theta)}{\partial W_{out}} = a_2^T \cdot (a_{out} - y)$$

The above equations make some sense. If the output neuron is overshooting the target, reduce the bias. It's a similar idea with the weights: if the weights cause the neuron to overshoot when they are given a positive input, they need to be reduced.

(You need the 1 in the bias gradient. It represents the intercept but is also necessary to get the right dimension.)

To proceed lower into the previous layer, we have to do some backprop. Here it is:

$$\delta_{out} = (a_{out} - y) \cdot W_{out}^T$$

We also need the derivative of the sigmoid function. We'll just call it $\sigma^\prime$.

We just have to include that in our equations and we'll be fine. The gradients for the second hidden layer are:

$$\frac{\partial J(\theta)}{\partial W_2} = a_1^T \cdot (\sigma^\prime(a_2) \circ \delta_{out}), \frac{\partial J(\theta)}{\partial B_2} = 1 \cdot \sigma^\prime(a_2) \circ \delta_{out}$$

The sigmoid derivative $\sigma^\prime$ is a newcomer, but otherwise these are similar to before. The weight gradients depend on layer 2's input, which comes from layer 1. The bias gradient is simpler, but it still has to pass through the $\sigma^\prime$ and the $\delta_{out}$.

For layer 1 we need a new delta.

$$\delta_2 = (\delta_{out} \circ \sigma^\prime(a_2)) \cdot W_2^T$$

Finally, the last backprop step.

$$\frac{\partial J(\theta)}{\partial W_1} = x^T \cdot (\sigma^\prime(a_1) \circ \delta_2), \frac{\partial J(\theta)}{\partial B_1} = 1 \cdot \sigma^\prime(a_1) \circ \delta_2$$

That's all there is to it.

## Gradient interpretation

In this section, I'll do my best to narrate what back propagation is doing. Feel free to skip this section.

Back propagation is a repeated application of the chain-rule of differentiation, and its purpose is to determine the effect of a parameter on model error, which is the error gradient with respect to the parameter ($W_1$, $B_{out}$, etc).

Let's take the weight update below as an example, starting with the $a_1^T$ term.

$$\frac{\partial J(\theta)}{\partial W_2} = a_1^T \cdot (\sigma^\prime(a_2) \circ \delta_{out})$$

Recall that $W_2$'s role in the neural network is to do the following:

$$a_2 = \sigma(a_1 W_2 + B_2), z_2 = a_1 W_2 + B_2$$

Let's put everything together by answering the question: *how does $W_2$ affect $z_2$?*

The answer: *$W_2$ affects $z_2$ through its interaction with $a_1$.*

The derivative $\frac{\partial z_2}{\partial W_2} = \frac{\partial a_1 W_2 + B_2}{\partial W_2} = a_1$ signifies that $z_2$ increases by $a_1$ when $W_2$ increases by 1. This works out nicely here since $a_1 W_2$ is linear; normally though the derivative "slope" only holds in a very small area around the current point.

This is where the $a_1^T$ in the gradient comes from. What does it mean? It means that $W_2$'s job is to multiply $a_1$, so its contribution to the model output is $a_1$. Since model error is closely related to model output (through the cost function), $a_1$ is also $W_2$'s contribution to model error.

That covers $a_1^T$.

Let's now look at that $\sigma^\prime(a_2)$ term.

$\sigma^\prime(a_2)$ is the sigmoid's contribution to error. $W_2$'s contribution to error, seen above, passes through the simgoid prime. The derivative of the sigmoid function is $\sigma(a_2)^\prime = \sigma(a_2) (1 - \sigma(a_2))$. Looking at it a bit, it becomes apparent that the derivative reaches its maximum value when $\sigma(a_2) = 1 - \sigma(a_2)$, or when $\sigma(a_2) = 0.5$ or $a_2 = 0$. The slope of the derivative vanishes when $a_2$ approaches 0 or 1. Thus the sigmoid's contribution to error: it restricts the flow of error depending on the value of $a_2$ fed to it, $\sigma^\prime(a_2)$.

Story so far: $\frac{\partial J(\theta)}{\partial W_2}$ is $a_1$ passed through $\sigma^\prime$, the latter at most being 0.5 but possibly 0.0.

The next component of $W_2$'s gradient is $\delta_{out} = (a_{out} - y) \cdot W_{out}^T$.

We are seeing $W_{out}$ here because $W_2$'s effect must pass through it to reach the model error. The idea here is that stronger $W_{out}$ values mean that whatever $x W$ outputs will be mangnified, while weaker $W_{out}$ will attenuate $W_2$'s influence. Therefore $W_{out}$ is a part of $W_2$'s effect on error.

$(a_{out} - y)$ is more difficult to explain because it is an algebraic simplication. Its full form is $\frac{\partial NLL_{cost} \sigma(a_{out})}{\partial a_{out}}$. But that isn't too important. The first role of $a_{out} - y$ is to keep the gradient positive if $a_{out} > y$ but turn it negative if $a_{out} < y$: this makes sense since you want to increase/lower $a_{out}$ if it has undershot/overshot y. The second role of $a_{out} - y$ is to return a higher error value the wider a gap there is between model output and true output: this gap resides in $[-1, 1]$ since all outputs belong in $[0, 1]$.

There you have it. I hope this has helped you understand gradients a little bit better.