# How the backpropagation algorithm works
http://neuralnetworksanddeeplearning.com/chap2.html

## Warm up: a fast matrix-based approach to computing the output from a neural network
![](http://neuralnetworksanddeeplearning.com/images/tikz16.png)

Explicitly, we use $b^l_j$ for the bias of the $j^{th}$ neuron in the $l^{th}$ layer. And we use $a^l_j$ for the activation of the $j^{th}$ neuron in the $j^{th}$ layer.

![](http://neuralnetworksanddeeplearning.com/images/tikz17.png)

$
\begin{eqnarray} 
  a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),
\end{eqnarray} \ (23)
$

where the sum is over all neurons k in the $(l-1)^{th}$ layer.

We use the obvious notation σ(v) to denote this kind of elementwise application of a function. That is, the components of σ(v) are just $\sigma(v)_j = \sigma(v_j)$. As an example, if we have the function $f(x) = x^2$ then the vectorized form of ff has the effect

$
\begin{eqnarray}
  f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)
  = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]
  = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],
\end{eqnarray} \ (24)
$

that is, the vectorized f just squares every element of the vector.

With these notations in mind, Equation (23)  can be rewritten in the beautiful and compact vectorized form

$
\begin{eqnarray} 
  a^{l} = \sigma(w^l a^{l-1}+b^l).
\end{eqnarray} \ (25)
$

When using Equation (25) to compute $a^l$, we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$ along the way. We call $z^l$ the weighted input to the neurons in layer l.  Equation (25) is sometimes written in terms of the weighted input, as $a^l =\sigma(z^l)$.

## The two assumptions we need about the cost function
In the notation of the last section, the quadratic cost has the form

$
\begin{eqnarray}
  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,
\end{eqnarray} \ (26)
$

- n is the total number of training examples
- the sum is over individual training examples, x
- y=y(x) is the corresponding desired output
- L denotes the number of layers in the network
- $a^L = a^L(x)$ is the vector of activations output from the network when x is input

The first assumption we need is that the cost function can be written as an average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for individual training examples, x. The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives $\partial C_x / \partial w$ and $\partial C_x / \partial b$  for a single training example. We then recover $\partial C / \partial w$ and $\partial C / \partial b$ by averaging over training examples.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

![](http://neuralnetworksanddeeplearning.com/images/tikz18.png)

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example x may be written as

$
\begin{eqnarray}
  C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,
\end{eqnarray} \ (27)
$

and thus is a function of the output activations.

## The Hadamard product, s⊙t
We use s⊙t to denote the elementwise product of the two vectors. Thus the components of s⊙t are just $(s \odot t)_j = s_j t_j$. As an example,

\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right] 
  \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].\ (28)
\end{eqnarray} 

This kind of elementwise multiplication is sometimes called the `Hadamard product` or `Schur product`. 

## The four fundamental equations behind backpropagation
![](http://neuralnetworksanddeeplearning.com/images/tikz19.png)

We define the error $\delta^l_j$ of neuron j in layer l by

$
\begin{eqnarray} 
  \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \ (29)
\end{eqnarray}
$

An equation for the error in the output layer, $\delta^L$: The components of $\delta^L$ are given by

$
\begin{eqnarray} 
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j). \ (BP1)
\end{eqnarray}
$

- $\partial C / \partial a^L_j$, just measures how fast the cost is changing as a function of the $j^{th}$ output activation.
- $\sigma'(z^L_j)$, measures how fast the activation function $\sigma$ is changing at $z^L_j$.

If we're using the quadratic cost function then $C = \frac{1}{2} \sum_j(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$.

It's easy to rewrite the equation in a matrix-based form, as

$
\begin{eqnarray} 
  \delta^L = \nabla_a C \odot \sigma'(z^L).\ (BP1a)
\end{eqnarray}
$

- $\nabla_a C$ is defined to be a vector whose components are the partial derivatives $\partial C / \partial a^L_j$. You can think of $\nabla_a C$ as expressing the rate of change of C with respect to the output activations.

As an example, in the case of the quadratic cost we have $\nabla_a C =(a^L-y)$, and so the fully matrix-based form of (BP1) becomes

$
\begin{eqnarray} 
  \delta^L = (a^L-y) \odot \sigma'(z^L).\ (30)
\end{eqnarray}
$

An equation for the error $\delta^{l}$ in terms of the error in the next layer, $\delta^{l+1}$: In particular

$
\begin{eqnarray} 
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),\ (BP2)
\end{eqnarray}
$

Suppose we know the error $\delta^{l+1}$ at the $l+1^{\rm th}$ layer. When we apply the transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the $l^{\rm th}$ layer. We then take the Hadamard product $\odot \sigma'(z^l)$. This moves the error backward through the activation function in layer l, giving us the error $\delta^l$ in the weighted input to layer l.

By combining (BP2) with (BP1) we can compute the error $\delta^l$ for any layer in the network. We start by using (BP1) to compute $\delta^L$, then apply Equation (BP2) to compute $\delta^{L-1}$, then Equation (BP2) again to compute $\delta^{L-2}$, and so on, all the way back through the network.

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

$
\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j. \ (BP3)
\end{eqnarray}
$

That is, the error $\delta^l_j$ is exactly equal to the rate of change $\partial C / \partial b^l_j$. We can rewrite (BP3) in shorthand as

$
\begin{eqnarray}
  \frac{\partial C}{\partial b} = \delta, \ (31)
\end{eqnarray}
$

where it is understood that δ is being evaluated at the same neuron as the bias b.

An equation for the rate of change of the cost with respect to any weight in the network: In particular:

$
\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.\ (BP4)
\end{eqnarray}
$

This tells us how to compute the partial derivatives $\partial C / \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and $a^{l-1}$, which we already know how to compute. The equation can be rewritten in a less index-heavy notation as

$
\begin{eqnarray}  \frac{\partial
    C}{\partial w} = a_{\rm in} \delta_{\rm out},\ (32)
\end{eqnarray}
$

where it's understood that $a_{in}$ is the activation of the neuron input to the weight w, and $\delta_{\rm out}$ is the error of the neuron output from the weight w. Zooming in to look at just the weight w, and the two neurons connected by that weight, we can depict this as:

![](http://neuralnetworksanddeeplearning.com/images/tikz20.png)

A nice consequence of Equation (32) is that when the activation $a_{in}$ is small, $a_{\rm in} \approx 0$, the gradient term $\partial C / \partial w$ will also tend to be small. 

![](http://neuralnetworksanddeeplearning.com/images/tikz21.png)

## The backpropagation algorithm
1. Input x: Set the corresponding activation $a^1$ for the input layer.
1. Feedforward: For each $l = 2, 3, \ldots, L$ compute $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.
1. Output error $\delta^L$: Compute the vector $\delta^{L} = \nabla_a C \odot \sigma'(z^L)$.
1. Backpropagate the error: For each $l = L-1, L-2, \ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^{l})$.
1. Output: The gradient of the cost function is given by $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.

The backpropagation algorithm computes the gradient of the cost function for a single training example, $C = C_x$. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of mm training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

1. Input a set of training examples
1. For each training example x: Set the corresponding input activation $a^{x,1}$, and perform the following steps:
    1. Feedforward: For each $l = 2, 3, \ldots, L$ compute $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.
    1. Output error $\delta^{x,L}$: Compute the vector $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.
    1. Backpropagate the error: For each $l = L-1, L-2, \ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,l})$.
1. Gradient descent: For each $l = L, L-1, \ldots, 2$ update the weights according to the rule $w^l \rightarrow
  w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}$.

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. I've omitted those for simplicity.