# Backpropagation

<http://neuralnetworksanddeeplearning.com/chap2.html>

The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function  C

OR

 the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost <http://neuralnetworksanddeeplearning.com/chap2.html#backpropagation_the_big_picture>

## Notations

![img](http://neuralnetworksanddeeplearning.com/images/tikz17.png)

$w^l_{jk}$ = weight for the connection from the k th neuron in the (l−1)th layer to the j th neuron in the lth layer

$b^l_j$ = bias of the  $j^{\rm th}$ neuron in the $l^{\rm th}$ layer

$a^l_j$ = activation of of the  $j^{\rm th}$ neuron in the $l^{\rm th}$ layer

## Assumptions

Assumption 1: the cost function can be written as an average over cost functions Cx for individual training examples, x

$C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2 = \frac{1}{n} \sum_x C_x$ where $C_x = \frac{1}{2} \|y-a^L \|^2$


Assumption 2: the cost can be written as a function of the outputs from the neural network 

$C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2$

Detailed discussion <http://neuralnetworksanddeeplearning.com/chap2.html#the_two_assumptions_we_need_about_the_cost_function>

## Equations

$\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}$ is the error of neuron j in layer l

Backpropagation gives a way of computing $\delta^l_j$ for every layer, and then relating those errors to the quantities of real interest, $\partial C / \partial w^l_{jk}$ and  $\partial C / \partial b^l_{j}$ 

### Equation 1:  error in the output layer $\delta^L$

$\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)\tag{BP1}$

Matrix form: 

$\begin{eqnarray} 
  \delta^L = \nabla_a C \odot \sigma'(z^L)
\tag{BP1a}\end{eqnarray}$

$\nabla_a C$  is defined to be a vector whose components are the partial derivatives over j

Quadratic cost function $C = \frac{1}{2} \sum_j(y_j-a^L_j)^2$ implies $\partial C / \partial a^L_j = (a_j^L-y_j)$

$\begin{eqnarray} 
  \delta^L = (a^L-y) \odot \sigma'(z^L).
\tag{30}\end{eqnarray}$

### Equation 2: error $\delta^l$ in terms of the error in the next layer, $\delta^l+1$

$\begin{eqnarray} 
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
\tag{BP2}\end{eqnarray}$

Intuition: 
- move the error $\delta^{l+1}$ backward through the network by appling the transpose weight matrix $(w^{l+1})^T$
- $\odot \sigma'(z^l)$ Hadamard product  moves the error backward through the activation function in layer

### Equation 3: rate of change of the cost with respect to any bias in the network

$\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j
\tag{BP3}\end{eqnarray}$

Apply (BP1) and (BP2)

$\begin{eqnarray}
  \frac{\partial C}{\partial b} = \delta
\tag{31}\end{eqnarray}$

$\delta$ is being evaluated at the same neuron as the bias b

### Equation 4: rate of change of the cost with respect to any weight in the network

$\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}$

Equivalent to:

$\begin{eqnarray}  \frac{\partial
    C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}$


weight learn slowly when $ \partial C / \partial \approx 0w $
- $a_{\rm in} \approx 0$
- $\sigma'(z^L_j) \approx 0$ when $\sigma(z^L_j) \approx 0 $ (low activation) or $\sigma(z^L_j) \approx 1 $ (high activation)
- the output neuron has saturated andthe weight has stopped learning (or is learning slowly)

Long example of slow learning <http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function>

## Algorithm

<http://neuralnetworksanddeeplearning.com/chap2.html#the_backpropagation_algorithm>

