# Proof of the Four Backpropagation Equations (BP1 - BP4)

## **Notation**
Let:
- $L$: Total number of layers in the network.
- $w^l_{jk}$: Weight connecting the $k$-th neuron in layer $l-1$ to the $j$-th neuron in layer $l$.
- $b^l_j$: Bias of the $j$-th neuron in layer $l$.
- $z^l_j$: Weighted input to the $j$-th neuron in layer $l$, defined as:
  $$
  z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j
  $$
- $a^l_j$: Activation of the $j$-th neuron in layer $l$, defined as:
  $$
  a^l_j = \sigma(z^l_j)
  $$
  where $\sigma$ is the activation function.
- $\delta^l_j$: Error of the $j$-th neuron in layer $l$, defined as:
  $$
  \delta^l_j = \frac{\partial C}{\partial z^l_j}
  $$
  where $C$ is the cost function.


## **BP1: Output Layer Error**
The error in the output layer $L$ is:
$$
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)
$$

**Derivation:**
1. By the chain rule:
   $$
   \delta^L_j = \frac{\partial C}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j} \cdot \frac{\partial a^L_j}{\partial z^L_j}
   $$
2. Since $a^L_j = \sigma(z^L_j)$, we have:
   $$
   \frac{\partial a^L_j}{\partial z^L_j} = \sigma'(z^L_j)
   $$
3. Substituting, we get:
   $$
   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)
   $$

## **BP2: Error Propagation**
The error in layer $l$ is related to the error in layer $l+1$ by:
$$
\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j)
$$

**Derivation:**
1. By the chain rule:
   $$
   \delta^l_j = \frac{\partial C}{\partial z^l_j} = \sum_k \frac{\partial C}{\partial z^{l+1}_k} \cdot \frac{\partial z^{l+1}_k}{\partial z^l_j}
   $$
2. From the definition of $z^{l+1}_k$:
   $$
   z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j + b^{l+1}_k
   $$
   so:
   $$
   \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j)
   $$
3. Substituting, we get:
   $$
   \delta^l_j = \sum_k \delta^{l+1}_k w^{l+1}_{kj} \sigma'(z^l_j)
   $$


## **BP3: Gradient of the Cost with Respect to Biases**
The gradient of the cost with respect to the biases is:
$$
\frac{\partial C}{\partial b^l_j} = \delta^l_j
$$

**Derivation:**
1. By the chain rule:
   $$
   \frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial b^l_j}
   $$
2. From the definition of $z^l_j$:
   $$
   z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j
   $$
   so:
   $$
   \frac{\partial z^l_j}{\partial b^l_j} = 1
   $$
3. Substituting, we get:
   $$
   \frac{\partial C}{\partial b^l_j} = \delta^l_j
   $$

## **BP4: Gradient of the Cost with Respect to Weights**
The gradient of the cost with respect to the weights is:
$$
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j
$$

**Derivation:**
1. By the chain rule:
   $$
   \frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial w^l_{jk}}
   $$
2. From the definition of $z^l_j$:
   $$
   z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j
   $$
   so:
   $$
   \frac{\partial z^l_j}{\partial w^l_{jk}} = a^{l-1}_k
   $$
3. Substituting, we get:
   $$
   \frac{\partial C}{\partial w^l_{jk}} = \delta^l_j a^{l-1}_k
   $$