# Error Backpropagation

*Error backpropagation* (or simply backpropagation) is a method for
computing the **gradient** of the loss function with respect to the
**weights** of the network ***citation needed***. This gradient is
then used to update the weights in a direction that
**reduces the loss**, typically using an **optimization algorithm**
like stochastic gradient descent.

## Passes

### Forward pass

During the forward pass, the *input* data is passed through the network
**layer by layer**, and the *output* of each layer is computed using
the current **weight** and **biases**.

The final output of the network is compared to the **true labels** to
compute the **loss** using a *loss function*.


Consider a simple neural network with one hidden layer. Let be:
- $x$ — input
- $y$ — true label
- $\hat{y}$ — predicted output
- $W_1$, $W_2$ — weights of layers 1 and 2
- $b_1$, $b_2$ — biases of layers 1 and 2
- $z_1$, $z_2$ — pre-activation outputs of layers 1 and 2
- $a_1$, $a_2$ — activation outputs of layers 1 and 2
- $L$ — loss
- $\sigma$ — activation function (sigmoid, ReLU, etc.)
- $\mathrm{Loss}$ — loss function (like MSE for regression tasks,
  CE for classification…)

The forward pass can be summarized as:

$$
z_1 = W_1x + b_1
\\
a_1 = \sigma(z_1)
\\
z_2 = W_2 a_1 + b_2
\\
\hat{y} = \sigma(z_2)
\\
L = \mathrm{Loss}(\hat{y}, y)
$$


### Backward pass

Starting from the *output* layer, the **gradient of the loss** with
respect to the output is computed. It is then **propagated backward**
through the network, layer by layer, using the **chain rule**.

For each layer, the gradient of the loss with respect to the
**weights** and **biases** is computed.

The computed gradients are used to update the weights of the network.
This is typically done using an **optimization algorithm** like SGD,
Adam, or [RMSprop][1].

Let be:

- $x$ — input
- $y$ — true label
- $\hat{y}$ — predicted output
- $W_1$, $W_2$ — weights of layers 1 and 2
- $b_1$, $b_2$ — biases of layers 1 and 2
- $z_1$, $z_2$ — pre-activation outputs of layers 1 and 2
- $a_1$, $a_2$ — activation outputs of layers 1 and 2

We can express backward pass like this:


$$
\newcommand{\same}{\color{blue}{}}
\newcommand{\tame}{\color{orange}{}}
\renewcommand{\same}{\color{blue}{}}
\renewcommand{\tame}{\color{orange}{}}
$$


$$
\frac{\partial \tame L}{\partial \same{W_2}} = \frac{\partial \tame L}{\partial \hat{y}}
  \cdot \frac{\partial \hat{y}}{\partial z_2}
  \cdot \frac{\partial z_2}{\partial \same{W_2}}
$$

$$
\frac{\partial \tame L}{\partial \same{b_2}} = \frac{\partial \tame L}{\partial \hat{y}}
  \cdot \frac{\partial \hat{y}}{\partial z_2}
  \cdot \frac{\partial z_2}{\partial \same{b_2}}
$$

$$
\frac{\partial \tame L}{\partial \same{W_1}} = (
  \frac{\partial \tame L}{\partial \hat{y}}
  \cdot \frac{\partial \hat{y}}{\partial z_2}
  \cdot W_2
)
\cdot \frac{\partial a_1}{\partial z_1}
\cdot \frac{\partial z_1}{\partial \same{W_1}}
$$

$$
\frac{\partial \tame L}{\partial \same{b_1}} = (
  \frac{\partial \tame L}{\partial \hat{y}}
  \cdot \frac{\partial \hat{y}}{\partial z_2}
  \cdot W_2
)
\cdot \frac{\partial a_1}{\partial z_1}
\cdot \frac{\partial z_1}{\partial \same{b_1}}
$$

These gradients are then used to update the weights and biases:

$$
W_2 \leftarrow W_2 - \eta \frac{\partial L}{\partial W_2}
$$

$$
b_2 \leftarrow b_2 - \eta \frac{\partial L}{\partial b_2}
$$

$$
W_1 \leftarrow W_1 - \eta \frac{\partial L}{\partial W_1}
$$

$$
b_1 \leftarrow b_1 - \eta \frac{\partial L}{\partial b_1}
$$

[1]: https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html

## TODO

- explain what passes are there and what they do