### 09. Backpropagation

#### Table of Contents

- [1. Chian rule](#heading)
- [2. Computational graph and local gradients](#heading)
- [3. Forward and backward pass](#heading)
- [4. Backpropagation with pytorch code](#heading)

#### 1. Chian rule

<img src="./images/chain_rule.png" alt="chain_rule" width="400"/>

**Chian rule** is a formula to compute the derivative of a composite function. If a variable $z$ depends on the variable $y$, $z=f(y)$, where the $y$ itself depends on the variable $x$, $y=f(x)$. The derivative of $z$ with respect to $x$ can be calculated by:

$\frac{\partial z}{\partial x}=\frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$

#### 2. Computational graph and local gradients

<img src="./images/computational_graph.png" alt="computational_graph" width="400"/>

**A computational graph** is defined as a directed graph where the nodes correspond to mathematical operations. Computational graphs are a way of expressing and evaluating a mathematical expression.

If we assume that the $z$ is a function of variables $x$ and $y$, $z=f(x,y)=xy$. The $Loss$ is a function of $z$, $Loss=f(z)$. Assume that we have already known the derivative of $Loss$ with respect to $z$ is equal to $\frac{\partial Loss}{\partial z}$, how can we calculate the derivative of $Loss$ with respect to $x$?

First, we can obtain the **local gradients** (the derivative of $z$ with respect to $x$ or $y$):

$\frac{\partial z}{\partial x}=\frac{\partial xy}{\partial x}=y$, $\frac{\partial z}{\partial y}=\frac{\partial xy}{\partial y}=x$

Then, based on the **chian rule**, the derivative of $Loss$ with respect to $x$ is given by:

$\frac{\partial Loss}{\partial x}=\frac{\partial Loss}{\partial z} \frac{\partial z}{\partial x}=y\frac{\partial Loss}{\partial z}$

#### 3. Forward and backward pass

The backpropagation consists of 3 steps:
 - (1) forward pass: compute loss
 - (2) compute local gradients
 - (3) backward pass: compute _dLoss/dWeights_ using the chain rule

<img src="./images/forward_backward_pass.png" alt="forward_backward_pass" width="400"/>

**Linear regression:**
 - The linear regression is defines as: $y=wx$
 - The prediction value is represented as: $\hat{y}$
 - Error between predicted and real value is: $s=\hat{y}-y$
 - Loss function is defined as squared error: $Loss=(\hat{y}-y)^{2}$
 - Given an example as: $(x=1, y=2)$, the initial weight: $w=1$

**Forward pass:**
 - $\hat{y}=wx=1 \times 1=1$
 - $s=\hat{y}-y=1-2=-1$
 - $Loss=(\hat{y}-y)^{2}=(-1)^{2}=1$

**Local gradients:**
  - $\frac{\partial Loss}{\partial s}=\frac{\partial s^{2}}{\partial s}=2s$
  - $\frac{\partial s}{\partial \hat{y}}=\frac{\partial (\hat{y}-y)}{\partial \hat{y}}=1$
  - $\frac{\partial \hat{y}}{\partial w}=\frac{\partial wx}{\partial w}=x$

**Backward pass:** using the chian rule to compute _dLoss/dWeights_
 - $\frac{\partial Loss}{\partial w}=\frac{\partial Loss}{\partial s} \cdot \frac{\partial s}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w}=2sx=2 \times (-1) \times 1=-2$

#### 4. Backpropagation with pytorch code

In [1]:
import torch

In [2]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

# define x, y and initial weights
x = torch.tensor(1.0, device=device)
y = torch.tensor(2.0, device=device)
w = torch.tensor(1.0, device=device, requires_grad=True)
print(x)
print(y)
print(w)

tensor(1., device='cuda:0')
tensor(2., device='cuda:0')
tensor(1., device='cuda:0', requires_grad=True)


In [3]:
# forward pass and compute the loss
y_hat = w * x
loss = (y_hat - y)**2
print(loss)

tensor(1., device='cuda:0', grad_fn=<PowBackward0>)


In [4]:
# backward pass
loss.backward()
print(w.grad)

# update weights
# next forward and backward

tensor(-2., device='cuda:0')
