# A2 - Neural Nets: Backprop

Consider the following three-layer neural network (with two hidden layers):

![title](lib/three_layer_net.png)

| Forward formula &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; | Backward formula &emsp;&emsp; | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; |
|---|---|---|
| $z_{j}^{(2)} = \sum_i w_{ij}^{(1)} x_i + b_{j}^{(1)}$ | $\frac{\partial z_{j}^{(2)}}{\partial w_{ij}^{(1)}} = x_i$ | $\frac{\partial z_{j}^{(2)}}{\partial x_i} = w_{ij}^{(1)}$ |
| $h_{j}^{(2)} = \max\{z_{j}^{(2)}, 0\}$ | $\frac{\partial h_{j}^{(2)}}{\partial z_{j}^{(2)}} = \mathbb{1}(z_{j}^{(2)} > 0)$
| $z_{k}^{(3)} = \sum_j w_{jk}^{(2)} h_{j}^{(2)} + b_{k}^{(2)}$ | $\frac{\partial z_{k}^{(3)}}{\partial w_{jk}^{(2)}} = h_{j}^{(2)}$ | $\frac{\partial z_{k}^{(3)}}{\partial h_{j}^{(2)}} = w_{jk}^{(2)}$ |
| $h_{k}^{(3)} = \max\{z_{k}^{(3)}, 0\}$ | $\frac{\partial h_{k}^{(3)}}{\partial z_{k}^{(3)}} = \mathbb{1}(z_{k}^{(3)} > 0)$ |
| $z = \sum_k w_{k}^{(3)} h_{k}^{(3)} + b^{(3)}$ | $\frac{\partial z}{\partial w_{k}^{(3)}} = h_{k}^{(3)}$ | $\frac{\partial z}{\partial h_{k}^{(3)}} = w_{k}^{(3)}$ |
| $\hat{y} = h = \sigma(z) $ | $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$ |
| $L = -\left[y\log\hat{y} + (1-y)\log(1-\hat{y})\right]$| $\frac{\partial L}{\partial \hat{y}} = - \left[\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right] $ |

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class ThreeLayerNet(nn.Module):
    def __init__(self, d1, d2, d3):
        super().__init__()
        self.fc1 = nn.Linear(d1, d2)
        self.fc2 = nn.Linear(d2, d3)
        self.fc3 = nn.Linear(d3, 1)
    
    def forward(self, x):
        z2 = self.fc1(x)
        h2 = torch.relu(z2)
        z3 = self.fc2(h2)
        h3 = torch.relu(z3)
        z = self.fc3(h3)
        h = torch.sigmoid(z)
        
        # also output intermediate computations and gradients
        intermediate = {
            'x': x, 
            'z2': z2, 
            'h2': h2, 
            'z3': z3, 
            'h3': h3, 
            'z' : z, 
            'h' : h,
        }
        for _, v in intermediate.items():
            v.retain_grad()
        return h, intermediate

In [None]:
model = ThreeLayerNet(2,2,3)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Manually initialize some weights
with torch.no_grad():
    model.fc1.weight = torch.nn.Parameter(torch.tensor([[1., 0.], [0., 1.]]))
    model.fc1.bias = torch.nn.Parameter(torch.tensor([0., 0.]))
    model.fc2.weight = torch.nn.Parameter(torch.tensor([[1., 1.], [-1., 1.], [1., 0.]]))
    model.fc2.bias = torch.nn.Parameter(torch.tensor([0., 0., 0.]))
    model.fc3.weight = torch.nn.Parameter(torch.tensor([[1., -1., 0.]]))
    model.fc3.bias = torch.nn.Parameter(torch.tensor([0.]))

In [None]:
model

In [None]:
# Create some data
x = torch.tensor([[1., -1.]], requires_grad=True)
y = torch.tensor([[0.]])

In [None]:
# Initial model weights
for name, p in model.named_parameters():
    print('-', name)
    print(p.data)
    print()

## Forward computation

| Forward formula &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; | Value(s) &emsp;&emsp;&emsp; |
|---------|-------|
| $z_{j}^{(2)} = \sum_i w_{ij}^{(1)} x_i + b_{j}^{(1)}$ | ___ |
| $h_{j}^{(2)} = \max\{z_{j}^{(2)}, 0\}$ | ___ |
| $z_{k}^{(3)} = \sum_j w_{jk}^{(2)} h_{j}^{(2)} + b_{k}^{(2)}$ | ___ |
| $h_{k}^{(3)} = \max\{z_{k}^{(3)}, 0\}$ | ___ |
| $z = \sum_k w_{k}^{(3)} h_{k}^{(3)} + b^{(3)}$ | ___ |
| $\hat{y} = \sigma(z) $ | ___ |
| $L = -\left[y\log\hat{y} + (1-y)\log(1-\hat{y})\right]$| ___ |

In [None]:
# Forward pass
y_hat, results = model(x)
loss = criterion(y_hat, y)

In [None]:
for name, v in results.items():
    print(name, '\t', v.data)

print('L', '\t', loss.data)

## Backward computation

| Backward formula &emsp;&emsp;&emsp; | Value(s) &emsp;&emsp;&emsp; | Backward formula &emsp;&emsp;&emsp; | Value(s) &emsp;&emsp;&emsp; |
|---|---|---|---|
| $\frac{\partial z_{j}^{(2)}}{\partial w_{ij}^{(1)}} = x_i$ | ___ | $\frac{\partial z_{j}^{(2)}}{\partial x_i} = w_{ij}^{(1)}$ | ___ |
| $\frac{\partial h_{j}^{(2)}}{\partial z_{j}^{(2)}} = \mathbb{1}(z_{j}^{(2)} > 0)$ | ___ |
| $\frac{\partial z_{k}^{(3)}}{\partial w_{jk}^{(2)}} = h_{j}^{(2)}$ | ___ | $\frac{\partial z_{k}^{(3)}}{\partial h_{j}^{(2)}} = w_{jk}^{(2)}$ | ___ |
| $\frac{\partial h_{k}^{(3)}}{\partial z_{k}^{(3)}} = \mathbb{1}(z_{k}^{(3)} > 0)$ | ___ |
| $\frac{\partial z}{\partial w_{k}^{(3)}} = h_{k}^{(3)}$ | ___ | $\frac{\partial z}{\partial h_{k}^{(3)}} = w_{k}^{(3)}$ | ___ |
| $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$ | ___ |
| $\frac{\partial L}{\partial \hat{y}} = - \left[\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right] $ | ___ |

Applying chain rule:

$\frac{\partial L}{\partial z} = $

$\frac{\partial L}{\partial w_{k}^{(3)}} = $

$\frac{\partial L}{\partial w_{jk}^{(2)}} = $

$\frac{\partial L}{\partial w_{ij}^{(1)}} = $

In [None]:
# Backward pass
optimizer.zero_grad()
loss.backward()

In [None]:
for name, v in results.items():
    print('Gradient of', name, '\t', v.grad)

In [None]:
for name, p in model.named_parameters():
    print('Gradient of', name)
    print(p.grad)
    print()