# 3. Backpropagation From Scratch

Backpropagation is the engine of neural networks! It's how they learn.
It calculates gradients: how much each parameter contributed to the error.
Let's build it from scratch!

In [None]:
import torch
import matplotlib.pyplot as plt

## 1. The Chain Rule & Computational Graph

Neural networks are just big composite functions.
To find the derivative of a composite function, we use the **Chain Rule**.

If $y = f(u)$ and $u = g(x)$, then:
$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

Let's trace a simple computation: $L = (wx + b - y_{true})^2$

## 2. Manual Calculation Step-by-Step

Let's calculate gradients for a single linear neuron manually.
Forward pass:
1. $z = w \cdot x + b$
2. $\hat{y} = z$ (linear activation for simplicity)
3. $L = (\hat{y} - y_{true})^2$

Backward pass (Chain Rule):
We want $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$.

1. $\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y_{true})$
2. $\frac{\partial \hat{y}}{\partial z} = 1$
3. $\frac{\partial z}{\partial w} = x$
4. $\frac{\partial z}{\partial b} = 1$

Putting it together:
$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w} = 2(\hat{y} - y_{true}) \cdot 1 \cdot x$
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial b} = 2(\hat{y} - y_{true}) \cdot 1 \cdot 1$

In [None]:
# Manual Backpropagation Example

# 1. Initialize parameters
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# 2. Input and Target
x = torch.tensor(3.0)
y_true = torch.tensor(10.0)

# 3. Forward Pass
z = w * x + b
y_pred = z
loss = (y_pred - y_true)**2

print(f"Forward Pass:")
print(f"x={x}, w={w}, b={b}")
print(f"y_pred = {w}*{x} + {b} = {y_pred}")
print(f"loss = ({y_pred} - {y_true})^2 = {loss}")
print("-" * 30)

# 4. Backward Pass (Manual)
dL_dy_pred = 2 * (y_pred - y_true)
dy_pred_dz = 1.0
dz_dw = x
dz_db = 1.0

grad_w_manual = dL_dy_pred * dy_pred_dz * dz_dw
grad_b_manual = dL_dy_pred * dy_pred_dz * dz_db

print(f"Manual Gradients:")
print(f"dL/dw = {grad_w_manual}")
print(f"dL/db = {grad_b_manual}")

# 5. Verify with PyTorch Autograd
loss.backward()
print("-" * 30)
print(f"PyTorch Autograd Gradients:")
print(f"w.grad = {w.grad}")
print(f"b.grad = {b.grad}")

assert grad_w_manual == w.grad
assert grad_b_manual == b.grad
print("\nMatch! We successfully implemented backprop manually!")

## 3. Building a 'Linear' Layer with Backprop

Now let's encapsulate this logic into a class.
We need to store `inputs` during the forward pass to use them in the backward pass.

In [None]:
class Linear:
    def __init__(self, in_features, out_features):
        # Initialize weights and bias
        self.W = torch.randn(out_features, in_features) * 0.1
        self.b = torch.zeros(out_features)
        
        # Gradients storage
        self.grad_W = None
        self.grad_b = None
        
        # Cache for backward pass
        self.input_cache = None

    def forward(self, x):
        """
        Forward pass: y = Wx + b
        x shape: (batch_size, in_features)
        output shape: (batch_size, out_features)
        """
        self.input_cache = x  # Store input for backward pass
        return x @ self.W.T + self.b

    def backward(self, grad_output):
        """
        Backward pass
        grad_output: dL/dy (gradient of loss w.r.t output of this layer)
        """
        x = self.input_cache
        
        # 1. Calculate gradients for parameters
        # dL/dW = (dL/dy)^T * x
        self.grad_W = grad_output.T @ x
        
        # dL/db = sum(dL/dy) across batch
        self.grad_b = grad_output.sum(dim=0)
        
        # 2. Calculate gradient for input (to pass to previous layer)
        # dL/dx = dL/dy * W
        grad_input = grad_output @ self.W
        
        return grad_input

## 4. Building MSE Loss with Backprop

We also need a loss function that can compute its own gradient.

In [None]:
class MSELoss:
    def __init__(self):
        self.pred_cache = None
        self.target_cache = None

    def forward(self, pred, target):
        self.pred_cache = pred
        self.target_cache = target
        return ((pred - target) ** 2).mean()

    def backward(self):
        """
        Computes dL/d_pred
        L = mean((pred - target)^2)
        dL/d_pred = 2 * (pred - target) / batch_size
        """
        n = self.pred_cache.shape[0]
        return 2 * (self.pred_cache - self.target_cache) / n

## 5. Training Loop from Scratch

Let's train our custom Linear layer to learn a simple function: $y = 3x + 2$

In [None]:
# Generate dummy data
X = torch.rand(100, 1) * 5  # Inputs 0-5
Y = 3 * X + 2 + torch.randn(100, 1) * 0.1  # y = 3x + 2 + noise

# Initialize model and loss
model = Linear(in_features=1, out_features=1)
criterion = MSELoss()
learning_rate = 0.01
epochs = 100

losses = []

print(f"Initial weights: W={model.W.item():.4f}, b={model.b.item():.4f}")

for epoch in range(epochs):
    # 1. Forward Pass
    y_pred = model.forward(X)
    loss = criterion.forward(y_pred, Y)
    losses.append(loss.item())
    
    # 2. Backward Pass
    grad_loss = criterion.backward()
    model.backward(grad_loss)
    
    # 3. Update Parameters (SGD)
    model.W -= learning_rate * model.grad_W
    model.b -= learning_rate * model.grad_b
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

print(f"\nFinal weights: W={model.W.item():.4f}, b={model.b.item():.4f}")
print(f"Target weights: W=3.0000, b=2.0000")

# Plot training loss
plt.plot(losses)
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.show()

## 6. Conclusion

We just built backpropagation from scratch!
1. We understood the **Chain Rule**.
2. We calculated gradients **manually**.
3. We implemented a **Linear Layer** that computes its own gradients.
4. We trained it to learn a function!

This is exactly what PyTorch does under the hood with `autograd`.