# 9. Backpropagation Through the Network

**Computing gradients for all layers via the chain rule**

We have the loss gradients with respect to logits. Now we need to propagate these gradients backward through every layer:

1. **Output layer** → gradients for W_lm and hidden states
2. **Layer norm** → gradients for gamma, beta, and pre-norm activations
3. **FFN** → gradients for W1, b1, W2, b2
4. **Multi-head attention** → gradients for W_Q, W_K, W_V, W_O
5. **Embeddings** → gradients for E_token and E_pos

The key tool is the **chain rule**: if $L$ depends on $y$ which depends on $x$, then:

$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$$

In [None]:
import random
import math

random.seed(42)

VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

In [None]:
# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

## Step 1: Output Layer Gradients

The output layer computes: $\text{logits} = h \cdot W_{lm}^T$

We need:
- $\frac{\partial L}{\partial W_{lm}}$ to update the weights
- $\frac{\partial L}{\partial h}$ to continue backprop

For a linear layer $y = x \cdot W^T$:
- $\frac{\partial L}{\partial W} = (\frac{\partial L}{\partial y})^T \cdot x$
- $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot W$

In [None]:
# Simulated values (in practice, these come from forward pass)
seq_len = 4  # Only positions with targets

# Loss gradients from previous notebook
dL_dlogits = [
    [0.1785, 0.2007, 0.1759, -0.8746, 0.1563, 0.1632],
    [0.1836, 0.1969, 0.1805, 0.1233, -0.8500, 0.1657],
    [0.1795, 0.2050, 0.1782, 0.1207, 0.1437, -0.8272],
    [0.1855, 0.2017, -0.8229, 0.1271, 0.1391, 0.1695],
]

# Random hidden states and W_lm (for demonstration)
h = [random_vector(D_MODEL) for _ in range(seq_len)]
W_lm = random_matrix(VOCAB_SIZE, D_MODEL)

print(f"dL_dlogits shape: [{seq_len}, {VOCAB_SIZE}]")
print(f"h shape: [{seq_len}, {D_MODEL}]")
print(f"W_lm shape: [{VOCAB_SIZE}, {D_MODEL}]")

In [None]:
# Compute gradient for W_lm
# dL_dW_lm[i][j] = sum over positions of dL_dlogits[pos][i] * h[pos][j]
dL_dW_lm = [[0.0] * D_MODEL for _ in range(VOCAB_SIZE)]
for pos in range(seq_len):
    for i in range(VOCAB_SIZE):
        for j in range(D_MODEL):
            dL_dW_lm[i][j] += dL_dlogits[pos][i] * h[pos][j]

print("Gradient for W_lm (first row):")
print(f"  {format_vector(dL_dW_lm[0])}")

In [None]:
# Compute gradient for hidden states
# dL_dh[pos][j] = sum over vocab of dL_dlogits[pos][i] * W_lm[i][j]
dL_dh = [[0.0] * D_MODEL for _ in range(seq_len)]
for pos in range(seq_len):
    for j in range(D_MODEL):
        for i in range(VOCAB_SIZE):
            dL_dh[pos][j] += dL_dlogits[pos][i] * W_lm[i][j]

print("Gradient for hidden states (position 0):")
print(f"  {format_vector(dL_dh[0])}")

## Step 2: Layer Norm Gradients

Layer norm is: $y = \gamma \odot \frac{x - \mu}{\sigma} + \beta$

The gradients involve:
- $\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L}{\partial \beta}$ for parameters
- $\frac{\partial L}{\partial x}$ which requires the Jacobian of layer norm

The Jacobian is complex because normalizing each element affects the mean and variance, which affects all other elements.

In [None]:
# For gamma and beta, the gradients are simpler
# dL_dgamma[j] = sum over positions of dL_dy[pos][j] * x_norm[pos][j]
# dL_dbeta[j] = sum over positions of dL_dy[pos][j]

# Simulated normalized values
x_norm = [random_vector(D_MODEL) for _ in range(seq_len)]

dL_dgamma = [0.0] * D_MODEL
dL_dbeta = [0.0] * D_MODEL

for pos in range(seq_len):
    for j in range(D_MODEL):
        dL_dgamma[j] += dL_dh[pos][j] * x_norm[pos][j]
        dL_dbeta[j] += dL_dh[pos][j]

print("Gradient for gamma (first 8 values):")
print(f"  {format_vector(dL_dgamma[:8])}")
print()
print("Gradient for beta (first 8 values):")
print(f"  {format_vector(dL_dbeta[:8])}")

## Step 3: FFN Gradients

The FFN computes:
1. $h_1 = x \cdot W_1^T + b_1$
2. $h_2 = \text{GELU}(h_1)$
3. $y = h_2 \cdot W_2^T + b_2$

We backprop through each in reverse order.

In [None]:
# GELU derivative
def gelu_derivative(x):
    """Derivative of GELU activation"""
    # Approximation of GELU derivative
    cdf = 0.5 * (1 + math.tanh(math.sqrt(2/math.pi) * (x + 0.044715 * x**3)))
    pdf = math.exp(-x**2 / 2) / math.sqrt(2 * math.pi)
    return cdf + x * pdf

# Example
print("GELU derivatives at sample points:")
for x in [-1.0, 0.0, 1.0]:
    print(f"  GELU'({x:4.1f}) = {gelu_derivative(x):.4f}")

In [None]:
# W2 gradients: dL_dW2[i][j] = sum of dL_dy[pos][i] * h2[pos][j]
# b2 gradients: dL_db2[i] = sum of dL_dy[pos][i]

# Simulated values
h2 = [random_vector(D_FF) for _ in range(seq_len)]  # After GELU
dL_dy = dL_dh  # Gradient flowing in (after accounting for residual)

W2 = random_matrix(D_MODEL, D_FF)

dL_dW2 = [[0.0] * D_FF for _ in range(D_MODEL)]
dL_db2 = [0.0] * D_MODEL

for pos in range(seq_len):
    for i in range(D_MODEL):
        dL_db2[i] += dL_dy[pos][i]
        for j in range(D_FF):
            dL_dW2[i][j] += dL_dy[pos][i] * h2[pos][j]

print(f"dL_dW2 shape: [{D_MODEL}, {D_FF}]")
print(f"dL_db2 shape: [{D_MODEL}]")

## Step 4: Attention Gradients

This is the most complex part. We need gradients for:
- $W_Q$, $W_K$, $W_V$ (per head)
- $W_O$ (output projection)

The attention computation involves:
1. Q, K, V projections
2. Scaled dot-product attention with softmax
3. Concatenation and output projection

Each step requires careful application of the chain rule.

In [None]:
# Simplified example: gradient for W_O
# W_O projects concatenated attention outputs back to d_model

concat_attn = [random_vector(D_MODEL) for _ in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)

# dL_dW_O[i][j] = sum of dL_dattn[pos][i] * concat[pos][j]
dL_dW_O = [[0.0] * D_MODEL for _ in range(D_MODEL)]
for pos in range(seq_len):
    for i in range(D_MODEL):
        for j in range(D_MODEL):
            dL_dW_O[i][j] += dL_dh[pos][i] * concat_attn[pos][j]

print(f"dL_dW_O shape: [{D_MODEL}, {D_MODEL}]")
print(f"First row: {format_vector(dL_dW_O[0][:8])}...")

## Step 5: Embedding Gradients

Finally, we compute gradients for the embeddings:
- $E_{token}$: token embeddings
- $E_{pos}$: position embeddings

The embedding lookup is essentially indexing, so gradients flow back only to the embeddings that were actually used.

In [None]:
# Gradient for token embeddings
# Only tokens that appeared in the sequence receive gradients
tokens_used = [1, 3, 4, 5, 2]  # BOS, I, like, transformers, EOS

dL_dE_token = [[0.0] * D_MODEL for _ in range(VOCAB_SIZE)]

# Simulated gradient flowing into embeddings
dL_dX = [random_vector(D_MODEL) for _ in range(5)]

for pos, token_id in enumerate(tokens_used):
    for j in range(D_MODEL):
        dL_dE_token[token_id][j] += dL_dX[pos][j]

print("Token embedding gradients:")
for i, name in enumerate(TOKEN_NAMES):
    norm = sum(g**2 for g in dL_dE_token[i]) ** 0.5
    if norm > 0:
        print(f"  {name:12s}: gradient norm = {norm:.4f}")
    else:
        print(f"  {name:12s}: no gradient (not used)")

## Summary: The Chain Rule at Work

We've traced gradients backward through:

1. **Loss → Logits**: Simple formula $P(i) - \mathbb{1}[i=\text{target}]$
2. **Logits → Hidden states**: Linear layer backprop
3. **Layer norm**: Jacobian through normalization
4. **FFN**: Two linear layers + GELU derivative
5. **Attention**: Complex but systematic chain rule application
6. **Embeddings**: Gradient accumulation for used tokens

Every parameter now has a gradient telling us how to reduce the loss.

## What's Next

We have gradients for all ~2,600 parameters. Now we use the **AdamW optimizer** to actually update them.

AdamW combines:
- Adaptive learning rates per parameter
- Momentum to smooth updates
- Weight decay for regularization

That's the final step in our training loop.