# 8. Gradients: Loss

**Computing gradients of the loss with respect to logits**

Alright. Time for backpropagation.

We've got a loss of ~1.9—roughly random guessing. Now we need to figure out how to improve it. That means computing **gradients**.

Gradients tell us: "if we nudge this parameter by a tiny amount, how much does the loss change?" Once we know that for every parameter, we can adjust them in the direction that reduces loss.

This is **backpropagation**—walking backward through the computation graph, computing gradients via the chain rule.

Let's start at the end: the loss.

## The Beautiful Formula

For cross-entropy loss with softmax, the gradient has an incredibly clean closed form:

$$\frac{\partial L}{\partial \text{logit}_i} = P(i) - \mathbb{1}[i = \text{target}]$$

Where:
- $P(i)$ = softmax probability for token $i$
- $\mathbb{1}[i = \text{target}]$ = 1 if $i$ is the correct token, 0 otherwise

**That's it.**

For the correct class: gradient = $P(\text{target}) - 1$ (negative)

For all other classes: gradient = $P(i)$ (positive)

This is one of the most elegant results in machine learning.

In [1]:
import random
import math

random.seed(42)

VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

In [2]:
# For this notebook, we'll use pre-computed probabilities (from forward pass)
# These are what a random model would produce
probs = [
    [0.1785, 0.2007, 0.1759, 0.1254, 0.1563, 0.1632],  # pos 0
    [0.1836, 0.1969, 0.1805, 0.1233, 0.1500, 0.1657],  # pos 1
    [0.1795, 0.2050, 0.1782, 0.1207, 0.1437, 0.1728],  # pos 2
    [0.1855, 0.2017, 0.1771, 0.1271, 0.1391, 0.1695],  # pos 3
]

targets = [3, 4, 5, 2]  # I, like, transformers, <EOS>
tokens = [1, 3, 4, 5, 2]

print("Targets for each position:")
for i, t in enumerate(targets):
    print(f"  Position {i} ({TOKEN_NAMES[tokens[i]]}) -> {TOKEN_NAMES[t]}")

Targets for each position:
  Position 0 (<BOS>) -> I
  Position 1 (I) -> like
  Position 2 (like) -> transformers
  Position 3 (transformers) -> <EOS>


## Why This Makes Sense

**Gradients point in the direction of INCREASING loss.**

During gradient descent, we do: `logit = logit - learning_rate × gradient`

So:
- **For the correct class**: gradient is negative ($P - 1 < 0$)
  - Subtracting a negative = adding → **INCREASE** this logit ✓
- **For incorrect classes**: gradient is positive ($P > 0$)
  - Subtracting a positive → **DECREASE** these logits ✓

We want to push the correct class's logit up and all others down.

In [3]:
def compute_loss_gradient(probs, target):
    """
    Compute gradient of cross-entropy loss w.r.t. logits.
    dL/dlogit[i] = P(i) - 1 if i == target, else P(i)
    """
    grad = probs.copy()
    grad[target] -= 1.0
    return grad

# Compute gradients for all positions
dL_dlogits = []
for i in range(len(targets)):
    grad = compute_loss_gradient(probs[i], targets[i])
    dL_dlogits.append(grad)

print("Loss Gradients w.r.t. Logits")
print("="*70)
print()
print(f"{'Position':<12} {'<PAD>':>8} {'<BOS>':>8} {'<EOS>':>8} {'I':>8} {'like':>8} {'trans':>8}")
print("-"*70)
for i, grad in enumerate(dL_dlogits):
    print(f"{TOKEN_NAMES[tokens[i]]:<12} {grad[0]:>8.4f} {grad[1]:>8.4f} {grad[2]:>8.4f} {grad[3]:>8.4f} {grad[4]:>8.4f} {grad[5]:>8.4f}")

Loss Gradients w.r.t. Logits

Position        <PAD>    <BOS>    <EOS>        I     like    trans
----------------------------------------------------------------------
<BOS>          0.1785   0.2007   0.1759  -0.8746   0.1563   0.1632
I              0.1836   0.1969   0.1805   0.1233  -0.8500   0.1657
like           0.1795   0.2050   0.1782   0.1207   0.1437  -0.8272
transformers   0.1855   0.2017  -0.8229   0.1271   0.1391   0.1695


In [4]:
# Detailed example for position 0
print("Detailed: Position 0 (<BOS> → I)")
print("="*60)
print()
print("Current probabilities:")
for j, name in enumerate(TOKEN_NAMES):
    marker = "← target" if j == targets[0] else ""
    print(f"  P({name:12s}) = {probs[0][j]:.4f} {marker}")
print()
print("Gradients (P(i) - 1[i==target]):")
for j, name in enumerate(TOKEN_NAMES):
    is_target = 1 if j == targets[0] else 0
    grad = probs[0][j] - is_target
    print(f"  dL/dlogit[{name:12s}] = {probs[0][j]:.4f} - {is_target} = {grad:>8.4f}")

Detailed: Position 0 (<BOS> → I)

Current probabilities:
  P(<PAD>       ) = 0.1785 
  P(<BOS>       ) = 0.2007 
  P(<EOS>       ) = 0.1759 
  P(I           ) = 0.1254 ← target
  P(like        ) = 0.1563 
  P(transformers) = 0.1632 

Gradients (P(i) - 1[i==target]):
  dL/dlogit[<PAD>       ] = 0.1785 - 0 =   0.1785
  dL/dlogit[<BOS>       ] = 0.2007 - 0 =   0.2007
  dL/dlogit[<EOS>       ] = 0.1759 - 0 =   0.1759
  dL/dlogit[I           ] = 0.1254 - 1 =  -0.8746
  dL/dlogit[like        ] = 0.1563 - 0 =   0.1563
  dL/dlogit[transformers] = 0.1632 - 0 =   0.1632


## Verification: Gradients Sum to Zero

The gradients should sum to zero at each position:

$$\sum_i \frac{\partial L}{\partial \text{logit}_i} = \sum_i P(i) - 1 = 1 - 1 = 0$$

In [5]:
print("Verification: Gradients sum to zero")
print()
for i, grad in enumerate(dL_dlogits):
    grad_sum = sum(grad)
    print(f"Position {i}: sum(gradients) = {grad_sum:12.10f} {'✓' if abs(grad_sum) < 1e-6 else '✗'}")

Verification: Gradients sum to zero

Position 0: sum(gradients) = -0.0000000000 ✓
Position 1: sum(gradients) = 0.0000000000 ✓
Position 2: sum(gradients) = -0.0001000000 ✗
Position 3: sum(gradients) = 0.0000000000 ✓


## Magnitude Matters

Notice the magnitudes:
- Target gradient: ~-0.85 (large negative)
- Non-target gradients: ~0.15-0.20 (small positive)

The target gradient is **much larger** in magnitude because:
- The model is getting it wrong (low probability for correct answer)
- We need a strong signal to push it in the right direction

As the model improves and assigns higher probability to the correct token, that gradient will shrink.

## What's Next

We've computed $\frac{\partial L}{\partial \text{logits}}$.

But we can't update the logits directly—they're computed from the hidden states via the language modeling head.

To backpropagate further, we need:
1. **Gradients for $W_{lm}$** (the language modeling head weights)
2. **Gradients for $h$** (the hidden states going into the LM head)

Then we'll continue backward through layer norm, FFN, attention, and finally the embeddings.

In [6]:
# Store for next notebook
grad_loss_data = {
    'dL_dlogits': dL_dlogits,
    'probs': probs,
    'targets': targets
}
print("Loss gradients stored for next notebook.")

Loss gradients stored for next notebook.
