# 7. Loss Calculation

**Measuring prediction error with cross-entropy loss**

Alright. The forward pass is almost done.

We've got these nice 16-dimensional vectors representing each token after going through embeddings, attention, FFN, and layer norm. But we still have a problem.

**The model isn't actually predicting anything yet.**

We have hidden states, not predictions. We need to convert these hidden states into probabilities over our vocabulary. Then we can measure how wrong we are.

That's what this step does: **project to vocabulary space** and **compute the loss**.

## The Task: Next-Token Prediction

Our model is a language model. Its job is simple: given a sequence of tokens, predict the next token.

```
Input:  <BOS> I    like transformers <EOS>
IDs:    [1,   3,   4,   5,          2]
```

At each position, we want to predict what comes next:

| Position | Current Token | Should Predict |
|----------|--------------|----------------|
| 0 | `<BOS>` | `I` (token 3) |
| 1 | `I` | `like` (token 4) |
| 2 | `like` | `transformers` (token 5) |
| 3 | `transformers` | `<EOS>` (token 2) |
| 4 | `<EOS>` | nothing (end) |

In [None]:
import random
import math

random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS
EPSILON = 1e-5

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

In [None]:
# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n = len(A), len(A[0])
    p = len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]

def softmax(vec):
    max_val = max(vec)
    exp_vec = [math.exp(v - max_val) for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def softmax_causal(vec):
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def gelu(x):
    return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))

def layer_norm(x, gamma, beta, epsilon=1e-5):
    mean = sum(x) / len(x)
    variance = sum((xi - mean)**2 for xi in x) / len(x)
    std = math.sqrt(variance + epsilon)
    x_norm = [(xi - mean) / std for xi in x]
    return [gamma[i] * x_norm[i] + beta[i] for i in range(len(x))]

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

In [None]:
# Recreate the full forward pass
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# Attention
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

def compute_attention(Q, K, V):
    seq_len, d_k = len(Q), len(Q[0])
    scale = math.sqrt(d_k)
    scores = matmul(Q, transpose(K))
    scaled = [[s / scale for s in row] for row in scores]
    for i in range(seq_len):
        for j in range(seq_len):
            if j > i:
                scaled[i][j] = float('-inf')
    weights = [softmax_causal(row) for row in scaled]
    return matmul(weights, V)

attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))

# FFN
W1 = random_matrix(D_FF, D_MODEL)
b1 = random_vector(D_FF)
W2 = random_matrix(D_MODEL, D_FF)
b2 = random_vector(D_MODEL)
hidden = [[sum(multi_head_output[i][k] * W1[j][k] for k in range(D_MODEL)) + b1[j] for j in range(D_FF)] for i in range(seq_len)]
activated = [[gelu(h) for h in row] for row in hidden]
ffn_output = [[sum(activated[i][k] * W2[j][k] for k in range(D_FF)) + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]

# Residual + LayerNorm
residual = [add_vectors(multi_head_output[i], ffn_output[i]) for i in range(seq_len)]
gamma = [1.0] * D_MODEL
beta = [0.0] * D_MODEL
layer_norm_output = [layer_norm(residual[i], gamma, beta, EPSILON) for i in range(seq_len)]

print("Recreated full forward pass through transformer block")

## Step 1: Project to Vocabulary Space (Logits)

Our hidden states are 16-dimensional. Our vocabulary has 6 tokens. We need to map from 16D → 6D.

Enter the **language modeling head** (LM head): a simple linear projection.

$$\text{logits} = W_{lm} \cdot \text{hidden\_state}$$

These "logits" are **unnormalized scores**. Higher scores mean the model thinks that token is more likely.

In [None]:
# Initialize LM head weight matrix
W_lm = random_matrix(VOCAB_SIZE, D_MODEL)  # [6, 16]

print(f"LM Head Weight Matrix W_lm")
print(f"Shape: [{VOCAB_SIZE}, {D_MODEL}]")

In [None]:
# Compute logits: hidden_state @ W_lm^T
W_lm_T = transpose(W_lm)
logits = matmul(layer_norm_output, W_lm_T)

print("Logits (unnormalized scores)")
print(f"Shape: [{seq_len}, {VOCAB_SIZE}]")
print()
print(f"{'Position':<12} {'<PAD>':>8} {'<BOS>':>8} {'<EOS>':>8} {'I':>8} {'like':>8} {'trans':>8}")
print("-"*70)
for i, row in enumerate(logits):
    print(f"{TOKEN_NAMES[tokens[i]]:<12} {row[0]:>8.4f} {row[1]:>8.4f} {row[2]:>8.4f} {row[3]:>8.4f} {row[4]:>8.4f} {row[5]:>8.4f}")

## Step 2: Convert to Probabilities (Softmax)

Logits are scores, but they're not probabilities. They don't sum to 1. Some are negative.

**Softmax** fixes this:

$$P(\text{token}_i) = \frac{\exp(\text{logit}_i)}{\sum_j \exp(\text{logit}_j)}$$

In [None]:
# Apply softmax to get probabilities
probs = [softmax(row) for row in logits]

print("Probabilities (after softmax)")
print(f"Shape: [{seq_len}, {VOCAB_SIZE}]")
print()
print(f"{'Position':<12} {'<PAD>':>8} {'<BOS>':>8} {'<EOS>':>8} {'I':>8} {'like':>8} {'trans':>8} {'Sum':>8}")
print("-"*80)
for i, row in enumerate(probs):
    row_sum = sum(row)
    print(f"{TOKEN_NAMES[tokens[i]]:<12} {row[0]:>8.4f} {row[1]:>8.4f} {row[2]:>8.4f} {row[3]:>8.4f} {row[4]:>8.4f} {row[5]:>8.4f} {row_sum:>8.4f}")

## Step 3: Compute Loss (Cross-Entropy)

Now we need to measure how wrong the model is.

The metric is **cross-entropy loss**:

$$L = -\log P(\text{correct\_token})$$

- If model is confident and correct (P = 1.0): loss = 0 → perfect!
- If model is uncertain (P = 0.5): loss ≈ 0.69 → okay
- If model is wrong and confident (P = 0.1): loss ≈ 2.3 → bad!

In [None]:
# Target tokens (what we should predict)
# At position i, we predict token i+1
targets = [3, 4, 5, 2]  # I, like, transformers, <EOS>

print("Targets (what the model should predict)")
print("="*60)
print()
for i in range(len(targets)):
    print(f"Position {i} ({TOKEN_NAMES[tokens[i]]:12s}) → should predict: {TOKEN_NAMES[targets[i]]} (token {targets[i]})")

In [None]:
# Compute cross-entropy loss
losses = []

print("Cross-Entropy Loss Calculation")
print("="*60)
print()
print(f"{'Position':<12} {'Current':<12} {'Target':<12} {'P(target)':>10} {'Loss':>10}")
print("-"*60)

for i in range(len(targets)):
    target = targets[i]
    prob_target = probs[i][target]
    loss = -math.log(prob_target)
    losses.append(loss)
    print(f"{i:<12} {TOKEN_NAMES[tokens[i]]:<12} {TOKEN_NAMES[target]:<12} {prob_target:>10.4f} {loss:>10.4f}")

total_loss = sum(losses)
avg_loss = total_loss / len(losses)
print("-"*60)
print(f"{'Total':<36} {' ':>10} {total_loss:>10.4f}")
print(f"{'Average':<36} {' ':>10} {avg_loss:>10.4f}")

## What Does This Loss Mean?

For reference:
- **Random guessing** (uniform over 6 tokens): $-\log(1/6) \approx 1.79$
- **Perfect prediction**: $-\log(1.0) = 0.0$

Our model's loss is close to random guessing. That's exactly what we'd expect from an **untrained model with random weights**.

In [None]:
random_loss = -math.log(1/6)
print(f"Our average loss: {avg_loss:.4f}")
print(f"Random guessing:  {random_loss:.4f}")
print()
if avg_loss > random_loss:
    print("We're slightly worse than random - untrained model, as expected!")
else:
    print("We're slightly better than random - just luck with initialization.")

## Forward Pass: Complete!

We've computed:

1. ✅ **Embeddings** — Converted tokens to vectors
2. ✅ **Q/K/V Projections** — Prepared for attention
3. ✅ **Attention** — Computed context-aware representations
4. ✅ **Multi-head** — Combined multiple attention perspectives
5. ✅ **Feed-forward** — Applied non-linear transformations
6. ✅ **Layer normalization** — Stabilized activations
7. ✅ **Output projection** — Projected to vocabulary space
8. ✅ **Loss calculation** — Measured prediction error

**The forward pass is done.**

## What's Next: Backpropagation

Now comes the fun part.

We know the model is wrong (loss ≈ 1.9). The question is: **how do we fix it?**

We need to compute gradients—how much each parameter contributed to the error. Then we'll update those parameters to reduce the loss.

This is **backpropagation**: computing gradients by walking backward through the computation graph, applying the chain rule at each step.

In [None]:
# Store everything for backprop
forward_pass_data = {
    'tokens': tokens,
    'targets': targets,
    'X': X,
    'layer_norm_output': layer_norm_output,
    'logits': logits,
    'probs': probs,
    'losses': losses,
    'avg_loss': avg_loss,
    # All weights
    'E_token': E_token,
    'E_pos': E_pos,
    'W_Q': W_Q, 'W_K': W_K, 'W_V': W_V,
    'W_O': W_O,
    'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2,
    'W_lm': W_lm,
    'gamma': gamma, 'beta': beta
}
print(f"Forward pass complete. Loss: {avg_loss:.4f}")
print("Data stored for backpropagation.")