# 6. Layer Normalization

**Stabilizing activations and adding residual connections**

Okay. We've got our feed-forward network output. Are we done?

Not quite.

See, we just **replaced** the attention output with the FFN output. That means we lost all the information from attention. All that careful work computing Q, K, V, attention scores, multi-head combinations... gone.

That's not ideal.

Plus, deep neural networks have this annoying tendency where activations can grow or shrink out of control as you stack more layers. Small errors compound, gradients explode or vanish, and training becomes a nightmare.

Two techniques solve these problems: **residual connections** and **layer normalization**.

## Solution 1: Residual Connections

The fix is beautifully simple: **add** the FFN output to the attention output instead of replacing it.

$$\text{residual} = \text{attention\_output} + \text{FFN}(\text{attention\_output})$$

This is called a **residual connection** (or skip connection). The idea: Let the FFN learn the **change** to make, not the entire new representation.

Benefits:
- **No information loss** — original attention output is preserved
- **Easier learning** — the FFN only needs to learn deltas
- **Better gradients** — during backprop, gradients can flow directly through the residual path

## Solution 2: Layer Normalization

Even with residual connections, activations can drift over time. One dimension might grow huge, another shrink to near-zero.

**Layer normalization** solves this by normalizing each position's activations to have:
- Mean = 0
- Variance = 1

Then it applies learned scale ($\gamma$) and shift ($\beta$) parameters to restore expressiveness.

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

In [None]:
import random
import math

random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS
EPSILON = 1e-5  # For numerical stability

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

In [None]:
# Helper functions (same as before)
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n = len(A), len(A[0])
    p = len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]

def softmax(vec):
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def gelu(x):
    return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

In [None]:
# Recreate everything from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# Attention
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

def compute_attention(Q, K, V):
    seq_len, d_k = len(Q), len(Q[0])
    scale = math.sqrt(d_k)
    scores = matmul(Q, transpose(K))
    scaled = [[s / scale for s in row] for row in scores]
    for i in range(seq_len):
        for j in range(seq_len):
            if j > i:
                scaled[i][j] = float('-inf')
    weights = [softmax(row) for row in scaled]
    return matmul(weights, V)

attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))

# FFN
W1 = random_matrix(D_FF, D_MODEL)
b1 = random_vector(D_FF)
W2 = random_matrix(D_MODEL, D_FF)
b2 = random_vector(D_MODEL)
W1_T = transpose(W1)
hidden = matmul(multi_head_output, W1_T)
hidden = [[hidden[i][j] + b1[j] for j in range(D_FF)] for i in range(seq_len)]
activated = [[gelu(h) for h in row] for row in hidden]
W2_T = transpose(W2)
ffn_output = matmul(activated, W2_T)
ffn_output = [[ffn_output[i][j] + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]

print("Recreated multi-head attention and FFN outputs")

## Step 1: Add Residual Connection

Just add the attention output and FFN output element-wise:

In [None]:
# Compute residual: attention output + FFN output
residual = [add_vectors(multi_head_output[i], ffn_output[i]) for i in range(seq_len)]

print("Residual Connection")
print("="*60)
print()
print("Example for position 0 (<BOS>):")
print(f"  Attention output: {format_vector(multi_head_output[0])}")
print(f"  FFN output:       {format_vector(ffn_output[0])}")
print(f"  Residual (sum):   {format_vector(residual[0])}")

## Step 2: Compute Mean and Variance

For each position, compute statistics across the D_MODEL dimension:

$$\mu = \frac{1}{d_{model}} \sum_{i=0}^{15} x_i$$

$$\sigma^2 = \frac{1}{d_{model}} \sum_{i=0}^{15} (x_i - \mu)^2$$

In [None]:
def layer_norm(x, gamma, beta, epsilon=1e-5):
    """Apply layer normalization to a single vector"""
    # Compute mean
    mean = sum(x) / len(x)
    
    # Compute variance
    variance = sum((xi - mean)**2 for xi in x) / len(x)
    
    # Normalize
    std = math.sqrt(variance + epsilon)
    x_norm = [(xi - mean) / std for xi in x]
    
    # Scale and shift
    output = [gamma[i] * x_norm[i] + beta[i] for i in range(len(x))]
    
    return output, mean, variance

# Initialize gamma and beta (learnable parameters)
gamma = [1.0] * D_MODEL  # Scale, initialized to 1
beta = [0.0] * D_MODEL   # Shift, initialized to 0

print(f"Layer norm parameters:")
print(f"  gamma (scale): {gamma[:4]}... (all 1.0)")
print(f"  beta (shift):  {beta[:4]}... (all 0.0)")

In [None]:
# Detailed calculation for position 0
x = residual[0]
mean = sum(x) / D_MODEL
variance = sum((xi - mean)**2 for xi in x) / D_MODEL
std = math.sqrt(variance + EPSILON)

print("Detailed calculation for position 0 (<BOS>)")
print("="*60)
print()
print(f"Input (residual): {format_vector(x)}")
print()
print(f"Mean: {mean:.6f}")
print(f"Variance: {variance:.6f}")
print(f"Std (with epsilon): {std:.6f}")
print()

# Normalize first few values
print("Normalization examples:")
for i in range(3):
    norm_val = (x[i] - mean) / std
    print(f"  x[{i}] = ({x[i]:.4f} - {mean:.4f}) / {std:.4f} = {norm_val:.4f}")

In [None]:
# Apply layer norm to all positions
layer_norm_output = []
stats = []

for i in range(seq_len):
    output, mean, var = layer_norm(residual[i], gamma, beta, EPSILON)
    layer_norm_output.append(output)
    stats.append((mean, var))

print("Layer Norm Output")
print(f"Shape: [{seq_len}, {D_MODEL}]")
print()
for i, row in enumerate(layer_norm_output):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

## Verification: Did It Work?

Let's check that layer normalization actually did what it promised. Each position should now have:
- Mean ≈ 0
- Variance ≈ 1

In [None]:
print("Verification: Mean and Variance after LayerNorm")
print("="*60)
print()
for i in range(seq_len):
    x = layer_norm_output[i]
    mean = sum(x) / len(x)
    var = sum((xi - mean)**2 for xi in x) / len(x)
    print(f"Position {i} ({TOKEN_NAMES[tokens[i]]:12s}): mean = {mean:9.6f}, variance = {var:.6f}")

print()
print("Mean is ~0, variance is ~1. Layer norm worked!")

## Before and After

Let's compare position 1 (`I`) before and after layer normalization:

In [None]:
print("Position 1 ('I') - Before and After LayerNorm")
print("="*60)
print()
print(f"Before (residual):")
print(f"  {format_vector(residual[1])}")
print()
print(f"After (layer norm):")
print(f"  {format_vector(layer_norm_output[1])}")
print()
print("The magnitudes changed (normalized), but relative relationships preserved.")

## One Transformer Block: Complete!

We just finished a complete transformer block!

In a real transformer, you'd stack multiple blocks. GPT-3 has 96 of these blocks. Each block is:

1. Multi-head attention
2. Residual + Layer norm
3. Feed-forward network
4. Residual + Layer norm

We're only using **one block** to keep things manageable.

## What's Next

The transformer block is done. Now we need to convert these 16-dimensional vectors into actual predictions.

How do we predict the next token?

We'll project these vectors into vocabulary space using a **language modeling head**, then compute the **loss** to see how wrong we are.

In [None]:
# Store for next notebook
layer_norm_data = {
    'X': X,
    'tokens': tokens,
    'layer_norm_output': layer_norm_output,
    'gamma': gamma,
    'beta': beta
}
print("Layer norm data stored for next notebook.")