# 5. Feed-Forward Network

**Position-wise transformations to add non-linearity and expressiveness**

Alright, multi-head attention is behind us. Now for something... simpler?

The feed-forward network (FFN).

Here's the deal: attention lets tokens talk to each other, which is great. But it's all linear transformations and weighted sums. We need some **non-linearity** in here—some way for the model to learn complex, non-linear relationships.

That's where the FFN comes in.

## What is the Feed-Forward Network?

It's just a two-layer fully connected neural network. Applied independently to each position.

That's it. No attention, no looking at other tokens. Just:
1. **Expand** the representation to a higher dimension
2. **Apply a non-linear activation** (GELU in our case)
3. **Project** back down to the original dimension

Think of it as giving each token's representation some "personal processing time" to transform itself in complex, non-linear ways.

## The Architecture

$$\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 \cdot x + b_1) + b_2$$

Breaking it down:
- **$W_1$**: Weights for the first layer, shape $[d_{ff}, d_{model}] = [64, 16]$ (expansion)
- **$b_1$**: Bias for the first layer, shape $[d_{ff}] = [64]$
- **GELU**: Gaussian Error Linear Unit activation
- **$W_2$**: Weights for the second layer, shape $[d_{model}, d_{ff}] = [16, 64]$ (projection)
- **$b_2$**: Bias for the second layer, shape $[d_{model}] = [16]$

**Why expand to 64 dimensions?** The standard ratio in transformers is $d_{ff} = 4 \times d_{model}$. The expansion gives the model more "room" to represent complex transformations.

In [None]:
import random
import math

random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64  # 4 * D_MODEL
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

In [None]:
# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n = len(A), len(A[0])
    p = len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]

def softmax(vec):
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def gelu(x):
    """GELU activation: x * Φ(x) using tanh approximation"""
    return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

In [None]:
# Recreate multi-head attention output from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# QKV and attention
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

def compute_attention(Q, K, V):
    seq_len, d_k = len(Q), len(Q[0])
    scale = math.sqrt(d_k)
    scores = matmul(Q, transpose(K))
    scaled = [[s / scale for s in row] for row in scores]
    for i in range(seq_len):
        for j in range(seq_len):
            if j > i:
                scaled[i][j] = float('-inf')
    weights = [softmax(row) for row in scaled]
    return matmul(weights, V)

attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))

print("Recreated multi-head attention output")

## The GELU Activation

We're using GELU (Gaussian Error Linear Unit) instead of the classic ReLU.

$$\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} \cdot (x + 0.044715 \cdot x^3)\right)\right)$$

**Why GELU instead of ReLU?**

ReLU just zeros out negative values: $\text{ReLU}(x) = \max(0, x)$. It's simple, but it's a hard cutoff.

GELU is smoother. It still emphasizes positive values, but it doesn't completely kill negative ones—they get gently suppressed instead. This smoothness helps with gradient flow during training.

In [None]:
# Visualize GELU vs ReLU
print("GELU vs ReLU comparison")
print("="*40)
print(f"{'x':>8} | {'ReLU':>8} | {'GELU':>8}")
print("-"*40)
for x in [-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0]:
    relu = max(0, x)
    gelu_val = gelu(x)
    print(f"{x:>8.1f} | {relu:>8.4f} | {gelu_val:>8.4f}")

In [None]:
# Initialize FFN weights
W1 = random_matrix(D_FF, D_MODEL)   # [64, 16] - expansion
b1 = random_vector(D_FF)            # [64]
W2 = random_matrix(D_MODEL, D_FF)   # [16, 64] - projection
b2 = random_vector(D_MODEL)         # [16]

print(f"FFN Weight Shapes:")
print(f"  W1: [{D_FF}, {D_MODEL}] (expansion)")
print(f"  b1: [{D_FF}]")
print(f"  W2: [{D_MODEL}, {D_FF}] (projection)")
print(f"  b2: [{D_MODEL}]")

## Step 1: First Linear Layer (Expansion)

$$\text{hidden} = W_1 \cdot x + b_1$$

We're expanding from 16 dimensions to 64.

In [None]:
# Compute first linear layer for all positions
# hidden = input @ W1^T + b1
W1_T = transpose(W1)
hidden = matmul(multi_head_output, W1_T)
hidden = [[hidden[i][j] + b1[j] for j in range(D_FF)] for i in range(seq_len)]

print(f"Hidden layer (after first linear)")
print(f"Shape: [{seq_len}, {D_FF}]")
print()
print(f"First 8 values for position 0 (<BOS>):")
print(f"  {format_vector(hidden[0][:8])}")

## Step 2: GELU Activation

Apply GELU element-wise to the hidden layer.

In [None]:
# Apply GELU activation
activated = [[gelu(h) for h in row] for row in hidden]

print(f"After GELU activation")
print()
print(f"Example for position 0, first 8 values:")
print(f"  Before: {format_vector(hidden[0][:8])}")
print(f"  After:  {format_vector(activated[0][:8])}")

## Step 3: Second Linear Layer (Projection)

$$\text{output} = W_2 \cdot \text{activated} + b_2$$

We're projecting back down from 64 dimensions to 16.

In [None]:
# Compute second linear layer
# output = activated @ W2^T + b2
W2_T = transpose(W2)
ffn_output = matmul(activated, W2_T)
ffn_output = [[ffn_output[i][j] + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]

print(f"FFN Output")
print(f"Shape: [{seq_len}, {D_MODEL}]")
print()
for i, row in enumerate(ffn_output):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

## Before and After

Let's compare position 1 (`I`) before and after the FFN:

In [None]:
print("Position 1 ('I') - Before and After FFN")
print("="*60)
print()
print(f"Before FFN (multi-head output):")
print(f"  {format_vector(multi_head_output[1])}")
print()
print(f"After FFN:")
print(f"  {format_vector(ffn_output[1])}")

## What's the Point?

Here's what the FFN accomplishes:

1. **Non-linearity**: Attention is all linear operations. The FFN adds crucial non-linear transformations via GELU.

2. **Position-wise processing**: Each token gets its own transformation, independent of others. Attention mixed information *between* tokens; FFN processes each token *individually*.

3. **Expressiveness**: The expansion to 64 dimensions gives the model more capacity to represent complex functions.

4. **Feature transformation**: The FFN can learn to emphasize certain features, suppress others, create new combinations.

## What's Next

We just replaced the attention output with the FFN output. But that means we **lost** all the information from attention. That's... not great.

Enter: **residual connections** and **layer normalization**.

Instead of just using the FFN output, we're going to *add* it to the original input. That way we keep the old information while adding new transformations.

In [None]:
# Store for next notebook
ffn_data = {
    'X': X,
    'tokens': tokens,
    'multi_head_output': multi_head_output,
    'ffn_output': ffn_output,
    'W1': W1, 'b1': b1,
    'W2': W2, 'b2': b2
}
print("FFN data stored for next notebook.")