# Attention Mechanism

**Computing attention scores and weighted combinations of values**

Alright, this is it. The attention mechanism itself.

This is the core innovation that makes transformers so powerful (and why they've basically taken over NLP, computer vision, and... well, everything).

## What is Attention?

Attention allows each token to look at other tokens in the sequence and decide how much to focus on each one. This creates context-aware representations where each token's output depends on the entire sequence (up to its position, anyway).

**The intuition:**
- **Query (Q)**: "What am I looking for?"
- **Key (K)**: "What do I offer?"
- **Value (V)**: "What information do I provide?"

For each token, we compute how well its query matches every key, then use those match scores to create a weighted combination of values. Simple idea, powerful results.

## The Attention Algorithm

The attention mechanism consists of 5 steps:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let's break this down:

1. **Compute attention scores**: $\text{scores} = QK^T$
2. **Scale**: $\text{scaled} = \frac{\text{scores}}{\sqrt{d_k}}$
3. **Apply causal mask**: Set future positions to $-\infty$
4. **Softmax**: Convert to probabilities
5. **Weighted sum**: $\text{output} = \text{weights} \cdot V$

In [1]:
import random
import math

# Set seed for reproducibility (same as previous notebooks)
random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS  # 8

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

In [2]:
# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n = len(A), len(A[0])
    p = len(B[0])
    result = [[0] * p for _ in range(m)]
    for i in range(m):
        for j in range(p):
            result[i][j] = sum(A[i][k] * B[k][j] for k in range(n))
    return result

def transpose(A):
    """Transpose matrix A"""
    rows, cols = len(A), len(A[0])
    return [[A[i][j] for i in range(rows)] for j in range(cols)]

def dot_product(v1, v2):
    """Compute dot product of two vectors"""
    return sum(a * b for a, b in zip(v1, v2))

def softmax(vec):
    """Compute softmax of a vector (handles -inf for masking)"""
    # Subtract max for numerical stability
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

In [3]:
# Recreate embeddings and QKV from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]

tokens = [1, 3, 4, 5, 2]  # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)

token_embeddings = [E_token[token_id] for token_id in tokens]
X = [add_vectors(token_embeddings[i], E_pos[i]) for i in range(seq_len)]

# QKV weight matrices
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]

# Compute Q, K, V
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

print("Recreated Q, K, V from previous notebooks")
print(f"Each has shape [{seq_len}, {D_K}]")

Recreated Q, K, V from previous notebooks
Each has shape [5, 8]


## Step 1: Compute Attention Scores

Multiply queries by keys (transposed) to get a matrix of "compatibility scores":

$$\text{scores} = QK^T$$

**Shapes:**
- $Q$: $[5, 8]$ (5 tokens, 8 dimensions per head)
- $K^T$: $[8, 5]$ (transposed from $[5, 8]$)
- $\text{scores}$: $[5, 5]$ (each token attending to each token)

Each element $\text{scores}_{ij}$ represents how much token $i$ should attend to token $j$.

In [4]:
# Compute attention scores for Head 0
head = 0
Q = Q_all[head]
K = K_all[head]
V = V_all[head]

# scores = Q @ K^T
K_T = transpose(K)
scores = matmul(Q, K_T)

print(f"HEAD {head} - Attention Scores (Q @ K^T)")
print(f"Shape: [{seq_len}, {D_K}] @ [{D_K}, {seq_len}] = [{seq_len}, {seq_len}]")
print()
for i, row in enumerate(scores):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

HEAD 0 - Attention Scores (Q @ K^T)
Shape: [5, 8] @ [8, 5] = [5, 5]

  [-0.0126,  0.0213, -0.0152,  0.0211, -0.0137]  # pos 0: <BOS>
  [ 0.0021, -0.0134,  0.0119, -0.0027,  0.0091]  # pos 1: I
  [-0.0140,  0.0097, -0.0039,  0.0169, -0.0061]  # pos 2: like
  [-0.0018, -0.0119,  0.0046, -0.0016,  0.0088]  # pos 3: transformers
  [-0.0022,  0.0084, -0.0022, -0.0016, -0.0069]  # pos 4: <EOS>


### Detailed Calculation: Position 1 attending to Position 0

Let's see how the score for "I" attending to "\<BOS\>" is computed:

In [5]:
print("Computing score[1, 0] - how much 'I' attends to '<BOS>'")
print("="*60)
print()
print(f"Q[0][1] (query for 'I'):")
print(f"  {format_vector(Q[1])}")
print()
print(f"K[0][0] (key for '<BOS>'):")
print(f"  {format_vector(K[0])}")
print()
print("score[1, 0] = Q[0][1] · K[0][0] (dot product)")
score_1_0 = dot_product(Q[1], K[0])
print(f"           = {score_1_0:.4f}")

Computing score[1, 0] - how much 'I' attends to '<BOS>'

Q[0][1] (query for 'I'):
  [-0.0997, -0.0394,  0.0301,  0.0469,  0.0628, -0.0026, -0.0506,  0.0320]

K[0][0] (key for '<BOS>'):
  [-0.0090, -0.0398,  0.0085, -0.0527, -0.0375, -0.0001, -0.0328,  0.0792]

score[1, 0] = Q[0][1] · K[0][0] (dot product)
           = 0.0021


## Step 2: Scale the Scores

Divide by $\sqrt{d_k}$ to prevent very large values:

$$\text{scaled\_scores} = \frac{\text{scores}}{\sqrt{d_k}} = \frac{\text{scores}}{\sqrt{8}} = \frac{\text{scores}}{2.8284}$$

**Why scale?** Without scaling, the dot products can grow large in magnitude, which pushes softmax into regions with very small gradients. This makes training unstable. The scaling factor $\sqrt{d_k}$ keeps the scores in a reasonable range.

In [6]:
scale = math.sqrt(D_K)
print(f"Scale factor: sqrt({D_K}) = {scale:.4f}")
print()

scaled_scores = [[s / scale for s in row] for row in scores]

print(f"Scaled Scores (scores / {scale:.4f})")
for i, row in enumerate(scaled_scores):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

Scale factor: sqrt(8) = 2.8284

Scaled Scores (scores / 2.8284)
  [-0.0045,  0.0075, -0.0054,  0.0075, -0.0048]  # pos 0: <BOS>
  [ 0.0007, -0.0047,  0.0042, -0.0009,  0.0032]  # pos 1: I
  [-0.0049,  0.0034, -0.0014,  0.0060, -0.0021]  # pos 2: like
  [-0.0006, -0.0042,  0.0016, -0.0006,  0.0031]  # pos 3: transformers
  [-0.0008,  0.0030, -0.0008, -0.0006, -0.0024]  # pos 4: <EOS>


## Step 3: Apply Causal Mask

For autoregressive (decoder-only) transformers, each position can only attend to **previous positions** (including itself). We set future positions to $-\infty$:

$$\text{masked\_scores}_{ij} = \begin{cases}
\text{scaled\_scores}_{ij} & \text{if } j \leq i \\
-\infty & \text{if } j > i
\end{cases}$$

**Mask pattern:**
```
Position:     0  1  2  3  4
0 (<BOS>)   [ ✓  ✗  ✗  ✗  ✗ ]  can only see itself
1 (I)       [ ✓  ✓  ✗  ✗  ✗ ]  can see 0, 1
2 (like)    [ ✓  ✓  ✓  ✗  ✗ ]  can see 0, 1, 2
3 (trans.)  [ ✓  ✓  ✓  ✓  ✗ ]  can see 0, 1, 2, 3
4 (<EOS>)   [ ✓  ✓  ✓  ✓  ✓ ]  can see all
```

In [7]:
# Apply causal mask
masked_scores = []
for i in range(seq_len):
    row = []
    for j in range(seq_len):
        if j <= i:
            row.append(scaled_scores[i][j])
        else:
            row.append(float('-inf'))
    masked_scores.append(row)

print("Masked Scores (future positions set to -inf)")
for i, row in enumerate(masked_scores):
    row_str = [f"{v:7.4f}" if v != float('-inf') else "   -inf" for v in row]
    print(f"  [{', '.join(row_str)}]  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

Masked Scores (future positions set to -inf)
  [-0.0045,    -inf,    -inf,    -inf,    -inf]  # pos 0: <BOS>
  [ 0.0007, -0.0047,    -inf,    -inf,    -inf]  # pos 1: I
  [-0.0049,  0.0034, -0.0014,    -inf,    -inf]  # pos 2: like
  [-0.0006, -0.0042,  0.0016, -0.0006,    -inf]  # pos 3: transformers
  [-0.0008,  0.0030, -0.0008, -0.0006, -0.0024]  # pos 4: <EOS>


## Step 4: Apply Softmax

Convert scores to probabilities (each row sums to 1):

$$\text{weights}_{ij} = \frac{e^{\text{masked\_scores}_{ij}}}{\sum_{k=1}^{n} e^{\text{masked\_scores}_{ik}}}$$

**What is softmax?**

The softmax function converts a vector of arbitrary real numbers into a probability distribution. All values end up between 0 and 1, and they sum to 1.

It's called "soft" max because it emphasizes the largest values while still keeping smaller values non-zero (unlike a "hard" max that just picks the biggest and zeroes everything else).

In [8]:
# Example: softmax on position 1's scores
print("Example: Softmax for position 1 ('I')")
print("="*60)
print()
print(f"Masked scores for pos 1: {masked_scores[1][:2]} (only first 2 visible)")
print()

# Manual calculation
s0, s1 = masked_scores[1][0], masked_scores[1][1]
exp_0 = math.exp(s0)
exp_1 = math.exp(s1)
sum_exp = exp_0 + exp_1

print(f"exp({s0:.4f}) = {exp_0:.4f}")
print(f"exp({s1:.4f}) = {exp_1:.4f}")
print(f"sum = {sum_exp:.4f}")
print()
print(f"weight[1,0] = {exp_0:.4f} / {sum_exp:.4f} = {exp_0/sum_exp:.4f}")
print(f"weight[1,1] = {exp_1:.4f} / {sum_exp:.4f} = {exp_1/sum_exp:.4f}")
print()
print(f"Sum of weights: {exp_0/sum_exp + exp_1/sum_exp:.4f} (should be 1.0)")

Example: Softmax for position 1 ('I')

Masked scores for pos 1: [0.000737600323286676, -0.004733186959169455] (only first 2 visible)

exp(0.0007) = 1.0007
exp(-0.0047) = 0.9953
sum = 1.9960

weight[1,0] = 1.0007 / 1.9960 = 0.5014
weight[1,1] = 0.9953 / 1.9960 = 0.4986

Sum of weights: 1.0000 (should be 1.0)


In [9]:
# Apply softmax to all rows
attention_weights = [softmax(row) for row in masked_scores]

print("Attention Weights (after softmax)")
for i, row in enumerate(attention_weights):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")
    
print()
print("Interpretation:")
print("- Position 0 (<BOS>) attends 100% to itself")
print("- Position 1 (I) attends ~50% to <BOS>, ~50% to itself")
print("- Later positions spread attention more evenly")

Attention Weights (after softmax)
  [ 1.0000,  0.0000,  0.0000,  0.0000,  0.0000]  # pos 0: <BOS>
  [ 0.5014,  0.4986,  0.0000,  0.0000,  0.0000]  # pos 1: I
  [ 0.3320,  0.3348,  0.3332,  0.0000,  0.0000]  # pos 2: like
  [ 0.2501,  0.2492,  0.2506,  0.2501,  0.0000]  # pos 3: transformers
  [ 0.1999,  0.2007,  0.1999,  0.2000,  0.1996]  # pos 4: <EOS>

Interpretation:
- Position 0 (<BOS>) attends 100% to itself
- Position 1 (I) attends ~50% to <BOS>, ~50% to itself
- Later positions spread attention more evenly


## Step 5: Compute Weighted Sum of Values

Multiply attention weights by values to get the final output:

$$\text{output} = \text{weights} \cdot V$$

Each output vector is a weighted combination of all the value vectors that this position can attend to.

In [10]:
# Compute attention output
attention_output = matmul(attention_weights, V)

print(f"Attention Output for Head {head}")
print(f"Shape: [{seq_len}, {seq_len}] @ [{seq_len}, {D_K}] = [{seq_len}, {D_K}]")
print()
for i, row in enumerate(attention_output):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

Attention Output for Head 0
Shape: [5, 5] @ [5, 8] = [5, 8]

  [ 0.0800,  0.0257, -0.0117, -0.1056,  0.0339, -0.0891, -0.0083, -0.0737]  # pos 0: <BOS>
  [ 0.0683,  0.0368, -0.0263, -0.0574,  0.0152, -0.0174, -0.0084, -0.0760]  # pos 1: I
  [ 0.0247,  0.0789,  0.0074, -0.0635,  0.0180, -0.0098, -0.0184, -0.0173]  # pos 2: like
  [ 0.0254,  0.0511, -0.0182, -0.0322,  0.0103, -0.0126, -0.0282,  0.0018]  # pos 3: transformers
  [ 0.0325,  0.0367, -0.0202, -0.0262,  0.0188, -0.0040, -0.0321,  0.0167]  # pos 4: <EOS>


In [11]:
# Detailed calculation for position 1
print("Detailed: Output for position 1 ('I')")
print("="*60)
print()
print(f"Weights: {attention_weights[1][:2]} (attending to pos 0 and 1)")
print(f"V[0] (value for <BOS>): {format_vector(V[0])}")
print(f"V[1] (value for I):     {format_vector(V[1])}")
print()
print("output[1] = 0.5003 × V[0] + 0.4997 × V[1]")
print()
w0, w1 = attention_weights[1][0], attention_weights[1][1]
manual_output = [w0 * V[0][d] + w1 * V[1][d] for d in range(D_K)]
print(f"Result: {format_vector(manual_output)}")

Detailed: Output for position 1 ('I')

Weights: [0.5013676934094159, 0.4986323065905841] (attending to pos 0 and 1)
V[0] (value for <BOS>): [ 0.0800,  0.0257, -0.0117, -0.1056,  0.0339, -0.0891, -0.0083, -0.0737]
V[1] (value for I):     [ 0.0565,  0.0479, -0.0409, -0.0089, -0.0037,  0.0547, -0.0085, -0.0782]

output[1] = 0.5003 × V[0] + 0.4997 × V[1]

Result: [ 0.0683,  0.0368, -0.0263, -0.0574,  0.0152, -0.0174, -0.0084, -0.0760]


## Complete Attention for Both Heads

Let's compute attention for both heads:

In [12]:
def compute_attention(Q, K, V, mask=True):
    """Compute scaled dot-product attention"""
    seq_len = len(Q)
    d_k = len(Q[0])
    scale = math.sqrt(d_k)
    
    # Step 1: Q @ K^T
    K_T = transpose(K)
    scores = matmul(Q, K_T)
    
    # Step 2: Scale
    scaled = [[s / scale for s in row] for row in scores]
    
    # Step 3: Causal mask
    if mask:
        for i in range(seq_len):
            for j in range(seq_len):
                if j > i:
                    scaled[i][j] = float('-inf')
    
    # Step 4: Softmax
    weights = [softmax(row) for row in scaled]
    
    # Step 5: Weighted sum
    output = matmul(weights, V)
    
    return weights, output

# Compute for both heads
attention_weights_all = []
attention_output_all = []

for h in range(NUM_HEADS):
    weights, output = compute_attention(Q_all[h], K_all[h], V_all[h])
    attention_weights_all.append(weights)
    attention_output_all.append(output)
    
    print(f"\nHEAD {h} - Attention Weights")
    for i, row in enumerate(weights):
        print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")


HEAD 0 - Attention Weights
  [ 1.0000,  0.0000,  0.0000,  0.0000,  0.0000]  # pos 0: <BOS>
  [ 0.5014,  0.4986,  0.0000,  0.0000,  0.0000]  # pos 1: I
  [ 0.3320,  0.3348,  0.3332,  0.0000,  0.0000]  # pos 2: like
  [ 0.2501,  0.2492,  0.2506,  0.2501,  0.0000]  # pos 3: transformers
  [ 0.1999,  0.2007,  0.1999,  0.2000,  0.1996]  # pos 4: <EOS>

HEAD 1 - Attention Weights
  [ 1.0000,  0.0000,  0.0000,  0.0000,  0.0000]  # pos 0: <BOS>
  [ 0.5009,  0.4991,  0.0000,  0.0000,  0.0000]  # pos 1: I
  [ 0.3342,  0.3337,  0.3322,  0.0000,  0.0000]  # pos 2: like
  [ 0.2514,  0.2494,  0.2510,  0.2482,  0.0000]  # pos 3: transformers
  [ 0.1999,  0.1997,  0.2001,  0.2000,  0.2003]  # pos 4: <EOS>


In [13]:
for h in range(NUM_HEADS):
    print(f"\nHEAD {h} - Attention Output")
    print(f"Shape: [{seq_len}, {D_K}]")
    for i, row in enumerate(attention_output_all[h]):
        print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")


HEAD 0 - Attention Output
Shape: [5, 8]
  [ 0.0800,  0.0257, -0.0117, -0.1056,  0.0339, -0.0891, -0.0083, -0.0737]  # pos 0: <BOS>
  [ 0.0683,  0.0368, -0.0263, -0.0574,  0.0152, -0.0174, -0.0084, -0.0760]  # pos 1: I
  [ 0.0247,  0.0789,  0.0074, -0.0635,  0.0180, -0.0098, -0.0184, -0.0173]  # pos 2: like
  [ 0.0254,  0.0511, -0.0182, -0.0322,  0.0103, -0.0126, -0.0282,  0.0018]  # pos 3: transformers
  [ 0.0325,  0.0367, -0.0202, -0.0262,  0.0188, -0.0040, -0.0321,  0.0167]  # pos 4: <EOS>

HEAD 1 - Attention Output
Shape: [5, 8]
  [ 0.0107, -0.0291, -0.0100, -0.0312,  0.0214,  0.0372,  0.0105,  0.0279]  # pos 0: <BOS>
  [-0.0199, -0.0151,  0.0026,  0.0107,  0.0091, -0.0204, -0.0320, -0.0193]  # pos 1: I
  [-0.0320, -0.0102,  0.0178, -0.0153,  0.0433,  0.0026,  0.0002, -0.0198]  # pos 2: like
  [-0.0111, -0.0085,  0.0093,  0.0101,  0.0440,  0.0237,  0.0056, -0.0311]  # pos 3: transformers
  [-0.0119, -0.0013, -0.0069,  0.0016,  0.0480,  0.0233,  0.0096, -0.0121]  # pos 4: <EOS>


## Interpreting Attention Patterns

The attention weights tell us what each token is "looking at":

- **Position 0 (`<BOS>`)** can only attend to itself (100%)
- **Position 1 (`I`)** attends ~50% to `<BOS>` and ~50% to itself
- **Later positions** spread attention more evenly across all previous tokens

The attention is spread almost equally because our weights are randomly initialized—the model hasn't learned anything meaningful yet.

In a trained model, you'd see more interesting patterns:
- Verbs attending strongly to their subjects
- Pronouns attending to their antecedents
- Related concepts attending to each other

## What's Next

We've got attention outputs from both heads. Now we need to combine them:
1. **Concatenate** the outputs from both heads
2. **Project** the concatenated result back to $d_{model}$ dimensions
3. Apply **residual connections** and **layer normalization**

Then it's on to the feed-forward network. We're making progress!

In [14]:
# Store for next notebook
attention_data = {
    'attention_weights': attention_weights_all,
    'attention_output': attention_output_all,
    'X': X,
    'tokens': tokens,
    'Q': Q_all,
    'K': K_all,
    'V': V_all
}
print("Attention data stored for next notebook.")

Attention data stored for next notebook.
