# 2. QKV Projections

**Transforming embeddings into Query, Key, and Value representations**

We've got our embedding matrix `X`. Now things get interesting.

The next step is to compute **Query (Q)**, **Key (K)**, and **Value (V)** projections. These are the fundamental building blocks of the attention mechanism (and where a lot of the magic happens).

## What are Q, K, V?

The attention mechanism lets each token decide how much attention to pay to every other token in the sequence. Think of it like a database lookup, but fuzzy and learned:

- **Query (Q)**: "What am I looking for?"
- **Key (K)**: "What do I contain?"
- **Value (V)**: "What information do I have to offer?"

We take each token's embedding and project it into these three different representations using learned weight matrices. Same input, three different views.

## Multi-Head Attention Structure

Our model uses **multi-head attention** with `num_heads = 2`. This means we compute attention independently in 2 different subspaces, then combine the results.

**Architecture:**
- **d_model:** 16 (embedding dimension)
- **num_heads:** 2
- **d_k:** d_model / num_heads = 8 (dimension per head)

Each head has its own set of weight matrices that project the 16-dimensional embeddings into 8-dimensional Q, K, V representations.

In [None]:
import random

# Set seed for reproducibility (same as previous notebook)
random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS  # 8

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

print(f"d_model: {D_MODEL}")
print(f"num_heads: {NUM_HEADS}")
print(f"d_k (dimension per head): {D_K}")

In [None]:
# Helper functions
def random_vector(size, scale=0.1):
    """Generate a random vector with values ~ N(0, scale^2)"""
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    """Generate a random matrix with values ~ N(0, scale^2)"""
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    """Element-wise addition of two vectors"""
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    """Multiply matrices A @ B where A is [m, n] and B is [n, p]"""
    m, n = len(A), len(A[0])
    p = len(B[0])
    result = [[0] * p for _ in range(m)]
    for i in range(m):
        for j in range(p):
            result[i][j] = sum(A[i][k] * B[k][j] for k in range(n))
    return result

def format_vector(vec, decimals=4):
    """Format vector as string with specified decimal places"""
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

In [None]:
# Recreate embeddings from previous notebook (same random seed)
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]

tokens = [1, 3, 4, 5, 2]  # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)

token_embeddings = [E_token[token_id] for token_id in tokens]
X = [add_vectors(token_embeddings[i], E_pos[i]) for i in range(seq_len)]

print(f"Embedding matrix X: [{seq_len}, {D_MODEL}]")
print("(Recreated from previous notebook with same random seed)")

## Weight Matrices

For each head, we need three weight matrices:

**For Head 0:**
- **W_Q[0]:** Query weight matrix `[16, 8]`
- **W_K[0]:** Key weight matrix `[16, 8]`
- **W_V[0]:** Value weight matrix `[16, 8]`

**For Head 1:**
- **W_Q[1]:** Query weight matrix `[16, 8]`
- **W_K[1]:** Key weight matrix `[16, 8]`
- **W_V[1]:** Value weight matrix `[16, 8]`

These matrices are initialized with small random values and are learned during training.

In [None]:
# Initialize weight matrices for each head
W_Q = []
W_K = []
W_V = []

for head in range(NUM_HEADS):
    W_Q.append(random_matrix(D_MODEL, D_K))  # [16, 8]
    W_K.append(random_matrix(D_MODEL, D_K))  # [16, 8]
    W_V.append(random_matrix(D_MODEL, D_K))  # [16, 8]
    
print(f"Initialized weight matrices for {NUM_HEADS} heads")
print(f"Each W_Q, W_K, W_V has shape [{D_MODEL}, {D_K}]")

## Matrix Multiplication Basics

Before we dive into the actual Q, K, V calculations, let's review **matrix multiplication**. It's the core operation we'll be using... basically everywhere in this project.

### How Matrix Multiplication Works

When we multiply two matrices `A @ B`:
- **A** has shape `[m, n]` (m rows, n columns)
- **B** has shape `[n, p]` (n rows, p columns)
- The result has shape `[m, p]` (m rows, p columns)

**Key requirement:** The number of columns in A must equal the number of rows in B.

**The operation:** To compute element `[i, j]` in the result:

$$\text{result}_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}$$

In other words, we take row $i$ from A, column $j$ from B, multiply corresponding elements, and sum them up.

In [None]:
# Simple example of matrix multiplication
A = [
    [1, 2, 3],
    [4, 5, 6]
]
B = [
    [1, 4],
    [2, 5],
    [3, 6]
]

print("Example: [2,3] @ [3,2] = [2,2]")
print()
print("A =", A)
print("B =", B)
print()
result = matmul(A, B)
print("Result[0,0] = (1*1) + (2*2) + (3*3) =", 1*1 + 2*2 + 3*3)
print("Result[0,1] = (1*4) + (2*5) + (3*6) =", 1*4 + 2*5 + 3*6)
print("Result[1,0] = (4*1) + (5*2) + (6*3) =", 4*1 + 5*2 + 6*3)
print("Result[1,1] = (4*4) + (5*5) + (6*6) =", 4*4 + 5*5 + 6*6)
print()
print("Result =", result)

## The Projection Operation

Now we can compute Q, K, V for each head using matrix multiplication:

$$
\begin{aligned}
Q_{head} &= X W_Q^{(head)} \\
K_{head} &= X W_K^{(head)} \\
V_{head} &= X W_V^{(head)}
\end{aligned}
$$

Where:
- `X` has shape `[seq_len, d_model]` = `[5, 16]`
- `W_Q[head]`, `W_K[head]`, `W_V[head]` have shape `[d_model, d_k]` = `[16, 8]`
- `Q[head]`, `K[head]`, `V[head]` have shape `[seq_len, d_k]` = `[5, 8]`

Each row of the result represents the Q/K/V vector for one token position.

In [None]:
# Compute Q, K, V for each head
Q_all = []
K_all = []
V_all = []

for head in range(NUM_HEADS):
    Q = matmul(X, W_Q[head])  # [5, 16] @ [16, 8] = [5, 8]
    K = matmul(X, W_K[head])
    V = matmul(X, W_V[head])
    
    Q_all.append(Q)
    K_all.append(K)
    V_all.append(V)

print(f"Computed Q, K, V for {NUM_HEADS} heads")
print(f"Each Q, K, V has shape [{seq_len}, {D_K}]")

## Detailed Calculation Example

Let's walk through computing `Q[0][0]`â€”the query vector for the first token `<BOS>` in head 0.

For each output dimension `j` (0 to 7), we compute:
```
Q[0][0][j] = sum(X[0][i] * W_Q[0][i][j] for i in range(16))
```

In [None]:
# Detailed calculation for Q[0][0] (query for <BOS> in head 0)
print("Computing Q[0][0] - Query for <BOS> in Head 0")
print("="*60)
print()
print(f"Input: X[0] (embedding for <BOS>):")
print(f"  {format_vector(X[0])}")
print()
print(f"Operation: Q[0][0] = X[0] @ W_Q[0]")
print(f"  [{1}, {D_MODEL}] @ [{D_MODEL}, {D_K}] = [{1}, {D_K}]")
print()

# Show detailed computation for first output dimension
print("Example: Computing Q[0][0][0] (first dimension):")
terms = [f"({X[0][i]:.4f} * {W_Q[0][i][0]:.4f})" for i in range(4)]
print(f"  = {' + '.join(terms)} + ...")
val = sum(X[0][i] * W_Q[0][i][0] for i in range(D_MODEL))
print(f"  = {val:.4f}")
print()

print(f"Result: Q[0][0] = {format_vector(Q_all[0][0])}")

## Head 0 Results

In [None]:
print("HEAD 0 - Query Matrix Q[0]")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(Q_all[0]):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

In [None]:
print("HEAD 0 - Key Matrix K[0]")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(K_all[0]):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

In [None]:
print("HEAD 0 - Value Matrix V[0]")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(V_all[0]):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

## Head 1 Results

The second head uses different weight matrices, producing different Q, K, V representations:

In [None]:
print("HEAD 1 - Query Matrix Q[1]")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(Q_all[1]):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

In [None]:
print("HEAD 1 - Key Matrix K[1]")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(K_all[1]):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

In [None]:
print("HEAD 1 - Value Matrix V[1]")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(V_all[1]):
    print(f"  {format_vector(row)}  # pos {i}: {TOKEN_NAMES[tokens[i]]}")

## Why Multiple Heads?

Notice that Head 0 and Head 1 produce completely different Q, K, V representations for the same input. This is the power of multi-head attention.

Different heads can learn to focus on different types of relationships:

- **Head 0** might learn to focus on syntactic relationships (like subject-verb agreement)
- **Head 1** might learn to focus on semantic relationships (like related concepts)

It's like having multiple experts examining the same data from different perspectives. Later, we'll combine the outputs from both heads to get a richer, more nuanced representation.

## What's Next

Now that we have Q, K, V for both heads, we can compute the actual attention mechanism:
1. **Attention scores**: How much should each token attend to every other token?
2. **Attention weights**: Normalized scores (using softmax)
3. **Attention output**: Weighted combination of value vectors

This is where the "attention" actually happens. Let's dive in.

In [None]:
# Store for next notebook
qkv_data = {
    'X': X,
    'tokens': tokens,
    'W_Q': W_Q,
    'W_K': W_K,
    'W_V': W_V,
    'Q': Q_all,
    'K': K_all,
    'V': V_all,
    'D_MODEL': D_MODEL,
    'D_K': D_K,
    'NUM_HEADS': NUM_HEADS
}
print("QKV data stored for next notebook.")