
# End-to-End Self-Attention Walkthrough (PyTorch)

This notebook takes a *single simple sentence* and walks through **every step**:
1. Tokenization → Token IDs  
2. Token embeddings + positional embeddings  
3. Compute **Q**, **K**, **V**  
4. Compute attention **scores**, **weights (softmax)**  
5. Compute the **context vectors** (attention head output)  
6. (Optional) Extend to **multi-head attention**

We keep the numbers small and deterministic for clarity.


In [1]:

import math
import torch
import torch.nn as nn
torch.set_printoptions(precision=4, sci_mode=False)
torch.manual_seed(0)  # reproducible numbers


<torch._C.Generator at 0x7273c43ccf90>


## 1) Tokenization → Token IDs

We'll use a **tiny whitespace tokenizer** with a small fixed vocabulary for clarity.


In [2]:

# Tiny vocab for a toy example
vocab = {
    "<pad>": 0,
    "<bos>": 1,
    "<eos>": 2,
    "time": 3,
    "flies": 4,
    "fast": 5,
}

def tokenize(text):
    return text.lower().strip().split()

def encode(tokens):
    # Add BOS and EOS to be explicit (optional, but educational)
    ids = [vocab["<bos>"]] + [vocab.get(t, 0) for t in tokens] + [vocab["<eos>"]]
    return torch.tensor(ids, dtype=torch.long)

sentence = "Time flies fast"
tokens = tokenize(sentence)
ids = encode(tokens)
print("Sentence:", sentence)
print("Tokens  :", tokens)
print("Token IDs:", ids)


Sentence: Time flies fast
Tokens  : ['time', 'flies', 'fast']
Token IDs: tensor([1, 3, 4, 5, 2])



## 2) Token embeddings + positional embeddings

We use small dimensions so the numbers are easy to read.


In [3]:

emb_dim = 4
context_len = 8  # bigger than we need, just to show capacity

tok_emb = nn.Embedding(num_embeddings=len(vocab), embedding_dim=emb_dim)
pos_emb = nn.Embedding(num_embeddings=context_len, embedding_dim=emb_dim)

# Initialize with deterministic (but non-trivial) weights for readability
with torch.no_grad():
    tok_emb.weight.copy_(torch.tensor([
        [ 0.00,  0.00,  0.00,  0.00],  # <pad>
        [ 0.10,  0.20,  0.30,  0.40],  # <bos>
        [ 0.20,  0.10, -0.10,  0.00],  # <eos>
        [ 0.50,  0.10,  0.00, -0.20],  # time
        [ 0.30, -0.10,  0.40,  0.10],  # flies
        [ 0.05,  0.60, -0.20,  0.10],  # fast
    ]))
    pos_emb.weight.copy_(torch.tensor([
        [ 0.00,  0.00,  0.00,  0.00],  # position 0
        [ 0.01,  0.02,  0.03,  0.04],
        [ 0.02,  0.01, -0.01,  0.00],
        [ 0.03,  0.00,  0.01, -0.02],
        [ 0.04, -0.01,  0.02, -0.01],
        [ 0.05,  0.02,  0.00,  0.01],
        [ 0.06,  0.03, -0.02,  0.02],
        [ 0.07,  0.01,  0.01,  0.03],
    ], dtype=torch.float))

# Shape: [seq_len, emb_dim]
tok_vectors = tok_emb(ids)
pos_vectors = pos_emb(torch.arange(len(ids)))
x = tok_vectors + pos_vectors

print("Token embeddings:\n", tok_vectors)
print("\nPositional embeddings:\n", pos_vectors)
print("\nInput embeddings X (token + position):\n", x)
print("\nShape X:", x.shape)


Token embeddings:
 tensor([[ 0.1000,  0.2000,  0.3000,  0.4000],
        [ 0.5000,  0.1000,  0.0000, -0.2000],
        [ 0.3000, -0.1000,  0.4000,  0.1000],
        [ 0.0500,  0.6000, -0.2000,  0.1000],
        [ 0.2000,  0.1000, -0.1000,  0.0000]], grad_fn=<EmbeddingBackward0>)

Positional embeddings:
 tensor([[ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0100,  0.0200,  0.0300,  0.0400],
        [ 0.0200,  0.0100, -0.0100,  0.0000],
        [ 0.0300,  0.0000,  0.0100, -0.0200],
        [ 0.0400, -0.0100,  0.0200, -0.0100]], grad_fn=<EmbeddingBackward0>)

Input embeddings X (token + position):
 tensor([[ 0.1000,  0.2000,  0.3000,  0.4000],
        [ 0.5100,  0.1200,  0.0300, -0.1600],
        [ 0.3200, -0.0900,  0.3900,  0.1000],
        [ 0.0800,  0.6000, -0.1900,  0.0800],
        [ 0.2400,  0.0900, -0.0800, -0.0100]], grad_fn=<AddBackward0>)

Shape X: torch.Size([5, 4])



## 3) Q, K, V (Single Head)

Each token embedding is linearly projected to:
- **Q** (Query): “What am I looking for?”  
- **K** (Key): “What do I contain?”  
- **V** (Value): “What can I provide?”  


In [4]:

head_dim = 2  # small for readability

W_Q = nn.Linear(emb_dim, head_dim, bias=False)
W_K = nn.Linear(emb_dim, head_dim, bias=False)
W_V = nn.Linear(emb_dim, head_dim, bias=False)

# Use deterministic weights so values are interpretable
with torch.no_grad():
    W_Q.weight.copy_(torch.tensor([
        [ 0.5,  0.0,  0.5,  0.0],
        [ 0.0,  0.5,  0.0, -0.5],
    ]))
    W_K.weight.copy_(torch.tensor([
        [ 0.4, -0.1,  0.3,  0.0],
        [-0.2,  0.6,  0.1,  0.2],
    ]))
    W_V.weight.copy_(torch.tensor([
        [ 0.3,  0.1, -0.2,  0.2],
        [ 0.1, -0.3,  0.4,  0.0],
    ]))

Q = W_Q(x)  # [seq_len, head_dim]
K = W_K(x)  # [seq_len, head_dim]
V = W_V(x)  # [seq_len, head_dim]

print("Q:\n", Q)
print("\nK:\n", K)
print("\nV:\n", V)


Q:
 tensor([[ 0.2000, -0.1000],
        [ 0.2700,  0.1400],
        [ 0.3550, -0.0950],
        [-0.0550,  0.2600],
        [ 0.0800,  0.0500]], grad_fn=<MmBackward0>)

K:
 tensor([[ 0.1100,  0.2100],
        [ 0.2010, -0.0590],
        [ 0.2540, -0.0590],
        [-0.0850,  0.3410],
        [ 0.0630, -0.0040]], grad_fn=<MmBackward0>)

V:
 tensor([[ 0.0700,  0.0700],
        [ 0.1270,  0.0270],
        [ 0.0290,  0.2150],
        [ 0.1380, -0.2480],
        [ 0.0950, -0.0350]], grad_fn=<MmBackward0>)



## 4) Attention Scores → Weights → Context

Scores: \( \text{scores} = \frac{QK^\top}{\sqrt{d_k}} \)  
Weights: \( \text{softmax}(\text{scores}) \) row-wise  
Context: \( Z = \text{weights} \cdot V \)


In [5]:

scale = math.sqrt(head_dim)
scores = (Q @ K.T) / scale                  # [seq_len, seq_len]
weights = torch.softmax(scores, dim=-1)     # attention distribution per query token
Z_single = weights @ V                      # [seq_len, head_dim]

print("Attention scores (scaled):\n", scores)
print("\nAttention weights (softmax rows):\n", weights)
print("\nContext vectors Z (single head):\n", Z_single)


Attention scores (scaled):
 tensor([[ 0.0007,  0.0326,  0.0401, -0.0361,  0.0092],
        [ 0.0418,  0.0325,  0.0427,  0.0175,  0.0116],
        [ 0.0135,  0.0544,  0.0677, -0.0442,  0.0161],
        [ 0.0343, -0.0187, -0.0207,  0.0660, -0.0032],
        [ 0.0136,  0.0093,  0.0123,  0.0072,  0.0034]], grad_fn=<DivBackward0>)

Attention weights (softmax rows):
 tensor([[0.1982, 0.2046, 0.2062, 0.1910, 0.1999],
        [0.2025, 0.2006, 0.2027, 0.1977, 0.1965],
        [0.1983, 0.2065, 0.2093, 0.1871, 0.1988],
        [0.2045, 0.1939, 0.1935, 0.2111, 0.1970],
        [0.2009, 0.2000, 0.2006, 0.1996, 0.1989]], grad_fn=<SoftmaxBackward0>)

Context vectors Z (single head):
 tensor([[0.0912, 0.0094],
        [0.0915, 0.0073],
        [0.0909, 0.0111],
        [0.0924, 0.0019],
        [0.0917, 0.0061]], grad_fn=<MmBackward0>)



## 5) (Optional) Multi-Head Attention

Multiple heads let the model learn **different types** of relationships in parallel.  
We'll show **2 heads** and then project back to the embedding dimension.


In [6]:

num_heads = 2
head_dim = 2
d_out = num_heads * head_dim  # 4

# Build per-head linear layers (for clarity, explicit two-head setup)
Q_layers = nn.ModuleList([nn.Linear(emb_dim, head_dim, bias=False) for _ in range(num_heads)])
K_layers = nn.ModuleList([nn.Linear(emb_dim, head_dim, bias=False) for _ in range(num_heads)])
V_layers = nn.ModuleList([nn.Linear(emb_dim, head_dim, bias=False) for _ in range(num_heads)])

# Initialize head 0 same as before; head 1 slightly different
with torch.no_grad():
    # Head 0
    Q_layers[0].weight.copy_(torch.tensor([[ 0.5,  0.0,  0.5,  0.0],
                                           [ 0.0,  0.5,  0.0, -0.5]]))
    K_layers[0].weight.copy_(torch.tensor([[ 0.4, -0.1,  0.3,  0.0],
                                           [-0.2,  0.6,  0.1,  0.2]]))
    V_layers[0].weight.copy_(torch.tensor([[ 0.3,  0.1, -0.2,  0.2],
                                           [ 0.1, -0.3,  0.4,  0.0]]))
    # Head 1 (tweaked values)
    Q_layers[1].weight.copy_(torch.tensor([[ 0.2,  0.2,  0.4,  0.0],
                                           [ 0.1,  0.3, -0.1,  0.3]]))
    K_layers[1].weight.copy_(torch.tensor([[ 0.1,  0.3,  0.2, -0.1],
                                           [ 0.2,  0.1, -0.2,  0.2]]))
    V_layers[1].weight.copy_(torch.tensor([[ 0.2, -0.1,  0.1,  0.3],
                                           [-0.1,  0.2,  0.0,  0.1]]))

# Compute head outputs and concatenate
head_outputs = []
for h in range(num_heads):
    Qh = Q_layers[h](x)
    Kh = K_layers[h](x)
    Vh = V_layers[h](x)
    scores_h = (Qh @ Kh.T) / math.sqrt(head_dim)
    weights_h = torch.softmax(scores_h, dim=-1)
    Zh = weights_h @ Vh
    head_outputs.append(Zh)

Z_cat = torch.cat(head_outputs, dim=-1)  # [seq_len, d_out]

# Output projection back to emb_dim
out_proj = nn.Linear(d_out, emb_dim, bias=False)
with torch.no_grad():
    out_proj.weight.copy_(torch.tensor([
        [ 0.5,  0.0,  0.3,  0.0],
        [ 0.0,  0.4,  0.0,  0.6],
        [ 0.2, -0.1,  0.5,  0.0],
        [ 0.1,  0.0,  0.0,  0.3],
    ]))

Z_multi = out_proj(Z_cat)  # [seq_len, emb_dim]

print("Concatenated heads Z (shape):", Z_cat.shape)
print("Z (multi-head) projected back to emb_dim:\n", Z_multi)


Concatenated heads Z (shape): torch.Size([5, 4])
Z (multi-head) projected back to emb_dim:
 tensor([[0.0650, 0.0160, 0.0497, 0.0152],
        [0.0653, 0.0150, 0.0501, 0.0152],
        [0.0650, 0.0165, 0.0496, 0.0151],
        [0.0656, 0.0130, 0.0507, 0.0153],
        [0.0654, 0.0145, 0.0503, 0.0152]], grad_fn=<MmBackward0>)



## 6) Interpretability Notes

- Each row of the **attention weights** corresponds to a **query token**: how much it listens to each key/token.  
- The **context vector** for a token is a **weighted mixture** of all value vectors, using its row of weights.  
- **Multi-head** attention lets different heads focus on different patterns; we concatenate and then project.  
