<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 3: Coding Attention Mechanisms

# From Tokens ‚Üí Embeddings ‚Üí Self-Attention (Q¬∑K¬∑V)
**Sentence:** *‚ÄúYour journey starts with one step.‚Äù*

**Goals**
- See how raw text becomes vectors the model can work with.
- Build intuition for **Q (queries)**, **K (keys)**, **V (values)**.
- Understand **positional embeddings** and why Transformers need them.
- Compute attention weights and the head output step-by-step.

> üß≠ Big idea: self-attention lets each token *ask* other tokens what matters, then **mix** their info into a context vector.



## Agenda
1. Tokenized corpus vs. embeddings (what vs. how we compute)
2. Positional encodings (why Transformers need ‚Äúwhere‚Äù)
3. Build input **X = TokenEmb + PosEmb**
4. Q/K/V projections ‚Äî intuition + math
5. Attention scores ‚Üí weights ‚Üí head output
6. (Optional) Multi-head attention
7. Transformer block (where attention lives) and recap

> Speaker note: RNNs encode order by processing left‚Üíright; Transformers see all tokens at once and need explicit position signals. 



Packages that are being used in this notebook:

In [1]:
import math, torch, torch.nn as nn
torch.set_printoptions(precision=4, sci_mode=False)
torch.manual_seed(0)  # reproducible

import importlib
import tiktoken
import random

print("tiktoken version:", importlib.metadata.version("tiktoken"))

from importlib.metadata import version

print("torch version:", version("torch"))

tiktoken version: 0.9.0
torch version: 2.7.1


## 1) Tokenized Corpus vs Embeddings

- **Tokenized corpus** = a sequence of discrete symbols/IDs (not yet usable by neural ops).
- **Embeddings** = continuous vectors that compress meaning into **latent features**.

We‚Äôll use the sentence from the book example:
> **‚ÄúYour journey starts with one step.‚Äù**

**Key idea:** IDs are *what* tokens are; embeddings are *how* we represent them to compute.  
(Embedding lookup = efficient table lookup learned via backprop. )

In [2]:
sentence = "Your journey starts with one step."
tokens = [t.strip(".,!?").lower() for t in sentence.split()]

special = ["<pad>", "<bos>", "<eos>"]
vocab = {w:i for i,w in enumerate(special + sorted(set(tokens)))}
ivocab = {i:w for w,i in vocab.items()}

def encode(ws):
    return torch.tensor([vocab["<bos>"]] + [vocab[w] for w in ws] + [vocab["<eos>"]], dtype=torch.long)

ids = encode(tokens)
print("Tokens   :", tokens)
print("Vocab    :", vocab)
print("Token IDs:", ids.tolist())


Tokens   : ['your', 'journey', 'starts', 'with', 'one', 'step']
Vocab    : {'<pad>': 0, '<bos>': 1, '<eos>': 2, 'journey': 3, 'one': 4, 'starts': 5, 'step': 6, 'with': 7, 'your': 8}
Token IDs: [1, 8, 3, 5, 7, 4, 6, 2]


## 2) Embeddings & Latent Features

We‚Äôll use **3-D** embeddings (tiny for readability).  
Think of dimensions as toy latent factors (purely pedagogical), e.g.:
- dim0: movement
- dim1: abstractness
- dim2: objectness

> Similar words point in similar directions: **cosine / dot product** measure alignment.

[Cosine Similarity](https://vizuara.substack.com/p/from-words-to-vectors-understanding)

# <img src="features-example.png" alt="alt text" width="700"/>

# <img src="feature-gauge.png" alt="alt text" width="700"/>

# <img src="cosine-similarity.png" alt="alt text" width="700"/>

# <img src="similar-words.png" alt="alt text" width="700"/>

# <img src="dissimilar-words.png" alt="alt text" width="700"/>

# <img src="oposite-words.png" alt="alt text" width="700"/>

# <img src="cosine-similarity-words.png" alt="alt text" width="700"/>

# <img src="cosine-similarity-formula.png" alt="alt text" width="700"/>

# <img src="implications-for-training.png" alt="alt text" width="700"/>

In [3]:
emb_dim = 3
emb = nn.Embedding(len(vocab), emb_dim)
with torch.no_grad():
    emb.weight.zero_()
    emb.weight[vocab["your"]]    = torch.tensor([0.43, 0.15, 0.89])
    emb.weight[vocab["journey"]] = torch.tensor([0.55, 0.87, 0.66])
    emb.weight[vocab["starts"]]  = torch.tensor([0.57, 0.85, 0.64])
    emb.weight[vocab["with"]]    = torch.tensor([0.22, 0.58, 0.33])
    emb.weight[vocab["one"]]     = torch.tensor([0.77, 0.25, 0.10])
    emb.weight[vocab["step"]]    = torch.tensor([0.05, 0.80, 0.55])

X_tokens = torch.stack([emb.weight[vocab[w]] for w in tokens])
print("Token embeddings (3D):\n", X_tokens)

def cosine(a,b, eps=1e-8):
    an, bn = a/(a.norm()+eps), b/(b.norm()+eps)
    return float((an*bn).sum())

for a,b in [("journey","starts"), ("one","step"), ("your","journey")]:
    va, vb = emb.weight[vocab[a]], emb.weight[vocab[b]]
    print(f"{a:>7s} vs {b:<7s} | cosine={cosine(va,vb):.3f} | dot={(va@vb).item():.3f}")


Token embeddings (3D):
 tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]], grad_fn=<StackBackward0>)
journey vs starts  | cosine=1.000 | dot=1.475
    one vs step    | cosine=0.370 | dot=0.294
   your vs journey | cosine=0.781 | dot=0.954


From Previous Chatpter

Tiktokenizer

[Tiktonizer Visualization Tool](https://tiktokenizer.vercel.app/?model=gpt2)

Vocabulary

In [4]:
# Load the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Get vocabulary size
vocab_size = tokenizer.n_vocab
print(f"Vocabulary size: {vocab_size}")  # 50257 for GPT-2

# Get the decoder (maps token IDs back to strings)
decoder = tokenizer.decode_single_token_bytes

n = 10  # Number of tokens to display
random_token_ids = random.sample(range(vocab_size), min(n, vocab_size))

for token_id in random_token_ids:
    try:
        token_bytes = decoder(token_id)
        token_string = token_bytes.decode('utf-8', errors='replace')
        print(f"Token ID {token_id}: {repr(token_string)}")
    except:
        print(f"Token ID {token_id}: <error>")

Vocabulary size: 50257
Token ID 3062: ' scient'
Token ID 1780: 'iting'
Token ID 5589: 'comp'
Token ID 25783: ' inh'
Token ID 17541: ' Insurance'
Token ID 4298: 'ront'
Token ID 46745: 'Ptr'
Token ID 43989: ' Romeo'
Token ID 50070: ' Tammy'
Token ID 4812: 'ffect'


Encoder and Decoder

In [5]:
# Get the encoder (maps strings to token IDs)
encoder = tiktoken.get_encoding("gpt2")

# Get the decoder (maps token IDs back to strings)
decoder = tokenizer.decode_single_token_bytes

# Your text
text = "Your journey starts with one step."


# Encode the text
token_ids = encoder.encode(text, allowed_special={"<|endoftext|>"})

# Get unique tokens used in this text
unique_tokens = set(token_ids)

# Create a mapping of token IDs to strings for this text
text_vocab = {}
for token_id in unique_tokens:
    try:
        token_bytes = encoder.decode_single_token_bytes(token_id)
        token_string = token_bytes.decode('utf-8', errors='replace')
        text_vocab[token_id] = token_string
    except:
        text_vocab[token_id] = f"<unknown_{token_id}>"
        
# Print the sentence with the token IDs
print("Sentence with token IDs:", token_ids)

# Print the sentence with the token strings
print("Sentence with token strings:", [text_vocab[id] for id in token_ids])


display(text_vocab)




Sentence with token IDs: [7120, 7002, 4940, 351, 530, 2239, 13]
Sentence with token strings: ['Your', ' journey', ' starts', ' with', ' one', ' step', '.']


{2239: ' step',
 4940: ' starts',
 13: '.',
 7120: 'Your',
 530: ' one',
 7002: ' journey',
 351: ' with'}

## 3) Why Positional Encodings?

- **RNNs** encode order via *sequential computation* (hidden state depends on time step).
- **Transformers** attend to *all tokens in parallel*; self-attention has **no order** unless we inject it.
- Solution: add a **positional embedding** per index and sum:
\[
X_i = \text{TokenEmb}[t_i] + \text{PosEmb}[i]
\]

> Learned positional embeddings are a simple lookup by index (0,1,2,‚Ä¶). The **indexing** ties these vectors to positions; the language modeling objective pushes them to carry useful order info. 


# <img src="lookup-table.png" alt="alt text" width="700"/>


Transformers process input sequences in parallel, unlike RNNs which process them sequentially. This parallelism means transformers **lack a built-in sense of word order**. Positional encodings solve this by injecting **sequence order** into the model.

Without positional information, a transformer would treat:

> "The cat sat on the mat."

the same as:

> "Mat the on sat cat the."

Even though the words are the same, the meaning is completely different.

---

# <img src="lookup-table-position.png" alt="alt text" width="700"/>


### Token Embeddings ‚Äî ‚ÄúWhat the word means‚Äù

Each token ID in the vocabulary (like `"data"`, `"step"`, `"journey"`) maps to one learned vector:

- token_emb[id_data] ‚àà ‚Ñù·µà


These embeddings are **shared globally** ‚Äî every time the token `"data"` appears, the same vector is retrieved.

These vectors capture **semantic and syntactic meaning** learned during training:

- `"data"` ‚âà `"information"`
- `"journey"` ‚âà `"trip"`

So yes ‚Äî it‚Äôs a **lookup table** from token ID ‚Üí vector that stays consistent everywhere.

---

### 2. Positional Embeddings ‚Äî ‚ÄúWhere the word is‚Äù

A totally separate table:

- pos_emb[i] ‚àà ‚Ñù·µà


Where `i` is the **position index** (0, 1, 2, ‚Ä¶, N‚àí1).

These are **not tied to tokens at all**.  
The 5th word in a sentence (whatever it is) always uses `pos_emb[4]`.

---

### Combined Representation

For every input sentence, you build:

- X·µ¢ = token_emb[t·µ¢] + pos_emb[i]


This gives each token vector **two kinds of information**:

- **What it is** ‚Üí token embedding  
- **Where it is** ‚Üí position embedding



In [6]:
pos_emb = nn.Embedding(16, emb_dim)
with torch.no_grad():
    pos_emb.weight.copy_(torch.tensor([
        [0.00, 0.00, 0.00],
        [0.01, 0.02, 0.03],
        [0.02, 0.01,-0.01],
        [0.03, 0.00, 0.01],
        [0.04,-0.01, 0.02],
        [0.05, 0.02, 0.00],
        [0.06, 0.03,-0.02],
        [0.07, 0.01, 0.01],
        [0.08, 0.00,-0.01],
        [0.09,-0.02, 0.02],
        [0.10, 0.03, 0.00],
        [0.11, 0.00, 0.01],
        [0.12, 0.01,-0.02],
        [0.13,-0.01, 0.02],
        [0.14, 0.02, 0.00],
        [0.15, 0.01, 0.01],
    ], dtype=torch.float))

pos = torch.arange(len(tokens))
X = X_tokens + pos_emb(pos)
print("X = token + positional embeddings:\n", X)


X = token + positional embeddings:
 tensor([[0.4300, 0.1500, 0.8900],
        [0.5600, 0.8900, 0.6900],
        [0.5900, 0.8600, 0.6300],
        [0.2500, 0.5800, 0.3400],
        [0.8100, 0.2400, 0.1200],
        [0.1000, 0.8200, 0.5500]], grad_fn=<AddBackward0>)


## 3.1 The problem with modeling long sequences

- Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks
- In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state‚Äîa kind of intermediate layer within the neural network‚Äîto generate a condensed representation of the entire input sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/04.webp" width="500px">

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/05.webp" width="500px">

- Self-attention in transformers is a technique designed to enhance input representations by **enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence**.

<br>

- **Step 1:** compute unnormalized attention scores $\omega$
- Suppose we use the second input token as the query, that is, $q^{(2)} = x^{(2)}$, we compute the unnormalized attention scores via dot products:
    - $\omega_{21} = x^{(1)} q^{(2)\top}$
    - $\omega_{22} = x^{(2)} q^{(2)\top}$
    - $\omega_{23} = x^{(3)} q^{(2)\top}$
    - ...
    - $\omega_{2T} = x^{(T)} q^{(2)\top}$
- Above, $\omega$ is the Greek letter "omega" used to symbolize the unnormalized attention scores
    - The subscript "21" in $\omega_{21}$ means that input sequence element 2 was used as a query against input sequence element 1

In [7]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

In [8]:
x2 = inputs[1]
x1 = inputs[0]
score = torch.dot(x2, x1)
score


tensor(0.9544)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/08.webp" width="400px">

In [9]:
query = inputs[1]  # 2nd input token is the query

attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


In [10]:
x2 = inputs[1]  # "journey"
scores = torch.matmul(inputs, x2)
scores


tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

- **Step 2:** normalize the unnormalized attention scores ("omegas", $\omega$) so that they sum up to 1
- Here is a simple way to normalize the unnormalized attention scores to sum up to 1 (a convention, useful for interpretation, and important for training stability):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/09.webp" width="500px">

In [11]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


- However, in practice, using the softmax function for normalization, which is better at handling extreme values and has more desirable gradient properties during training, is common and recommended.
- Here's a naive implementation of a softmax function for scaling, which also normalizes the vector elements such that they sum up to 1:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/11.webp" width="400px">

- (Please note that the numbers in this figure are truncated to two
digits after the decimal point to reduce visual clutter; the values in each row should add up to 1.0 or 100%; similarly, digits in other figures are truncated)

- **Step 3**: compute the context vector $z^{(2)}$ by multiplying the embedded input tokens, $x^{(i)}$ with the attention weights and sum the resulting vectors:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/10.webp" width="500px">

In [12]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


In [13]:
query = inputs[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


- In self-attention, the process starts with the calculation of attention scores, which are subsequently normalized to derive attention weights that total 1
- These attention weights are then utilized to generate the context vectors through a weighted summation of the inputs

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/12.webp" width="400px">

- Apply previous **step 1** to all pairwise elements to compute the unnormalized attention score matrix:

In [14]:
attn_scores = torch.empty(6, 6)

for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- We can achieve the same as above more efficiently via matrix multiplication:

In [15]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- Similar to **step 2** previously, we normalize each row so that the values in each row sum to 1:

In [16]:
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [17]:
attn_weights_2

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

- Quick verification that the values in each row indeed sum to 1:

In [18]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)

print("All row sums:", attn_weights.sum(dim=-1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


- Apply previous **step 3** to compute all context vectors:

In [19]:
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


- As a sanity check, the previously computed context vector $z^{(2)} = [0.4419, 0.6515, 0.5683]$ can be found in the 2nd row in above: 

In [20]:
print("Previous 2nd context vector:", context_vec_2)

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])


## 4) Q, K, V ‚Äî Intuition

- **Query (Q):** ‚ÄúWhat am *I* looking for?‚Äù
- **Key (K):** ‚ÄúWhat do *you* contain?‚Äù
- **Value (V):** ‚ÄúWhat information can you give me?‚Äù

Linear projections:
$$
Q = XW_Q,\quad K = XW_K,\quad V = XW_V
$$

> Alignment = **Q ¬∑ K** (dot product). If my interests match your attributes, I attend to you.  
(We‚Äôll compute scaled dot-product attention next. )


### Q‚ÄìK‚ÄìV Intuition ‚Äî ‚ÄúYour journey starts with the first step.‚Äù

Let‚Äôs imagine one **trained attention head** has learned to capture the idea of  
**‚Äúwho is involved in an action.‚Äù**

---

#### üß© 1. Context: A head learning ‚Äúwho‚Äìwhat‚Äù relationships

In our sentence  
> **Your journey starts with the first step.**

this attention head may learn:

[Your] ‚Üí [journey] ‚Üí [step]


- **‚ÄúYour‚Äù** gives *ownership* (the actor or possessor).  
- **‚ÄúJourney‚Äù** and **‚Äústep‚Äù** describe *actions or goals*.  
- **‚ÄúStarts with the first‚Äù** links those ideas together.

So, ‚Äújourney‚Äù should attend mostly to ‚Äúyour‚Äù (who owns it) and ‚Äústep‚Äù (how it happens).

---

#### 2. Roles of Q, K, and V

| Vector | Meaning | In our sentence |
|:--|:--|:--|
| **Q (Query)** | What this token is *looking for* | ‚Äújourney‚Äù asks ‚Üí ‚Äúwho owns me?‚Äù |
| **K (Key)** | What this token *offers* to others | ‚Äúyour‚Äù advertises ‚Üí ‚ÄúI express ownership.‚Äù |
| **V (Value)** | The *actual information* it shares | ‚Äúyour‚Äù provides ‚Üí the *ownership feature* |

---

#### 3. Conceptual numeric example

Imagine 3-dimensional learned vectors (made-up for intuition):

| Token | Q | K | V | Interpretation |
|:------|:-----------:|:-----------:|:-----------:|:--|
| **Your** | [0.1, 0.2, 0.3] | [0.9, 0.1, 0.2] | [1.0, 0.0, 0.0] | Offers *ownership* |
| **Journey** | [0.8, 0.1, 0.2] | [0.3, 0.7, 0.2] | [0.2, 1.0, 0.0] | *Goal / theme* |
| **Starts** | [0.3, 0.8, 0.1] | [0.1, 0.9, 0.3] | [0.0, 0.8, 0.6] | *Action* |
| **With** | [0.2, 0.4, 0.3] | [0.1, 0.2, 0.8] | [0.0, 0.3, 0.9] | *Connector* |
| **The** | [0.1, 0.1, 0.1] | [0.1, 0.1, 0.1] | [0.0, 0.0, 0.0] | Neutral |
| **First** | [0.3, 0.6, 0.2] | [0.2, 0.7, 0.3] | [0.3, 0.9, 0.1] | *Modifier* |
| **Step** | [0.6, 0.2, 0.5] | [0.5, 0.2, 0.5] | [0.1, 0.6, 0.9] | *Concrete action* |

---

#### 4. Step 1 ‚Äî Q¬∑K builds the *attention map*

For token **‚Äújourney‚Äù**, compare its Query with all Keys:

| Compared with | Q ¬∑ K (alignment) | Interpretation |
|:--|:--:|:--|
| *your* | **‚âà 0.76** | Strong ‚Äî ‚Äújourney‚Äù finds its owner |
| *starts* | ‚âà 0.26 | Weaker |
| *step* | ‚âà 0.51 | Moderate |
| others | low | Irrelevant |

After softmax, ‚Äújourney‚Äù attends mostly to **your** and somewhat to **step**.

---

#### 5. Step 2 ‚Äî Weighted mix of V (the payload)

After computing the **alignment scores** between each token‚Äôs Query (**Q**) and all other tokens‚Äô Keys (**K**),  
we obtain the *attention weights* for each token by applying the **softmax** function.

---

##### üî¢ Step-by-step computation

1. **Compute attention scores** for token *i* against every other token *j*:

$$
\text{score}_{ij} = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}}
$$

2. **Normalize** the scores across all tokens *j* using the softmax function so that they sum to 1:

$$
a_{ij} = \frac{\exp(\text{score}_{ij})}
               {\sum_{m=1}^{N} \exp(\text{score}_{im})}
$$

Each \( $a_{ij}$ \) is a **scalar weight** representing how much token *i* ‚Äúattends‚Äù to token *j*.

3. **Combine Values (V)** according to these weights:

$$
Z_i = \sum_{j=1}^{N} a_{ij} \, V_j
$$

Here:
- \( $V_j$ \) is the **value vector** (the information token *j* carries)
- \( $a_{ij}$ \) is how much token *i* listens to token *j*
- The sum is a **weighted linear combination** ‚Äî effectively a soft, differentiable ‚Äúmix‚Äù of other tokens‚Äô features.

---

##### üß© Example with ‚Äújourney‚Äù

From Step 4, ‚Äújourney‚Äù had the following attention alignment (after softmax normalization):

| Source token | raw \(Q¬∑K\) | attention weight \(a_{ij}\) | |
|:--|:--:|:--:|:--|
| **your** | 0.76 | **0.6** | highest relevance (owner) |
| **step** | 0.51 | **0.3** | moderate relevance (action) |
| **starts** | 0.26 | **0.1** | low relevance (verb cue) |

Using these weights, we compute the **context vector** for ‚Äújourney‚Äù:

$$
Z_\text{journey}
= 0.6\,V_\text{your}
+ 0.3\,V_\text{step}
+ 0.1\,V_\text{starts}
$$

$$
Z_\text{journey}
= 0.6[1,0,0]
+ 0.3[0.1,0.6,0.9]
+ 0.1[0.0,0.8,0.6]
‚âà [0.63, 0.30, 0.33]
$$

‚úÖ **Interpretation:**
- ‚ÄúJourney‚Äù keeps its original meaning (a goal)
- Gains *ownership features* from ‚Äúyour‚Äù
- Gains *motion features* from ‚Äústep‚Äù

It becomes a **context-aware representation** of  
> ‚Äú*your journey that involves a step or movement.*‚Äù

---

### üîó Summary Equation (complete)

$$
\boxed{
Z_i = \sum_{j=1}^{N}
      \text{softmax}_j\!
      \left(
      \frac{Q_i K_j^T}{\sqrt{d_k}}
      \right)
      V_j
}
$$

Every token‚Äôs new representation \( Z_i \) is therefore a **weighted average of all tokens‚Äô values** ‚Äî  
weights determined by **how well their Queries and Keys align**.

---

#### 6. What if we remove V?

If we replaced V with K (or nothing):

$$
Z'_\text{journey}
= 0.6\,K_\text{your}
+ 0.3\,K_\text{step}
+ 0.1\,K_\text{starts}
‚âà [0.67, 0.23, 0.29]
$$
Numerically similar but **semantically empty** ‚Äî  
we just averaged identity tags (Keys), not actual content features (Values).  
The model knows *who is related* but not *what to transfer*.

---

#### 7. Intuitive takeaway

| Component | Function | If removed |
|:-----------|:----------|:-----------|
| **Q** | expresses what a token seeks | no search; all tokens equal |
| **K** | advertises what a token offers | no way to measure relevance |
| **V** | carries the **information** that flows | connections without content |

> üí¨ **Without V**, attention is only *gossip about relevance* ‚Äî  
> with V, it becomes an *information exchange* that builds meaning.

---

### Visual summary

      [Q] ---- alignment ----> [K]
       |                       |
       |                       |
       |                       v
       | <----- mix ---------- [V]
       v
 context-aware output (Z)


Back to the book

**step 1**

Create the matrices QKV and input output dimensions

In [23]:
x_2 = inputs[1] # second input element
d_in = inputs.shape[1] # the input embedding size, d=3
d_out = 2 # the output embedding size, d=2

In [24]:
inputs

tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])

In [25]:
x_2

tensor([0.5500, 0.8700, 0.6600])

- Below, we initialize the three weight matrices; note that we are setting `requires_grad=False` to reduce clutter in the outputs for illustration purposes, but if we were to use the weight matrices for model training, we would set `requires_grad=True` to update these matrices during model training

In [26]:
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

In [27]:
W_query

Parameter containing:
tensor([[0.2961, 0.5166],
        [0.2517, 0.6886],
        [0.0740, 0.8665]])

In [28]:
W_key

Parameter containing:
tensor([[0.1366, 0.1025],
        [0.1841, 0.7264],
        [0.3153, 0.6871]])

In [29]:
W_value

Parameter containing:
tensor([[0.0756, 0.1966],
        [0.3164, 0.4017],
        [0.1186, 0.8274]])

- Next we compute the query, key, and value vectors:

In [30]:
query_2 = x_2 @ W_query # _2 because it's with respect to the 2nd input element
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])


In [None]:
key_2

tensor([0.4433, 1.1419])

In [None]:
value_2

tensor([0.3951, 1.0037])

- As we can see below, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space:

In [None]:
keys = inputs @ W_key 
values = inputs @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


- In the next step, **step 2**, we compute the unnormalized attention scores by computing the dot product between the query and each key vector:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/15.webp" width="600px">

In [None]:
keys_2 = keys[1] # Python starts index at 0
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(1.8524)


- Since we have 6 inputs, we have 6 attention scores for the given query vector:

In [None]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/16.webp" width="600px">

- Next, in **step 3**, we compute the attention weights (normalized attention scores that sum up to 1) using the softmax function we used earlier
- The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension, $\sqrt{d_k}$ (i.e., `d_k**0.5`):

In [None]:
d_k = keys.shape[1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/17.webp" width="600px">

- In **step 4**, we now compute the context vector for input query vector 2:

In [None]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.3061, 0.8210])


### Implementing a compact SelfAttention class

- Putting it all together, we can implement the self-attention mechanism as follows:

In [None]:
import torch.nn as nn

class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/18.webp" width="400px">

- We can streamline the implementation above using PyTorch's Linear layers, which are equivalent to a matrix multiplication if we disable the bias units
- Another big advantage of using `nn.Linear` over our manual `nn.Parameter(torch.rand(...)` approach is that `nn.Linear` has a preferred weight initialization scheme, which leads to more stable model training

In [None]:
class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


- Note that `SelfAttention_v1` and `SelfAttention_v2` give different outputs because they use different initial weights for the weight matrices

## 3.5 Hiding future words with causal attention

- In causal attention, the attention weights above the diagonal are masked, ensuring that for any given input, the LLM is unable to utilize future tokens while calculating the context vectors with the attention weight

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/19.webp" width="400px">

### 3.5.1 Applying a causal attention mask

- In this section, we are converting the previous self-attention mechanism into a causal self-attention mechanism
- Causal self-attention ensures that the model's prediction for a certain position in a sequence is only dependent on the known outputs at previous positions, not on future positions
- In simpler words, this ensures that each next word prediction should only depend on the preceding words
- To achieve this, for each given token, we mask out the future tokens (the ones that come after the current token in the input text):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/20.webp" width="600px">

- To illustrate and implement causal self-attention, let's work with the attention scores and weights from the previous section: 

In [None]:
# Reuse the query and key weight matrices of the
# SelfAttention_v2 object from the previous section for convenience
queries = sa_v2.W_query(inputs)
keys = sa_v2.W_key(inputs) 
attn_scores = queries @ keys.T

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


- The simplest way to mask out future attention weights is by creating a mask via PyTorch's tril function with elements below the main diagonal (including the diagonal itself) set to 1 and above the main diagonal set to 0:

In [None]:
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])


- Then, we can multiply the attention weights with this mask to zero out the attention scores above the diagonal:

In [None]:
masked_simple = attn_weights*mask_simple
print(masked_simple)

tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<MulBackward0>)


- However, if the mask were applied after softmax, like above, it would disrupt the probability distribution created by softmax
- Softmax ensures that all output values sum to 1
- Masking after softmax would require re-normalizing the outputs to sum to 1 again, which complicates the process and might lead to unintended effects

- To make sure that the rows sum to 1, we can normalize the attention weights as follows:

In [None]:
row_sums = masked_simple.sum(dim=-1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)


- While we are technically done with coding the causal attention mechanism now, let's briefly look at a more efficient approach to achieve the same as above
- So, instead of zeroing out attention weights above the diagonal and renormalizing the results, we can mask the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/21.webp" width="450px">

In [None]:
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)

tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)


- As we can see below, now the attention weights in each row correctly sum to 1 again:

In [None]:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


### 3.5.2 Masking additional attention weights with dropout

- In addition, we also apply dropout to reduce overfitting during training
- Dropout can be applied in several places:
  - for example, after computing the attention weights;
  - or after multiplying the attention weights with the value vectors
- Here, we will apply the dropout mask after computing the attention weights because it's more common

- Furthermore, in this specific example, we use a dropout rate of 50%, which means randomly masking out half of the attention weights. (When we train the GPT model later, we will use a lower dropout rate, such as 0.1 or 0.2

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/22.webp" width="400px">

- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2
- The scaling is calculated by the formula 1 / (1 - `dropout_rate`)

In [None]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout rate of 50%
example = torch.ones(6, 6) # create a matrix of ones

print(dropout(example))

tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])


In [None]:
torch.manual_seed(123)
print(dropout(attn_weights))

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
       grad_fn=<MulBackward0>)


- Note that the resulting dropout outputs may look different depending on your operating system; you can read more about this inconsistency [here on the PyTorch issue tracker](https://github.com/pytorch/pytorch/issues/121595)

### 3.5.3 Implementing a compact causal self-attention class

- Now, we are ready to implement a working implementation of self-attention, including the causal and dropout masks
- One more thing is to implement the code to handle batches consisting of more than one input so that our `CausalAttention` class supports the batch outputs produced by the data loader we implemented in chapter 2
- For simplicity, to simulate such batch input, we duplicate the input text example:

In [None]:
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape) # 2 inputs with 6 tokens each, and each token has embedding dimension 3

torch.Size([2, 6, 3])


In [None]:
class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        # For inputs where `num_tokens` exceeds `context_length`, this will result in errors
        # in the mask creation further below.
        # In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs  
        # do not exceed `context_length` before reaching this forward method. 
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)

context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]], grad_fn=<UnsafeViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note that dropout is only applied during training, not during inference

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/23.webp" width="500px">

## 3.6 Extending single-head attention to multi-head attention

### 3.6.1 Stacking multiple single-head attention layers

- Below is a summary of the self-attention implemented previously (causal and dropout masks not shown for simplicity)

- This is also called single-head attention:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/24.webp" width="400px">

- We simply stack multiple single-head attention modules to obtain a multi-head attention module:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/25.webp" width="400px">

- The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

In [None]:
class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)


torch.manual_seed(123)

context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])


- In the implementation above, the embedding dimension is 4, because we `d_out=2` as the embedding dimension for the key, query, and value vectors as well as the context vector. And since we have 2 attention heads, we have the output embedding dimension 2*2=4

### 3.6.2 Implementing multi-head attention with weight splits

- While the above is an intuitive and fully functional implementation of multi-head attention (wrapping the single-head attention `CausalAttention` implementation from earlier), we can write a stand-alone class called `MultiHeadAttention` to achieve the same

- We don't concatenate single attention heads for this stand-alone `MultiHeadAttention` class
- Instead, we create single W_query, W_key, and W_value weight matrices and then split those into individual matrices for each attention head:

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        # As in `CausalAttention`, for inputs where `num_tokens` exceeds `context_length`, 
        # this will result in errors in the mask creation further below. 
        # In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs  
        # do not exceed `context_length` before reaching this forwar

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]],

        [[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient
- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters
- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/26.webp" width="400px">

- Note that if you are interested in a compact and efficient implementation of the above, you can also consider the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch

- Since the above implementation may look a bit complex at first glance, let's look at what happens when executing `attn_scores = queries @ keys.transpose(2, 3)`:

In [None]:
# (b, num_heads, num_tokens, head_dim) = (1, 2, 3, 4)
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],
                    [0.8993, 0.0390, 0.9268, 0.7388],
                    [0.7179, 0.7058, 0.9156, 0.4340]],

                   [[0.0772, 0.3565, 0.1479, 0.5331],
                    [0.4066, 0.2318, 0.4545, 0.9737],
                    [0.4606, 0.5159, 0.4220, 0.5786]]]])

print(a @ a.transpose(2, 3))

tensor([[[[1.3208, 1.1631, 1.2879],
          [1.1631, 2.2150, 1.8424],
          [1.2879, 1.8424, 2.0402]],

         [[0.4391, 0.7003, 0.5903],
          [0.7003, 1.3737, 1.0620],
          [0.5903, 1.0620, 0.9912]]]])


- In this case, the matrix multiplication implementation in PyTorch will handle the 4-dimensional input tensor so that the matrix multiplication is carried out between the 2 last dimensions (num_tokens, head_dim) and then repeated for the individual heads 

- For instance, the following becomes a more compact way to compute the matrix multiplication for each head separately:

In [None]:
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)

second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)

First head:
 tensor([[1.3208, 1.1631, 1.2879],
        [1.1631, 2.2150, 1.8424],
        [1.2879, 1.8424, 2.0402]])

Second head:
 tensor([[0.4391, 0.7003, 0.5903],
        [0.7003, 1.3737, 1.0620],
        [0.5903, 1.0620, 0.9912]])


# Summary and takeaways

- See the [./multihead-attention.ipynb](./multihead-attention.ipynb) code notebook, which is a concise version of the data loader (chapter 2) plus the multi-head attention class that we implemented in this chapter and will need for training the GPT model in upcoming chapters
- You can find the exercise solutions in [./exercise-solutions.ipynb](./exercise-solutions.ipynb)

## Recap

- **Tokenized corpus** ‚Üí discrete IDs (what).
- **Embeddings** ‚Üí continuous vectors (how to compute meaning).
- **Positional embeddings** ‚Üí inject order (where); learned by index and task signals.
- **Q/K/V** ‚Üí attention scores reflect **alignment** (dot product).
- **Weights** (softmax rows) ‚Üí how much each token listens to others.
- **Head output Z** ‚Üí weighted mix of **V**; richer, context-aware representations.

> One slide summary: ‚Äú**what + where ‚Üí project ‚Üí align(Q,K) ‚Üí mix(V)**‚Äù.

**Q&A**


### With and without Value vectors

We'll simulate a trained head that relates **ownership** ("your") to **object nouns** ("journey", "step").
Observe how using `V` transfers *semantic meaning*, while omitting it loses that information.


In [21]:
import torch
import math

# made-up example for clarity
tokens = ["your","journey","starts","with","the","first","step"]
Q = torch.tensor([[0.1,0.2,0.3],
                  [0.8,0.1,0.2],
                  [0.3,0.8,0.1],
                  [0.2,0.4,0.3],
                  [0.1,0.1,0.1],
                  [0.3,0.6,0.2],
                  [0.6,0.2,0.5]])
K = torch.tensor([[0.9,0.1,0.2],
                  [0.3,0.7,0.2],
                  [0.1,0.9,0.3],
                  [0.1,0.2,0.8],
                  [0.1,0.1,0.1],
                  [0.2,0.7,0.3],
                  [0.5,0.2,0.5]])
V = torch.tensor([[1.0,0.0,0.0],
                  [0.2,1.0,0.0],
                  [0.0,0.8,0.6],
                  [0.0,0.3,0.9],
                  [0.0,0.0,0.0],
                  [0.3,0.9,0.1],
                  [0.1,0.6,0.9]])

scores = (Q @ K.T) / math.sqrt(3)
weights = torch.softmax(scores[1], dim=-1)  # journey's attention row

Z_with_V = (weights.unsqueeze(0) @ V).squeeze()
Z_no_V   = (weights.unsqueeze(0) @ K).squeeze()

print("Attention weights (journey):", weights.round(decimals=2))
print("Z_with_V:", Z_with_V.round(decimals=2))
print("Z_no_V  :", Z_no_V.round(decimals=2))


Attention weights (journey): tensor([0.1800, 0.1400, 0.1300, 0.1300, 0.1200, 0.1400, 0.1600])
Z_with_V: tensor([0.2600, 0.5000, 0.3500])
Z_no_V  : tensor([0.3500, 0.4000, 0.3400])


üéì Presentation Plan: ‚ÄúStep-by-Step Through Self-Attention‚Äù

Objective:
To demystify how a Transformer processes a sentence ‚Äî from text to embeddings, positional information, and the QKV self-attention mechanism ‚Äî culminating in the attention head output.
















In [9]:

import math
import torch
import torch.nn as nn
torch.set_printoptions(precision=4, sci_mode=False)
torch.manual_seed(0)


<torch._C.Generator at 0x7812f02ccfd0>


## 1) Tokenized Corpus vs Embeddings

- **Tokenized corpus**: a sequence of discrete symbols (tokens/IDs).  
- **Embeddings**: continuous vectors that *compress* token meaning into numbers (latent features).

We'll use the sentence used in the book: **"Your journey starts with one step."**


In [4]:

# Tokenization
sentence = "Your journey starts with one step."
tokens = [t.strip(".,!?").lower() for t in sentence.split()]
tokens


['your', 'journey', 'starts', 'with', 'one', 'step']

In [5]:

# Build tiny vocab from the sentence
special = ["<pad>", "<bos>", "<eos>"]
word_set = special + sorted(set(tokens))
vocab = {w:i for i,w in enumerate(word_set)}
ivocab = {i:w for w,i in vocab.items()}

def encode(ws):
    return torch.tensor([vocab["<bos>"]] + [vocab[w] for w in ws] + [vocab["<eos>"]], dtype=torch.long)

ids = encode(tokens)
print("Vocab:", vocab)
print("Token IDs:", ids.tolist())


Vocab: {'<pad>': 0, '<bos>': 1, '<eos>': 2, 'journey': 3, 'one': 4, 'starts': 5, 'step': 6, 'with': 7, 'your': 8}
Token IDs: [1, 8, 3, 5, 7, 4, 6, 2]



**Takeaway:** token IDs are **discrete**; neural nets need **vectors** to compute.  
Next, we embed each token ID into a **dense vector**. We'll use 3D to keep things readable.


In [6]:

# 3D embeddings + fixed values for the six words
emb_dim = 3
emb = nn.Embedding(len(vocab), emb_dim)

with torch.no_grad():
    emb.weight.zero_()
    emb.weight[vocab["your"]]    = torch.tensor([0.43, 0.15, 0.89])
    emb.weight[vocab["journey"]] = torch.tensor([0.55, 0.87, 0.66])
    emb.weight[vocab["starts"]]  = torch.tensor([0.57, 0.85, 0.64])
    emb.weight[vocab["with"]]    = torch.tensor([0.22, 0.58, 0.33])
    emb.weight[vocab["one"]]     = torch.tensor([0.77, 0.25, 0.10])
    emb.weight[vocab["step"]]    = torch.tensor([0.05, 0.80, 0.55])

X_tokens = torch.stack([emb.weight[vocab[w]] for w in tokens], dim=0)
print("Token embeddings (3D) for the 6 words:\n", X_tokens)
print("Shape:", X_tokens.shape)


Token embeddings (3D) for the 6 words:
 tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]], grad_fn=<StackBackward0>)
Shape: torch.Size([6, 3])



## 2) Latent Dimensions & Similarity (Cosine / Dot)

An embedding dimension can be interpreted as a **latent feature**.  
For intuition, imagine toy axes like:
- dim0: "is_movement"  
- dim1: "is_abstract"  
- dim2: "is_objectness"

> These are made-up for teaching; real models learn their own axes.

We measure **similarity** with **cosine** or **dot product**.  
The attention score uses a **dot product** (scaled) between **Q** and **K**, which is like asking:  
> *Do my interests (Q) align with your attributes (K)?*


In [7]:

def cosine(a,b, dim=-1, eps=1e-8):
    an = a / (a.norm(dim=dim, keepdim=True)+eps)
    bn = b / (b.norm(dim=dim, keepdim=True)+eps)
    return (an*bn).sum(dim=dim)

pairs = [("journey","starts"), ("one","step"), ("your","journey"), ("with","your")]
for a,b in pairs:
    va, vb = emb.weight[vocab[a]], emb.weight[vocab[b]]
    cos = float(cosine(va, vb))
    dot = float(va @ vb)
    print(f"{a:>7s} vs {b:<7s}  cosine={cos:.3f}  dot={dot:.3f}")


journey vs starts   cosine=1.000  dot=1.475
    one vs step     cosine=0.370  dot=0.294
   your vs journey  cosine=0.781  dot=0.954
   with vs your     cosine=0.677  dot=0.475



## 3) Add Positional Embeddings

Self-attention alone has no sense of order, so we add positional vectors and sum with token embeddings.


In [8]:

pos_emb = nn.Embedding(16, 3)
with torch.no_grad():
    pos_emb.weight.copy_(torch.tensor([
        [0.00, 0.00, 0.00],
        [0.01, 0.02, 0.03],
        [0.02, 0.01, -0.01],
        [0.03, 0.00, 0.01],
        [0.04, -0.01, 0.02],
        [0.05, 0.02, 0.00],
        [0.06, 0.03, -0.02],
        [0.07, 0.01, 0.01],
        [0.08, 0.00, -0.01],
        [0.09, -0.02, 0.02],
        [0.10, 0.03, 0.00],
        [0.11, 0.00, 0.01],
        [0.12, 0.01, -0.02],
        [0.13, -0.01, 0.02],
        [0.14, 0.02, 0.00],
        [0.15, 0.01, 0.01],
    ], dtype=torch.float))

pos = torch.arange(len(tokens))
X = X_tokens + pos_emb(pos)
print("X = token + positional embeddings:\n", X)


X = token + positional embeddings:
 tensor([[0.4300, 0.1500, 0.8900],
        [0.5600, 0.8900, 0.6900],
        [0.5900, 0.8600, 0.6300],
        [0.2500, 0.5800, 0.3400],
        [0.8100, 0.2400, 0.1200],
        [0.1000, 0.8200, 0.5500]], grad_fn=<AddBackward0>)


In [15]:
# Extract the positional embedding for the first token
pos_emb(torch.tensor([1]))

tensor([[0.0100, 0.0200, 0.0300]], grad_fn=<EmbeddingBackward0>)

In [14]:
pos_emb(pos)

tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.0100,  0.0200,  0.0300],
        [ 0.0200,  0.0100, -0.0100],
        [ 0.0300,  0.0000,  0.0100],
        [ 0.0400, -0.0100,  0.0200],
        [ 0.0500,  0.0200,  0.0000]], grad_fn=<EmbeddingBackward0>)

In [16]:
X_tokens

tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]], grad_fn=<StackBackward0>)

In [19]:
X_tokens[1] + pos_emb(torch.tensor([1]))

tensor([[0.5600, 0.8900, 0.6900]], grad_fn=<AddBackward0>)

In [20]:
print(pos_emb.weight)

Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.0100,  0.0200,  0.0300],
        [ 0.0200,  0.0100, -0.0100],
        [ 0.0300,  0.0000,  0.0100],
        [ 0.0400, -0.0100,  0.0200],
        [ 0.0500,  0.0200,  0.0000],
        [ 0.0600,  0.0300, -0.0200],
        [ 0.0700,  0.0100,  0.0100],
        [ 0.0800,  0.0000, -0.0100],
        [ 0.0900, -0.0200,  0.0200],
        [ 0.1000,  0.0300,  0.0000],
        [ 0.1100,  0.0000,  0.0100],
        [ 0.1200,  0.0100, -0.0200],
        [ 0.1300, -0.0100,  0.0200],
        [ 0.1400,  0.0200,  0.0000],
        [ 0.1500,  0.0100,  0.0100]], requires_grad=True)


![alt text](<Captura de tela 2025-11-01 160404.png>)


## 4) Attention on the Sentence (Single Head)

We compute **Q**, **K**, **V**, then:
\[
\text{scores} = \frac{QK^\top}{\sqrt{d_k}}, \quad
\text{weights} = \text{softmax}(\text{scores}), \quad
Z = \text{weights}\cdot V
\]


In [None]:

head_dim = 2
W_Q = nn.Linear(3, head_dim, bias=False)
W_K = nn.Linear(3, head_dim, bias=False)
W_V = nn.Linear(3, head_dim, bias=False)

with torch.no_grad():
    W_Q.weight.copy_(torch.tensor([[0.5, 0.0, 0.5],
                                   [0.0, 0.5, -0.5]]))
    W_K.weight.copy_(torch.tensor([[0.4, -0.1, 0.3],
                                   [-0.2, 0.6, 0.1]]))
    W_V.weight.copy_(torch.tensor([[0.3, 0.1, -0.2],
                                   [0.1, -0.3, 0.4]]))

Q, K, V = W_Q(X), W_K(X), W_V(X)
scores = (Q @ K.T) / math.sqrt(head_dim)
weights = torch.softmax(scores, dim=-1)
Z = weights @ V

print("Q:\n", Q, "\n")
print("K:\n", K, "\n")
print("V:\n", V, "\n")
print("Scores (scaled QK^T):\n", scores, "\n")
print("Attention weights (rows sum to 1):\n", weights, "\n")
print("Context vectors Z:\n", Z)



## 5) Wrap-up

- **Tokenized corpus**: discrete IDs for words/subwords.  
- **Embeddings**: continuous vectors where dimensions act as **latent features**.  
- **Similarity** via cosine/dot reflects **alignment** of features.  
- **Attention** uses the dot product of \(Q\) and \(K\) so each token ‚Äúlistens‚Äù to the most relevant others; 
  the resulting **context vector \(Z\)** is a **weighted mixture** of **V** across the sentence.
