```{contents}
```

## Multi-Head Attention

**Multi-Head Attention = Multiple attention mechanisms running in parallel.**

Each head learns to focus on **different relationships** between tokens.

Example:

* Head 1: learns subject–verb alignment
* Head 2: learns long-range dependencies
* Head 3: learns coreference (“it” → “cat”)
* Head 4: learns punctuation/syntax patterns

The outputs of all heads are concatenated → projected → passed to next layer.

---

###  Why multiple heads?

One attention head can only learn **one type** of relation.
Multiple heads allow the model to process **different patterns simultaneously**.

For example, in the sentence:

**“The cat that I adopted sleeps.”**

A good LLM needs to learn:

* subject relation: cat → sleeps
* relative clause: I adopted → cat
* article relations: The → cat
* semantic meaning: sleeps → cat

One head alone cannot learn all this.

---

### How Multi-Head Attention works

Suppose we have **h heads**.
For each head:

#### 1. Create separate projection matrices:

* $W_Q^1, W_K^1, W_V^1$
* $W_Q^2, W_K^2, W_V^2$
* ...
* $W_Q^h, W_K^h, W_V^h$

#### 2. Compute attention independently for each head:

$$
\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)
$$

### 3. Concatenate all heads:

$$
\text{concat} = [\text{head}_1, \text{head}_2, \dots, \text{head}_h]
$$

#### 4. Apply a final output projection:

$$
\text{MHAoutput} = \text{concat} \cdot W_O
$$

Where $W_O$ is another learned matrix.

---

### **4. Visual Overview (simple)**

```
                ┌─────────────┐
Input Embedding →  Head 1      ─┐
                ├─────────────┤ │
                │  Head 2      │ │
                ├─────────────┤ │
                │  Head 3      │ │
                └─────────────┘ │
                                 ↓
                    Concatenate Outputs
                                 ↓
                       Linear Projection
                                 ↓
                         MHA Output
```

Each head sees the same input but learns different patterns.

---

### **5. Mini Numerical Example (2 Heads)**

To keep it simple:

* Only **one token**
* Model dimension = 4
* Each head dimension = 2
* We show how heads create different outputs

#### Input token embedding:

```
X = [1, 2, 3, 4]
```

---

### **Head 1 projection matrices**

Pick simple values:

```
WQ1 = [[1,0],[0,1],[0,0],[1,0]]
WK1 = [[1,1],[0,1],[1,0],[0,1]]
WV1 = [[1,0],[0,2],[1,1],[0,1]]
```

Compute:

```
Q1 = X @ WQ1
K1 = X @ WK1
V1 = X @ WV1
```

After attention calculation → head1_output
(Details skipped to keep it short)

Assume:

```
head1_output = [0.5, 1.2]
```

---

### **Head 2 projection matrices**

Different values:

```
WQ2 = [[0,1],[1,0],[1,0],[0,1]]
WK2 = [[0,1],[1,1],[0,0],[1,0]]
WV2 = [[0,1],[1,1],[0,2],[1,0]]
```

Compute:

```
Q2 = X @ WQ2
K2 = X @ WK2
V2 = X @ WV2
```

Assume:

```
head2_output = [−0.4, 2.3]
```

---

### **Concatenate heads**

```
concat = [0.5, 1.2, −0.4, 2.3]
```

---

### **Final output projection**

With some matrix (W_O):

```
MHA_output = concat @ W_O
```

This produces the final vector passed to the next layer.

---

### **6. Key points to remember**

#### **A. Each head has different WQ, WK, WV**

So each head attends to different features in the sequence.

#### **B. All heads see the full input**

But learn different attention patterns.

#### **C. Multi-head attention == multi-perspective understanding**

This is why LLMs can:

* resolve pronouns
* understand relationships
* perform reasoning
* encode structure
* remember long context

### **D. Outputs are merged**

Concatenation → linear projection → next layer.


**Summary**

Multi-head attention =
**“Run attention several times with different learned projections, so the model can focus on multiple aspects of the text at once.”**

Each head learns something different.
Combine all → richer understanding.

---

If you want, I can also demonstrate:

* A full multi-head numerical example (Q, K, V per head)
* How multi-head differs from single-head attention
* How multi-head works in GPT specifically

### 1. Why Attention Exists (Core Intuition)

Before explaining self-attention and cross-attention, understand the **problem** they solve.

Neural Machine Translation (NMT) requires:

* Reading a source sentence (English)
* Understanding it as a whole
* Generating a target sentence (French)

RNNs and LSTMs struggled because:

* They compress the entire meaning of a sentence into a *single* hidden vector.
* Long sentences are hard to encode correctly.
* They process words sequentially, slowing down training.

**Attention** solved this by allowing the model to:

* Look back at specific words it needs,
* Weigh them differently depending on context,
* And process many words in parallel.

This idea became the foundation of Transformers.

---

### 2. What Q, K, V Represent (Intuitive View)

In all attention mechanisms, we project token embeddings into:

* **Query (Q)** → What I am looking for
* **Key (K)** → What information I offer
* **Value (V)** → The actual information content

Analogy:
Imagine researching in a library.

* Query = the question you're trying to answer
* Key = the index of each book
* Value = the content inside the book

Attention computes similarity between Query and Key, and uses that to decide how much of Value to read.

The formula:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

This is:

1. Compare Q with every K
2. Convert similarities into probabilities (softmax)
3. Blend the Value vectors using these probabilities

This blending produces **contextual embeddings**.

---

### 3. Self-Attention (Detailed, Practical Intuition)

Self-attention means:
**A token looks at every other token in the same sentence to understand its contextual meaning.**

All Q, K, V come from the *same* sentence.

#### Why do we need this?

Words change meaning based on context:

* *“bank”* (river bank vs monetary bank)
* *“trains”* (verb or noun)

Self-attention adjusts the embedding of each word based on the other words around it.

#### How it works in translation

Example:
“The boy **trains** the puppy.”

The raw word embedding of “trains” is ambiguous.
Self-attention allows “trains” to check:

* "boy" → subject
* "puppy" → object
* "the" → determiner

Because it sees these words, the layer learns that “trains” is a *verb*, not a noun.

**Effect:**
A new enriched, context-aware embedding of “trains” is produced.
This enriched embedding is what the encoder passes to the decoder.

#### Bidirectional vs Masked

* **Encoder self-attention**: Can look left and right (bidirectional).
* **Decoder self-attention**: Only looks left (causal mask), to prevent cheating by seeing future words when predicting.

---

### 4. Cross-Attention (Detailed, Practical Intuition)

Cross-attention connects the decoder with the encoder.

**The decoder uses its own Query, and attends to the encoder's Key and Value.**

#### Why?

When generating a translation, the decoder needs to look back at the source sentence.

Example:
Translating to French:

“The boy trains the puppy.” → “Le garçon entraîne le chiot.”

When the decoder is about to output the French equivalent of “trains”:

* Query = the decoder’s current hidden state
* Keys = encoder’s representation of each English word
* Values = same encoder representations

Cross-attention determines which source word is most relevant.

#### What happens internally?

Decoder asks:

> “Which English word should I focus on now?”

The attention score becomes highest for the source word “trains”.

So the decoder retrieves that part of the encoder's output and uses it to output the correct French verb form “entraîne”.

#### Why cross-attention is critical

Without cross-attention:

* Decoder would generate output blindly
* Translation quality would collapse
* Long-range dependencies would be lost

Cross-attention is a learnable lookup into encoder memory.

---

### 5. Putting Both Together (Full Translation Process)

#### Step 1: Encoder (Self-Attention)

The encoder reads the English sentence.

Self-attention refines each word:

* “trains” becomes a verb representation
* “boy” becomes a subject representation
* “puppy” becomes an object representation

It outputs a sequence of embeddings that represent the whole sentence meaningfully.

---

#### Step 2: Decoder (Masked Self-Attention)

When generating output token by token:

* The decoder uses masked self-attention to understand what it has generated so far.

---

#### Step 3: Cross-Attention (Connecting encoder and decoder)

At each decoding step:

* Decoder Q looks at encoder K, V
* Retrieves most relevant part of the source sentence
* Uses that to produce the next word

This is how alignment between languages emerges.

---

**Summary Table**

| Concept                      | Source of Q    | Source of K & V | Purpose                             |
| ---------------------------- | -------------- | --------------- | ----------------------------------- |
| **Self-attention (encoder)** | Encoder tokens | Encoder tokens  | Understand source sentence context  |
| **Self-attention (decoder)** | Decoder tokens | Decoder tokens  | Understand partial output so far    |
| **Cross-attention**          | Decoder tokens | Encoder tokens  | Link source meaning to output words |

---

**Most Important Intuition**

* **Self-attention helps each word understand its meaning by looking at surrounding words.**
* **Cross-attention helps the decoder retrieve the right source information at the right time.**
