```{contents}
```

## Multi-Head Attention

**Multi-Head Attention = Multiple attention mechanisms running in parallel.**

Each head learns to focus on **different relationships** between tokens.

Example:

* Head 1: learns subject–verb alignment
* Head 2: learns long-range dependencies
* Head 3: learns coreference (“it” → “cat”)
* Head 4: learns punctuation/syntax patterns

The outputs of all heads are concatenated → projected → passed to next layer.

---

###  Why multiple heads?

One attention head can only learn **one type** of relation.
Multiple heads allow the model to process **different patterns simultaneously**.

For example, in the sentence:

**“The cat that I adopted sleeps.”**

A good LLM needs to learn:

* subject relation: cat → sleeps
* relative clause: I adopted → cat
* article relations: The → cat
* semantic meaning: sleeps → cat

One head alone cannot learn all this.

---

### How Multi-Head Attention works

Suppose we have **h heads**.
For each head:

#### 1. Create separate projection matrices:

* $W_Q^1, W_K^1, W_V^1$
* $W_Q^2, W_K^2, W_V^2$
* ...
* $W_Q^h, W_K^h, W_V^h$

#### 2. Compute attention independently for each head:

$$
\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)
$$

### 3. Concatenate all heads:

$$
\text{concat} = [\text{head}_1, \text{head}_2, \dots, \text{head}_h]
$$

#### 4. Apply a final output projection:

$$
\text{MHAoutput} = \text{concat} \cdot W_O
$$

Where $W_O$ is another learned matrix.

---

### **4. Visual Overview (simple)**

```
                ┌─────────────┐
Input Embedding →  Head 1      ─┐
                ├─────────────┤ │
                │  Head 2      │ │
                ├─────────────┤ │
                │  Head 3      │ │
                └─────────────┘ │
                                 ↓
                    Concatenate Outputs
                                 ↓
                       Linear Projection
                                 ↓
                         MHA Output
```

Each head sees the same input but learns different patterns.

---

# **5. Mini Numerical Example (2 Heads)**

To keep it simple:

* Only **one token**
* Model dimension = 4
* Each head dimension = 2
* We show how heads create different outputs

### Input token embedding:

```
X = [1, 2, 3, 4]
```

---

## **Head 1 projection matrices**

Pick simple values:

```
WQ1 = [[1,0],[0,1],[0,0],[1,0]]
WK1 = [[1,1],[0,1],[1,0],[0,1]]
WV1 = [[1,0],[0,2],[1,1],[0,1]]
```

Compute:

```
Q1 = X @ WQ1
K1 = X @ WK1
V1 = X @ WV1
```

After attention calculation → head1_output
(Details skipped to keep it short)

Assume:

```
head1_output = [0.5, 1.2]
```

---

## **Head 2 projection matrices**

Different values:

```
WQ2 = [[0,1],[1,0],[1,0],[0,1]]
WK2 = [[0,1],[1,1],[0,0],[1,0]]
WV2 = [[0,1],[1,1],[0,2],[1,0]]
```

Compute:

```
Q2 = X @ WQ2
K2 = X @ WK2
V2 = X @ WV2
```

Assume:

```
head2_output = [−0.4, 2.3]
```

---

## **Concatenate heads**

```
concat = [0.5, 1.2, −0.4, 2.3]
```

---

## **Final output projection**

With some matrix (W_O):

```
MHA_output = concat @ W_O
```

This produces the final vector passed to the next layer.

---

# **6. Key points to remember**

### **A. Each head has different WQ, WK, WV**

So each head attends to different features in the sequence.

### **B. All heads see the full input**

But learn different attention patterns.

### **C. Multi-head attention == multi-perspective understanding**

This is why LLMs can:

* resolve pronouns
* understand relationships
* perform reasoning
* encode structure
* remember long context

### **D. Outputs are merged**

Concatenation → linear projection → next layer.

---

# **7. Summary (easy version)**

Multi-head attention =
**“Run attention several times with different learned projections, so the model can focus on multiple aspects of the text at once.”**

Each head learns something different.
Combine all → richer understanding.

---

If you want, I can also demonstrate:

* A full multi-head numerical example (Q, K, V per head)
* How multi-head differs from single-head attention
* How multi-head works in GPT specifically
