```{contents}
```

## Attention Mechanism

**Attention Mechanism** is the core operation inside Transformers (and therefore LLMs) that allows the model to **focus on the most relevant tokens** in a sequence when generating or understanding text.

Below is a clean explanation.

---

### **1. What Attention Does (Intuition)**

For every word (token), the model asks:

**“Which other words in the sentence should I pay attention to?”**

Example:
In the sentence *“The cat sat on the mat because it was tired.”*
The model must understand that **“it” refers to “cat”**, not “mat”.

Attention assigns **scores** to all token pairs to decide relevance.

---

### **2. How Attention Works (Simple Explanation)**

Each token is converted into three vectors:

1. **Q (Query)** – What this token is looking for
2. **K (Key)** – What this token offers
3. **V (Value)** – Actual information content

Then attention is computed as:

```
Attention = softmax( Q · Kᵀ / √d ) · V
```

Where:

* `Q · Kᵀ` = similarity between tokens
* `√d` = scaling factor
* `softmax` = turns scores into probabilities
* The result is a **weighted mix of all value vectors**

---

### **3. Why Self-Attention Is Powerful**

Self-attention lets each token look at *all* other tokens simultaneously.
This allows the model to learn:

* long-range dependencies
* context relationships
* meaning across distant words
* structural patterns

Traditional RNNs could not do this efficiently.

---

### **4. Multi-Head Attention**

Instead of one attention calculation, LLMs use **multiple heads**.

Each head learns:

* different types of relationships
* different semantic patterns

Example heads:

* “subject → verb” relationships
* “coreference” (pronoun resolution)
* “syntax patterns”
* “entity tracking”
* “reasoning links”

All heads are concatenated and combined.

---

### **5. Why LLMs Scale So Well with Attention**

#### Advantages:

* Parallelizable (unlike RNNs)
* Global context understanding
* Works for very long sequences
* Learns complex reasoning patterns

#### Disadvantages:

* Quadratic compute cost: O(n²) (each token looks at all tokens)

This is why modern models use:

* Sliding window attention
* Sparse attention
* FlashAttention
* ALiBi, RoPE embeddings
* Long-context optimization

---

### **6. Types of Attention in LLMs**

* **Self-attention**: token attends to *other tokens* in same input
* **Cross-attention** (in encoder–decoder): decoder attends to encoder output
* **Causal attention** (GPT-style): token can only attend to *earlier* tokens

Causal attention ensures next-token prediction is valid.

---

### **7. Summary (Easy Version)**

* Attention lets LLMs decide *which words matter most*.
* It uses **Q, K, V vectors** to compute relevance.
* Multi-head attention captures many relationships at once.
* It enables long-context understanding and reasoning.


Below is a **simple, clear, step-by-step demonstration** of the **attention mechanism**, using a small sentence with tiny numbers so you can see exactly how attention is computed.

I will show:

1. Token embeddings
2. How Q, K, V are computed
3. How attention scores are produced
4. Softmax
5. Weighted sum of values
6. Final attention output

Everything is numeric, simple, and visible.

---

### **Sentence**

**“The cat sleeps.”**

Tokens:

1. The
2. cat
3. sleeps

We will compute **attention for the token "cat"** (token 2).

---

### **Step 1 — Token embeddings (toy 2-dimensional vectors)**

These are not real embeddings; just example numeric vectors.

```
x_the    = [1, 0]
x_cat    = [0, 1]
x_sleeps = [1, 1]
```

---

### **Step 2 — Define WQ, WK, WV matrices (simple numbers)**

These are the *learned weights* in real LLMs, but here we use simple matrices.

Assume:

```
WQ = [[1, 0],
      [0, 1]]

WK = [[1, 1],
      [0, 1]]

WV = [[1, 0],
      [0, 2]]
```

---

### **Step 3 — Compute Q, K, V for each token**

Use the formula:

$$
Q = XW_Q,;; K = XW_K,;; V = XW_V
$$

#### For **“cat”**:

```
Q_cat = x_cat @ WQ = [0,1] @ [[1,0],[0,1]] 
       = [0*1 + 1*0 , 0*0 + 1*1] 
       = [0, 1]
```

#### For **keys**:

```
K_the    = [1,0] @ [[1,1],[0,1]] = [1,1]
K_cat    = [0,1] @ [[1,1],[0,1]] = [0,1]
K_sleeps = [1,1] @ [[1,1],[0,1]] = [1,2]
```

#### For **values**:

```
V_the    = [1,0] @ [[1,0],[0,2]] = [1,0]
V_cat    = [0,1] @ [[1,0],[0,2]] = [0,2]
V_sleeps = [1,1] @ [[1,0],[0,2]] = [1,2]
```

---

### **Step 4 — Compute attention scores using Q · Kᵀ**

$$
\text{score} = Q_{cat} \cdot K^T
$$

#### Score(cat → The)

```
[0,1] · [1,1] = 0*1 + 1*1 = 1
```

#### Score(cat → cat)

```
[0,1] · [0,1] = 0*0 + 1*1 = 1
```

#### Score(cat → sleeps)

```
[0,1] · [1,2] = 0*1 + 1*2 = 2
```

#### Raw scores:

```
[1, 1, 2]
```

---

### **Step 5 — Apply softmax to scores**

Softmax converts scores to probabilities (attention weights).

Compute exponentials:

```
exp([1,1,2]) = [2.718, 2.718, 7.389]
sum = 12.825
```

Attention weights:

```
[2.718/12.825 , 2.718/12.825 , 7.389/12.825]
= [0.212 , 0.212 , 0.576]
```

### Final attention weights:

```
The    = 0.212
cat    = 0.212
sleeps = 0.576
```

Interpretation:

* “sleeps” is most relevant to “cat”
* “The” and “cat” have lower relevance

This makes sense: **“cat sleeps”** is a strong phrase.

---

### **Step 6 — Weighted sum of Value vectors**

Now multiply each V vector with its attention weight.

#### From “The”

```
0.212 * [1,0] = [0.212 , 0]
```

#### From “cat”

```
0.212 * [0,2] = [0 , 0.424]
```

#### From “sleeps”

```
0.576 * [1,2] = [0.576 , 1.152]
```

---

### **Step 7 — Add all weighted values**

```
[0.212 , 0] 
+ [0 , 0.424]
+ [0.576 , 1.152]
= [0.788 , 1.576]
```

#### **Final attention output for “cat”:**

```
[0.788 , 1.576]
```

This is the new, context-aware representation of the token “cat”.

---

#### **Interpretation**

The model internally computed:

* “cat” relates strongly to “sleeps” → highest attention weight
* “cat” relates moderately to “The” and itself
* It created a new vector representing “cat in context”

This vector goes into the next transformer layer.

---

**Summary**

1. Tokens → embeddings
2. Multiply by WQ, WK, WV → get Q, K, V
3. Compute attention scores (dot product Q·Kᵀ)
4. Apply softmax → weights
5. Weighted sum of V → attention output

This is exactly how attention works in LLMs.