In [None]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

## Day 7 - Part 3: The Attention Mechanism

---

### üîó **Continuing from Parts 1 & 2**

So far in Day 7, we've built a **Bigram Language Model**:

| Part | What We Did | Key Limitation |
|:----:|-------------|----------------|
| Part 1 | Built basic Bigram model | Only looks at 1 previous character |
| Part 2 | Added proper training loop | Still only looks at 1 previous character! |

**The Problem:** Our Bigram model predicts based on ONLY the last character. It has no way to use broader context!

```
Input: "The cat sat on the ___"
Bigram sees: "e" ‚Üí makes prediction
We want:    "The cat sat on the" ‚Üí make prediction (use ALL context!)
```

---

### üéØ **Agenda for this Notebook**

| Section | Topic | Description |
|:-------:|-------|-------------|
| 1 | **Transformers & Attention** | Why attention is revolutionary |
| 2 | **The Intuition** | Library analogy for Query-Key-Value |
| 3 | **Implementation** | Build attention step-by-step |

---

### üéì **Learning Objectives**

By the end of this notebook, you will:
- ‚úÖ Understand WHY attention is needed
- ‚úÖ Grasp the Query-Key-Value intuition
- ‚úÖ Implement self-attention from scratch
- ‚úÖ Understand causal masking (no peeking at future!)

This is the **most important concept** in modern AI - let's master it! üöÄ

---
## Section 1: Transformers & Attention

In 2017, researchers at Google published a groundbreaking paper titled **"Attention Is All You Need"**. They introduced the **Transformer architecture**, which fundamentally changed how we approach sequence processing tasks in AI.

| Innovation | Benefit | Impact |
|-----------|---------|--------|
| **Parallel Processing** | Process all words simultaneously | Training 10-100x faster than RNNs |
| **Attention Mechanism** | Direct connections between any words | Better understanding of context |
| **Scalability** | Works better with more data/parameters | Powers models from millions to trillions of parameters |
| **Transfer Learning** | Pre-train once, adapt to many tasks | Enables ChatGPT, GPT-4, BERT, and more |

The **attention mechanism** is what makes Transformers special. It allows the model to focus on relevant parts of the input when processing each word - just like how you naturally focus on important words when reading!

<div style="display: flex; justify-content: center; gap: 20px; align-items: center;">
  <div style="width: 40%; text-align: center;">
    <img src="https://substackcdn.com/image/fetch/$s_!jtT-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4d7dc94-6f18-4973-a501-de1d5b101c10_1903x856.png" width="100%"/>
    <p><i>Attention Mechanism</i></p>
  </div>
  
  <div style="width: 30%; text-align: center;">
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*BHzGVskWGS_3jEcYYi6miQ.png" width="100%"/>
    <p><i>The Transformer Architecture</i></p>
  </div>
</div>

In this notebook, we'll understand **how attention works** - the fundamental mechanism behind all modern language models. By the end, you'll know exactly how ChatGPT, GPT-4, and other AI systems process and understand language!

Let's dive in! üöÄ

---
## Section 2: The Intuition Behind Attention

Before diving into code, let's build strong intuition!

Let's solidify your understanding with a library analogy:

**üèõÔ∏è The Library Analogy**

Imagine you're researching "climate change impacts" in a library:

**Your Query:** "How does climate change affect polar bears?"

**The Library Catalog (Keys):**
- Book 1: "Climate Change Overview" üîë
- Book 2: "Polar Bear Biology" üîë
- Book 3: "Arctic Ecosystems" üîë
- Book 4: "17th Century Poetry" üîë
- Book 5: "Ocean Acidification" üîë

**What You Do:**

1. **Compare Query with Keys** (Matching step)
   - Your query ‚Üî "Climate Change Overview": High relevance! ‚úÖ
   - Your query ‚Üî "Polar Bear Biology": High relevance! ‚úÖ
   - Your query ‚Üî "Arctic Ecosystems": Medium relevance ‚úì
   - Your query ‚Üî "17th Century Poetry": No relevance ‚ùå
   - Your query ‚Üî "Ocean Acidification": Low relevance

2. **Assign Attention Weights** (based on relevance)
   - Book 1: 0.35 (35% attention)
   - Book 2: 0.40 (40% attention) üéØ
   - Book 3: 0.20 (20% attention)
   - Book 4: 0.00 (0% attention)
   - Book 5: 0.05 (5% attention)

3. **Read Content (Values) Proportionally**
   - Spend 40% of your time on "Polar Bear Biology"
   - Spend 35% on "Climate Change Overview"
   - Spend 20% on "Arctic Ecosystems"
   - Skip "17th Century Poetry" entirely
   - Briefly skim "Ocean Acidification"

4. **Synthesize Information** (Weighted aggregation)
   - Your final understanding = 
     - 0.40 √ó (Polar Bear content) +
     - 0.35 √ó (Climate content) +
     - 0.20 √ó (Arctic content) +
     - 0.05 √ó (Ocean content)

**Mapping to Attention:**

| Library Concept | Attention Mechanism |
|----------------|--------------------|
| Your research question | **Query (Q)** |
| Book titles in catalog | **Keys (K)** |
| Book contents | **Values (V)** |
| Relevance matching | **Q¬∑K (dot product)** |
| Time allocation | **Attention weights (Œ±)** |
| Final understanding | **Output (weighted sum of V)** |

**The Formula Revealed:**

$$\text{Understanding} = \sum_{i} \alpha_i \cdot \text{Book}_i$$

$$\text{where } \alpha_i = \text{softmax}(\frac{\text{Query} \cdot \text{Key}_i}{\sqrt{d}})$$

This is **exactly** how attention mechanisms work!

---
## üíª Section 3: Implementation

Now let's implement attention step-by-step! We'll build up to the complete formula:

| | |
| :---: | :---: |
| $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ | <img src="https://velog.velcdn.com/images%2Fcha-suyeon%2Fpost%2Fba830026-6d8f-4e77-b288-f75dd3a51457%2Fimage.png" width="400" alt="Attention Formula Diagram"/> |

Let's start with some random input data:

In [2]:
import torch

### üìä Creating Sample Input

We'll create random input data to work with:
- **B = 4**: Batch size (4 sequences)
- **T = 8**: Time/sequence length (8 tokens per sequence)
- **C = 65**: Channels/embedding dimension (same as our vocab size)

In [3]:
torch.manual_seed(1337)

B,T,C = 4,8,65 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 65])

---
### üîí The Look-Ahead Problem (Why We Need Masking)

**Critical Concept for Language Models!**

When training autoregressive models (like GPT), the model predicts the next token based ONLY on previous tokens:

```
Training sequence: "The cat sat on the mat"

Predicting position 3 ("sat"):
‚úÖ Can see: "The", "cat"
‚ùå Cannot see: "sat", "on", "the", "mat" (these are in the future!)
```

**Why This Matters:**
- Without masking, during training the model would see the answer before predicting it (cheating! üö´)
- At test time, it won't have access to future tokens
- This mismatch causes poor generalization

**The Solution - Causal Mask:**

A **lower triangular matrix** that only allows attention to previous positions:

$$\text{Mask} = \begin{bmatrix}
1 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 \\
1 & 1 & 1 & 0 \\
1 & 1 & 1 & 1
\end{bmatrix}$$

Position 1 can only see position 1. Position 2 can see positions 1-2. And so on!

Let's create this mask using PyTorch's `tril` (lower triangular) function:

In [5]:
# Mask

mask = torch.tril(torch.ones(T,T))
mask

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

---
### üîÑ Version 1: Simple Averaging (Uniform Attention)

**Idea:** Each position averages ALL previous positions equally.

**How it works:**
1. Take the mask (1s and 0s)
2. Normalize each row so it sums to 1 (divide by row sum)
3. Multiply with input to get weighted average

**Example for position 3:**
```
mask row 3: [1, 1, 1, 0, 0, 0, 0, 0]
normalized: [0.33, 0.33, 0.33, 0, 0, 0, 0, 0]  (sums to 1!)
```

This means position 3's output is: (1/3 √ó token_1) + (1/3 √ó token_2) + (1/3 √ó token_3)

**Limitation:** Every previous token gets EQUAL weight. But shouldn't some tokens be more important than others? ü§î

In [6]:
# Version 1

attn_scores = mask / mask.sum(1, keepdim = True) # normalize the rows
out_1 = attn_scores @ x # (T,T) @ (B,T,C) ---> (B,T,T) @ (B,T,C) ---> (B,T,C)
out_1.shape

torch.Size([4, 8, 65])

---
### üéØ Version 2: Using Softmax (The Real Way!)

**Idea:** Use `softmax` for normalization, but first set future positions to `-inf`.

**Why `-inf`?**
- `softmax(-inf) = 0` (mathematically!)
- So future positions contribute ZERO to the output
- All other positions share the remaining weight

In [7]:
# Version 2

attn_scores = torch.zeros(T,T)
attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
attn_scores = torch.softmax(attn_scores, dim=-1)
out_2 = attn_scores @ x
out_2.shape

torch.Size([4, 8, 65])

**Why Version 2 is Better:**
- Version 1 only works with uniform weights
- Version 2 can have ANY starting weights (we'll learn these!)
- The `-inf` masking trick is standard in all Transformers

---
### üîë Adding Query, Key, Value: The Complete Attention!

Both versions above use **uniform weights** - every token gets equal attention. But we want **learned, content-dependent weights**!

**The Query-Key-Value Framework:**

Every token produces THREE vectors:
- **Query (Q)**: "What am I looking for?"
- **Key (K)**: "What do I contain?"
- **Value (V)**: "What information will I share?"

**How Attention Works:**
1. Each token's Query "asks a question"
2. All Keys "answer" how relevant they are (via dot product)
3. High Q¬∑K score = high attention weight
4. Values are combined using these weights

**The Magic Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The $\sqrt{d_k}$ scaling prevents dot products from getting too large (which would make softmax too "peaky").

In [13]:
head_size = 65
key = torch.nn.Linear(C, head_size, bias = False)
query = torch.nn.Linear(C, head_size, bias = False)
value = torch.nn.Linear(C, head_size, bias = False)

k = key(x)
q = query(x)
v = value(x)

print(k.shape)
print(q.shape)
print(v.shape)

torch.Size([4, 8, 65])
torch.Size([4, 8, 65])
torch.Size([4, 8, 65])


Let's create the Q, K, V projection layers. These are **learned** linear transformations:

In [None]:
attn_scores = q @ k.transpose(-2,-1) / head_size**0.5  # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)

attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
attn_scores = torch.softmax(attn_scores, dim=-1)
out = attn_scores @ v
out.shape

Now let's compute **self-attention** step by step:

1. **Compute attention scores**: $QK^T$ (how much each token attends to others)
2. **Scale**: Divide by $\sqrt{d_k}$ to stabilize gradients
3. **Mask**: Set future positions to `-inf`
4. **Softmax**: Normalize to get attention weights (sum to 1)
5. **Apply to Values**: Weighted combination of V vectors

In [None]:
# Let's look at the attention weights for the first sequence
print("Attention weights for first sequence:")
print(attn_scores[0])

tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [7.1821e-01, 2.8179e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [6.2292e-01, 2.6785e-01, 1.0923e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [2.7394e-02, 3.5937e-02, 8.8566e-02, 8.4810e-01, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [9.2227e-01, 1.0381e-02, 4.4360e-02, 7.6919e-04, 2.2219e-02, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [7.4951e-02, 1.1837e-01, 2.4863e-01, 3.7787e-01, 6.3060e-03, 1.7387e-01,
         0.0000e+00, 0.0000e+00],
        [1.8730e-01, 5.8106e-02, 6.1382e-02, 3.6453e-03, 6.4791e-01, 1.8433e-02,
         2.3230e-02, 0.0000e+00],
        [4.1080e-01, 6.0570e-02, 2.1063e-02, 1.6063e-03, 1.6883e-01, 1.5380e-02,
         4.0297e-03, 3.1772e-01]], grad_fn=<SelectBackward0>)

---
## üìù Summary: What We Learned

### üéØ Key Concepts

| Concept | Description |
|---------|-------------|
| **Attention** | Mechanism allowing tokens to "communicate" and share information |
| **Query (Q)** | "What am I looking for?" - the question each token asks |
| **Key (K)** | "What do I contain?" - the relevance signal each token provides |
| **Value (V)** | "What information do I share?" - the content passed forward |
| **Causal Mask** | Prevents looking at future tokens (critical for language models!) |
| **Softmax** | Normalizes attention weights to sum to 1 |
| **Scaling (‚àöd)** | Prevents dot products from getting too large |

### üîÑ The Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### üìä Version Comparison

| Version | Description | Weights | Use Case |
|:-------:|-------------|---------|----------|
| **V1** | Simple averaging | Uniform (1/n) | Understanding the concept |
| **V2** | Softmax with masking | Uniform (but flexible) | Foundation for real attention |
| **Full** | Q-K-V attention | Learned, content-dependent | Real Transformers! |

### ‚û°Ô∏è What's Next?

In Day 8, we'll learn:
- üß† **Multi-Head Attention**: Run multiple attention "heads" in parallel
- üèóÔ∏è **Transformer Blocks**: Combine attention with feed-forward layers
- üìà **Scaling Up**: Build a real GPT-style language model!

**Congratulations!** You now understand the core mechanism behind ChatGPT, GPT-4, Claude, and all modern language models! üéâ

### üîç Visualizing Attention Weights

Let's look at the attention weights. Notice:
- Each row sums to 1 (softmax!)
- Upper triangle is 0 (causal mask!)
- Values vary based on content (learned Q, K!)