In [None]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

### **1. Transformers and Attention**

In 2017, researchers at Google published a groundbreaking paper titled **"Attention Is All You Need"**. They introduced the **Transformer architecture**, which fundamentally changed how we approach sequence processing tasks in AI.

| Innovation | Benefit | Impact |
|-----------|---------|--------|
| **Parallel Processing** | Process all words simultaneously | Training 10-100x faster than RNNs |
| **Attention Mechanism** | Direct connections between any words | Better understanding of context |
| **Scalability** | Works better with more data/parameters | Powers models from millions to trillions of parameters |
| **Transfer Learning** | Pre-train once, adapt to many tasks | Enables ChatGPT, GPT-4, BERT, and more |

The **attention mechanism** is what makes Transformers special. It allows the model to focus on relevant parts of the input when processing each word - just like how you naturally focus on important words when reading!

<div style="display: flex; justify-content: center; gap: 20px; align-items: center;">
  <div style="width: 40%; text-align: center;">
    <img src="https://substackcdn.com/image/fetch/$s_!jtT-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4d7dc94-6f18-4973-a501-de1d5b101c10_1903x856.png" width="100%"/>
    <p><i>Attention Mechanism</i></p>
  </div>
  
  <div style="width: 30%; text-align: center;">
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*BHzGVskWGS_3jEcYYi6miQ.png" width="100%"/>
    <p><i>The Transformer Architecture</i></p>
  </div>
</div>

In this notebook, we'll understand **how attention works** - the fundamental mechanism behind all modern language models. By the end, you'll know exactly how ChatGPT, GPT-4, and other AI systems process and understand language!

Let's dive in! üöÄ

### **2. The Intuition behind Attention**

Let's solidify your understanding with a library analogy:

**üèõÔ∏è The Library Analogy**

Imagine you're researching "climate change impacts" in a library:

**Your Query:** "How does climate change affect polar bears?"

**The Library Catalog (Keys):**
- Book 1: "Climate Change Overview" üîë
- Book 2: "Polar Bear Biology" üîë
- Book 3: "Arctic Ecosystems" üîë
- Book 4: "17th Century Poetry" üîë
- Book 5: "Ocean Acidification" üîë

**What You Do:**

1. **Compare Query with Keys** (Matching step)
   - Your query ‚Üî "Climate Change Overview": High relevance! ‚úÖ
   - Your query ‚Üî "Polar Bear Biology": High relevance! ‚úÖ
   - Your query ‚Üî "Arctic Ecosystems": Medium relevance ‚úì
   - Your query ‚Üî "17th Century Poetry": No relevance ‚ùå
   - Your query ‚Üî "Ocean Acidification": Low relevance

2. **Assign Attention Weights** (based on relevance)
   - Book 1: 0.35 (35% attention)
   - Book 2: 0.40 (40% attention) üéØ
   - Book 3: 0.20 (20% attention)
   - Book 4: 0.00 (0% attention)
   - Book 5: 0.05 (5% attention)

3. **Read Content (Values) Proportionally**
   - Spend 40% of your time on "Polar Bear Biology"
   - Spend 35% on "Climate Change Overview"
   - Spend 20% on "Arctic Ecosystems"
   - Skip "17th Century Poetry" entirely
   - Briefly skim "Ocean Acidification"

4. **Synthesize Information** (Weighted aggregation)
   - Your final understanding = 
     - 0.40 √ó (Polar Bear content) +
     - 0.35 √ó (Climate content) +
     - 0.20 √ó (Arctic content) +
     - 0.05 √ó (Ocean content)

**Mapping to Attention:**

| Library Concept | Attention Mechanism |
|----------------|--------------------|
| Your research question | **Query (Q)** |
| Book titles in catalog | **Keys (K)** |
| Book contents | **Values (V)** |
| Relevance matching | **Q¬∑K (dot product)** |
| Time allocation | **Attention weights (Œ±)** |
| Final understanding | **Output (weighted sum of V)** |

**The Formula Revealed:**

$$\text{Understanding} = \sum_{i} \alpha_i \cdot \text{Book}_i$$

$$\text{where } \alpha_i = \text{softmax}(\frac{\text{Query} \cdot \text{Key}_i}{\sqrt{d}})$$

This is **exactly** how attention mechanisms work!

### **3. Implementation**

**The Complete Formula:**

| | |
| :---: | :---: |
| $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ | <img src="https://velog.velcdn.com/images%2Fcha-suyeon%2Fpost%2Fba830026-6d8f-4e77-b288-f75dd3a51457%2Fimage.png" width="400" alt="Attention Formula Diagram"/> |


In [2]:
import torch

In [3]:
torch.manual_seed(1337)

B,T,C = 4,8,65 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 65])

**The Look-Ahead Problem:**

When training autoregressive models (like GPT), the model predicts the next token based ONLY on previous tokens:

```
Training sequence: "The cat sat on the mat"

Predicting position 3 ("sat"):
‚úÖ Can see: "The", "cat"
‚ùå Cannot see: "sat", "on", "the", "mat" (these are in the future!)
```

**Why This Matters:**

Without masking, during training:
- The model would see the answer before predicting it (cheating! üö´)
- At test time, it won't have access to future tokens
- This mismatch causes poor generalization

**The Causal Mask:**

A causal mask is a **lower triangular matrix** that only allows attention to previous positions:

$$\text{Mask} = \begin{bmatrix}
1 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 \\
1 & 1 & 1 & 0 \\
1 & 1 & 1 & 1
\end{bmatrix}$$

In [5]:
# Mask

mask = torch.tril(torch.ones(T,T))
mask

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [6]:
# Version 1

attn_scores = mask / mask.sum(1, keepdim = True) # normalize the rows
out_1 = attn_scores @ x # (T,T) @ (B,T,C) ---> (B,T,T) @ (B,T,C) ---> (B,T,C)
out_1.shape

torch.Size([4, 8, 65])

In [7]:
# Version 2

attn_scores = torch.zeros(T,T)
attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
attn_scores = torch.softmax(attn_scores, dim=-1)
out_2 = attn_scores @ x
out_2.shape

torch.Size([4, 8, 65])

- Every single token at every position will now emit three vectors, a Query and a Key and a Value
- Query means What am I looking for?
- Key means What do I contain?
- Value means What will I communicate?
- Their dot product of Q and K will then basically give us attention scores meaning which token has a higher affinity to which other tokens.
- Finally we will take the dot product of the attention scores with the values to get the final output.

In [13]:
head_size = 65
key = torch.nn.Linear(C, head_size, bias = False)
query = torch.nn.Linear(C, head_size, bias = False)
value = torch.nn.Linear(C, head_size, bias = False)

k = key(x)
q = query(x)
v = value(x)

print(k.shape)
print(q.shape)
print(v.shape)

torch.Size([4, 8, 65])
torch.Size([4, 8, 65])
torch.Size([4, 8, 65])


In [None]:
attn_scores = q @ k.transpose(-2,-1) / head_size**0.5  # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)

attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
attn_scores = torch.softmax(attn_scores, dim=-1)
out = attn_scores @ v
out.shape

In [11]:
wei[0]

tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [7.1821e-01, 2.8179e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [6.2292e-01, 2.6785e-01, 1.0923e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [2.7394e-02, 3.5937e-02, 8.8566e-02, 8.4810e-01, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [9.2227e-01, 1.0381e-02, 4.4360e-02, 7.6919e-04, 2.2219e-02, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [7.4951e-02, 1.1837e-01, 2.4863e-01, 3.7787e-01, 6.3060e-03, 1.7387e-01,
         0.0000e+00, 0.0000e+00],
        [1.8730e-01, 5.8106e-02, 6.1382e-02, 3.6453e-03, 6.4791e-01, 1.8433e-02,
         2.3230e-02, 0.0000e+00],
        [4.1080e-01, 6.0570e-02, 2.1063e-02, 1.6063e-03, 1.6883e-01, 1.5380e-02,
         4.0297e-03, 3.1772e-01]], grad_fn=<SelectBackward0>)