# Phase 2, Lesson 3: Attention Mechanism[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/suraaj3poudel/Learn-To-Make-GPT-Model/blob/main/phase2_text_and_embeddings/03_attention_mechanism.ipynb)The breakthrough that powers GPT! 🧠## What You'll Learn1. Why attention is needed2. How attention works3. Self-attention mechanism4. Building attention from scratchThis is THE key to modern AI!

In [None]:
# Setupimport numpy as npimport matplotlib.pyplot as pltprint('✅ Ready to learn attention!')

## 1. The Problem with AveragingIn Lesson 2, we **averaged** all word vectors:```pythonsentence = "I love this amazing product"avg = mean([vec_I, vec_love, vec_this, vec_amazing, vec_product])```**Problem**: All words get equal weight!But some words are more important:- "I **love** this **amazing** product" - The words "love" and "amazing" matter most for sentiment!We need a way to **focus** on important words. That's **attention**!

## 2. Attention: The Core Idea**Attention** = Learn which words to focus onInstead of:```output = mean(all_words)```We want:```output = 0.05 * word1 + 0.10 * word2 + 0.60 * word3 + ...```Where the weights (0.05, 0.10, 0.60, ...) are **learned**!

In [None]:
# Simple example: Manual attention weightssentence = ["I", "love", "this", "amazing", "product"]word_vecs = np.random.randn(5, 10)  # 5 words, 10-dim vectors# Manual attention weights (what we want to learn)attention_weights = np.array([0.05, 0.35, 0.10, 0.40, 0.10])# Weighted sum (instead of average)attended_output = np.sum(attention_weights[:, None] * word_vecs, axis=0)print("Sentence:", " ".join(sentence))print("\nAttention weights:")for word, weight in zip(sentence, attention_weights):    print(f"  {word}: {weight:.2f} {'🔥' if weight > 0.3 else ''}")print(f"\nAttended output shape: {attended_output.shape}")print("The model is 'paying attention' to 'love' and 'amazing'!")

## 3. How to Compute Attention Weights?**Key insight**: Use the vectors themselves to decide importance!**Formula** (simplified):1. **Score**: How relevant is each word to what we're looking for?2. **Softmax**: Convert scores to weights that sum to 13. **Weighted sum**: Combine vectors using weightsLet's build it!

In [None]:
def simple_attention(word_vectors, query_vector):    """    Compute attention weights and output        Args:        word_vectors: (num_words, embedding_dim) - the words to attend to        query_vector: (embedding_dim,) - what we're looking for    """    # Step 1: Compute scores (dot product)    scores = np.dot(word_vectors, query_vector)        # Step 2: Softmax to get weights    exp_scores = np.exp(scores - np.max(scores))  # Numerical stability    weights = exp_scores / exp_scores.sum()        # Step 3: Weighted sum    output = np.sum(weights[:, None] * word_vectors, axis=0)        return output, weights# Examplewords = np.random.randn(5, 10)  # 5 words, 10 dimensionsquery = np.random.randn(10)      # What we're looking foroutput, weights = simple_attention(words, query)print("Attention weights:", weights)print(f"Sum of weights: {weights.sum():.4f} (should be 1.0)")print(f"\nOutput shape: {output.shape}")

## 4. Self-Attention: The Magic Ingredient**Self-attention** = Each word attends to all other words (including itself)!For the sentence "I love programming":- "I" attends to ["I", "love", "programming"]- "love" attends to ["I", "love", "programming"]  - "programming" attends to ["I", "love", "programming"]Each word gets its own unique representation based on context!

In [None]:
def self_attention(word_vectors):    """    Self-attention: each word attends to all words        Args:        word_vectors: (num_words, embedding_dim)        Returns:        outputs: (num_words, embedding_dim) - attended representations        all_weights: (num_words, num_words) - attention matrix    """    num_words = word_vectors.shape[0]    outputs = []    all_weights = []        # For each word as query    for i in range(num_words):        query = word_vectors[i]                # Attend to all words (including itself)        output, weights = simple_attention(word_vectors, query)        outputs.append(output)        all_weights.append(weights)        return np.array(outputs), np.array(all_weights)# Examplesentence_vecs = np.random.randn(4, 8)  # 4 words, 8 dimensionsattended_vecs, attention_matrix = self_attention(sentence_vecs)print(f"Input shape: {sentence_vecs.shape}")print(f"Output shape: {attended_vecs.shape}")print(f"Attention matrix shape: {attention_matrix.shape}")print(f"\nAttention matrix:\n{attention_matrix}")

## 5. Visualizing AttentionLet's see which words attend to which!

In [None]:
# Create a simple sentencesentence = ["I", "love", "machine", "learning"]num_words = len(sentence)# Create word vectors (simplified)np.random.seed(42)vecs = np.random.randn(num_words, 10)# Compute self-attention_, attention_matrix = self_attention(vecs)# Visualizeplt.figure(figsize=(8, 6))plt.imshow(attention_matrix, cmap='Blues')plt.colorbar(label='Attention Weight')plt.xticks(range(num_words), sentence)plt.yticks(range(num_words), sentence)plt.xlabel('Attending to...')plt.ylabel('Word')plt.title('Self-Attention Matrix')# Add valuesfor i in range(num_words):    for j in range(num_words):        text = plt.text(j, i, f'{attention_matrix[i, j]:.2f}',                       ha="center", va="center", color="black", fontsize=10)plt.tight_layout()plt.show()print("Each row shows where that word 'pays attention'")print("Darker = more attention")

## 6. Scaled Dot-Product Attention (Real Version)The real attention mechanism used in Transformers!**Formula**:```Attention(Q, K, V) = softmax(QK^T / √d_k) V```Where:- **Q** (Query): What am I looking for?- **K** (Key): What do I contain?- **V** (Value): What do I actually output?- **d_k**: Dimension (for scaling)Let's implement it!

In [None]:
def scaled_dot_product_attention(Q, K, V):    """    Scaled dot-product attention (used in Transformers!)        Args:        Q: (num_queries, d_k) - Queries        K: (num_keys, d_k) - Keys        V: (num_values, d_v) - Values        Returns:        output: (num_queries, d_v)        attention_weights: (num_queries, num_keys)    """    d_k = Q.shape[-1]        # Compute attention scores    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Scaling is important!        # Softmax to get weights    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))    attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)        # Apply attention to values    output = np.dot(attention_weights, V)        return output, attention_weights# Example: 4 words, 8-dimensional embeddingsnum_words = 4d_model = 8np.random.seed(42)embeddings = np.random.randn(num_words, d_model)# In self-attention: Q = K = V = embeddingsoutput, weights = scaled_dot_product_attention(embeddings, embeddings, embeddings)print(f"Input shape: {embeddings.shape}")print(f"Output shape: {output.shape}")print(f"\nAttention weights shape: {weights.shape}")print(f"\nAttention weights:\n{weights}")print(f"\nEach row sums to: {weights.sum(axis=1)}")

## 7. Why Scaling MattersThe division by √d_k is crucial!**Without scaling**: Dot products get very large → softmax saturates → gradients vanishLet's see the difference:

In [None]:
# Compare with and without scalingd_k = 64  # Typical dimensionQ = np.random.randn(3, d_k)K = np.random.randn(3, d_k)# Without scalingscores_no_scale = np.dot(Q, K.T)print("Without scaling:")print(f"  Scores: {scores_no_scale[0]}")print(f"  Range: [{scores_no_scale.min():.2f}, {scores_no_scale.max():.2f}]")# With scalingscores_scaled = np.dot(Q, K.T) / np.sqrt(d_k)print(f"\nWith scaling:")print(f"  Scores: {scores_scaled[0]}")print(f"  Range: [{scores_scaled.min():.2f}, {scores_scaled.max():.2f}]")print(f"\n✅ Scaling keeps values in a reasonable range!")

## 8. Multi-Head Attention (Preview)**Multi-head attention** = Run multiple attention mechanisms in parallel!Why?- Different heads can focus on different aspects- Head 1 might focus on syntax- Head 2 might focus on semantics- Head 3 might focus on long-range dependenciesWe'll implement this fully in Phase 3!

In [None]:
# Conceptual examplenum_heads = 3d_model = 12d_k = d_model // num_heads  # Each head gets smaller dimensionprint(f"Model dimension: {d_model}")print(f"Number of heads: {num_heads}")print(f"Dimension per head: {d_k}")print("\nEach head learns to attend to different patterns!")print("Then we concatenate all heads and project back to d_model.")

## Summary### What We Learned:1. **Attention** = Learn what to focus on2. **Self-attention** = Each word attends to all words3. **Scaled dot-product** = The real attention formula4. **Q, K, V** = Query, Key, Value matrices5. **Scaling** = Keeps gradients stable### Key Insights:- Attention replaces fixed averaging- It's **learned** - the model decides what's important- Powers all modern language models (GPT, BERT, etc.)### Next Steps:👉 **Phase 3**: Build a complete **Transformer** using attention!You now understand the core of modern AI! 🚀