### **Overview: What We'll Build**

In this notebook, we'll implement the complete Encoder from scratch. We'll build it step by step:

| Steps | Component |
|------|-----------|
| **1** | Positional Encoding |
| **2** | Position-wise Feed-Forward Network |
| **3** | Complete Encoder Layer |
| **4** | Encoder Stack + Testing |

Let's start with **Positional Encoding**! üöÄ

---

## **Step 1: Positional Encoding**

---

### **1.1 Why Do We Need Positional Encoding?**

#### **The Problem: Self-Attention Has No Sense of Order**

Remember how self-attention works? It computes relationships between all pairs of words using dot products:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**The Issue:** This operation is **permutation invariant** ‚Äì it doesn't care about word order!

**Demonstration:**

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

ModuleNotFoundError: No module named 'seaborn'

In [None]:
def scaled_dot_product_attention(Q, K, V, mask = None):
    """
    Scaled Dot-Product Attention
    
    Args:
        Q: Query tensor (batch_size, seq_len, d_k)
        K: Key tensor (batch_size, seq_len, d_k)
        V: Value tensor (batch_size, seq_len, d_v)
        mask: Optional mask (batch_size, seq_len, seq_len) or (seq_len, seq_len)
    
    Returns:
        output: Attention output (batch_size, seq_len, d_v)
        attention_weights: Attention weights (batch_size, seq_len, seq_len)
    """
    # Get dimension for scaling
    d_k = Q.size(-1)
    
    # Step 1 & 2: Compute scaled scores
    # Q: (batch, seq_len, d_k)
    # K.transpose: (batch, d_k, seq_len)
    # scores: (batch, seq_len, seq_len)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    # Step 3: Apply mask (if provided)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Step 4: Apply softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Step 5: Weighted sum of values
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

In [None]:
def simple_attention(Q, K, V):
    """Simple scaled dot-product attention"""
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

# Create word embeddings (without position info)
# Sentence: "cat sat mat" - 3 words, 4-dimensional embeddings
embeddings = {
    'cat': torch.tensor([1.0, 0.0, 0.5, 0.2]),
    'sat': torch.tensor([0.0, 1.0, 0.3, 0.8]),
    'mat': torch.tensor([0.5, 0.0, 1.0, 0.1])
}

# Original order: "cat sat mat"
sentence1 = torch.stack([embeddings['cat'], embeddings['sat'], embeddings['mat']]).unsqueeze(0)

# Shuffled order: "mat cat sat"
sentence2 = torch.stack([embeddings['mat'], embeddings['cat'], embeddings['sat']]).unsqueeze(0)

print("Original sentence: 'cat sat mat'")
print(f"Shape: {sentence1.shape}")
print()
print("Shuffled sentence: 'mat cat sat'")
print(f"Shape: {sentence2.shape}")

In [None]:
# Compute attention for both orders
output1, weights1 = simple_attention(sentence1, sentence1, sentence1)
output2, weights2 = simple_attention(sentence2, sentence2, sentence2)

print("=" * 60)
print("ATTENTION WEIGHTS COMPARISON")
print("=" * 60)

print("\n--- Original: 'cat sat mat' ---")
print("Attention weights:")
print(weights1.squeeze().detach().numpy().round(3))

print("\n--- Shuffled: 'mat cat sat' ---")
print("Attention weights:")
print(weights2.squeeze().detach().numpy().round(3))

print("\n" + "=" * 60)
print("KEY OBSERVATION:")
print("=" * 60)
print("\nNotice how the attention pattern for 'cat' is the SAME")
print("regardless of its position in the sentence!")
print("\nüö® The model has NO IDEA about word order!")

#### **Why This Is a Problem**

Word order is **crucial** for understanding language:

| Sentence | Meaning |
|----------|--------|
| "The dog bit the man" | üêï Dog is the attacker |
| "The man bit the dog" | üë® Man is the attacker |

Same words, completely different meanings! Without positional information, the model would treat these identically.

**Solution:** Add position information to the embeddings using **Positional Encoding**!

### **1.2 The Positional Encoding Formula**

The original Transformer paper uses **sine and cosine functions** of different frequencies:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where:
- $pos$ = position in the sequence (0, 1, 2, ...)
- $i$ = dimension index (0, 1, 2, ..., $d_{model}/2$)
- $d_{model}$ = embedding dimension (e.g., 512)

**Breaking It Down:**

| Component | Meaning |
|-----------|--------|
| $pos$ | Which position in the sequence (0th word, 1st word, etc.) |
| $i$ | Which dimension of the encoding |
| $10000^{2i/d_{model}}$ | Wavelength that increases with dimension |
| Even dimensions (2i) | Use sine |
| Odd dimensions (2i+1) | Use cosine |

**Intuition:** Each dimension captures position at a different "frequency":
- Low dimensions (small $i$): High frequency, captures fine-grained positions
- High dimensions (large $i$): Low frequency, captures coarse positions

<div align="center">
  <img src="https://machinelearningmastery.com/wp-content/uploads/2022/01/PE3.png" width="600"/>
  <p><i>Positional Encoding: Each column is a position, each row is a dimension</i></p>
</div>

### **1.3 Step-by-Step Implementation**

Let's build the Positional Encoding step by step to understand each part:

In [None]:
# Parameters
max_seq_len = 10  # Maximum sequence length
d_model = 8       # Embedding dimension (small for visualization)

print(f"Creating positional encoding for:")
print(f"  - Max sequence length: {max_seq_len}")
print(f"  - Embedding dimension: {d_model}")

#### **Step 1a: Create Position Indices**

In [None]:
# Create position indices: [0, 1, 2, ..., max_seq_len-1]
# Shape: (max_seq_len, 1)
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

print("Position indices (pos):")
print(position.T)  # Transpose for horizontal display
print(f"\nShape: {position.shape}")

#### **Step 1b: Create the Division Term**

We need to compute $10000^{2i/d_{model}}$ for each dimension $i$.

For numerical stability, we compute this as:
$$10000^{2i/d_{model}} = e^{2i \cdot \ln(10000) / d_{model}}$$

In [None]:
# Create dimension indices for even dimensions: [0, 2, 4, 6, ...]
# We only need d_model/2 values since each i covers 2 dimensions (sin and cos)
div_term_indices = torch.arange(0, d_model, 2).float()
print(f"Dimension indices (2i): {div_term_indices.numpy()}")

# Compute the division term: 10000^(2i/d_model)
# Using exp(log) for numerical stability
div_term = torch.exp(div_term_indices * (-np.log(10000.0) / d_model))

print(f"\nDivision terms (1/10000^(2i/d_model)):")
print(div_term.numpy().round(6))

print(f"\nShape: {div_term.shape}")
print(f"\nNote: These decrease as dimension increases (lower frequency)")

#### **Step 1c: Compute Sine and Cosine Values**

In [None]:
# Initialize the positional encoding matrix
# Shape: (max_seq_len, d_model)
pe = torch.zeros(max_seq_len, d_model)

# Compute position * div_term for all positions and dimensions
# position: (max_seq_len, 1)
# div_term: (d_model/2,)
# Result: (max_seq_len, d_model/2)
angles = position * div_term

print("Angles (pos * div_term):")
print(angles.numpy().round(3))
print(f"\nShape: {angles.shape}")

In [None]:
# Apply sine to even indices (0, 2, 4, ...)
pe[:, 0::2] = torch.sin(angles)

# Apply cosine to odd indices (1, 3, 5, ...)
pe[:, 1::2] = torch.cos(angles)

print("Positional Encoding Matrix:")
print(pe.numpy().round(3))
print(f"\nShape: {pe.shape}")
print("\nRows = Positions (0 to 9)")
print("Columns = Dimensions (0 to 7)")
print("Even columns = sin, Odd columns = cos")

### **1.4 Visualizing Positional Encodings**

In [None]:
# Create a larger positional encoding for better visualization
max_seq_len_viz = 100
d_model_viz = 64

position_viz = torch.arange(0, max_seq_len_viz, dtype=torch.float).unsqueeze(1)
div_term_viz = torch.exp(torch.arange(0, d_model_viz, 2).float() * (-np.log(10000.0) / d_model_viz))

pe_viz = torch.zeros(max_seq_len_viz, d_model_viz)
pe_viz[:, 0::2] = torch.sin(position_viz * div_term_viz)
pe_viz[:, 1::2] = torch.cos(position_viz * div_term_viz)

# Visualize as heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pe_viz.numpy(), cmap='RdBu_r', center=0, 
            xticklabels=8, yticklabels=10)
plt.xlabel('Embedding Dimension', fontsize=12)
plt.ylabel('Position in Sequence', fontsize=12)
plt.title('Positional Encoding Visualization\n(Red = Positive, Blue = Negative)', fontsize=14)
plt.tight_layout()
plt.show()

print("Observations:")
print("1. Left side (low dimensions): High frequency patterns - changes rapidly with position")
print("2. Right side (high dimensions): Low frequency patterns - changes slowly")
print("3. Each position has a UNIQUE pattern!")

In [None]:
# Visualize individual dimensions
fig, axes = plt.subplots(2, 2, figsize=(14, 8))

dimensions_to_plot = [0, 1, 30, 31]  # Low and high frequency dimensions
positions = np.arange(max_seq_len_viz)

for idx, dim in enumerate(dimensions_to_plot):
    ax = axes[idx // 2, idx % 2]
    values = pe_viz[:, dim].numpy()
    
    ax.plot(positions, values, 'b-', linewidth=2)
    ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Position', fontsize=10)
    ax.set_ylabel('Encoding Value', fontsize=10)
    
    func_type = 'sin' if dim % 2 == 0 else 'cos'
    freq_type = 'HIGH' if dim < 16 else 'LOW'
    ax.set_title(f'Dimension {dim} ({func_type}) - {freq_type} Frequency', fontsize=12)
    ax.grid(True, alpha=0.3)

plt.suptitle('Positional Encoding: Different Frequencies at Different Dimensions', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüìä Key Insight:")
print("- Low dimensions (0, 1): Oscillate rapidly - distinguish nearby positions")
print("- High dimensions (30, 31): Oscillate slowly - capture global position")
print("- Together, they create a UNIQUE fingerprint for each position!")

### **1.5 The Complete PositionalEncoding Module**

Now let's wrap everything into a proper `nn.Module`:

In [None]:
class PositionalEncoding(nn.Module):
    """
    Positional Encoding using sine and cosine functions.
    
    Adds positional information to input embeddings so the model
    can distinguish between different positions in the sequence.
    
    Args:
        d_model: Embedding dimension (must match input embedding size)
        max_seq_len: Maximum sequence length to support
        dropout: Dropout probability (default: 0.1)
    """
    
    def __init__(self, d_model, max_seq_len=5000, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        
        self.dropout = nn.Dropout(p=dropout)
        
        # Create positional encoding matrix
        # Shape: (max_seq_len, d_model)
        pe = torch.zeros(max_seq_len, d_model)
        
        # Position indices: [0, 1, 2, ..., max_seq_len-1]
        # Shape: (max_seq_len, 1)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        
        # Division term: 10000^(2i/d_model) computed as exp for stability
        # Shape: (d_model/2,)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
        )
        
        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension: (1, max_seq_len, d_model)
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter, but should be saved with model)
        # Buffers are tensors that are part of the module but not trainable
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """
        Add positional encoding to input embeddings.
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            
        Returns:
            Tensor of same shape with positional encoding added
        """
        # x: (batch_size, seq_len, d_model)
        # self.pe: (1, max_seq_len, d_model)
        # We slice pe to match the actual sequence length
        seq_len = x.size(1)
        
        # Add positional encoding (broadcasts across batch dimension)
        x = x + self.pe[:, :seq_len, :]
        
        return self.dropout(x)

print("‚úÖ PositionalEncoding class defined!")

### **1.6 Testing the PositionalEncoding Module**

In [None]:
# Test parameters
batch_size = 2
seq_len = 5
d_model = 16

# Create the positional encoding module
pos_encoder = PositionalEncoding(d_model=d_model, max_seq_len=100, dropout=0.0)

# Create dummy embeddings (simulating word embeddings)
embeddings = torch.randn(batch_size, seq_len, d_model)

print("Input embeddings:")
print(f"  Shape: {embeddings.shape}")
print(f"  Sample values (batch 0, position 0): {embeddings[0, 0, :4].numpy().round(3)}")

# Apply positional encoding
output = pos_encoder(embeddings)

print("\nOutput (embeddings + positional encoding):")
print(f"  Shape: {output.shape}")
print(f"  Sample values (batch 0, position 0): {output[0, 0, :4].detach().numpy().round(3)}")

print("\n‚úÖ Positional encoding applied successfully!")

In [None]:
# Verify that different positions get different encodings
print("=" * 60)
print("VERIFYING UNIQUE POSITION ENCODINGS")
print("=" * 60)

# Get the positional encodings for first 5 positions
pe_values = pos_encoder.pe[0, :5, :8].numpy().round(3)  # First 5 positions, first 8 dims

print("\nPositional encodings for positions 0-4 (first 8 dimensions):")
print()
for pos in range(5):
    print(f"Position {pos}: {pe_values[pos]}")

print("\n‚úÖ Each position has a unique encoding pattern!")

In [None]:
# Now let's verify that attention can distinguish positions
print("=" * 60)
print("ATTENTION WITH POSITIONAL ENCODING")
print("=" * 60)

# Same word embeddings as before
embeddings = {
    'cat': torch.tensor([1.0, 0.0, 0.5, 0.2]),
    'sat': torch.tensor([0.0, 1.0, 0.3, 0.8]),
    'mat': torch.tensor([0.5, 0.0, 1.0, 0.1])
}

# Create positional encoder for d_model=4
pos_enc_small = PositionalEncoding(d_model=4, max_seq_len=10, dropout=0.0)

# Original order: "cat sat mat"
sentence1 = torch.stack([embeddings['cat'], embeddings['sat'], embeddings['mat']]).unsqueeze(0)

# Shuffled order: "mat cat sat"  
sentence2 = torch.stack([embeddings['mat'], embeddings['cat'], embeddings['sat']]).unsqueeze(0)

# Add positional encodings
sentence1_with_pos = pos_enc_small(sentence1)
sentence2_with_pos = pos_enc_small(sentence2)

# Compute attention
output1, weights1 = simple_attention(sentence1_with_pos, sentence1_with_pos, sentence1_with_pos)
output2, weights2 = simple_attention(sentence2_with_pos, sentence2_with_pos, sentence2_with_pos)

print("\n--- Original: 'cat sat mat' (with positional encoding) ---")
print("Attention weights:")
print(weights1.squeeze().detach().numpy().round(3))

print("\n--- Shuffled: 'mat cat sat' (with positional encoding) ---")
print("Attention weights:")
print(weights2.squeeze().detach().numpy().round(3))

print("\n" + "=" * 60)
print("KEY OBSERVATION:")
print("=" * 60)
print("\n‚úÖ Now the attention patterns are DIFFERENT!")
print("‚úÖ The model can distinguish word ORDER!")
print("‚úÖ Same words at different positions ‚Üí different representations!")

### **1.7 Why Sine and Cosine? The Clever Math**

You might wonder: Why use sine and cosine specifically? There are several elegant reasons:

#### **Reason 1: Unique Encoding for Each Position**

The combination of sines and cosines at different frequencies creates a unique "fingerprint" for each position.

#### **Reason 2: Bounded Values**

Sine and cosine are always between -1 and 1, which is numerically stable:
$$-1 \leq \sin(x), \cos(x) \leq 1$$

#### **Reason 3: Relative Position Information**

This is the cleverest part! For any fixed offset $k$, we can express $PE_{pos+k}$ as a linear function of $PE_{pos}$:

$$\sin(pos + k) = \sin(pos)\cos(k) + \cos(pos)\sin(k)$$
$$\cos(pos + k) = \cos(pos)\cos(k) - \sin(pos)\sin(k)$$

This means the model can easily learn to attend to **relative positions** (e.g., "the word 3 positions back")!

#### **Reason 4: Extrapolation to Longer Sequences**

Since sine and cosine are continuous functions, the encoding naturally extends to any sequence length, even longer than seen during training.

In [None]:
# Demonstrate relative position property
print("=" * 60)
print("RELATIVE POSITION DEMONSTRATION")
print("=" * 60)

# Get positional encodings for positions 0, 3, and 10, 13 (same relative offset of 3)
pe_large = PositionalEncoding(d_model=64, max_seq_len=100, dropout=0.0)

pe_0 = pe_large.pe[0, 0, :8].numpy()
pe_3 = pe_large.pe[0, 3, :8].numpy()
pe_10 = pe_large.pe[0, 10, :8].numpy()
pe_13 = pe_large.pe[0, 13, :8].numpy()

# Compute differences
diff_0_to_3 = pe_3 - pe_0
diff_10_to_13 = pe_13 - pe_10

print("\nDifference between position 0 and 3:")
print(diff_0_to_3.round(4))

print("\nDifference between position 10 and 13:")
print(diff_10_to_13.round(4))

print("\nüéØ The differences are SIMILAR (not identical due to different frequencies)!")
print("This allows the model to learn relative position relationships.")