# Day 17: Introduction to Transformers

**Goal:** Understand the Transformer architecture and why it replaced RNNs for most NLP tasks.

**Time estimate:** 3-4 hours

**What you'll learn:**
- Self-attention mechanism and scaled dot-product attention
- Multi-head attention
- Positional encodings
- Complete Transformer block

---

## Part 1: Why Transformers? (RNN Limitations)

### The Problem with RNNs/LSTMs

From Days 13-16, we saw RNNs process sequences **sequentially**:

```
Input:  [token_0, token_1, token_2, ..., token_n]
RNN:     h_0  -->  h_1  -->  h_2  -->  ...  -->  h_n
```

**Issues:**
1. **Sequential processing is slow**: Can't parallelize across time steps
2. **Long-range dependencies fade**: Information decays over long sequences
3. **Vanishing/exploding gradients**: Even LSTMs have difficulty with very long sequences
4. **Limited by hidden state size**: All information compressed into fixed-size vector

### The Transformer Solution

**Key Innovation:** Replace sequential recurrence with **self-attention** that can attend to **any** position in one step.

```
Input:  [token_0, token_1, token_2, ..., token_n]
         |         |         |              |
         +-------- Self-Attention --------+
         |         |         |              |
Output: [out_0, out_1, out_2, ..., out_n]
```

**Advantages:**
- ✅ **Fully parallelizable**: All positions processed simultaneously
- ✅ **Direct long-range connections**: Each position can attend to any other position
- ✅ **Constant path length**: Information flows directly, no RNN compression
- ✅ **Interpretable**: Attention weights show what model attends to

**Result:** Transformers became the foundation for:
- BERT (2018)
- GPT family (2018-2023)
- T5, RoBERTa, XLNet, ELECTRA, and most modern NLP models

## Part 2: Self-Attention Mechanism

### Intuition: "What should I pay attention to?"

When processing token `i` in a sequence, we want to:
1. Compare token `i` with **all** other tokens
2. Compute **relevance scores** (how important is each token?)
3. **Aggregate information** from all tokens, weighted by relevance

### Mathematical Formulation

#### Scaled Dot-Product Attention

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Where:
- $Q$ = **Query** matrix (shape: `[seq_len, d_k]`)
- $K$ = **Key** matrix (shape: `[seq_len, d_k]`)
- $V$ = **Value** matrix (shape: `[seq_len, d_v]`)
- $d_k$ = dimension of keys (used for scaling)

**Step by step:**

1. **Compute attention scores**: $QK^T$ (shape: `[seq_len, seq_len]`)
   - Entry $(i, j)$ = how much token $i$ should attend to token $j$
   
2. **Scale by $\sqrt{d_k}$**: Stabilizes gradients
   - Without scaling, dot products can be very large
   - Leads to softmax with tiny gradients
   
3. **Apply softmax**: Normalize to get probabilities
   - Each row sums to 1
   - Represents "attention distribution"
   
4. **Weight values**: Multiply by $V$
   - Aggregate value vectors weighted by attention

### Example: Attention in Action

Sentence: "The cat sat on the mat"

When processing "cat", attention might compute:
- "The" (high relevance - article modifying cat)
- "sat" (high relevance - main verb, subject)
- "on" (medium relevance - location)
- "the" (medium relevance - another article)
- "mat" (medium relevance - location object)

The output for "cat" is a weighted combination of all word embeddings.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### Implementation: Scaled Dot-Product Attention

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix, shape (batch_size, seq_len, d_k)
        K: Key matrix, shape (batch_size, seq_len, d_k)
        V: Value matrix, shape (batch_size, seq_len, d_v)
        mask: Optional mask for preventing attention to future tokens
    
    Returns:
        output: Attention output, shape (batch_size, seq_len, d_v)
        attention_weights: Attention weights, shape (batch_size, seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute attention scores (Q @ K^T) / sqrt(d_k)
    # Shape: (batch_size, seq_len, seq_len)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(d_k)
    
    # Step 2: Apply mask (optional, for causal attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Step 3: Apply softmax
    # Shape: (batch_size, seq_len, seq_len)
    attention_weights = torch.softmax(scores, dim=-1)
    
    # Handle NaN from masked positions
    attention_weights = torch.nan_to_num(attention_weights, 0.0)
    
    # Step 4: Multiply by values
    # Shape: (batch_size, seq_len, d_v)
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

print("✓ Scaled dot-product attention implemented")

### Test Attention: Simple Example

In [None]:
# Create a simple example
# Sequence of 4 tokens, embedding dimension = 64
seq_len = 4
d_k = 64
d_v = 64
batch_size = 1

# Random Q, K, V (in practice, come from embeddings)
Q = torch.randn(batch_size, seq_len, d_k).to(device)
K = torch.randn(batch_size, seq_len, d_k).to(device)
V = torch.randn(batch_size, seq_len, d_v).to(device)

# Compute attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)

print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
print(f"\nAttention weights (for first sequence):")
print(attention_weights[0])
print(f"\nRow sums (should be 1.0): {attention_weights[0].sum(dim=1)}")

### Visualize Attention Patterns

In [None]:
# Create a more interpretable example with sentence tokens
words = ["The", "cat", "sat", "on", "the", "mat"]
seq_len = len(words)

# Create attention weights with some structure
# (In real case, these come from trained model)
torch.manual_seed(42)
Q = torch.randn(1, seq_len, 64)
K = torch.randn(1, seq_len, 64)
V = torch.randn(1, seq_len, 64)

output, attention_weights = scaled_dot_product_attention(Q, K, V)

# Visualize attention patterns
fig, ax = plt.subplots(figsize=(8, 7))

# Plot heatmap
sns.heatmap(attention_weights[0].detach().cpu().numpy(), 
            annot=True, 
            fmt='.2f', 
            cmap='viridis',
            xticklabels=words,
            yticklabels=words,
            cbar_kws={'label': 'Attention Weight'},
            ax=ax)

ax.set_xlabel('Key (attend to)', fontsize=12, fontweight='bold')
ax.set_ylabel('Query (attend from)', fontsize=12, fontweight='bold')
ax.set_title('Self-Attention Weights\n"The cat sat on the mat"', 
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("Each row shows: FROM which token (left) TO which tokens (top)")
print(f"\nExample: Token '{words[1]}' (cat) attends to:")
for j, word in enumerate(words):
    weight = attention_weights[0, 1, j].item()
    print(f"  {word:6s}: {weight:.3f}")

## Part 3: Multi-Head Attention

### Motivation: Why Multiple Heads?

Single-head attention focuses on **one type of relationship**.

**Example:**
- "The cat sat on the mat"
- One head might learn: "subject-verb dependencies"
- Another head might learn: "noun-adjective modifiers"
- Another might learn: "prepositional phrases"

With **multiple heads**, the model can simultaneously attend to:
- Different types of relationships
- Different parts of the sequence
- Different levels of abstraction

### Mathematical Formulation

$$\text{MultiHeadAttention}(Q, K, V) = \text{Concat}(head_1, ..., head_h) W^O$$

Where each head is:
$$head_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$$

**Parameters:**
- $h$ = number of heads
- $W_i^Q, W_i^K, W_i^V$ = linear projections for head $i$
- $W^O$ = output projection
- Each head operates on dimension $d_k = d_{model}/h$

**Benefit:** Computational cost similar to single-head with dimension $d_{model}$, but much more expressive.

### Implementation: Multi-Head Attention Layer

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head self-attention layer.
    
    Args:
        d_model: Dimension of model (embedding dimension)
        num_heads: Number of attention heads
    """
    
    def __init__(self, d_model, num_heads):
        super().__init__()
        
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
        
        self.register_buffer('scale', torch.tensor(self.d_k ** 0.5))
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input tensor, shape (batch_size, seq_len, d_model)
            mask: Optional attention mask
        
        Returns:
            output: Output tensor, shape (batch_size, seq_len, d_model)
            attention_weights: Attention weights from all heads
        """
        batch_size, seq_len, d_model = x.shape
        
        # Step 1: Linear projections
        # Shape: (batch_size, seq_len, d_model)
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Step 2: Split into multiple heads
        # Reshape from (batch_size, seq_len, d_model)
        # to (batch_size, seq_len, num_heads, d_k)
        # then transpose to (batch_size, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
        # Step 3: Apply attention for each head
        # Compute scores: (batch_size, num_heads, seq_len, seq_len)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Attention weights: (batch_size, num_heads, seq_len, seq_len)
        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = torch.nan_to_num(attention_weights, 0.0)
        
        # Apply attention to values
        # (batch_size, num_heads, seq_len, d_k)
        attended_values = torch.matmul(attention_weights, V)
        
        # Step 4: Concatenate heads
        # Transpose from (batch_size, num_heads, seq_len, d_k)
        # to (batch_size, seq_len, num_heads, d_k)
        # then reshape to (batch_size, seq_len, d_model)
        concat = attended_values.transpose(1, 2).contiguous()
        concat = concat.view(batch_size, seq_len, d_model)
        
        # Step 5: Final linear projection
        output = self.W_o(concat)
        
        return output, attention_weights

print("✓ Multi-head attention layer implemented")

### Test Multi-Head Attention

In [None]:
# Create multi-head attention layer
d_model = 64
num_heads = 4
seq_len = 6
batch_size = 2

mha = MultiHeadAttention(d_model, num_heads).to(device)

# Create input
x = torch.randn(batch_size, seq_len, d_model).to(device)

# Forward pass
output, attn_weights = mha(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"  (batch_size={batch_size}, num_heads={num_heads}, seq_len={seq_len}, seq_len={seq_len})")

# Count parameters
params = sum(p.numel() for p in mha.parameters())
print(f"\nTotal parameters: {params:,}")
print(f"  (4 linear layers: 4 × {d_model} × {d_model} = {4 * d_model * d_model:,})")

### Visualize Multiple Heads

In [None]:
# Get attention weights for visualization
words = ["The", "cat", "sat", "on", "the", "mat"]
x_demo = torch.randn(1, len(words), d_model).to(device)
output_demo, attn_weights_demo = mha(x_demo)

# Plot attention patterns for each head
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for head_idx in range(num_heads):
    ax = axes[head_idx]
    
    # Get attention weights for this head
    head_attn = attn_weights_demo[0, head_idx].detach().cpu().numpy()
    
    # Plot heatmap
    sns.heatmap(head_attn, 
                annot=True, 
                fmt='.2f', 
                cmap='Blues',
                xticklabels=words,
                yticklabels=words,
                ax=ax,
                cbar=False)
    
    ax.set_title(f'Head {head_idx + 1}', fontweight='bold')
    ax.set_xlabel('Attend to')
    ax.set_ylabel('Attend from')

plt.suptitle('Multi-Head Attention Patterns', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("Note: Each head learns different patterns!")
print("Head 1 might focus on adjacent words, Head 2 on content words, etc.")

## Part 4: Positional Encodings

### The Problem: Where is Position Information?

Attention is **permutation-invariant**:
```
"The cat sat" → same output as "cat The sat"
```

This is because:
- Attention only depends on **similarity** between tokens
- No inherent notion of position or order
- Need to **inject position information** into embeddings

### Solution: Positional Encodings

Add position-dependent values to token embeddings.

**Sinusoidal Positional Encodings (from "Attention is All You Need"):**

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Where:
- $pos$ = position in sequence (0, 1, 2, ...)
- $i$ = dimension index (0, 1, 2, ...)
- $d_{model}$ = embedding dimension

**Why sinusoidal?**
1. Creates unique pattern for each position
2. Model can learn relative positions (linear transformations of sine/cosine)
3. Generalizes to sequences longer than training
4. Fixed (non-learned) - reduces parameters

### Alternative: Learned Positional Embeddings

Simply learn embedding table:
$$PE = \text{Embedding}(pos)$$

- ✅ Simpler, model can learn whatever helps
- ❌ Limited to training sequence length
- Less common now, sinusoidal is standard

### Implementation: Positional Encodings

In [None]:
class PositionalEncoding(nn.Module):
    """
    Positional encoding using sinusoidal functions.
    
    Args:
        d_model: Dimension of model
        max_seq_len: Maximum sequence length
    """
    
    def __init__(self, d_model, max_seq_len=1000):
        super().__init__()
        
        # Create position indices: [0, 1, 2, ..., max_seq_len-1]
        positions = torch.arange(max_seq_len).unsqueeze(1)  # Shape: (max_seq_len, 1)
        
        # Create dimension indices: [0, 1, 2, ..., d_model-1]
        dimensions = torch.arange(d_model).unsqueeze(0)  # Shape: (1, d_model)
        
        # Compute angle rates: 10000^(2i/d_model)
        angle_rates = 1 / (10000 ** (2 * (dimensions // 2) / d_model))
        
        # Compute angles: pos / 10000^(2i/d_model)
        angle_rads = positions * angle_rates  # Shape: (max_seq_len, d_model)
        
        # Apply sin to even indices, cos to odd indices
        angle_rads[:, 0::2] = torch.sin(angle_rads[:, 0::2])  # sin for 0, 2, 4, ...
        angle_rads[:, 1::2] = torch.cos(angle_rads[:, 1::2])  # cos for 1, 3, 5, ...
        
        # Register as buffer (not a parameter, but part of module state)
        self.register_buffer('pe', angle_rads.unsqueeze(0))  # Shape: (1, max_seq_len, d_model)
    
    def forward(self, x):
        """
        Args:
            x: Input embeddings, shape (batch_size, seq_len, d_model)
        
        Returns:
            x + positional_encoding
        """
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len, :]

print("✓ Positional encoding layer implemented")

### Visualize Positional Encodings

In [None]:
# Create positional encoding
d_model = 512
pe = PositionalEncoding(d_model, max_seq_len=100)

# Extract positional encodings (without the input, just the PE)
pos_enc = pe.pe[0, :50, :].detach().cpu().numpy()  # First 50 positions

# Visualize
fig, ax = plt.subplots(figsize=(12, 8))

# Plot as heatmap
im = ax.imshow(pos_enc.T, cmap='coolwarm', aspect='auto')

ax.set_xlabel('Position in Sequence', fontsize=12, fontweight='bold')
ax.set_ylabel('Dimension Index', fontsize=12, fontweight='bold')
ax.set_title(f'Sinusoidal Positional Encodings (d_model={d_model})', 
             fontsize=14, fontweight='bold')

# Add colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label('Encoding Value', fontweight='bold')

plt.tight_layout()
plt.show()

print("Observations:")
print("- Each position has a unique pattern")
print("- Different dimensions oscillate at different frequencies")
print("- Lower dimensions change slowly (capture long-range positions)")
print("- Higher dimensions change quickly (capture fine-grained positions)")

### Property: Learning Relative Positions

In [None]:
# Demonstrate that sinusoidal PE can represent relative positions
# PE(pos + k) can be computed from PE(pos) via linear transformation

pos_enc_small = PositionalEncoding(d_model=64, max_seq_len=100)
pe_table = pos_enc_small.pe[0].detach().cpu()  # Shape: (100, 64)

# Pick two positions
pos1, pos2 = 10, 15
pe_pos1 = pe_table[pos1]
pe_pos2 = pe_table[pos2]

# Compute distance
offset = pos2 - pos1
pe_offset = pe_table[offset]

# In theory: PE(pos2) ≈ some linear transformation of PE(pos1)
# This allows model to learn relative position relationships

print(f"Position 1: {pos1}")
print(f"Position 2: {pos2}")
print(f"Offset: {offset}")
print(f"\nPE dimensions match: {pe_pos1.shape} = {pe_pos2.shape} = {pe_offset.shape}")
print(f"\nKey property: Sinusoidal encoding allows model to learn")
print(f"relative positions through linear transformations.")

## Part 5: Transformer Block

### Complete Transformer Encoder Block

A complete transformer block combines:
1. **Multi-head self-attention**
2. **Add & Norm (residual connection + layer norm)**
3. **Feed-forward network**
4. **Add & Norm**

### Architecture

```
Input (batch_size, seq_len, d_model)
    |
    ├─→ Multi-Head Attention
    |       |
    |       ├─→ LayerNorm
    |       ├─→ Attention
    |       └─→ Add (residual)
    |
    ├─→ Feed-Forward Network
    |       |
    |       ├─→ LayerNorm
    |       ├─→ Linear → ReLU → Linear
    |       └─→ Add (residual)
    |
Output (batch_size, seq_len, d_model)
```

### Key Components

**Layer Normalization:**
$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

- Normalizes each feature dimension independently
- Stabilizes training in residual networks

**Residual Connection:**
$$y = \text{Sublayer}(x) + x$$

- Helps gradient flow in deep networks
- Preserves identity for efficient updates

**Feed-Forward Network:**
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

- Typical expansion: $d_{model} \to 4 \times d_{model} \to d_{model}$
- Applied independently to each position

### Implementation: Transformer Block

In [None]:
class FeedForwardNetwork(nn.Module):
    """
    Position-wise Feed-Forward Network.
    
    Args:
        d_model: Input/output dimension
        d_ff: Hidden layer dimension (typically 4 × d_model)
    """
    
    def __init__(self, d_model, d_ff=2048):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
    
    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))


class TransformerBlock(nn.Module):
    """
    Single Transformer encoder block.
    
    Args:
        d_model: Model dimension
        num_heads: Number of attention heads
        d_ff: Feed-forward hidden dimension
        dropout: Dropout rate
    """
    
    def __init__(self, d_model, num_heads, d_ff=2048, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        
        # Feed-forward network
        self.ffn = FeedForwardNetwork(d_model, d_ff)
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input, shape (batch_size, seq_len, d_model)
            mask: Optional attention mask
        
        Returns:
            output: Output, shape (batch_size, seq_len, d_model)
        """
        # Multi-head attention with residual connection
        attn_output, _ = self.attention(x, mask)
        attn_output = self.dropout1(attn_output)
        x = self.norm1(x + attn_output)  # Add & Norm
        
        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        ffn_output = self.dropout2(ffn_output)
        x = self.norm2(x + ffn_output)  # Add & Norm
        
        return x


print("✓ Transformer block implemented")

### Test Transformer Block

In [None]:
# Create a transformer block
d_model = 128
num_heads = 8
d_ff = 512
seq_len = 10
batch_size = 4

transformer_block = TransformerBlock(d_model, num_heads, d_ff, dropout=0.1).to(device)

# Create input
x = torch.randn(batch_size, seq_len, d_model).to(device)

# Forward pass
output = transformer_block(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

# Count parameters
params = sum(p.numel() for p in transformer_block.parameters())
print(f"\nTransformer block parameters: {params:,}")

# Breakdown
attn_params = sum(p.numel() for p in transformer_block.attention.parameters())
ffn_params = sum(p.numel() for p in transformer_block.ffn.parameters())
norm_params = sum(p.numel() for p in transformer_block.norm1.parameters()) + \
              sum(p.numel() for p in transformer_block.norm2.parameters())

print(f"  Multi-head attention: {attn_params:,}")
print(f"  Feed-forward: {ffn_params:,}")
print(f"  Layer norms: {norm_params:,}")

## Part 6: Complete Transformer Encoder

Stack multiple transformer blocks with embeddings and positional encodings.

In [None]:
class TransformerEncoder(nn.Module):
    """
    Complete Transformer encoder (stack of transformer blocks).
    
    Args:
        vocab_size: Size of vocabulary
        d_model: Model dimension
        num_heads: Number of attention heads
        num_layers: Number of transformer blocks
        d_ff: Feed-forward dimension
        max_seq_len: Maximum sequence length
        dropout: Dropout rate
    """
    
    def __init__(self, vocab_size, d_model, num_heads, num_layers, 
                 d_ff=2048, max_seq_len=1000, dropout=0.1):
        super().__init__()
        
        # Embeddings
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        # Stack of transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        # Output layer norm
        self.final_norm = nn.LayerNorm(d_model)
        
        self.d_model = d_model
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input token IDs, shape (batch_size, seq_len)
            mask: Optional attention mask
        
        Returns:
            output: Encoded representations, shape (batch_size, seq_len, d_model)
        """
        # Embedding
        x = self.embedding(x) * np.sqrt(self.d_model)  # Scale embeddings
        
        # Add positional encoding
        x = self.pos_encoding(x)
        
        # Pass through transformer blocks
        for block in self.transformer_blocks:
            x = block(x, mask)
        
        # Final normalization
        x = self.final_norm(x)
        
        return x


print("✓ Transformer encoder implemented")

### Test Complete Transformer Encoder

In [None]:
# Create transformer encoder
vocab_size = 10000
d_model = 256
num_heads = 8
num_layers = 3
d_ff = 1024
seq_len = 20
batch_size = 8

encoder = TransformerEncoder(
    vocab_size=vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=d_ff,
    dropout=0.1
).to(device)

# Create input (random token IDs)
x = torch.randint(1, vocab_size, (batch_size, seq_len)).to(device)

# Forward pass
output = encoder(x)

print(f"Input shape: {x.shape}")
print(f"Input (token IDs): {x[0]}")
print(f"\nOutput shape: {output.shape}")
print(f"Output (embeddings): {output[0, 0, :5]} ... (showing first 5 dims)")

# Count total parameters
total_params = sum(p.numel() for p in encoder.parameters())
print(f"\nTotal parameters: {total_params:,}")

# Breakdown by component
embedding_params = sum(p.numel() for p in encoder.embedding.parameters())
transformer_params = sum(p.numel() for p in 
                         [p for block in encoder.transformer_blocks 
                          for p in block.parameters()])
norm_params = sum(p.numel() for p in encoder.final_norm.parameters())

print(f"\nBreakdown:")
print(f"  Embeddings: {embedding_params:,}")
print(f"  Transformer blocks: {transformer_params:,}")
print(f"  Final norm: {norm_params:,}")

## Part 7: Comparison with RNNs

### Why Transformers Won

In [None]:
# Compare Transformer vs RNN properties
import pandas as pd

comparison = pd.DataFrame({
    'Property': [
        'Parallelization',
        'Long-range dependencies',
        'Gradient flow path length',
        'Interpretability',
        'Parameter efficiency',
        'Memory usage',
        'Training speed',
        'Inference on long seqs',
    ],
    'RNN/LSTM': [
        '❌ Sequential',
        '⚠️ Difficult (vanishing gradients)',
        '❌ O(seq_len)',
        '⚠️ Black box hidden states',
        '✅ Smaller models',
        '✅ O(hidden_dim)',
        '❌ Slow',
        '❌ O(seq_len) per token',
    ],
    'Transformer': [
        '✅ Fully parallel',
        '✅ Direct attention',
        '✅ O(1)',
        '✅ Attention weights',
        '⚠️ Larger (for same capacity)',
        '⚠️ O(seq_len²) (quadratic)',
        '✅ Fast',
        '⚠️ O(seq_len²) per token',
    ]
})

print(comparison.to_string(index=False))

print("\n" + "="*80)
print("Key Insight: Transformers trade memory for parallelization and interpretability")
print("O(seq_len²) is acceptable for modern hardware, especially with batching")
print("="*80)

### Computational Complexity Analysis

In [None]:
# Analyze computational complexity
import matplotlib.pyplot as plt

# Sequence lengths
seq_lengths = np.array([10, 50, 100, 200, 500, 1000])

# Complexity (arbitrary units)
rnn_complexity = seq_lengths  # O(n)
transformer_complexity = seq_lengths ** 2  # O(n²)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Linear scale
ax = axes[0]
ax.plot(seq_lengths, rnn_complexity, 'o-', label='RNN O(n)', linewidth=2, markersize=8)
ax.plot(seq_lengths, transformer_complexity, 's-', label='Transformer O(n²)', linewidth=2, markersize=8)
ax.set_xlabel('Sequence Length', fontsize=12, fontweight='bold')
ax.set_ylabel('Computational Cost (units)', fontsize=12, fontweight='bold')
ax.set_title('Computational Complexity (Linear Scale)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Log scale
ax = axes[1]
ax.loglog(seq_lengths, rnn_complexity, 'o-', label='RNN O(n)', linewidth=2, markersize=8)
ax.loglog(seq_lengths, transformer_complexity, 's-', label='Transformer O(n²)', linewidth=2, markersize=8)
ax.set_xlabel('Sequence Length', fontsize=12, fontweight='bold')
ax.set_ylabel('Computational Cost (units)', fontsize=12, fontweight='bold')
ax.set_title('Computational Complexity (Log Scale)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations:")
print(f"- At seq_len=100: Transformer is {(100**2) / 100:.0f}x more expensive")
print(f"- At seq_len=1000: Transformer is {(1000**2) / 1000:.0f}x more expensive")
print(f"\nBut: Transformers are 10-100x faster in practice due to GPU parallelization!")

## Part 8: Key Takeaways

### What You've Learned

✅ **Self-Attention:** How to compute attention weights and aggregate information  
✅ **Multi-Head Attention:** Learning multiple types of relationships simultaneously  
✅ **Positional Encodings:** Injecting position information into embeddings  
✅ **Transformer Block:** Combining attention, FFN, and normalization  
✅ **Transformer Encoder:** Full encoder architecture  

### Why Transformers Matter

**Parallelization:** No sequential dependency, enables GPUs  
**Long-range:** Direct attention to any position  
**Interpretability:** Attention weights reveal model reasoning  
**Scalability:** Works with very large datasets and models  

### Foundation for Days 18-30

**Day 18:** Text Classification with Transformers  
**Days 21-22:** Transformer decoder (for generation)  
**Days 24-25:** Building GPT from scratch  
**Day 26:** BERT (encoder-only)  

---

## Part 9: Exercises

### Basic
1. **Modify attention mask:** Create a causal mask for autoregressive attention
2. **Add dropout:** Add dropout to scaled dot-product attention
3. **Experiment with heads:** Compare 1-head vs 8-head vs 16-head attention

### Intermediate
1. **Implement relative position encoding:** Use Shaw et al. 2018 approach
2. **Add learnable positional embeddings:** Replace sinusoidal with learned PE
3. **Analyze attention:** What patterns do different heads learn?

### Advanced
1. **Implement efficient attention:** Linear attention (Linformer) or sparse patterns
2. **Add layer weight sharing:** Reuse weights across blocks (like Albert)
3. **Rotary positional embeddings (RoPE):** Modern alternative to sinusoidal
4. **Analyze gradient flow:** How deep can Transformers get without exploding gradients?

## Part 10: Summary

You've successfully implemented a complete Transformer encoder from scratch!

**Key Components:**
- Scaled dot-product attention
- Multi-head attention with separate heads
- Sinusoidal positional encodings
- Transformer block with residual connections
- Complete encoder stack

**Next:** Tomorrow (Day 18) we'll use Transformers for text classification and explore fine-tuning with pretrained models.

**Further Reading:**
- "Attention is All You Need" (Vaswani et al., 2017) - Original paper
- "The Illustrated Transformer" (Jay Alammar) - Excellent visual guide
- PyTorch Transformer Tutorial - Official implementation guide