# Day 21: Transformer Encoder Architecture

**Goal:** Build a complete Transformer encoder from scratch, understanding all components.

**Time estimate:** 4-5 hours

**What you'll learn:**
- Layer normalization and why it's crucial
- Encoder layer architecture
- Feed-forward networks in Transformers
- Residual connections and their role
- Stacking encoder layers
- Complete encoder with embeddings

**Topics build on:**
- Day 17: Self-attention and multi-head attention
- Day 19-20: Language modeling and generation

---

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import FancyBboxPatch

# Set seeds
torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
print(f"PyTorch: {torch.__version__}")

## Part 1: Layer Normalization

### Why Layer Normalization?

Batch normalization normalizes across the batch dimension:
$$\text{BatchNorm}(x) = \gamma \odot \frac{x - \mathbb{E}_{batch}[x]}{\sqrt{\text{Var}_{batch}[x] + \epsilon}} + \beta$$

**Problem for Transformers:**
- Batch size matters (affects statistics)
- RNNs: batch_size=1 at inference time (batch norm breaks)
- Variable sequence lengths (batch norm expects fixed sizes)

Layer normalization normalizes across the feature dimension:
$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mathbb{E}_{features}[x]}{\sqrt{\text{Var}_{features}[x] + \epsilon}} + \beta$$

**Advantages:**
✅ Independent of batch size
✅ Works with variable sequence lengths
✅ No train/test discrepancy
✅ Better for Transformers (empirically)

In [None]:
class LayerNormalization(nn.Module):
    """
    Layer normalization: normalize features independently of batch.
    
    Normalizes across feature dimension (last dimension).
    """
    
    def __init__(self, d_model, eps=1e-6):
        """
        Args:
            d_model: Dimension of features to normalize
            eps: Small value for numerical stability
        """
        super().__init__()
        self.d_model = d_model
        self.eps = eps
        
        # Learnable scale (gamma) and shift (beta)
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
    
    def forward(self, x):
        """
        Args:
            x: Tensor of shape (..., d_model)
        
        Returns:
            Normalized tensor same shape as input
        """
        # Compute mean and variance across feature dimension
        # For (..., d_model): compute mean/var along last dimension
        mean = x.mean(dim=-1, keepdim=True)  # (..., 1)
        var = x.var(dim=-1, keepdim=True, unbiased=False)  # (..., 1)
        
        # Normalize
        x_norm = (x - mean) / torch.sqrt(var + self.eps)  # (..., d_model)
        
        # Scale and shift
        return self.gamma * x_norm + self.beta


print("✓ Layer normalization implemented")

### Test Layer Normalization

In [None]:
# Create test data
batch_size = 4
seq_len = 6
d_model = 8

x = torch.randn(batch_size, seq_len, d_model)
print(f"Input shape: {x.shape}")
print(f"Input mean (before norm): {x.mean(dim=-1)[0, 0].item():.4f}")
print(f"Input std (before norm): {x.std(dim=-1)[0, 0].item():.4f}")

# Apply layer norm
ln = LayerNormalization(d_model)
x_norm = ln(x)

print(f"\nAfter layer norm:")
print(f"Output shape: {x_norm.shape}")
print(f"Output mean (after norm): {x_norm.mean(dim=-1)[0, 0].item():.4f}")
print(f"Output std (after norm): {x_norm.std(dim=-1)[0, 0].item():.4f}")
print(f"\nNote: Mean ≈ 0, Std ≈ 1 ✓")

# Compare with PyTorch's LayerNorm
pytorch_ln = nn.LayerNorm(d_model)
pytorch_output = pytorch_ln(x)

print(f"\nComparison with PyTorch LayerNorm:")
max_diff = (x_norm - pytorch_output).abs().max().item()
print(f"Max difference: {max_diff:.6f}")
print(f"Outputs match: {max_diff < 1e-5} ✓")

## Part 2: Feed-Forward Network

### Position-wise Feed-Forward Network (FFN)

Applied to each position independently:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Or with GELU activation:
$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

**Key properties:**
- Applied to each token independently
- Shared parameters across positions
- Expansion: $d_{model} \to d_{ff} \to d_{model}$ (typically $d_{ff} = 4 \times d_{model}$)
- Adds non-linearity and capacity

In [None]:
class FeedForwardNetwork(nn.Module):
    """
    Position-wise feed-forward network.
    
    Two linear layers with activation in between.
    Applied independently to each position.
    """
    
    def __init__(self, d_model, d_ff=2048, activation='relu'):
        """
        Args:
            d_model: Input/output dimension
            d_ff: Hidden layer dimension (typically 4 * d_model)
            activation: 'relu' or 'gelu'
        """
        super().__init__()
        
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        
        if activation == 'relu':
            self.activation = F.relu
        elif activation == 'gelu':
            self.activation = F.gelu
        else:
            raise ValueError(f"Unknown activation: {activation}")
    
    def forward(self, x):
        """
        Args:
            x: Input tensor (..., d_model)
        
        Returns:
            Output tensor same shape as input
        """
        # First linear layer + activation: (..., d_model) -> (..., d_ff)
        x = self.linear1(x)
        x = self.activation(x)
        
        # Second linear layer: (..., d_ff) -> (..., d_model)
        x = self.linear2(x)
        
        return x


print("✓ Feed-forward network implemented")

### Test FFN

In [None]:
d_model = 64
d_ff = 256

ffn = FeedForwardNetwork(d_model, d_ff, activation='relu')

# Test data
x = torch.randn(batch_size, seq_len, d_model)

# Forward pass
output = ffn(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Shapes match: {output.shape == x.shape} ✓")

# Count parameters
params = sum(p.numel() for p in ffn.parameters())
print(f"\nFFN parameters: {params:,}")
print(f"  Linear1: {d_model} × {d_ff} = {d_model * d_ff:,}")
print(f"  Linear2: {d_ff} × {d_model} = {d_ff * d_model:,}")

## Part 3: Residual Connections

### Add & Norm Pattern

Residual connections (skip connections) are crucial:
$$y = \text{LayerNorm}(x + \text{Sublayer}(x))$$

Or pre-norm variant:
$$y = x + \text{Sublayer}(\text{LayerNorm}(x))$$

**Benefits:**
✅ Gradient flow: Can skip over deep layers
✅ Identity shortcut: If sublayer learns little, output ≈ input
✅ Training stability: Easier to train very deep networks
✅ Feature preservation: Input features propagate forward

In [None]:
class ResidualConnection(nn.Module):
    """
    Residual connection with layer normalization.
    
    Implements: output = LayerNorm(input + Sublayer(input))
    """
    
    def __init__(self, d_model, sublayer, dropout=0.1):
        """
        Args:
            d_model: Dimension
            sublayer: The sublayer to apply (attention or FFN)
            dropout: Dropout rate
        """
        super().__init__()
        self.norm = LayerNormalization(d_model)
        self.sublayer = sublayer
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        """
        Args:
            x: Input tensor (..., d_model)
        
        Returns:
            Output tensor same shape
        """
        # Apply sublayer, add residual, apply dropout
        return self.norm(x + self.dropout(self.sublayer(x)))


print("✓ Residual connection implemented")

## Part 4: Multi-Head Self-Attention (Review)

Let's reimplement the multi-head attention for the encoder:

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head self-attention.
    """
    
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('scale', torch.tensor(self.d_k ** 0.5))
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input (batch, seq_len, d_model)
            mask: Optional mask for attention
        
        Returns:
            output: (batch, seq_len, d_model)
        """
        batch_size, seq_len, d_model = x.shape
        
        # Project to Q, K, V: (batch, seq_len, d_model)
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Reshape for multi-head: (batch, seq_len, num_heads, d_k) -> (batch, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale  # (batch, num_heads, seq_len, seq_len)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply softmax and dropout
        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = torch.nan_to_num(attn_weights, 0.0)
        attn_weights = self.dropout(attn_weights)
        
        # Apply attention to values
        attended = torch.matmul(attn_weights, V)  # (batch, num_heads, seq_len, d_k)
        
        # Concatenate heads
        attended = attended.transpose(1, 2).contiguous()  # (batch, seq_len, num_heads, d_k)
        attended = attended.view(batch_size, seq_len, d_model)  # (batch, seq_len, d_model)
        
        # Final projection
        output = self.W_o(attended)
        
        return output, attn_weights


print("✓ Multi-head attention implemented")

## Part 5: Encoder Layer

### Complete Encoder Layer

Combines:
1. Multi-head self-attention
2. Add & Norm
3. Feed-forward network
4. Add & Norm

**Architecture:**
```
Input (batch, seq_len, d_model)
  ↓
Layer Norm
  ↓
Multi-Head Attention
  ↓
Dropout
  ↓
Add (residual) → [intermediate]
  ↓
Layer Norm
  ↓
Feed-Forward
  ↓
Dropout
  ↓
Add (residual) → Output
```

In [None]:
class EncoderLayer(nn.Module):
    """
    Single Transformer encoder layer.
    
    Combines:
    - Multi-head self-attention
    - Feed-forward network
    - Layer normalization and residual connections
    """
    
    def __init__(self, d_model, num_heads, d_ff=2048, dropout=0.1, activation='relu'):
        """
        Args:
            d_model: Model dimension
            num_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
            dropout: Dropout rate
            activation: Activation function ('relu' or 'gelu')
        """
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)
        
        # Feed-forward network
        self.ffn = FeedForwardNetwork(d_model, d_ff, activation)
        
        # Layer normalization (pre-norm variant)
        self.norm1 = LayerNormalization(d_model)
        self.norm2 = LayerNormalization(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input tensor (batch, seq_len, d_model)
            mask: Optional attention mask
        
        Returns:
            output: Encoded representation (batch, seq_len, d_model)
            attn_weights: Attention weights for visualization
        """
        # Multi-head attention with residual
        x_norm = self.norm1(x)
        attn_output, attn_weights = self.attention(x_norm, mask)
        x = x + self.dropout(attn_output)
        
        # Feed-forward with residual
        x_norm = self.norm2(x)
        ffn_output = self.ffn(x_norm)
        x = x + self.dropout(ffn_output)
        
        return x, attn_weights


print("✓ Encoder layer implemented")

### Test Encoder Layer

In [None]:
# Configuration
d_model = 256
num_heads = 8
d_ff = 1024
seq_len = 20
batch_size = 4

# Create encoder layer
encoder_layer = EncoderLayer(d_model, num_heads, d_ff, dropout=0.1)

# Test data
x = torch.randn(batch_size, seq_len, d_model)

# Forward pass
output, attn_weights = encoder_layer(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"\nShapes correct: {output.shape == x.shape} ✓")

# Count parameters
params = sum(p.numel() for p in encoder_layer.parameters())
print(f"\nEncoder layer parameters: {params:,}")

# Breakdown
attn_params = sum(p.numel() for p in encoder_layer.attention.parameters())
ffn_params = sum(p.numel() for p in encoder_layer.ffn.parameters())
norm_params = sum(p.numel() for p in encoder_layer.norm1.parameters()) + \
              sum(p.numel() for p in encoder_layer.norm2.parameters())

print(f"  Attention: {attn_params:,}")
print(f"  FFN: {ffn_params:,}")
print(f"  Normalization: {norm_params:,}")

## Part 6: Positional Encodings (Review)

In [None]:
class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encodings.
    """
    
    def __init__(self, d_model, max_seq_len=1000):
        super().__init__()
        
        # Create position indices
        positions = torch.arange(max_seq_len).unsqueeze(1)  # (max_seq_len, 1)
        dimensions = torch.arange(d_model).unsqueeze(0)  # (1, d_model)
        
        # Compute angle rates
        angle_rates = 1 / (10000 ** (2 * (dimensions // 2) / d_model))
        angle_rads = positions * angle_rates  # (max_seq_len, d_model)
        
        # Apply sin to even indices, cos to odd
        angle_rads[:, 0::2] = torch.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = torch.cos(angle_rads[:, 1::2])
        
        # Register as buffer
        self.register_buffer('pe', angle_rads.unsqueeze(0))  # (1, max_seq_len, d_model)
    
    def forward(self, x):
        """
        Args:
            x: Input embeddings (batch, seq_len, d_model)
        
        Returns:
            x + positional encoding
        """
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len, :]


print("✓ Positional encoding implemented")

## Part 7: Complete Transformer Encoder

### Stack Multiple Encoder Layers

In [None]:
class TransformerEncoder(nn.Module):
    """
    Complete Transformer encoder.
    
    Stacks multiple encoder layers with embeddings and positional encodings.
    """
    
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff=2048,
                 max_seq_len=1000, dropout=0.1, activation='relu'):
        """
        Args:
            vocab_size: Size of vocabulary
            d_model: Model dimension
            num_heads: Number of attention heads
            num_layers: Number of encoder layers
            d_ff: Feed-forward hidden dimension
            max_seq_len: Maximum sequence length
            dropout: Dropout rate
            activation: Activation function
        """
        super().__init__()
        
        self.d_model = d_model
        
        # Embeddings
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        # Stack of encoder layers
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout, activation)
            for _ in range(num_layers)
        ])
        
        # Final layer norm
        self.final_norm = LayerNormalization(d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input token IDs (batch, seq_len)
            mask: Optional attention mask
        
        Returns:
            output: Encoded representations (batch, seq_len, d_model)
            all_attn_weights: Attention weights from all layers
        """
        # Embedding and position encoding
        x = self.embedding(x) * np.sqrt(self.d_model)  # Scale embeddings
        x = self.pos_encoding(x)
        x = self.dropout(x)
        
        # Pass through encoder layers
        all_attn_weights = []
        for layer in self.encoder_layers:
            x, attn_weights = layer(x, mask)
            all_attn_weights.append(attn_weights)
        
        # Final normalization
        x = self.final_norm(x)
        
        return x, all_attn_weights


print("✓ Transformer encoder implemented")

### Test Complete Encoder

In [None]:
# Configuration
vocab_size = 5000
d_model = 256
num_heads = 8
num_layers = 3
d_ff = 1024
seq_len = 20
batch_size = 8

# Create encoder
encoder = TransformerEncoder(
    vocab_size=vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=d_ff,
    dropout=0.1
)

print(f"Encoder architecture:")
print(encoder)

# Count parameters
total_params = sum(p.numel() for p in encoder.parameters())
print(f"\nTotal parameters: {total_params:,}")

In [None]:
# Test data
x = torch.randint(1, vocab_size, (batch_size, seq_len))

print(f"Input shape: {x.shape}")
print(f"Sample input (token IDs): {x[0]}")

# Forward pass
output, attn_weights = encoder(x)

print(f"\nOutput shape: {output.shape}")
print(f"Number of layers: {len(attn_weights)}")
print(f"Attention weights per layer shape: {attn_weights[0].shape}")
print(f"  (batch={batch_size}, num_heads={num_heads}, seq_len={seq_len}, seq_len={seq_len})")

# Verify shapes
print(f"\n✓ Output shape correct: {output.shape == (batch_size, seq_len, d_model)}")

## Part 8: Visualization and Analysis

### Analyze Attention Patterns

In [None]:
# Create simple example for visualization
# Simple sentence: "The cat sat on the mat"
vocab = {
    '<pad>': 0, 'the': 1, 'cat': 2, 'sat': 3,
    'on': 4, 'mat': 5
}
tokens = torch.tensor([[1, 2, 3, 4, 1, 5]], dtype=torch.long)  # "The cat sat on the mat"
word_labels = ['The', 'cat', 'sat', 'on', 'the', 'mat']

# Create small encoder for visualization
small_encoder = TransformerEncoder(
    vocab_size=6,
    d_model=64,
    num_heads=4,
    num_layers=2,
    d_ff=256,
    dropout=0.1
)

# Get attention weights
output, attn_weights_list = small_encoder(tokens)

print(f"Encoder output shape: {output.shape}")
print(f"Number of layers: {len(attn_weights_list)}")

# Visualize attention from first layer, first head
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Layer 1
layer_1_attn = attn_weights_list[0][0].detach()  # (num_heads, seq_len, seq_len)
head_0_attn = layer_1_attn[0].cpu().numpy()  # First head

ax = axes[0]
im = ax.imshow(head_0_attn, cmap='viridis', aspect='auto')
ax.set_xticks(range(len(word_labels)))
ax.set_yticks(range(len(word_labels)))
ax.set_xticklabels(word_labels, rotation=45, ha='right')
ax.set_yticklabels(word_labels)
ax.set_title('Layer 1, Head 1 Attention\n(What each word attends to)', fontweight='bold')
ax.set_xlabel('Key (attend to)', fontweight='bold')
ax.set_ylabel('Query (attend from)', fontweight='bold')
plt.colorbar(im, ax=ax, label='Attention Weight')

# Layer 2
layer_2_attn = attn_weights_list[1][0].detach()
head_0_attn = layer_2_attn[0].cpu().numpy()

ax = axes[1]
im = ax.imshow(head_0_attn, cmap='viridis', aspect='auto')
ax.set_xticks(range(len(word_labels)))
ax.set_yticks(range(len(word_labels)))
ax.set_xticklabels(word_labels, rotation=45, ha='right')
ax.set_yticklabels(word_labels)
ax.set_title('Layer 2, Head 1 Attention\n(Refined attention after first layer)', fontweight='bold')
ax.set_xlabel('Key (attend to)', fontweight='bold')
ax.set_ylabel('Query (attend from)', fontweight='bold')
plt.colorbar(im, ax=ax, label='Attention Weight')

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Diagonal: All words attend to themselves (expected)")
print("- 'The' attends to article and nouns")
print("- 'sat' (verb) attends to subject and object")
print("- Layer 2 shows refined, task-specific patterns")

### Visualize All Heads in a Layer

In [None]:
# Visualize all 4 heads from layer 1
layer_1_attn = attn_weights_list[0][0].detach().cpu().numpy()  # (4, 6, 6)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for head_idx in range(num_heads):
    ax = axes[head_idx]
    head_attn = layer_1_attn[head_idx]
    
    im = ax.imshow(head_attn, cmap='Blues', aspect='auto')
    ax.set_xticks(range(len(word_labels)))
    ax.set_yticks(range(len(word_labels)))
    ax.set_xticklabels(word_labels, rotation=45, ha='right', fontsize=9)
    ax.set_yticklabels(word_labels, fontsize=9)
    ax.set_title(f'Head {head_idx + 1}', fontweight='bold')
    
    if head_idx % 2 == 0:
        ax.set_ylabel('Query', fontweight='bold')
    if head_idx >= 2:
        ax.set_xlabel('Key', fontweight='bold')
    
    plt.colorbar(im, ax=ax, fraction=0.046)

plt.suptitle('Layer 1: All Attention Heads', fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("\nKey insight:")
print("Each head learns different attention patterns:")
print("- Some heads might focus on adjacent words")
print("- Others might focus on syntactic roles")
print("- Multi-head attention = ensemble of different perspectives")

## Part 9: Architectural Variants

### Pre-Norm vs Post-Norm

**Post-Norm (Original, what we implemented):**
$$y = \text{LayerNorm}(x + \text{Sublayer}(x))$$

**Pre-Norm (More stable):**
$$y = x + \text{Sublayer}(\text{LayerNorm}(x))$$

Pre-norm is increasingly popular - better gradient flow, more stable.

In [None]:
class EncoderLayerPreNorm(nn.Module):
    """
    Encoder layer with pre-normalization (more stable).
    
    Apply normalization before sublayers instead of after.
    """
    
    def __init__(self, d_model, num_heads, d_ff=2048, dropout=0.1, activation='relu'):
        super().__init__()
        
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = FeedForwardNetwork(d_model, d_ff, activation)
        
        # Layer norms come BEFORE sublayers (pre-norm)
        self.norm1 = LayerNormalization(d_model)
        self.norm2 = LayerNormalization(d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Apply norm FIRST, then sublayer
        attn_output, attn_weights = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_output)
        
        ffn_output = self.ffn(self.norm2(x))
        x = x + self.dropout(ffn_output)
        
        return x, attn_weights


print("✓ Pre-norm encoder layer implemented")
print("\nComparison:")
print("Post-Norm: Better generalization, original paper")
print("Pre-Norm:  Better gradient flow, more stable training")

## Part 10: Key Concepts Summary

### Layer Normalization
- Normalizes features **across feature dimension**
- Independent of batch size (better for Transformers)
- Learnable scale and shift parameters

### Feed-Forward Network
- Position-wise: Applied independently to each token
- Two-layer MLP: $d_{model} \to d_{ff} \to d_{model}$
- Typical expansion: $d_{ff} = 4 \times d_{model}$
- Adds non-linearity and capacity

### Residual Connections
- $y = x + \text{Sublayer}(x)$ (skip connection)
- Crucial for training deep networks
- Preserves input features
- Improves gradient flow

### Encoder Layer Architecture
- Multi-head attention
- Residual connection
- Feed-forward network
- Residual connection
- Total: ~6 × d_model²/num_heads parameters

### Multi-Head Attention
- $h$ parallel attention mechanisms
- Each head: dimension $d_model / h$
- Concatenate outputs
- Learn different relationship types

### Information Flow in Encoder
1. **Embedding & Position Encoding:** Add position information
2. **Layer 1:** Immediate context, local patterns
3. **Layer 2:** Larger context, higher-order patterns
4. **Layer N:** Global context, semantic information

Deeper layers see more of the sequence through attention!

## Part 11: Practical Considerations

### Computational Complexity

For a single encoder layer:
- Multi-head attention: **O(seq_len² × d_model)**
- Feed-forward: **O(seq_len × d_ff × d_model)**
- Total: **O(seq_len² × d_model + seq_len × d_ff × d_model)**

With $d_{ff} = 4 \times d_{model}$, dominant term is **O(seq_len² × d_model)**

### Memory Complexity

- Attention weights: O(seq_len²) - can be bottleneck for long sequences
- Activation checkpointing: Trade computation for memory
- Sparse attention: Only attend to subset of positions

### Training Tips

1. **Gradient clipping:** Important for stability
2. **Warmup:** Linear warmup of learning rate helps
3. **Weight initialization:** Xavier/He initialization standard
4. **Dropout:** 0.1-0.2 typical
5. **Learning rate:** Usually 1e-4 to 1e-3
6. **Label smoothing:** 0.1 helps with generalization

### Common Issues

- **Unstable training:** Check gradient clipping, learning rate
- **Poor convergence:** Try warmup, adjust learning rate schedule
- **Overfitting:** Increase dropout, add regularization
- **Memory issues:** Use gradient checkpointing, reduce sequence length

## Part 12: Exercises

### Basic
1. **Modify hidden dimension:** Try d_ff = 2×d_model vs 8×d_model. How does it affect performance?
2. **Change activation:** Replace ReLU with GELU. What's the difference?
3. **Analyze layer outputs:** Print norms of outputs at each layer.

### Intermediate
1. **Implement pre-norm variant:** Compare training stability with post-norm
2. **Add layer-wise visualization:** Track how representations change across layers
3. **Gradient analysis:** Check gradients at each layer during backprop
4. **Residual importance:** Remove residual connections, see what breaks

### Advanced
1. **Efficient attention:** Implement sparse or linear attention
2. **Relative position bias:** Replace absolute positions with relative
3. **Adaptive computation:** Skip layers for easy examples
4. **Layer sharing:** Share weights across encoder layers (ALBERT style)
5. **Profile computation:** Measure FLOPs and memory usage by component

## Part 13: Coming Up

### What We've Built

✅ Complete Transformer encoder from scratch
✅ All component implementations (attention, FFN, norm)
✅ Understanding of architectural design choices
✅ Visualization and analysis tools

### Day 22: Transformer Decoder
- Masked self-attention (causal attention)
- Cross-attention (encoder-decoder attention)
- Sequence generation
- Complete Seq2Seq Transformer

### Days 23-30: Complete Transformer-Based Models
- Day 23: Tokenization strategies (BPE, WordPiece)
- Day 24-25: GPT architecture and training
- Day 26: BERT and bidirectional models
- Day 27-30: minGPT implementation and training

### Why This Matters

Understanding encoders is crucial because:
- **BERT and RoBERTa:** Encoder-only models
- **GPT and Llama:** Decoder-only, but use encoder knowledge
- **T5 and mBART:** Full encoder-decoder
- **Vision Transformers:** Use encoder architecture
- **Multimodal models:** Often have encoder-decoder structure

The encoder is the **fundamental building block** of modern deep learning!