# Mini Transformer Tutorial

## Introduction

This tutorial provides a comprehensive walkthrough of the Mini Transformer implementation, a simplified version of the Transformer architecture designed for educational purposes. The Mini Transformer demonstrates the core concepts of attention mechanisms and self-attention without the complexity of full-scale models.

### What You'll Learn
- The fundamental components of a Transformer model
- How multi-head attention works
- Implementation of Transformer layers
- Model configuration and initialization
- Forward pass through the network
- Practical usage and testing

In [None]:
# Import required libraries
import torch
import torch.nn as nn
import sys
from pathlib import Path
import time

# Add project root to path
sys.path.append(str(Path('.').parent))

# Import our Mini Transformer implementation
from src.model.mini_transformer import MiniTransformer, MiniTransformerConfig

# Set random seed for reproducibility
torch.manual_seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")

## 1. Understanding the Transformer Architecture

The Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al., revolutionized natural language processing by replacing recurrent layers with self-attention mechanisms. The key components include:

1. **Multi-Head Attention**: Allows the model to focus on different parts of the input simultaneously
2. **Feed-Forward Networks**: Position-wise fully connected layers
3. **Layer Normalization**: Stabilizes training
4. **Residual Connections**: Enable training of deep networks

Our Mini Transformer simplifies these concepts while maintaining the core functionality.

## 2. Model Configuration

The `MiniTransformerConfig` class defines all the hyperparameters for our model. Let's examine and create a configuration:

In [None]:
# Create a model configuration
config = MiniTransformerConfig(
    vocab_size=1000,          # Size of vocabulary
    hidden_size=128,          # Dimension of hidden layers
    num_attention_heads=4,    # Number of attention heads
    num_hidden_layers=4,      # Number of Transformer layers
    intermediate_size=256,    # Size of feed-forward layers
    max_position_embeddings=64, # Maximum sequence length
    dropout_prob=0.1,         # Dropout probability
    use_cuda=torch.cuda.is_available(), # Enable CUDA optimizations
    use_cudnn=True            # Enable cuDNN optimizations
)

print("Mini Transformer Configuration:")
print(f"  Vocabulary size: {config.vocab_size}")
print(f"  Hidden size: {config.hidden_size}")
print(f"  Attention heads: {config.num_attention_heads}")
print(f"  Hidden layers: {config.num_hidden_layers}")
print(f"  Intermediate size: {config.intermediate_size}")
print(f"  Max position embeddings: {config.max_position_embeddings}")
print(f"  Dropout probability: {config.dropout_prob}")
print(f"  Head dimension: {config.head_dim}")

## 3. Creating the Model

Now let's create an instance of our Mini Transformer using the configuration:

In [None]:
# Create the model
model = MiniTransformer(config)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model created successfully")
print(f"Device: {device}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Show model structure
print(f"\nModel structure:")
print(f"  Embedding layers: token_embeddings, position_embeddings")
print(f"  Transformer layers: {len(model.layers)} layers")
print(f"  Final layer norm and LM head")

## 4. Understanding Model Components

Let's examine the individual components of our Mini Transformer:

In [None]:
# Examine embedding layers
print("Embedding Layers:")
print(f"  Token embeddings shape: {model.token_embeddings.weight.shape}")
print(f"  Position embeddings shape: {model.position_embeddings.weight.shape}")

# Examine first Transformer layer
first_layer = model.layers[0]
print(f"\nFirst Transformer Layer:")
print(f"  Attention mechanism: {type(first_layer.attention).__name__}")
print(f"  Layer norm 1: {type(first_layer.layer_norm1).__name__}")
print(f"  Feed-forward network: {type(first_layer.ffn).__name__}")
print(f"  Layer norm 2: {type(first_layer.layer_norm2).__name__}")

# Examine attention mechanism in detail
attention = first_layer.attention
print(f"\nAttention Mechanism:")
print(f"  Hidden size: {attention.hidden_size}")
print(f"  Number of heads: {attention.num_heads}")
print(f"  Head dimension: {attention.head_dim}")
print(f"  Q projection: {attention.q_proj}")
print(f"  K projection: {attention.k_proj}")
print(f"  V projection: {attention.v_proj}")
print(f"  Output projection: {attention.o_proj}")

## 5. Forward Pass Through the Model

Let's perform a forward pass through our Mini Transformer with sample data:

In [None]:
# Create sample input data
batch_size = 4
sequence_length = 16

# Generate random token IDs
input_ids = torch.randint(0, config.vocab_size, (batch_size, sequence_length)).to(device)
print(f"Input IDs shape: {input_ids.shape}")
print(f"Sample input IDs: {input_ids[0][:10]}...")

# Perform forward pass
start_time = time.time()
outputs = model(input_ids)
forward_time = time.time() - start_time

print(f"\nForward pass completed in {forward_time:.4f} seconds")
print(f"Logits shape: {outputs['logits'].shape}")
print(f"Hidden states shape: {outputs['hidden_states'].shape}")

# Examine logits
logits = outputs['logits']
print(f"\nLogits statistics:")
print(f"  Mean: {logits.mean().item():.4f}")
print(f"  Std: {logits.std().item():.4f}")
print(f"  Min: {logits.min().item():.4f}")
print(f"  Max: {logits.max().item():.4f}")

## 6. Training with Loss Computation

Let's see how the model computes loss for training:

In [None]:
# Create labels for loss computation (shifted input for language modeling)
labels = torch.randint(0, config.vocab_size, (batch_size, sequence_length)).to(device)
print(f"Labels shape: {labels.shape}")

# Forward pass with labels to compute loss
start_time = time.time()
outputs_with_loss = model(input_ids, labels=labels)
loss_time = time.time() - start_time

print(f"\nForward pass with loss computation in {loss_time:.4f} seconds")
print(f"Logits shape: {outputs_with_loss['logits'].shape}")
print(f"Loss: {outputs_with_loss['loss'].item():.4f}")

# Examine loss computation details
logits_flat = outputs_with_loss['logits'].view(-1, config.vocab_size)
labels_flat = labels.view(-1)
print(f"\nFlattened logits shape: {logits_flat.shape}")
print(f"Flattened labels shape: {labels_flat.shape}")

## 7. Performance Analysis

Let's analyze the performance characteristics of our Mini Transformer:

In [None]:
def benchmark_model(model, input_ids, iterations=100):
    """Benchmark model performance"""
    # Warmup
    for _ in range(10):
        _ = model(input_ids)
    
    # Benchmark
    start_time = time.time()
    for _ in range(iterations):
        _ = model(input_ids)
    end_time = time.time()
    
    avg_time = (end_time - start_time) / iterations
    sequences_per_second = (input_ids.shape[0] * input_ids.shape[1]) / avg_time
    
    return avg_time, sequences_per_second

# Benchmark the model
avg_time, seq_per_sec = benchmark_model(model, input_ids)

print(f"Performance Analysis:")
print(f"  Average forward pass time: {avg_time*1000:.2f} ms")
print(f"  Sequences per second: {seq_per_sec:.2f}")
print(f"  Parameters: {total_params:,}")
print(f"  Parameters per sequence element: {total_params/(batch_size*sequence_length):.2f}")

## 8. Gradient Computation and Backward Pass

Let's examine how gradients are computed during training:

In [None]:
# Enable gradient computation
model.train()

# Forward pass with loss
outputs = model(input_ids, labels=labels)
loss = outputs['loss']

print(f"Loss before backward pass: {loss.item():.4f}")
print(f"Loss requires gradient: {loss.requires_grad}")

# Check gradients before backward pass
first_layer_weights = model.layers[0].attention.q_proj.weight
print(f"\nFirst layer Q-projection weights require gradient: {first_layer_weights.requires_grad}")
print(f"First layer Q-projection gradients present: {first_layer_weights.grad is not None}")

# Backward pass
start_time = time.time()
loss.backward()
backward_time = time.time() - start_time

print(f"\nBackward pass completed in {backward_time:.4f} seconds")
print(f"First layer Q-projection gradients present: {first_layer_weights.grad is not None}")
print(f"First layer Q-projection gradient norm: {first_layer_weights.grad.norm().item():.4f}")

## 9. Model Optimization Features

Our Mini Transformer implementation includes several optimization features:

In [None]:
# Examine CUDA optimizations
print("CUDA Optimizations:")
print(f"  Model uses CUDA: {model.use_cuda}")
print(f"  cuDNN benchmarking enabled: {torch.backends.cudnn.benchmark}")

# Examine weight initialization
print(f"\nWeight Initialization:")
embedding_weights = model.token_embeddings.weight
print(f"  Embedding weights mean: {embedding_weights.mean().item():.4f}")
print(f"  Embedding weights std: {embedding_weights.std().item():.4f}")

attention_weights = model.layers[0].attention.q_proj.weight
print(f"  Attention weights mean: {attention_weights.mean().item():.4f}")
print(f"  Attention weights std: {attention_weights.std().item():.4f}")

# Examine layer normalization
layer_norm = model.layers[0].layer_norm1
print(f"\nLayer Normalization:")
print(f"  Weight mean: {layer_norm.weight.mean().item():.4f}")
print(f"  Weight std: {layer_norm.weight.std().item():.4f}")
print(f"  Bias mean: {layer_norm.bias.mean().item():.4f}")

## 10. Practical Usage Examples

Let's look at some practical examples of how to use the Mini Transformer:

In [None]:
# Example 1: Simple inference
model.eval()
with torch.no_grad():
    sample_input = torch.randint(0, config.vocab_size, (1, 8)).to(device)
    outputs = model(sample_input)
    
    print("Example 1: Simple Inference")
    print(f"  Input shape: {sample_input.shape}")
    print(f"  Output logits shape: {outputs['logits'].shape}")
    print(f"  Hidden states shape: {outputs['hidden_states'].shape}")

# Example 2: Getting predictions
with torch.no_grad():
    logits = outputs['logits']
    predictions = torch.argmax(logits, dim=-1)
    
    print(f"\nExample 2: Predictions")
    print(f"  Predicted token IDs: {predictions[0].tolist()}")
    print(f"  Prediction probabilities shape: {torch.softmax(logits, dim=-1).shape}")

# Example 3: Model saving and loading
print(f"\nExample 3: Model Persistence")
save_path = "mini_transformer_checkpoint.pth"

# Save model state
torch.save(model.state_dict(), save_path)
print(f"  Model saved to {save_path}")

# Create new model and load weights
new_model = MiniTransformer(config).to(device)
new_model.load_state_dict(torch.load(save_path, map_location=device))
print(f"  Model loaded successfully")

# Verify the models produce the same output
with torch.no_grad():
    original_output = model(sample_input)
    loaded_output = new_model(sample_input)
    
    outputs_match = torch.allclose(original_output['logits'], loaded_output['logits'], atol=1e-6)
    print(f"  Outputs match: {outputs_match}")

# Clean up
import os
if os.path.exists(save_path):
    os.remove(save_path)
    print(f"  Cleanup: Removed {save_path}")

## 11. Scaling Considerations

Let's examine how the model scales with different configurations:

In [None]:
def analyze_scaling():
    """Analyze how model parameters scale with configuration"""
    configs = [
        {"name": "Small", "hidden_size": 64, "num_heads": 2, "num_layers": 2, "intermediate_size": 128},
        {"name": "Medium", "hidden_size": 128, "num_heads": 4, "num_layers": 4, "intermediate_size": 256},
        {"name": "Large", "hidden_size": 256, "num_heads": 8, "num_layers": 6, "intermediate_size": 512}
    ]
    
    print("Scaling Analysis:")
    print(f"{'Config':<10} {'Hidden':<8} {'Heads':<6} {'Layers':<7} {'FFN':<8} {'Params':<12} {'Ratio':<8}")
    print("-" * 65)
    
    small_params = None
    for config_spec in configs:
        config = MiniTransformerConfig(
            vocab_size=1000,
            hidden_size=config_spec["hidden_size"],
            num_attention_heads=config_spec["num_heads"],
            num_hidden_layers=config_spec["num_layers"],
            intermediate_size=config_spec["intermediate_size"],
            max_position_embeddings=64
        )
        
        model = MiniTransformer(config)
        params = sum(p.numel() for p in model.parameters())
        
        if small_params is None:
            small_params = params
        
        ratio = params / small_params
        
        print(f"{config_spec['name']:<10} {config_spec['hidden_size']:<8} {config_spec['num_heads']:<6} {config_spec['num_layers']:<7} {config_spec['intermediate_size']:<8} {params:<12,} {ratio:<8.1f}")

analyze_scaling()

## Summary

In this tutorial, we've explored the Mini Transformer implementation in depth:

- **Architecture**: Understanding the core components of the Transformer model
- **Configuration**: How to set up model hyperparameters
- **Implementation**: Examining the code structure and components
- **Forward Pass**: How data flows through the network
- **Training**: Loss computation and gradient calculation
- **Optimizations**: CUDA and cuDNN optimizations for performance
- **Practical Usage**: Real-world examples of model usage
- **Scaling**: How model complexity grows with configuration

The Mini Transformer serves as an excellent educational tool that demonstrates the fundamental concepts of attention mechanisms and Transformer architectures while remaining accessible for learning and experimentation. This implementation forms the foundation for understanding more complex models like the Advanced Transformer in this repository.