# Training Demo Tutorial

## Introduction

This tutorial provides a comprehensive demonstration of training workflows for both the Mini Transformer and Advanced Transformer models. We'll explore different training approaches, from simple toy-scale training to enterprise-scale distributed training with DeepSpeed.

### What You'll Learn
- Setting up training environments
- Configuring training parameters
- Implementing training loops
- Monitoring training progress
- Optimizing training performance
- Best practices for model training

In [None]:
# Import required libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import sys
from pathlib import Path
import time
import numpy as np
import matplotlib.pyplot as plt

# Add project root to path
sys.path.append(str(Path('.').parent))

# Import our model implementations
from src.model.mini_transformer import MiniTransformer, MiniTransformerConfig
from src.training.train_toy import TextDataset

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")

## 1. Training Environment Setup

Before training, we need to set up the appropriate environment and configure CUDA optimizations for better performance.

In [None]:
def setup_training_environment():
    """Setup training environment with CUDA optimizations"""
    # Enable cuDNN benchmarking for better performance
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.enabled = True
    
    # Enable TensorFloat-32 for better performance on modern GPUs
    if torch.cuda.is_available():
        torch.backends.cuda.matmul.allow_tf32 = True
        torch.backends.cudnn.allow_tf32 = True
    
    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    print(f"Training environment setup completed")
    print(f"Device: {device}")
    print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")
    print(f"cuDNN benchmark: {torch.backends.cudnn.benchmark}")
    if torch.cuda.is_available():
        print(f"TensorFloat-32 enabled: {torch.backends.cuda.matmul.allow_tf32}")
    
    return device

# Setup environment
device = setup_training_environment()

## 2. Creating Sample Data

For demonstration purposes, we'll create sample training data that simulates text data for language modeling.

In [None]:
def create_sample_texts(num_samples=1000):
    """Create sample text data for training"""
    base_texts = [
        "The field of artificial intelligence has seen tremendous growth in recent years.",
        "Machine learning algorithms can learn patterns from data without explicit programming.",
        "Deep learning models, particularly neural networks, have achieved remarkable results.",
        "Natural language processing enables computers to understand and generate human language.",
        "Computer vision allows machines to interpret and understand visual information.",
        "Reinforcement learning trains agents to make decisions through trial and error.",
        "Data science combines statistics, programming, and domain expertise to extract insights.",
        "Big data technologies handle the storage and processing of massive datasets.",
        "Cloud computing provides scalable resources for machine learning workloads.",
        "Ethical AI ensures that artificial intelligence systems are fair and unbiased."
    ]
    
    # Generate a larger dataset
    sample_texts = []
    for i in range(num_samples):
        text = base_texts[i % len(base_texts)] + " " + base_texts[(i + 1) % len(base_texts)]
        sample_texts.append(text[:128])  # Limit text length
    
    return sample_texts

# Create sample data
sample_texts = create_sample_texts(5000)
print(f"Created {len(sample_texts)} sample texts")
print(f"Sample text: {sample_texts[0][:50]}...")

## 3. Dataset and DataLoader Configuration

We'll create a dataset and configure a DataLoader for efficient training.

In [None]:
# Create dataset
dataset = TextDataset(sample_texts, max_length=64)
print(f"Dataset created with {len(dataset)} samples")

# Create DataLoader with optimizations
batch_size = 16
dataloader = DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=True,
    pin_memory=True,  # Enable pinned memory for faster GPU transfer
    num_workers=2,    # Use multiple workers for data loading
    persistent_workers=True,  # Keep workers alive between epochs
    prefetch_factor=2  # Prefetch data for better performance
)

print(f"DataLoader configured with batch size: {batch_size}")
print(f"Number of batches: {len(dataloader)}")

# Examine a sample batch
sample_batch = next(iter(dataloader))
print(f"\nSample batch:")
print(f"  Input IDs shape: {sample_batch['input_ids'].shape}")
print(f"  Labels shape: {sample_batch['labels'].shape}")
print(f"  Device: {sample_batch['input_ids'].device}")

## 4. Model Configuration and Initialization

Let's configure and initialize our Mini Transformer model for training.

In [None]:
# Create model configuration
config = MiniTransformerConfig(
    vocab_size=10000,
    hidden_size=256,
    num_attention_heads=4,
    num_hidden_layers=4,
    intermediate_size=512,
    max_position_embeddings=64,
    dropout_prob=0.1,
    use_cuda=torch.cuda.is_available(),
    use_cudnn=True
)

# Create model
model = MiniTransformer(config)
model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model configuration:")
print(f"  Hidden size: {config.hidden_size}")
print(f"  Attention heads: {config.num_attention_heads}")
print(f"  Hidden layers: {config.num_hidden_layers}")
print(f"  \nModel created successfully")
print(f"  Device: {device}")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")

## 5. Optimizer and Learning Rate Scheduler

We'll set up an optimizer and learning rate scheduler for effective training.

In [None]:
# Create optimizer with weight decay for better generalization
learning_rate = 5e-4
optimizer = optim.AdamW(
    model.parameters(),
    lr=learning_rate,
    weight_decay=0.01,
    betas=(0.9, 0.999),
    eps=1e-8
)

# Create learning rate scheduler
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=1000,  # Adjust based on your needs
    eta_min=1e-6
)

print(f"Optimizer configured:")
print(f"  Type: {type(optimizer).__name__}")
print(f"  Learning rate: {learning_rate}")
print(f"  Weight decay: 0.01")
print(f"\nLearning rate scheduler:")
print(f"  Type: {type(scheduler).__name__}")
print(f"  T_max: 1000")
print(f"  Eta min: 1e-6")

## 6. Training Loop Implementation

Let's implement a training loop with proper monitoring and logging.

In [None]:
def train_model(model, dataloader, optimizer, scheduler, device, num_epochs=3):
    """Train the model with monitoring and logging"""
    model.train()
    
    # Initialize metrics tracking
    epoch_losses = []
    epoch_times = []
    
    for epoch in range(num_epochs):
        epoch_start_time = time.time()
        total_loss = 0
        num_batches = 0
        
        # Progress tracking
        batch_times = []
        
        for batch_idx, batch in enumerate(dataloader):
            batch_start_time = time.time()
            
            # Move tensors to device with non-blocking transfer
            input_ids = batch["input_ids"].to(device, non_blocking=True)
            labels = batch["labels"].to(device, non_blocking=True)
            
            # Forward pass
            outputs = model(input_ids, labels=labels)
            loss = outputs["loss"]
            
            # Backward pass
            optimizer.zero_grad(set_to_none=True)  # More memory efficient
            loss.backward()
            
            # Gradient clipping for stability
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            # Optimizer step
            optimizer.step()
            
            # Update learning rate
            scheduler.step()
            
            # Track metrics
            total_loss += loss.item()
            num_batches += 1
            
            batch_time = time.time() - batch_start_time
            batch_times.append(batch_time)
            
            # Log progress periodically
            if batch_idx % 50 == 0:
                current_lr = optimizer.param_groups[0]['lr']
                avg_batch_time = np.mean(batch_times[-50:]) if len(batch_times) >= 50 else np.mean(batch_times)
                samples_per_sec = input_ids.size(0) / batch_time
                print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx:3d}, "
                      f"Loss: {loss.item():.4f}, LR: {current_lr:.6f}, "
                      f"Batch Time: {batch_time:.3f}s, Samples/sec: {samples_per_sec:.1f}")
        
        # End of epoch metrics
        epoch_time = time.time() - epoch_start_time
        avg_loss = total_loss / num_batches
        
        epoch_losses.append(avg_loss)
        epoch_times.append(epoch_time)
        
        print(f"\nEpoch {epoch+1}/{num_epochs} completed:")
        print(f"  Average Loss: {avg_loss:.4f}")
        print(f"  Epoch Time: {epoch_time:.2f}s")
        print(f"  Average Batch Time: {np.mean(batch_times):.3f}s")
        print(f"  Current Learning Rate: {optimizer.param_groups[0]['lr']:.6f}")
        print("-" * 50)
    
    return epoch_losses, epoch_times

# Train the model
print("Starting training...")
train_start_time = time.time()
epoch_losses, epoch_times = train_model(model, dataloader, optimizer, scheduler, device, num_epochs=2)
total_train_time = time.time() - train_start_time

print(f"\nTraining completed in {total_train_time:.2f}s")
print(f"Average epoch time: {np.mean(epoch_times):.2f}s")

## 7. Training Metrics Visualization

Let's visualize the training metrics to understand the model's learning progress.

In [None]:
# Plot training metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot loss over epochs
ax1.plot(range(1, len(epoch_losses) + 1), epoch_losses, 'b-o')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss Over Epochs')
ax1.grid(True, alpha=0.3)

# Plot epoch times
ax2.bar(range(1, len(epoch_times) + 1), epoch_times, color='orange', alpha=0.7)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Time (seconds)')
ax2.set_title('Epoch Training Times')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final training loss: {epoch_losses[-1]:.4f}")
print(f"Loss improvement: {epoch_losses[0] - epoch_losses[-1]:.4f}")

## 8. Model Evaluation

Let's evaluate our trained model with some basic metrics.

In [None]:
def evaluate_model(model, dataloader, device):
    """Evaluate the model on validation data"""
    model.eval()
    total_loss = 0
    num_batches = 0
    
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device, non_blocking=True)
            labels = batch["labels"].to(device, non_blocking=True)
            
            outputs = model(input_ids, labels=labels)
            total_loss += outputs["loss"].item()
            num_batches += 1
    
    avg_loss = total_loss / num_batches
    perplexity = torch.exp(torch.tensor(avg_loss)).item()
    
    return avg_loss, perplexity

# Evaluate the model
eval_loss, perplexity = evaluate_model(model, dataloader, device)

print(f"Model Evaluation:")
print(f"  Validation Loss: {eval_loss:.4f}")
print(f"  Perplexity: {perplexity:.2f}")

## 9. Model Saving and Loading

Let's save our trained model and demonstrate how to load it later.

In [None]:
import os
import json

# Create output directory
output_dir = Path("training_output")
output_dir.mkdir(exist_ok=True)

# Save model
model_path = output_dir / "mini_transformer_trained.pth"
torch.save(model.state_dict(), model_path)

# Save configuration
config_path = output_dir / "config.json"
with open(config_path, "w") as f:
    json.dump(config.__dict__, f, indent=2)

print(f"Model saved to {model_path}")
print(f"Configuration saved to {config_path}")

# Demonstrate loading
print(f"\nLoading model...")
loaded_model = MiniTransformer(config)
loaded_model.to(device)
loaded_model.load_state_dict(torch.load(model_path, map_location=device))
print(f"Model loaded successfully")

# Verify the models produce the same output
model.eval()
loaded_model.eval()

with torch.no_grad():
    sample_batch = next(iter(dataloader))
    input_ids = sample_batch["input_ids"].to(device)
    
    original_output = model(input_ids, labels=sample_batch["labels"].to(device))
    loaded_output = loaded_model(input_ids, labels=sample_batch["labels"].to(device))
    
    outputs_match = torch.allclose(original_output["loss"], loaded_output["loss"], atol=1e-6)
    print(f"Outputs match: {outputs_match}")

# Clean up
if model_path.exists():
    os.remove(model_path)
if config_path.exists():
    os.remove(config_path)
if output_dir.exists() and not any(output_dir.iterdir()):
    output_dir.rmdir()
    
print(f"\nCleanup completed")

## 10. Performance Optimization Techniques

Let's explore some advanced performance optimization techniques for training.

In [None]:
def benchmark_training_optimizations():
    """Benchmark different training optimization techniques"""
    print("Performance Optimization Benchmarking:")
    print("=" * 50)
    
    # Create a small dataset for benchmarking
    small_texts = create_sample_texts(100)
    small_dataset = TextDataset(small_texts, max_length=32)
    
    optimizations = [
        {"name": "Baseline", "pin_memory": False, "num_workers": 0},
        {"name": "Pinned Memory", "pin_memory": True, "num_workers": 0},
        {"name": "Multi-worker", "pin_memory": False, "num_workers": 2},
        {"name": "Optimized", "pin_memory": True, "num_workers": 2},
    ]
    
    results = []
    
    for opt in optimizations:
        # Create DataLoader with specific optimization
        dataloader = DataLoader(
            small_dataset,
            batch_size=8,
            shuffle=True,
            pin_memory=opt["pin_memory"],
            num_workers=opt["num_workers"],
            persistent_workers=opt["num_workers"] > 0
        )
        
        # Create model
        config = MiniTransformerConfig(
            vocab_size=10000,
            hidden_size=128,
            num_attention_heads=2,
            num_hidden_layers=2,
            intermediate_size=256,
            max_position_embeddings=32
        )
        model = MiniTransformer(config).to(device)
        optimizer = optim.AdamW(model.parameters(), lr=5e-4)
        
        # Warmup
        for _ in range(2):
            for batch in dataloader:
                input_ids = batch["input_ids"].to(device, non_blocking=opt["pin_memory"])
                labels = batch["labels"].to(device, non_blocking=opt["pin_memory"])
                outputs = model(input_ids, labels=labels)
                optimizer.zero_grad()
                outputs["loss"].backward()
                optimizer.step()
        
        # Benchmark
        start_time = time.time()
        for _ in range(5):
            for batch in dataloader:
                input_ids = batch["input_ids"].to(device, non_blocking=opt["pin_memory"])
                labels = batch["labels"].to(device, non_blocking=opt["pin_memory"])
                outputs = model(input_ids, labels=labels)
                optimizer.zero_grad()
                outputs["loss"].backward()
                optimizer.step()
        end_time = time.time()
        
        avg_time = (end_time - start_time) / 5
        results.append((opt["name"], avg_time))
        
        print(f"{opt['name']:<15}: {avg_time:.3f}s per epoch")
    
    # Show improvement
    baseline_time = results[0][1]
    best_time = min(results, key=lambda x: x[1])[1]
    improvement = (baseline_time - best_time) / baseline_time * 100
    
    print(f"\nPerformance improvement: {improvement:.1f}% with optimized settings")

if torch.cuda.is_available():
    benchmark_training_optimizations()
else:
    print("Performance benchmarking requires CUDA availability")

## 11. Gradient Analysis

Let's analyze the gradients during training to understand the learning dynamics.

In [None]:
def analyze_gradients(model):
    """Analyze gradient statistics"""
    total_norm = 0
    layer_norms = []
    
    for name, param in model.named_parameters():
        if param.grad is not None:
            param_norm = param.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
            layer_norms.append((name, param_norm.item()))
    
    total_norm = total_norm ** (1. / 2)
    
    print(f"Gradient Analysis:")
    print(f"  Total gradient norm: {total_norm:.4f}")
    print(f"  \nTop 5 layers by gradient norm:")
    layer_norms.sort(key=lambda x: x[1], reverse=True)
    for name, norm in layer_norms[:5]:
        print(f"    {name}: {norm:.4f}")
    
    return total_norm, layer_norms

# Analyze gradients (requires a backward pass first)
sample_batch = next(iter(dataloader))
input_ids = sample_batch["input_ids"].to(device)
labels = sample_batch["labels"].to(device)

outputs = model(input_ids, labels=labels)
outputs["loss"].backward()

total_grad_norm, layer_grad_norms = analyze_gradients(model)

## 12. Memory Usage Monitoring

Let's monitor memory usage during training to understand resource requirements.

In [None]:
def monitor_memory_usage():
    """Monitor GPU memory usage"""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        max_allocated = torch.cuda.max_memory_allocated() / 1e9
        
        print(f"Memory Usage Monitoring:")
        print(f"  Allocated memory: {allocated:.2f} GB")
        print(f"  Reserved memory: {reserved:.2f} GB")
        print(f"  Max allocated memory: {max_allocated:.2f} GB")
        
        return allocated, reserved, max_allocated
    else:
        print("Memory monitoring requires CUDA availability")
        return None, None, None

memory_stats = monitor_memory_usage()

## Summary

In this tutorial, we've demonstrated a comprehensive training workflow for the Mini Transformer:

- **Environment Setup**: Configuring CUDA optimizations for better performance
- **Data Preparation**: Creating datasets and efficient data loaders
- **Model Configuration**: Setting up model hyperparameters
- **Optimizer Setup**: Configuring AdamW optimizer with learning rate scheduling
- **Training Loop**: Implementing a robust training loop with monitoring
- **Performance Metrics**: Tracking loss, timing, and other metrics
- **Model Persistence**: Saving and loading trained models
- **Optimization Techniques**: Benchmarking different performance optimizations
- **Gradient Analysis**: Understanding learning dynamics through gradient monitoring
- **Memory Monitoring**: Tracking resource usage during training

This training demo provides a solid foundation for understanding how to train Transformer models effectively. The techniques demonstrated here can be scaled up for training larger models and adapted for different architectures. For enterprise-scale training, consider using the DeepSpeed training script which provides additional optimizations for distributed training across multiple GPUs.