# Week 3.3: Model Training Hyperparameters

In this notebook, we'll explore key hyperparameters and optimization techniques for training large language models effectively. These techniques build upon the performance optimizations covered in the previous lesson, now focusing on training effectiveness rather than just speed.

We'll continue examining Andrej Karpathy's [build-nanogpt](https://github.com/karpathy/build-nanogpt) repository, focusing on the commits that implement these hyperparameter optimizations.

## Overview of Hyperparameter Optimizations

Based on the timestamps in the image, we can see the following progression of optimizations:

1. **Hyperparameters, AdamW, gradient clipping** (02:14:55)
2. **Learning rate scheduler: warmup + cosine decay** (02:21:06)
3. **Batch size schedule, weight decay, FusedAdamW** (02:26:21 - 90ms)
4. **Gradient accumulation** (02:34:09)
5. **Distributed Data Parallel (DDP)** (02:46:52)
6. **Datasets used in GPT-2, GPT-3, FineWeb (EDU)** (03:10:21)
7. **Validation data split, validation loss, sampling revive** (03:23:10)
8. **Evaluation: HellaSwag, starting the run** (03:28:23)

## 1. Hyperparameters, AdamW, and Gradient Clipping

The first optimization focuses on properly configuring the optimizer and implementing gradient clipping to stabilize training.

### AdamW Optimizer

AdamW is a variant of the Adam optimizer that implements weight decay correctly, separating it from the adaptive learning rate mechanism.

```python
# Configure AdamW optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=learning_rate,
    betas=(0.9, 0.95),  # Default is (0.9, 0.999)
    eps=1e-8,
    weight_decay=0.1  # Will be applied separately from learning rate adjustments
)
```

### Gradient Clipping

Gradient clipping prevents exploding gradients by scaling them when their norm exceeds a threshold.

```python
# Apply gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
```

### Benefits:
- AdamW properly decouples weight decay from adaptive momentum
- Gradient clipping helps stabilize training, especially in early stages
- Beta parameters tuned for transformer models (β₂ = 0.95 instead of 0.999)

## 2. Learning Rate Scheduler: Warmup + Cosine Decay

Learning rate scheduling is crucial for efficient transformer training, typically using a warmup period followed by cosine decay.

```python
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

# Update learning rate during training
lr = get_lr(iteration)
for param_group in optimizer.param_groups:
    param_group['lr'] = lr
```

### Benefits:
- Warmup period helps stabilize early training when gradients might be erratic
- Cosine decay provides a smooth learning rate reduction
- Prevents the model from getting stuck in local minima
- Improves final model quality and training stability

## 3. Batch Size Schedule, Weight Decay, and FusedAdamW

These optimizations focus on efficient batch processing and further optimizer improvements.

### Batch Size Schedule

Gradually increasing batch size during training can improve both efficiency and final model performance.

```python
# Example of batch size scheduling
batch_size = min(max_batch_size, initial_batch_size * (iteration // batch_size_schedule_interval + 1))
```

### Weight Decay with Parameter Filtering

Applying weight decay only to weight matrices (not biases, normalization params):

```python
# Only apply weight decay to 2D parameters (weights, not biases or LN params)
decay_params = []
nodecay_params = []
for pname, p in model.named_parameters():
    if p.dim() >= 2:
        decay_params.append(p)
    else:
        nodecay_params.append(p)
        
optim_groups = [
    {"params": decay_params, "weight_decay": weight_decay},
    {"params": nodecay_params, "weight_decay": 0.0}
]
```

### FusedAdamW

Using NVIDIA Apex's fused implementation of AdamW for faster optimization steps:

```python
# Using apex for faster optimizer implementation when available
try:
    from apex.optimizers import FusedAdam
    optimizer = FusedAdam(optim_groups, lr=learning_rate, betas=(0.9, 0.95))
except ImportError:
    optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95))
```

### Benefits:
- Scheduled batch sizes improve memory usage and convergence
- Selective weight decay improves model generalization
- FusedAdamW improves optimizer step performance
- Combined, these optimizations reduce training iteration time to ~90ms

## 4. Gradient Accumulation

Gradient accumulation allows training with effectively larger batch sizes by accumulating gradients across multiple forward-backward passes before updating the model.

```python
# Training loop with gradient accumulation
accum_iter = 4  # Accumulate gradients over 4 batches
model.zero_grad()

for micro_step in range(accum_iter):
    with torch.amp.autocast(device_type=device_type, dtype=dtype):
        logits, loss = model(X, Y)
        # Scale the loss to account for gradient accumulation
        loss = loss / accum_iter
    
    # Backward pass
    scaler.scale(loss).backward()
    
# Apply optimizer step after accumulation
if (iter_num + 1) % accum_iter == 0:
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()
    model.zero_grad(set_to_none=True)
```

### Benefits:
- Enables training with larger effective batch sizes without increasing memory usage
- Improves gradient signal quality by averaging over more examples
- Allows using larger models on the same hardware
- Useful for scenarios with limited GPU memory

## 5. Distributed Data Parallel (DDP)

DDP enables training across multiple GPUs or nodes, significantly accelerating training for large models.

```python
# Initialize process group
torch.distributed.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
device = f'cuda:{local_rank}'

# Wrap model in DDP
model = torch.nn.parallel.DistributedDataParallel(
    model, 
    device_ids=[local_rank], 
    output_device=local_rank
)

# Use distributed sampler for dataloader
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
```

### Benefits:
- Linear scaling of training speed with the number of GPUs
- Automatic gradient synchronization across all processes
- Efficient communication using NCCL backend
- Enables training models that wouldn't fit on a single GPU

## 6. Datasets Used in GPT-2, GPT-3, FineWeb (EDU)

The quality and composition of training data significantly impact model performance. This section explores datasets used for training large language models.

### OpenWebText (for GPT-2)
- Recreation of WebText dataset used to train GPT-2
- Crawled from Reddit links with at least 3 upvotes
- ~8 million documents with ~40GB of text

### The Pile and C4 (Common Crawl used for GPT-3)
- The Pile: 825GB of diverse English text from 22 sources
- C4 (Colossal Clean Crawled Corpus): 156GB filtered web text

### FineWeb EDU
- Educational content filtered from Common Crawl
- Higher quality and more reliable information than general web text
- Used for specialized models requiring factual accuracy

```python
# Example dataset loading code
train_data = np.memmap(f'{data_dir}/train.bin', dtype=np.uint16, mode='r')
val_data = np.memmap(f'{data_dir}/val.bin', dtype=np.uint16, mode='r')

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy(data[i+1:i+1+block_size].astype(np.int64)) for i in ix])
    return x.to(device), y.to(device)
```

## 7. Validation Data Split, Validation Loss, Sampling Revive

Proper validation procedures are essential for monitoring training and avoiding overfitting.

### Validation Data Split

```python
# Creating validation split
n = len(data)
train_data = data[:int(n*0.9)]  # 90% for training
val_data = data[int(n*0.9):]    # 10% for validation
```

### Validation Loss Monitoring

```python
# Evaluate on validation set periodically
if iter_num % eval_interval == 0:
    model.eval()
    losses = []
    for _ in range(eval_iters):
        with torch.no_grad():
            X, Y = get_batch('val')
            logits, loss = model(X, Y)
            losses.append(loss.item())
    val_loss = torch.tensor(losses).mean()
    model.train()
    
    # Log metrics
    print(f"Iteration {iter_num}: train loss {train_loss:.4f}, val loss {val_loss:.4f}")
    
    # Early stopping logic or learning rate adjustment based on val_loss
```

### Sampling During Training (Model Output Preview)

```python
# Generate sample outputs during training to monitor quality
if iter_num % sample_interval == 0:
    model.eval()
    context = "Once upon a time"
    encoded = tokenizer.encode(context)
    x = torch.tensor([encoded], dtype=torch.long, device=device)
    
    # Generate sample text
    with torch.no_grad():
        y = model.generate(x, max_new_tokens=100, temperature=0.8)[0]
        
    decoded = tokenizer.decode(y.tolist())
    print(f"\nSample at iteration {iter_num}:\n{decoded}\n")
    model.train()
```

### Benefits:
- Validation loss provides a signal for overfitting
- Sample generation reveals model capabilities during training
- Helps in early stopping or hyperparameter adjustment

## 8. Evaluation: HellaSwag

Beyond validation loss, evaluating models on standardized benchmarks provides a more comprehensive understanding of capabilities.

### HellaSwag Benchmark

HellaSwag is a challenging commonsense NLI dataset for evaluating language models' ability to complete scenarios with commonsense reasoning.

```python
# Example HellaSwag evaluation code
def evaluate_hellaswag(model, tokenizer):
    model.eval()
    correct = 0
    total = 0
    
    for item in hellaswag_dataset:
        # Get context and candidate answers
        context = item['context']
        candidates = item['candidates']
        label = item['label']
        
        # Score each candidate
        scores = []
        for candidate in candidates:
            # Calculate log probability of candidate given context
            score = score_text(model, tokenizer, context, candidate)
            scores.append(score)
            
        # Select highest scoring candidate
        pred = scores.index(max(scores))
        
        # Check if prediction is correct
        if pred == label:
            correct += 1
        total += 1
    
    return correct / total
```

### Benefits:
- Provides standardized, task-specific evaluation
- Allows comparison with other models in the field
- Tests specific reasoning capabilities beyond simple perplexity
- Helps identify model strengths and weaknesses

## Implementing a Complete Training Pipeline

Below is a simplified implementation that incorporates these hyperparameter optimizations into a complete training pipeline:

In [None]:
import os
import math
import time
import numpy as np
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Hyperparameters
learning_rate = 6e-4
min_lr = 6e-5
weight_decay = 0.1
max_iters = 600000
warmup_iters = 2000
lr_decay_iters = 500000
eval_interval = 1000
eval_iters = 200
grad_clip = 1.0
grad_accum_steps = 8
batch_size = 12

# Set up distributed training if available
ddp = int(os.environ.get('RANK', -1)) != -1
if ddp:
    dist.init_process_group(backend='nccl')
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0
else:
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    master_process = True

device_type = 'cuda' if 'cuda' in device else 'cpu'

# Create model (simplified example)
model = TransformerModel(vocab_size=50304).to(device)

# Configure optimizer
# Separate weight decay parameters
decay_params = [p for n, p in model.named_parameters() if p.dim() >= 2]
nodecay_params = [p for n, p in model.named_parameters() if p.dim() < 2]
optim_groups = [
    {"params": decay_params, "weight_decay": weight_decay},
    {"params": nodecay_params, "weight_decay": 0.0}
]

# Use FUSED AdamW if available
try:
    from apex.optimizers import FusedAdam
    optimizer = FusedAdam(optim_groups, lr=learning_rate, betas=(0.9, 0.95))
    print("Using FusedAdam")
except ImportError:
    optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95))
    print("Using torch.optim.AdamW")

# Mixed precision setup
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
ctx = torch.amp.autocast(device_type=device_type, dtype=getattr(torch, dtype))
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# Wrap model for DDP
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])
    
# Torch compile if available (PyTorch 2.0+)
if hasattr(torch, 'compile'):
    model = torch.compile(model)

# Set up learning rate scheduler function
def get_lr(it):
    # Linear warmup followed by cosine decay
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    if it > lr_decay_iters:
        return min_lr
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (learning_rate - min_lr)

# Training loop
iter_num = 0
best_val_loss = float('inf')

# Main loop
model.train()
while iter_num < max_iters:
    
    # Update learning rate according to schedule
    lr = get_lr(iter_num)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
        
    # Gradient accumulation loop
    micro_step = 0
    while micro_step < grad_accum_steps:
        # Get batch data
        X, Y = get_batch('train', batch_size=batch_size)
        
        # Forward pass with mixed precision
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / grad_accum_steps  # Scale for accumulation
            
        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        micro_step += 1
        
    # Update weights after accumulation
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)
    
    # Evaluation
    if iter_num % eval_interval == 0 and master_process:
        model.eval()
        val_loss = estimate_loss(model, 'val', eval_iters)
        print(f"Step {iter_num}: val loss {val_loss:.4f}, lr {lr:.6f}")
        
        # Generate sample
        if iter_num > 0:
            sample_text = generate_sample(model, tokenizer, "Once upon a time")
            print(f"\nSample:\n{sample_text}\n")
            
        # Save checkpoint if best
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            if iter_num > 0:
                checkpoint = {
                    'model': model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                }
                torch.save(checkpoint, "best_model.pt")
                print(f"Saved new best model with val_loss {best_val_loss:.4f}")
        
        model.train()
    
    iter_num += 1

# Evaluate on HellaSwag benchmark after training
if master_process:
    hellaswag_score = evaluate_hellaswag(model, tokenizer)
    print(f"HellaSwag score: {hellaswag_score:.4f}")

## Conclusion and Key Takeaways

This notebook covered essential hyperparameter optimizations for training large language models effectively. Key takeaways include:

1. **Optimizer configuration matters**:
   - AdamW with proper weight decay separation
   - Gradient clipping to prevent instability
   - FusedAdamW for performance optimization

2. **Learning rate scheduling is crucial**:
   - Warmup period followed by cosine decay
   - Finding the right minimum and maximum learning rates

3. **Efficient training strategies**:
   - Gradient accumulation for larger effective batch sizes
   - Distributed Data Parallel for multi-GPU training
   - Selective weight decay for better generalization

4. **Data and evaluation**:
   - High-quality datasets are essential (FineWeb EDU vs standard web crawls)
   - Proper validation splits and monitoring
   - Task-specific evaluation with benchmarks like HellaSwag

These optimizations go beyond simply making training faster—they significantly improve model quality, stability, and final performance on downstream tasks. By implementing these techniques, you can train more effective language models with the same compute budget.