![Thinkube AI Lab](../icons/tk_full_logo.svg)

# Distributed Training with DDP ⚡

Scale training across multiple GPUs:
- DistributedDataParallel basics
- Multi-GPU setup
- Distributed training loop
- Synchronization and communication
- Performance optimization

## Why Distributed Training?

Benefits of multi-GPU training:

- **Faster Training**: Linear speedup with number of GPUs
- **Larger Batches**: Distribute batch across GPUs
- **Bigger Models**: Split models that don't fit on one GPU
- **Efficiency**: Better GPU utilization

PyTorch DDP is the recommended approach!

## Setup Distributed Environment

In [None]:
# Initialize distributed training
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import os

# TODO: Setup environment variables (RANK, WORLD_SIZE, MASTER_ADDR)
# TODO: Initialize process group
# TODO: Set device for this process
# TODO: Display rank and world size

## Distributed Data Loading

Each GPU gets different data:

In [None]:
# Setup distributed data loader
from torch.utils.data.distributed import DistributedSampler

# TODO: Load dataset
# TODO: Create DistributedSampler
# TODO: Create DataLoader with sampler
# TODO: Ensure different data on each GPU
# TODO: Display data distribution

## Wrap Model with DDP

Enable distributed training:

In [None]:
# Create DDP model

# TODO: Define model
# TODO: Move model to local GPU
# TODO: Wrap with DDP
# TODO: Define optimizer on DDP model
# TODO: Display model structure

## Distributed Training Loop

Train across all GPUs:

In [None]:
# DDP training loop

# TODO: Loop through epochs
# TODO: Set sampler epoch for shuffling
# TODO: Training step on local GPU
# TODO: Gradient synchronization happens automatically
# TODO: Log metrics only on rank 0
# TODO: Synchronize metrics across GPUs with all_reduce
# TODO: Display training progress

## Synchronization

Coordinate between GPUs:

In [None]:
# Synchronization operations

# TODO: Use dist.barrier() to synchronize all processes
# TODO: Use dist.all_reduce() to sum metrics
# TODO: Use dist.broadcast() to share data
# TODO: Demonstrate gradient synchronization
# TODO: Show how DDP handles it automatically

## Save and Load Checkpoints

Handle checkpointing in distributed setting:

In [None]:
# Distributed checkpointing

# TODO: Save only on rank 0
# TODO: Save model.module.state_dict() (unwrap DDP)
# TODO: Add barrier before/after saving
# TODO: Load checkpoint on all ranks
# TODO: Map to correct device

## Performance Comparison

Measure speedup:

In [None]:
# Benchmark single vs multi-GPU

# TODO: Time single GPU training
# TODO: Time multi-GPU training
# TODO: Calculate speedup
# TODO: Measure throughput (samples/sec)
# TODO: Display comparison chart
# TODO: Analyze efficiency (linear scaling?)

## Debugging DDP

Common issues and solutions:

In [None]:
# Debugging tips

# TODO: Check for unused parameters
# TODO: Verify gradient synchronization
# TODO: Monitor communication overhead
# TODO: Check for hangs (missing barriers)
# TODO: Validate data distribution

## Clean Up

In [None]:
# Cleanup distributed resources

# TODO: Destroy process group
# TODO: Clear CUDA cache
# TODO: Display cleanup status

## Best Practices

- ✅ Use DistributedSampler for data loading
- ✅ Save checkpoints only on rank 0
- ✅ Log metrics only on rank 0 or aggregate across ranks
- ✅ Use barriers to synchronize when needed
- ✅ Handle find_unused_parameters carefully
- ✅ Monitor GPU utilization on all devices
- ✅ Test with 1 GPU first, then scale up
- ✅ Use gradient accumulation if batch size limited

## Next Steps

Continue with:
- **04-transformers-training.ipynb** - Train large transformer models
- **05-mlops-integration.ipynb** - Track distributed experiments