<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 15 — Training Large Models (DDP, AMP, Checkpointing)

This notebook provides small, runnable snippets that mirror the chapter’s concepts:

- Gradient accumulation (larger effective batch on limited memory)
- AMP (automatic mixed precision) demo when CUDA is available
- Checkpointing patterns (save/load state)
- DDP (Distributed Data Parallel) launch sketch (reference code)

These are minimal, CPU-friendly examples intended to build intuition — not full training runs.

In [None]:
import math, os, sys, time
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

print(f'Python: {sys.version.split()[0]}')
print(f'Torch : {torch.__version__}')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)

## Gradient Accumulation (CPU-friendly demo)

Accumulate gradients over multiple micro-batches to simulate a larger batch size without increasing memory usage.

In [None]:
# Tiny synthetic regression problem
torch.manual_seed(0)  # reproducibility
N, D = 512, 16
X = torch.randn(N, D)  # inputs
true_w = torch.randn(D, 1)
y = X @ true_w + 0.1*torch.randn(N, 1)  # targets/labels
ds = TensorDataset(X, y)  # wrap tensors as a dataset
loader = DataLoader(ds, batch_size = 32, shuffle = True)  # create data loader

model = nn.Sequential(nn.Linear(D, 32), nn.ReLU(), nn.Linear(32, 1)).to(device)
opt = torch.optim.SGD(model.parameters(), lr = 0.1)  # optimizer setup / step
accum_steps = 4  # 32*4 = effective 128
loss_fn = nn.MSELoss()

model.train()
running = 0.0
opt.zero_grad(set_to_none = True)
for step, (xb, yb) in enumerate(loader, start = 1):
    xb, yb = xb.to(device), yb.to(device)
    pred = model(xb)  # forward pass / predictions
    loss = loss_fn(pred, yb) / accum_steps  # training objective
    loss.backward()
    running += loss.item()
    if step % accum_steps == 0:
        opt.step()
        opt.zero_grad(set_to_none = True)
        # report every accumulation to show progress
        print(f'step {step:3d} loss {running:.4f}')
        running = 0.0


## AMP (Automatic Mixed Precision)

Use `torch.cuda.amp.autocast` and `GradScaler` on CUDA for speed/memory benefits. Skips on CPU.

In [None]:
import torch, torch.nn as nn

use_amp = torch.cuda.is_available()
if use_amp:
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    model = nn.Sequential(nn.Linear(D, 128), nn.ReLU(), nn.Linear(128, 1)).to('cuda')
    opt = torch.optim.AdamW(model.parameters(), lr=1e-3)
    model.train()
    for xb, yb in loader:
        xb, yb = xb.cuda(), yb.cuda()
        opt.zero_grad(set_to_none=True)
        with torch.amp.autocast('cuda', enabled=True):
            pred = model(xb)
            loss = nn.functional.mse_loss(pred, yb)
            scaler.scale(loss).backward()
            scaler.step(opt)
            scaler.update()
            break
            print('AMP step complete on CUDA')
            else:
                print('CUDA not available; AMP demo skipped.')


## Checkpointing

Save and load model/optimizer state for resumable training.

In [None]:
ckpt_path = 'checkpoint_ch15_demo.pt'
state = {
'model': model.state_dict(),
'optimizer': opt.state_dict(),
'meta': {'epoch': 1, 'accum_steps': accum_steps}
}
torch.save(state, ckpt_path)
print('Saved checkpoint to', ckpt_path)

# restore
loaded = torch.load(ckpt_path, map_location = device)
model.load_state_dict(loaded['model'])
opt.load_state_dict(loaded['optimizer'])
print('Restored epoch:', loaded['meta']['epoch'])


## DDP Launch Sketch (reference)

Full DDP runs are typically launched via the CLI using `torchrun` (or `python -m torch.distributed.run`). Below is a minimal training script outline. Save as `train_ddp.py` and launch with:

```bash
torchrun --standalone --nproc_per_node=2 train_ddp.py
```
This cell shows the code for reference (not executed here).

In [None]:
ddp_code = r'''
import os, torch, torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler, TensorDataset

def main():
    dist.init_process_group('nccl' if torch.cuda.is_available() else 'gloo')
    rank = dist.get_rank()
    device = rank % torch.cuda.device_count() if torch.cuda.is_available() else 'cpu'
    if torch.cuda.is_available(): torch.cuda.set_device(device)

    # Toy data
    X = torch.randn(1024, 16)  # inputs
    y = torch.randn(1024, 1)  # targets/labels
    ds = TensorDataset(X, y)  # wrap tensors as a dataset
    sampler = DistributedSampler(ds,  # shard dataset across ranks
        shuffle = True) if dist.is_initialized() else None  # shard dataset across ranks
    loader = DataLoader(ds, batch_size = 64, sampler = sampler,  # create data loader
        shuffle = (sampler is None))  # create data loader

    model = nn.Sequential(nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, 1))
    if torch.cuda.is_available(): model = model.cuda(device)
    model = DDP(model, device_ids = [device] if torch.cuda.is_available() else None)

    opt = torch.optim.AdamW(model.parameters(), lr = 1e-3)  # optimizer setup / step
    loss_fn = nn.MSELoss()

    for epoch in range(2):
        if sampler is not None: sampler.set_epoch(epoch)
        for xb, yb in loader:
            if torch.cuda.is_available():
                xb, yb = xb.cuda(device), yb.cuda(device)
            opt.zero_grad(set_to_none = True)
            loss = loss_fn(model(xb), yb)  # training objective
            loss.backward()
            opt.step()
        if rank == 0:
            print('epoch', epoch)
    dist.destroy_process_group()

if __name__ == '__main__':
    main()
'''
print(ddp_code)


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
