<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 15 — Training Large Models (DDP, AMP, Checkpointing)

This notebook provides small, runnable snippets that mirror the chapter’s concepts:

- Gradient accumulation (larger effective batch on limited memory)
- AMP (automatic mixed precision) demo when CUDA is available
- Checkpointing patterns (save/load state)
- DDP (Distributed Data Parallel) launch sketch (reference code)

These are minimal, CPU-friendly examples intended to build intuition — not full training runs.

## Overview

This notebook provides a concise, hands-on walkthrough of Deep Learning Basics with PyTorch.
Use it as a companion to the chapter: run each cell, read the short notes,
and try small variations to build intuition.

Tips:
- Run cells top to bottom; restart kernel if state gets confusing.
- Prefer small, fast experiments; iterate quickly and observe outputs.
- Keep an eye on shapes, dtypes, and devices when using PyTorch.


### Inspect environment and device

Import the libraries we need and report the interpreter and accelerator availability.

In [None]:
import math
import os
import sys
import time

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

print(f'Python: {sys.version.split()[0]}')  # show interpreter version
print(f'Torch : {torch.__version__}')  # show torch version
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # prefer GPU when available
print('Device:', device)  # confirm compute target

## Gradient Accumulation (CPU-friendly demo)

Accumulate gradients over multiple micro-batches to simulate a larger batch size without increasing memory usage.

### Simulate gradient accumulation

Train a tiny regression model with micro-batch accumulation to emulate a larger batch size.

In [None]:
torch.manual_seed(0)  # reproducibility
N, D = 512, 16  # dataset size and feature width
X = torch.randn(N, D)  # inputs
true_w = torch.randn(D, 1)  # ground-truth weights
y = X @ true_w + 0.1 * torch.randn(N, 1)  # noisy labels
ds = TensorDataset(X, y)  # wrap tensors as a dataset
loader = DataLoader(ds, batch_size=32, shuffle=True)  # iterate mini-batches

model = nn.Sequential(nn.Linear(D, 32), nn.ReLU(), nn.Linear(32, 1)).to(device)  # tiny regression network
opt = torch.optim.SGD(model.parameters(), lr=0.1)  # SGD optimizer
accum_steps = 4  # number of micro-batches per update
loss_fn = nn.MSELoss()  # mean squared error loss

model.train()  # enable training mode
running = 0.0  # track scaled loss
opt.zero_grad(set_to_none=True)  # reset gradients before loop
for step, (xb, yb) in enumerate(loader, start=1):
    xb, yb = xb.to(device), yb.to(device)  # move data to device
    pred = model(xb)  # forward pass
    loss = loss_fn(pred, yb) / accum_steps  # scale loss for accumulation
    loss.backward()  # backprop scaled loss
    running += loss.item()  # accumulate scalar loss
    if step % accum_steps == 0:
        opt.step()  # apply gradients
        opt.zero_grad(set_to_none=True)  # clear gradients for next cycle
        print(f'step {step:3d} loss {running:.4f}')  # report running loss
        running = 0.0  # reset accumulator

## AMP (Automatic Mixed Precision)

Use `torch.cuda.amp.autocast` and `GradScaler` on CUDA for speed/memory benefits. Skips on CPU.

### Demonstrate automatic mixed precision

Run a single automatic mixed-precision step when CUDA is available, or explain why it is skipped.

In [None]:
import torch
import torch.nn as nn

use_amp = torch.cuda.is_available()  # only run AMP on CUDA
if use_amp:
    scaler = torch.amp.GradScaler('cuda', enabled=True)  # manage scaled gradients
    model = nn.Sequential(nn.Linear(D, 128), nn.ReLU(), nn.Linear(128, 1)).to('cuda')  # beefier AMP model
    opt = torch.optim.AdamW(model.parameters(), lr=1e-3)  # optimizer for AMP example
    model.train()  # enable training mode
    for xb, yb in loader:
        xb, yb = xb.cuda(), yb.cuda()  # move batch to GPU
        opt.zero_grad(set_to_none=True)  # clear gradients
        with torch.amp.autocast('cuda', enabled=True):  # mixed-precision region
            pred = model(xb)  # forward pass
            loss = nn.functional.mse_loss(pred, yb)  # compute loss
        scaler.scale(loss).backward()  # backprop scaled loss
        scaler.step(opt)  # apply optimizer step
        scaler.update()  # adjust scaling factor
        break  # single AMP iteration for illustration
    message = 'AMP step complete on CUDA'  # success status
else:
    message = 'CUDA not available; AMP demo skipped.'  # fallback status

message  # display outcome

## Checkpointing

Save and load model/optimizer state for resumable training.

### Save and reload checkpoints

Store the model and optimizer state, then reload to confirm the checkpoint workflow.

In [None]:
ckpt_path = 'checkpoint_ch15_demo.pt'  # checkpoint file location
state = {
    'model': model.state_dict(),
    'optimizer': opt.state_dict(),
    'meta': {'epoch': 1, 'accum_steps': accum_steps},
}
torch.save(state, ckpt_path)  # serialize training state
print('Saved checkpoint to', ckpt_path)  # confirm save

loaded = torch.load(ckpt_path, map_location=device)  # load checkpoint on current device
model.load_state_dict(loaded['model'])  # restore model weights
opt.load_state_dict(loaded['optimizer'])  # restore optimizer
print('Restored epoch:', loaded['meta']['epoch'])  # confirm metadata

## DDP Launch Sketch (reference)

Full DDP runs are typically launched via the CLI using `torchrun` (or `python -m torch.distributed.run`). Below is a minimal training script outline. Save as `train_ddp.py` and launch with:

```bash
torchrun --standalone --nproc_per_node=2 train_ddp.py
```
This cell shows the code for reference (not executed here).

### Provide a minimal DDP training script

Include a ready-to-run DistributedDataParallel example as a reference string.

In [None]:
ddp_code = r'''
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler, TensorDataset


def main():
    dist.init_process_group('nccl' if torch.cuda.is_available() else 'gloo')
    rank = dist.get_rank()
    device = rank % torch.cuda.device_count() if torch.cuda.is_available() else 'cpu'
    if torch.cuda.is_available():
        torch.cuda.set_device(device)

    # Toy data
    X = torch.randn(1024, 16)  # synthetic inputs
    y = torch.randn(1024, 1)  # synthetic targets
    ds = TensorDataset(X, y)  # wrap tensors as a dataset
    sampler = DistributedSampler(ds, shuffle=True) if dist.is_initialized() else None  # shard data across ranks
    loader = DataLoader(ds, batch_size=64, sampler=sampler, shuffle=(sampler is None))  # construct loader

    model = nn.Sequential(nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, 1))
    if torch.cuda.is_available():
        model = model.cuda(device)
    model = DDP(model, device_ids=[device] if torch.cuda.is_available() else None)

    opt = torch.optim.AdamW(model.parameters(), lr=1e-3)  # optimizer configuration
    loss_fn = nn.MSELoss()  # regression loss

    for epoch in range(2):
        if sampler is not None:
            sampler.set_epoch(epoch)
        for xb, yb in loader:
            if torch.cuda.is_available():
                xb, yb = xb.cuda(device), yb.cuda(device)
            opt.zero_grad(set_to_none=True)
            loss = loss_fn(model(xb), yb)  # forward and loss computation
            loss.backward()
            opt.step()
        if rank == 0:
            print('epoch', epoch)
    dist.destroy_process_group()


if __name__ == '__main__':
    main()
'''
    print(ddp_code)  # show reference script

## Exercises

1. Sketch a pseudo-DDP setup: outline processes, seed management, and gradients.
2. Compare gradient accumulation vs. true data parallelism conceptually.


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
