<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 12 — Training at Scale

## Overview

This notebook provides a concise, hands-on walkthrough of Deep Learning Basics with PyTorch.
Use it as a companion to the chapter: run each cell, read the short notes,
and try small variations to build intuition.

Tips:
- Run cells top to bottom; restart kernel if state gets confusing.
- Prefer small, fast experiments; iterate quickly and observe outputs.
- Keep an eye on shapes, dtypes, and devices when using PyTorch.


## Throughput quick check (toy)

Measure how fast a dense matrix multiply runs on the current device. This tiny
benchmark is not a substitute for full profiling, but it quickly tells you if
you are GPU-bound or CPU-bound before launching a long training job.

In [None]:
import time  # timing utilities for throughput estimation
import torch  # tensor library powering PyTorch workloads

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # prefer GPU when available
torch.manual_seed(0)  # deterministic inputs for fair comparisons
matrix = torch.randn(4096, 4096, device=device)  # dense matrix used for matmul benchmark
iterations = 40  # number of matmul steps to average over

if device.type == 'cuda':  # flush outstanding GPU work before timing
    torch.cuda.synchronize()  # ensure GPU queue is empty before timing

start = time.perf_counter()  # start high-resolution timer
for _ in range(iterations):  # loop over matmul workload
    _ = matrix @ matrix  # square matrix multiply as throughput proxy
if device.type == 'cuda':  # ensure GPU kernels finish before stopping timer
    torch.cuda.synchronize()  # wait for GPU kernels to complete

elapsed = time.perf_counter() - start  # total elapsed wall-clock seconds
avg_time = elapsed / iterations  # average latency per matmul
ops = 2 * matrix.size(0) ** 3  # operation count for dense matmul (2*n^3)
gflops = ops / avg_time / 1e9  # convert to billions of floating-point ops
print(f'device={device.type} avg_step={avg_time:.4f}s gflops≈{gflops:.1f}')  # report results

## AMP training step

Automatic mixed precision (AMP) helps GPUs sustain higher throughput by mixing
float16/float32 operations. The cell below performs a single optimization step
and falls back to standard FP32 on CPU-only environments.

In [None]:

import torch  # tensor operations
import torch.nn.functional as F  # neural network losses
from torch import nn  # module definitions

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # pick execution device
model = nn.Linear(128, 10).to(device)  # simple linear classifier for the demo
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)  # optimizer with modest learning rate
use_amp = device.type == 'cuda'  # AMP only runs when CUDA is available
scaler = torch.amp.GradScaler('cuda', enabled=use_amp) if use_amp else None  # gradient scaler for mixed precision

inputs = torch.randn(32, 128, device=device)  # mini-batch of features
targets = torch.randint(0, 10, (32,), device=device)  # integer class labels

optimizer.zero_grad(set_to_none=True)  # clear gradients before the step
if use_amp:  # mixed precision execution path
    with torch.amp.autocast('cuda'):  # enable automatic casting to float16 where safe
        logits = model(inputs)  # forward pass under autocast
        loss = F.cross_entropy(logits, targets)  # compute loss in mixed precision
    scaler.scale(loss).backward()  # scale gradients to avoid underflow
    scaler.step(optimizer)  # perform the optimizer step on scaled grads
    scaler.update()  # adjust scaling factor for next iteration
else:  # CPU / non-CUDA fallback path
    logits = model(inputs)  # forward pass in fp32
    loss = F.cross_entropy(logits, targets)  # standard cross-entropy loss
    loss.backward()  # accumulate gradients
    optimizer.step()  # optimizer update without scaling

print(f"running_amp={use_amp} loss={loss.detach().item():.4f}")  # summarise step configuration


## Gradient accumulation

Accumulate gradients across several micro-batches to emulate a larger effective
batch size without exceeding device memory. The example also verifies that the
accumulated update matches a single large batch.

In [None]:
import torch  # tensor library
import torch.nn.functional as F  # neural network losses
from torch import nn  # module definitions

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # prefer GPU for speed
micro_batch = 16  # size of each micro-batch
accum_steps = 4  # number of micro-batches per update

# Reference run with a single large batch
reference_model = nn.Linear(128, 10).to(device)  # model for large-batch baseline
reference_opt = torch.optim.SGD(reference_model.parameters(), lr=1e-2)  # optimizer for baseline
reference_opt.zero_grad(set_to_none=True)  # clear gradients
big_inputs = torch.randn(micro_batch * accum_steps, 128, device=device)  # combined batch
big_targets = torch.randint(0, 10, (micro_batch * accum_steps,), device=device)  # combined labels
big_loss = F.cross_entropy(reference_model(big_inputs), big_targets)  # large-batch loss
big_loss.backward()  # compute gradients in one shot
reference_opt.step()  # apply update
reference_state = {k: v.detach().clone() for k, v in reference_model.state_dict().items()}  # snapshot weights

# Gradient accumulation variant
accum_model = nn.Linear(128, 10).to(device)  # fresh model for accumulation test
accum_opt = torch.optim.SGD(accum_model.parameters(), lr=1e-2)  # optimizer for accumulation
accum_opt.zero_grad(set_to_none=True)  # clear gradients
for inputs_chunk, targets_chunk in zip(big_inputs.chunk(accum_steps), big_targets.chunk(accum_steps)):  # iterate micro-batches
    loss_chunk = F.cross_entropy(accum_model(inputs_chunk), targets_chunk) / accum_steps  # scale loss per chunk
    loss_chunk.backward()  # accumulate scaled gradients
accum_opt.step()  # apply single update after accumulation
accum_state = {k: v.detach().clone() for k, v in accum_model.state_dict().items()}  # snapshot accumulated weights

max_diff = max((accum_state[key] - reference_state[key]).abs().max().item() for key in accum_state)  # compute max parameter delta
print(f'eff_batch={micro_batch * accum_steps} max_param_diff={max_diff:.2e}')  # report closeness of updates

## Checkpoint save/load (toy)

Persist the model, optimizer, and bookkeeping so training can resume cleanly
after an interruption. Always record the epoch/step alongside the state dicts.

## Learning-rate schedules (constant, step, cosine)

Plot three learning-rate strategies to visualise how each schedule evolves over time.

## Exercises

1. Profile a short run with AMP on/off; compare speed and memory usage.
2. Accumulate gradients over N steps and match an equivalent batch size.


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
