<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**

# Chapter 12 â€” Training at Scale
Cross-device benchmark: CPU vs MPS vs CUDA.

## Overview

Benchmark CPU, Apple's MPS, and NVIDIA CUDA by timing a small training loop and inference pass.
The goal is to compare relative throughput rather than optimize model accuracy.

In [None]:
# !pip -q install torch torchvision matplotlib  # install dependencies if running in a fresh env
import time  # timing utilities
from dataclasses import dataclass  # structured configuration object
from typing import Iterable  # type hints for helpers

import torch  # core tensor library
from torch import nn  # neural network layers
from torch.utils.data import DataLoader, TensorDataset  # lightweight dataset utilities

from IPython import get_ipython  # runtime configuration tweaks

get_ipython().run_line_magic("config", "InlineBackend.figure_format = 'retina'")
torch.manual_seed(0)  # ensure reproducible random data

## Benchmark configuration

Configure dataset size, model width, and batch size for the benchmark.

In [None]:
@dataclass
class BenchmarkConfig:
    num_samples: int = 50_000  # synthetic dataset size
    num_features: int = 300  # feature dimensionality
    num_classes: int = 10  # number of classes
    hidden_dim: int = 512  # hidden width for MLP
    batch_size: int = 512  # data loader batch size
    epochs: int = 3  # training epochs per device

CONFIG = BenchmarkConfig()

## Synthetic dataset helper

Return a reproducible toy classification dataset used by the benchmark.

In [None]:
def make_dataset(cfg: BenchmarkConfig) -> TensorDataset:
    """Return a synthetic classification dataset."""
    features = torch.randn(cfg.num_samples, cfg.num_features)  # random features
    labels = torch.randint(0, cfg.num_classes, (cfg.num_samples,))  # random class labels
    return TensorDataset(features, labels)  # wrap in TensorDataset for PyTorch loaders

DATASET = make_dataset(CONFIG)  # materialize once for entire notebook run

## Model definition

Use a small MLP so training focuses on throughput rather than complex architectures.

In [None]:
def make_model(cfg: BenchmarkConfig) -> nn.Module:
    """Create a simple feed-forward classifier."""
    return nn.Sequential(
        nn.Linear(cfg.num_features, cfg.hidden_dim),
        nn.ReLU(),
        nn.Linear(cfg.hidden_dim, cfg.num_classes),
    )

## Device enumeration

Detect CPU, Apple's Metal (mps), and CUDA devices available on this machine.

In [None]:
def available_devices() -> list[str]:
    devices = ['cpu']  # CPU always present
    if torch.backends.mps.is_available():
        devices.append('mps')
    if torch.cuda.is_available():
        devices.append('cuda')
    return devices

AVAILABLE = available_devices()
print("devices detected:", AVAILABLE)

## Training benchmark helper

Time a brief training loop on the given device using the shared configuration.

In [None]:
def benchmark_training(device: str, cfg: BenchmarkConfig) -> float:
    model = make_model(cfg).to(device)  # move model to target device
    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)  # simple optimizer
    loader = DataLoader(DATASET, batch_size=cfg.batch_size, shuffle=True)  # iterate dataset
    criterion = nn.CrossEntropyLoss()  # classification loss

    start = time.perf_counter()  # start timer
    for epoch in range(cfg.epochs):  # iterate epochs
        for xb, yb in loader:  # iterate mini-batches
            xb = xb.to(device)  # move features
            yb = yb.to(device)  # move labels
            optimizer.zero_grad(set_to_none=True)  # clear gradients
            logits = model(xb)  # forward pass
            loss = criterion(logits, yb)  # compute loss
            loss.backward()  # backpropagate
            optimizer.step()  # update weights
    if device == 'cuda':  # synchronize CUDA kernels before stopping timer
        torch.cuda.synchronize()
    return time.perf_counter() - start  # elapsed seconds

## Inference benchmark helper

Measure average inference latency per batch on each device.

In [None]:
def benchmark_inference(device: str, cfg: BenchmarkConfig) -> float:
    model = make_model(cfg).to(device)
    model.eval()
    batch = torch.randn(cfg.batch_size, cfg.num_features, device=device)  # synthetic batch
    with torch.no_grad():
        start = time.perf_counter()
        for _ in range(200):
            _ = model(batch)
        if device == 'cuda':
            torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    return elapsed / 200  # average seconds per batch

## Run benchmarks

Execute training and inference benchmarks across available devices.

In [None]:
results = []
for device in AVAILABLE:
    train_time = benchmark_training(device, CONFIG)
    infer_time = benchmark_inference(device, CONFIG)
    results.append((device, train_time, infer_time))

for device, train_time, infer_time in results:
    print(f"{device:>4} | train {train_time:.2f}s | infer {infer_time * 1e3:.2f} ms per batch")

## Exercises

- Adjust `BenchmarkConfig` (e.g., larger `num_features`) and observe throughput differences.
- Switch the optimizer to Adam or add an extra hidden layer to test impact on device scaling.
- Increase `epochs` and compare how startup overhead differs across devices.

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>