<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 9 — Working with Data in PyTorch
Datasets, DataLoaders, transforms, and batching.

## Overview

This notebook provides a concise, hands-on walkthrough of Deep Learning Basics with PyTorch.
Use it as a companion to the chapter: run each cell, read the short notes,
and try small variations to build intuition.

Tips:
- Run cells top to bottom; restart kernel if state gets confusing.
- Prefer small, fast experiments; iterate quickly and observe outputs.
- Keep an eye on shapes, dtypes, and devices when using PyTorch.


In [None]:
  # !pip -q install torch numpy matplotlib scikit-learn
import torch, numpy as np, matplotlib.pyplot as plt
from torch import nn
from torch.utils.data import TensorDataset, DataLoader, Dataset
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
%config InlineBackend.figure_format = 'retina'


## Dataset and DataLoader (moons)

In [None]:
torch.manual_seed(0)
X, y = make_moons(n_samples=600, noise=0.25, random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
X_tr = torch.tensor(X_tr, dtype=torch.float32)
X_te = torch.tensor(X_te, dtype=torch.float32)
y_tr = torch.tensor(y_tr, dtype=torch.long)
y_te = torch.tensor(y_te, dtype=torch.long)

  # Wrap tensors as datasets and loaders
train_loader = DataLoader(TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True)
test_loader = DataLoader(TensorDataset(X_te, y_te), batch_size=256)

class TinyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(2, 16), nn.ReLU(), nn.Linear(16, 2))

    def forward(self, x):
        return self.net(x)

model = TinyMLP()
opt = torch.optim.Adam(model.parameters(), lr=5e-3)
loss_fn = nn.CrossEntropyLoss()


In [None]:
for epoch in range(10):
    model.train()
    for Xb, yb in train_loader:
        logits = model(Xb)
        loss = loss_fn(logits, yb)
        opt.zero_grad()
        loss.backward()
        opt.step()

model.eval()
with torch.no_grad():
    accuracy = (model(X_te).argmax(1) == y_te).float().mean().item()
accuracy


## Transform: standardization

In [None]:
class Standardize:
    def __init__(self, mean, std):
        self.mean = mean
        self.std = std
        def __call__(self, x):
            return (x - self.mean) / (self.std + 1e-8)

            mu, sigma = X_tr.mean(0), X_tr.std(0)
            std = Standardize(mu, sigma)
            X_tr_s, X_te_s = std(X_tr), std(X_te)

            train_loader = DataLoader(TensorDataset(X_tr_s, y_tr), batch_size = 64,  # wrap tensors as a dataset
                shuffle = True)
            test_loader = DataLoader(TensorDataset(X_te_s, y_te), batch_size = 256)  # wrap tensors as a dataset


## Custom collate (variable length)

In [None]:
class ToySeq(Dataset):
    def __init__(self, rng, n = 20):
        self.x = [torch.tensor(rng.integers(1, 10, size = rng.integers(3,
            8))) for _ in range(n)]
        self.y = [int(xi.sum() % 2) for xi in self.x]
        def __len__(self):
            return len(self.x)
            def __getitem__(self, i):
                return self.x[i].float(), self.y[i]

                def pad_collate(batch):
                    xs, ys = zip(*batch)
                    L = max(x.size(0) for x in xs)
                    Xp = torch.zeros(len(xs), L)
                    for i, x in enumerate(xs):
                        Xp[i, :x.size(0)] = x
                        return Xp, torch.tensor(ys, dtype = torch.long)

                        rng = np.random.default_rng(0)  # RNG setup
                        seq_loader = DataLoader(ToySeq(rng), batch_size = 4,  # create data loader
                            collate_fn = pad_collate)
                        xb, yb = next(iter(seq_loader))
                        xb.shape, yb.shape


## Exercises

1. Write a tiny custom Dataset; use a DataLoader and inspect batching/padding.
2. Measure throughput for two DataLoader settings (num_workers, pin_memory).


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
