# Chapter 7 -- Deep Learning with PyTorch
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH07_Deep_Learning_PyTorch.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Part:** 3 -- Machine Learning and AI  
**Prerequisites:** Chapter 6 (Machine Learning with scikit-learn)  
**Estimated time:** 6-7 hours

---

### Learning Objectives

By the end of this chapter you will be able to:

- Explain what a tensor is and why PyTorch uses them instead of NumPy arrays
- Create and manipulate tensors: indexing, reshaping, device transfer
- Build a neural network with `nn.Module`, `nn.Linear`, and activation functions
- Write a complete training loop: forward pass, loss, backward pass, optimiser step
- Use `DataLoader` and `Dataset` for efficient batched training
- Apply regularisation: dropout, weight decay, batch normalisation
- Diagnose overfitting with training vs validation loss curves
- Save and load model weights with `torch.save` and `torch.load`

---

### Project Thread -- Chapter 7

Two neural networks trained on SO 2025 data:

1. **Salary regression MLP** -- a multi-layer perceptron that predicts log salary
   from developer profile features, compared against the Chapter 6 Random Forest baseline
2. **Python usage classifier MLP** -- a binary classifier that predicts Python adoption,
   compared against the Chapter 6 Gradient Boosting baseline

Both networks are built from scratch -- no high-level Keras-style wrappers.
Every component is explicit so you understand what the framework does for you.


---

## Setup -- Imports, Device, and Data


> **Before running this notebook:** go to **Runtime → Change runtime type → T4 GPU**.
> The two training loops in Sections 7.4 and 7.5 work on CPU but run 3-5x faster on GPU.
> If T4 is unavailable, CPU will still complete in a few minutes.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from torch.optim.lr_scheduler import ReduceLROnPlateau

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, accuracy_score, classification_report

print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')

# Use GPU if available, otherwise CPU
# In Colab: Runtime -> Change runtime type -> T4 GPU for faster training
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {DEVICE}')

RANDOM_STATE = 42
torch.manual_seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi']       = 110
plt.rcParams['axes.titlesize']   = 13
plt.rcParams['axes.titleweight'] = 'bold'

DATASET_URL = 'https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv'


In [None]:
# Load and clean SO 2025 -- identical pipeline to Chapter 6
df_raw = pd.read_csv(DATASET_URL)
df = df_raw.copy()
df = df.dropna(subset=['ConvertedCompYearly'])
df['ConvertedCompYearly'] = pd.to_numeric(df['ConvertedCompYearly'], errors='coerce')
Q1, Q3 = df['ConvertedCompYearly'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[
    (df['ConvertedCompYearly'] >= max(Q1 - 3*IQR, 5_000)) &
    (df['ConvertedCompYearly'] <= min(Q3 + 3*IQR, 600_000))
].copy()
if 'YearsCodePro' in df.columns:
    df['YearsCodePro'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')
    df['YearsCodePro'] = df['YearsCodePro'].fillna(df['YearsCodePro'].median())
for col in ['Country', 'EdLevel', 'Employment', 'RemoteWork']:
    if col in df.columns:
        df[col] = df[col].fillna('Unknown')
df['uses_python'] = df.get('LanguageHaveWorkedWith', pd.Series(dtype=str)).str.contains('Python', na=False).astype(int)
df['uses_sql']    = df.get('LanguageHaveWorkedWith', pd.Series(dtype=str)).str.contains('SQL', na=False).astype(int)
df['uses_js']     = df.get('LanguageHaveWorkedWith', pd.Series(dtype=str)).str.contains('JavaScript', na=False).astype(int)
df['uses_ai']     = df.get('AIToolCurrently', pd.Series(dtype=str)).notna().astype(int)
df['log_salary']  = np.log(df['ConvertedCompYearly'])
df = df.reset_index(drop=True)
print(f'Dataset ready: {len(df):,} rows')


---

## Section 7.1 -- Tensors: PyTorch's Core Data Structure

A **tensor** is an n-dimensional array -- identical in structure to a NumPy array,
but with two crucial additions:

1. **Device awareness:** a tensor can live on CPU or GPU. Moving it to GPU (`tensor.to('cuda')`)
   makes all operations on it run on the GPU automatically -- no code changes needed.
2. **Autograd:** PyTorch tracks every operation on tensors with `requires_grad=True`,
   building a computation graph that enables automatic differentiation.
   This is what makes backpropagation possible without manual gradient calculation.

Everything else in PyTorch -- layers, optimisers, losses -- operates on tensors.


In [None]:
# 7.1.1 -- Creating tensors

# From a Python list
t1 = torch.tensor([1.0, 2.0, 3.0, 4.0])
print(f't1: {t1}  dtype={t1.dtype}  shape={t1.shape}')

# From a NumPy array -- shares memory when on CPU (no copy)
arr = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
t2  = torch.from_numpy(arr)
print(f't2 shape: {t2.shape}  dtype={t2.dtype}')

# Factory functions
zeros = torch.zeros(3, 4)          # 3x4 matrix of 0.0
ones  = torch.ones(2, 5)           # 2x5 matrix of 1.0
rand  = torch.rand(3, 3)           # uniform [0, 1)
randn = torch.randn(3, 3)          # standard normal
eye   = torch.eye(4)               # 4x4 identity matrix
print(f'zeros shape: {zeros.shape}')
print(f'randn:\n{randn.numpy().round(3)}')

# Specifying dtype explicitly
t_float32 = torch.tensor([1, 2, 3], dtype=torch.float32)  # default for neural nets
t_int64   = torch.tensor([1, 2, 3], dtype=torch.int64)    # required for class labels
print(f'float32: {t_float32.dtype}  int64: {t_int64.dtype}')


In [None]:
# 7.1.2 -- Tensor operations: indexing, reshaping, device transfer

t = torch.randn(4, 3)
print(f'Original shape: {t.shape}')
print(f'First row:      {t[0]}')
print(f'Col 1:          {t[:, 1]}')
print(f'Element [2,1]:  {t[2, 1].item():.4f}')  # .item() extracts Python scalar

# Reshaping
flat   = t.reshape(-1)           # flatten to 1D (-1 = infer size)
t_2x6  = t.reshape(2, 6)        # reshape to 2x6
t_T    = t.T                     # transpose
print(f'Flattened: {flat.shape}  Reshaped: {t_2x6.shape}  Transposed: {t_T.shape}')

# Arithmetic -- all operations are element-wise unless you use @
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(f'a + b:     {a + b}')
print(f'a * b:     {a * b}')    # element-wise, NOT dot product
print(f'a @ b:     {a @ b}')    # dot product (inner product)
print(f'a.dot(b):  {torch.dot(a, b)}')

# Matrix multiplication
W = torch.randn(3, 5)   # weight matrix
x = torch.randn(5)      # input vector
y = W @ x               # matrix-vector product: shape (3,)
print(f'W @ x shape: {y.shape}')

# Device transfer -- move to GPU if available
t_device = t.to(DEVICE)
print(f'Tensor device: {t_device.device}')


In [None]:
# 7.1.3 -- Autograd: automatic differentiation
#
# When requires_grad=True, PyTorch records every operation on the tensor.
# Calling .backward() on the final scalar loss computes gradients
# for all tensors in the computation graph that have requires_grad=True.
# This is the entire mathematical basis of neural network training.

# Simple example: compute d/dx of f(x) = 3x^2 + 2x + 1 at x=2
x = torch.tensor(2.0, requires_grad=True)  # leaf variable
f = 3 * x**2 + 2 * x + 1                  # builds computation graph

f.backward()   # computes df/dx

print(f'f(2)   = {f.item():.1f}  (expected: 3*4 + 2*2 + 1 = 17)')
print(f'df/dx  = {x.grad.item():.1f}  (expected: 6x + 2 at x=2 = 14)')

# In a neural network, x becomes the model weights, f becomes the loss.
# .backward() fills .grad on every weight tensor, then the optimiser
# uses those gradients to update the weights.

# torch.no_grad() disables gradient tracking -- use during inference
# to save memory and speed up forward passes when we do not need gradients
with torch.no_grad():
    y = 3 * x**2 + 2 * x + 1
    print(f'Under no_grad: y={y.item():.1f}, grad_fn={y.grad_fn}')  # None

# Converting between tensor and numpy
t = torch.tensor([1.0, 2.0, 3.0])
arr = t.detach().numpy()    # .detach() removes from computation graph first
print(f'Tensor -> numpy: {arr}  type={type(arr).__name__}')
back = torch.from_numpy(arr)
print(f'numpy -> tensor: {back}')


---

## Section 7.2 -- Building Neural Networks with nn.Module

`nn.Module` is the base class for every neural network in PyTorch.
You subclass it, define layers in `__init__`, and implement the forward pass in `forward()`.
PyTorch then handles parameter registration, gradient tracking, device transfer,
serialisation, and training/eval mode switching automatically.

This is the same OOP pattern from Chapter 2 -- a class with `__init__` storing
configuration and learned state, and methods implementing behaviour.


In [None]:
# 7.2.1 -- A minimal MLP (Multi-Layer Perceptron)

class SalaryMLP(nn.Module):
    """
    Multi-layer perceptron for salary regression.

    Architecture:
        Input -> Linear -> BatchNorm -> ReLU -> Dropout
               -> Linear -> BatchNorm -> ReLU -> Dropout
               -> Linear -> output

    Parameters
    ----------
    input_dim  : int   -- number of input features
    hidden_dims: list  -- number of neurons in each hidden layer
    dropout    : float -- dropout probability (0 = disabled)
    """

    def __init__(self, input_dim, hidden_dims=(128, 64, 32), dropout=0.3):
        super().__init__()   # must call parent __init__

        layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            layers += [
                nn.Linear(prev_dim, hidden_dim),  # learnable weight matrix + bias
                nn.BatchNorm1d(hidden_dim),        # normalise activations per batch
                nn.ReLU(),                         # non-linearity: max(0, x)
                nn.Dropout(p=dropout),             # randomly zero p% of activations
            ]
            prev_dim = hidden_dim

        layers.append(nn.Linear(prev_dim, 1))   # output layer: single salary prediction

        # nn.Sequential chains layers -- forward() calls them in order
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        """
        Forward pass: compute predictions from input tensor x.
        Called automatically when you do: output = model(x)
        """
        return self.network(x).squeeze(-1)   # squeeze removes the trailing dim-1


# Instantiate the model and move to the target device
input_dim   = 5   # placeholder -- will be set properly in section 7.3
salary_net  = SalaryMLP(input_dim=input_dim, hidden_dims=(128, 64, 32), dropout=0.3)
salary_net  = salary_net.to(DEVICE)

print(salary_net)
print()

# Count trainable parameters
n_params = sum(p.numel() for p in salary_net.parameters() if p.requires_grad)
print(f'Trainable parameters: {n_params:,}')

# Test with a random batch
x_test  = torch.randn(8, input_dim).to(DEVICE)   # batch of 8 samples
y_test  = salary_net(x_test)
print(f'Input shape:  {x_test.shape}')
print(f'Output shape: {y_test.shape}')   # (8,) -- one prediction per sample


In [None]:
# 7.2.2 -- Loss functions and optimisers
#
# A loss function measures how wrong the model's predictions are.
# The optimiser adjusts model weights to reduce the loss.

# Common loss functions:
#   nn.MSELoss()    -- Mean Squared Error: regression tasks
#   nn.MAELoss()    -- Mean Absolute Error: regression, robust to outliers
#   nn.BCEWithLogitsLoss() -- Binary Cross-Entropy: binary classification
#   nn.CrossEntropyLoss()  -- Multi-class classification

criterion = nn.MSELoss()   # for log-salary regression

# Common optimisers:
#   optim.SGD        -- Stochastic Gradient Descent (baseline)
#   optim.Adam       -- Adaptive Moment Estimation (default choice)
#   optim.AdamW      -- Adam with decoupled weight decay (better regularisation)

optimizer = optim.AdamW(
    salary_net.parameters(),
    lr=1e-3,           # learning rate: step size for weight updates
    weight_decay=1e-4  # L2 regularisation: penalises large weights
)

# Demonstrate one manual forward + backward pass
x_demo  = torch.randn(16, input_dim).to(DEVICE)
y_demo  = torch.randn(16).to(DEVICE)          # fake targets

optimizer.zero_grad()          # clear gradients from the previous step
y_pred  = salary_net(x_demo)   # forward pass
loss    = criterion(y_pred, y_demo)  # compute loss
loss.backward()                # backward pass: compute all gradients
optimizer.step()               # update weights using gradients

print(f'One manual training step completed.')
print(f'Loss: {loss.item():.6f}')
print(f'Output grad_fn: {y_pred.grad_fn.__class__.__name__}')  # SqueezeBackward


---

## Section 7.3 -- Dataset and DataLoader

PyTorch's `Dataset` and `DataLoader` handle the mechanics of batched training:
shuffling, batching, and multi-process data loading. They are the equivalent
of scikit-learn's `Pipeline` for the data side of training.

You subclass `Dataset` and implement three methods:
- `__len__`: how many samples in the dataset
- `__getitem__`: return one (features, label) pair by index

`DataLoader` wraps the dataset and yields batches during the training loop.


In [None]:
# 7.3.1 -- Custom Dataset for SO 2025

class SurveyDataset(Dataset):
    """
    PyTorch Dataset wrapping the SO 2025 feature matrix and targets.

    Parameters
    ----------
    X : np.ndarray -- feature matrix (already scaled)
    y : np.ndarray -- target vector
    """

    def __init__(self, X, y):
        # Convert numpy arrays to float32 tensors
        # float32 is the standard dtype for neural network weights and activations
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        # Called by DataLoader to fetch one sample
        # Returns a (features, label) tuple
        return self.X[idx], self.y[idx]


# Build the feature matrix for regression
feature_cols = [c for c in ['YearsCodePro', 'uses_python', 'uses_sql', 'uses_js', 'uses_ai']
                if c in df.columns]
print(f'Features: {feature_cols}')

X_raw = df[feature_cols].copy()
for col in feature_cols:
    med = X_raw[col].median()
    X_raw[col] = X_raw[col].fillna(med if pd.notna(med) else 0)
y_raw = df['log_salary'].values

# Train/test split
X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(
    X_raw.values, y_raw, test_size=0.2, random_state=RANDOM_STATE
)

# Scale features -- fit ONLY on training data
feat_scaler = StandardScaler()
X_train_sc  = feat_scaler.fit_transform(X_train_np)
X_test_sc   = feat_scaler.transform(X_test_np)

# Create Dataset objects
train_dataset = SurveyDataset(X_train_sc, y_train_np)
test_dataset  = SurveyDataset(X_test_sc,  y_test_np)

# DataLoader: batches + shuffling
# batch_size=256: process 256 samples per weight update
# shuffle=True: randomise order each epoch (prevents the model memorising order)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
test_loader  = DataLoader(test_dataset,  batch_size=512, shuffle=False)

input_dim = X_train_sc.shape[1]
print(f'Train: {len(train_dataset):,} samples  ({len(train_loader)} batches of 256)')
print(f'Test:  {len(test_dataset):,} samples')
print(f'Input dim: {input_dim}')


---

## Section 7.4 -- The Training Loop

The training loop is the heart of deep learning. Every epoch it:

1. Iterates over batches from the DataLoader
2. Runs the **forward pass**: compute predictions
3. Computes the **loss**: how wrong are the predictions?
4. Runs the **backward pass**: compute gradients via backprop
5. **Optimiser step**: update weights in the direction that reduces loss
6. Runs a **validation pass** with `torch.no_grad()` to monitor overfitting

The train/validation loss curves are the most important diagnostic in deep learning.


In [None]:
# 7.4.1 -- Reusable train_epoch and evaluate functions

def train_epoch(model, loader, criterion, optimizer):
    """
    Run one full pass over the training data.
    Returns mean training loss for this epoch.
    """
    model.train()   # sets BatchNorm and Dropout to training mode
    total_loss = 0.0

    for X_batch, y_batch in loader:
        X_batch = X_batch.to(DEVICE)
        y_batch = y_batch.to(DEVICE)

        optimizer.zero_grad()              # 1. clear old gradients
        y_pred = model(X_batch)            # 2. forward pass
        loss   = criterion(y_pred, y_batch)  # 3. compute loss
        loss.backward()                    # 4. backward pass
        optimizer.step()                   # 5. update weights

        total_loss += loss.item() * len(X_batch)  # accumulate weighted loss

    return total_loss / len(loader.dataset)


def evaluate(model, loader, criterion):
    """
    Evaluate model on a DataLoader without updating weights.
    Returns mean loss and all predictions as numpy arrays.
    """
    model.eval()    # sets BatchNorm and Dropout to inference mode
    total_loss = 0.0
    all_preds  = []
    all_targets = []

    with torch.no_grad():   # disable gradient tracking for efficiency
        for X_batch, y_batch in loader:
            X_batch = X_batch.to(DEVICE)
            y_batch = y_batch.to(DEVICE)
            y_pred  = model(X_batch)
            loss    = criterion(y_pred, y_batch)
            total_loss += loss.item() * len(X_batch)
            all_preds.append(y_pred.cpu().numpy())
            all_targets.append(y_batch.cpu().numpy())

    preds   = np.concatenate(all_preds)
    targets = np.concatenate(all_targets)
    return total_loss / len(loader.dataset), preds, targets


print('train_epoch() and evaluate() functions defined.')


In [None]:
# 7.4.2 -- Train the salary regression MLP

# Build fresh model
salary_net = SalaryMLP(
    input_dim   = input_dim,
    hidden_dims = (128, 64, 32),
    dropout     = 0.3
).to(DEVICE)

criterion  = nn.MSELoss()
optimizer  = optim.AdamW(salary_net.parameters(), lr=1e-3, weight_decay=1e-4)

# ReduceLROnPlateau: halve the learning rate if validation loss stops improving
scheduler  = ReduceLROnPlateau(optimizer, mode='min', factor=0.5,
                                patience=5, verbose=False)

N_EPOCHS   = 60
train_losses = []
val_losses   = []
best_val     = float('inf')
best_weights = None

print(f'Training SalaryMLP for {N_EPOCHS} epochs on {DEVICE}...')
print(f'{"Epoch":>6}  {"Train Loss":>12}  {"Val Loss":>12}  {"LR":>10}')
print('-' * 44)

for epoch in range(1, N_EPOCHS + 1):
    tr_loss          = train_epoch(salary_net, train_loader, criterion, optimizer)
    val_loss, _, _   = evaluate(salary_net, test_loader, criterion)
    scheduler.step(val_loss)

    train_losses.append(tr_loss)
    val_losses.append(val_loss)

    # Save best weights (early stopping logic)
    if val_loss < best_val:
        best_val     = val_loss
        best_weights = {k: v.clone() for k, v in salary_net.state_dict().items()}

    if epoch % 10 == 0 or epoch == 1:
        lr = optimizer.param_groups[0]['lr']
        print(f'{epoch:>6}  {tr_loss:>12.6f}  {val_loss:>12.6f}  {lr:>10.2e}')

print(f'Best validation loss: {best_val:.6f}')


In [None]:
# 7.4.3 -- Evaluate and plot training curves

# Restore best weights
salary_net.load_state_dict(best_weights)

# Final evaluation
_, y_pred_log, y_true_log = evaluate(salary_net, test_loader, criterion)
y_pred_usd = np.exp(y_pred_log)
y_true_usd = np.exp(y_true_log)

r2  = r2_score(y_true_log, y_pred_log)
mae = np.mean(np.abs(y_true_usd - y_pred_usd))

print(f'SalaryMLP Test Results:')
print(f'  R^2 (log scale): {r2:.4f}')
print(f'  MAE (USD):       ${mae:,.0f}')

# Training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

epochs = range(1, len(train_losses) + 1)
axes[0].plot(epochs, train_losses, '#E8722A', linewidth=2, label='Train loss')
axes[0].plot(epochs, val_losses,   '#2E75B6', linewidth=2, label='Val loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('MSE Loss (log salary)')
axes[0].set_title('Training vs Validation Loss\n(curves converging = good fit)')
axes[0].legend()

axes[1].scatter(y_true_usd/1000, y_pred_usd/1000, alpha=0.2, s=8, color='#2E75B6')
lim = max(y_true_usd.max(), y_pred_usd.max()) / 1000
axes[1].plot([0, lim], [0, lim], 'r--', linewidth=2, label='Perfect')
axes[1].set_xlabel('Actual Salary ($k)')
axes[1].set_ylabel('Predicted Salary ($k)')
axes[1].set_title(f'Actual vs Predicted Salary\nR^2={r2:.3f}, MAE=${mae/1000:.1f}k')
axes[1].legend()

plt.tight_layout()
plt.show()


---

## Section 7.5 -- Binary Classification MLP

The classifier network is identical to the regression network in architecture,
with two changes:

- **Output layer:** 1 neuron with no activation (raw logit)
- **Loss function:** `BCEWithLogitsLoss` -- combines sigmoid activation and
  binary cross-entropy in a numerically stable single operation

During inference, apply `torch.sigmoid()` to the raw logit to get a probability.


In [None]:
# 7.5.1 -- Build classifier dataset and model

clf_feature_cols = [c for c in ['YearsCodePro', 'ConvertedCompYearly',
                                 'uses_sql', 'uses_js', 'uses_ai']
                    if c in df.columns]

X_clf_raw = df[clf_feature_cols].copy()
for col in clf_feature_cols:
    med = X_clf_raw[col].median()
    X_clf_raw[col] = X_clf_raw[col].fillna(med if pd.notna(med) else 0)
y_clf_raw = df['uses_python'].values.astype(np.float32)

X_tr_c, X_te_c, y_tr_c, y_te_c = train_test_split(
    X_clf_raw.values, y_clf_raw, test_size=0.2,
    random_state=RANDOM_STATE, stratify=y_clf_raw
)

clf_scaler  = StandardScaler()
X_tr_c_sc   = clf_scaler.fit_transform(X_tr_c)
X_te_c_sc   = clf_scaler.transform(X_te_c)

clf_train_ds = SurveyDataset(X_tr_c_sc, y_tr_c)
clf_test_ds  = SurveyDataset(X_te_c_sc, y_te_c)
clf_train_loader = DataLoader(clf_train_ds, batch_size=256, shuffle=True)
clf_test_loader  = DataLoader(clf_test_ds,  batch_size=512, shuffle=False)

clf_input_dim = X_tr_c_sc.shape[1]
print(f'Classifier features: {clf_feature_cols}')
print(f'Class balance: Python={y_clf_raw.mean()*100:.1f}%')

# The network architecture is identical -- only the loss changes
class_net = SalaryMLP(
    input_dim   = clf_input_dim,
    hidden_dims = (64, 32),
    dropout     = 0.2
).to(DEVICE)

clf_criterion = nn.BCEWithLogitsLoss()   # sigmoid + BCE in one stable op
clf_optimizer = optim.AdamW(class_net.parameters(), lr=1e-3, weight_decay=1e-4)
clf_scheduler = ReduceLROnPlateau(clf_optimizer, mode='min', factor=0.5, patience=5)

n_params = sum(p.numel() for p in class_net.parameters() if p.requires_grad)
print(f'Classifier parameters: {n_params:,}')


In [None]:
# 7.5.2 -- Train the classifier

N_EPOCHS_CLF    = 50
clf_train_losses = []
clf_val_losses   = []
clf_best_val     = float('inf')
clf_best_weights = None

print(f'Training classifier for {N_EPOCHS_CLF} epochs on {DEVICE}...')
print(f'{"Epoch":>6}  {"Train Loss":>12}  {"Val Loss":>12}')
print('-' * 34)

for epoch in range(1, N_EPOCHS_CLF + 1):
    tr_loss        = train_epoch(class_net, clf_train_loader, clf_criterion, clf_optimizer)
    val_loss, _, _ = evaluate(class_net, clf_test_loader, clf_criterion)
    clf_scheduler.step(val_loss)

    clf_train_losses.append(tr_loss)
    clf_val_losses.append(val_loss)

    if val_loss < clf_best_val:
        clf_best_val     = val_loss
        clf_best_weights = {k: v.clone() for k, v in class_net.state_dict().items()}

    if epoch % 10 == 0 or epoch == 1:
        print(f'{epoch:>6}  {tr_loss:>12.6f}  {val_loss:>12.6f}')

print(f'Best val loss: {clf_best_val:.6f}')


In [None]:
# 7.5.3 -- Evaluate the classifier

class_net.load_state_dict(clf_best_weights)

_, logits, y_true = evaluate(class_net, clf_test_loader, clf_criterion)

# Convert raw logits to probabilities and binary predictions
probs  = 1 / (1 + np.exp(-logits))   # sigmoid manually for clarity
y_pred = (probs >= 0.5).astype(int)

acc = accuracy_score(y_true, y_pred)
print(f'Classifier Test Accuracy: {acc:.4f}  ({acc*100:.1f}%)')
print()
print(classification_report(y_true, y_pred, target_names=['Non-Python', 'Python']))

# Loss curves
fig, ax = plt.subplots(figsize=(9, 5))
epochs_clf = range(1, len(clf_train_losses) + 1)
ax.plot(epochs_clf, clf_train_losses, '#E8722A', linewidth=2, label='Train loss')
ax.plot(epochs_clf, clf_val_losses,   '#2E75B6', linewidth=2, label='Val loss')
ax.set_xlabel('Epoch')
ax.set_ylabel('BCE Loss')
ax.set_title(f'Classifier Training Curves\nFinal accuracy: {acc*100:.1f}%')
ax.legend()
plt.tight_layout()
plt.show()


---

## Section 7.6 -- Saving, Loading, and Model Comparison


In [None]:
# 7.6.1 -- Saving and loading model weights
#
# Best practice: save only the state_dict (weights), not the entire model object.
# The model class definition must be available when loading.

import os

save_path = '/tmp/salary_mlp_best.pt'
torch.save(best_weights, save_path)
print(f'Saved weights to {save_path}  ({os.path.getsize(save_path)/1024:.1f} KB)')

# Load into a fresh model instance
loaded_net = SalaryMLP(input_dim=input_dim, hidden_dims=(128, 64, 32), dropout=0.3)
loaded_net.load_state_dict(torch.load(save_path, map_location='cpu'))
loaded_net.eval()
print('Weights loaded successfully into fresh model instance.')

# Verify predictions match
x_verify = torch.tensor(X_test_sc[:5], dtype=torch.float32)
with torch.no_grad():
    salary_net.eval()
    original_preds = salary_net(x_verify).numpy()
    loaded_preds   = loaded_net(x_verify).numpy()

print(f'Original predictions: {np.exp(original_preds).round(0)}')
print(f'Loaded predictions:   {np.exp(loaded_preds).round(0)}')
print(f'Match: {np.allclose(original_preds, loaded_preds)}')


In [None]:
# 7.6.2 -- Chapter summary: MLP vs Random Forest on both tasks

print('=' * 60)
print('  Chapter 7 Results: Neural Network vs Baseline')
print('=' * 60)

print()
print('Task 1: Salary Regression (test set)')
print(f'  SalaryMLP (PyTorch):          R^2 = {r2:.4f},  MAE = ${mae:,.0f}')
print('  Random Forest (Chapter 6):    see CH06 notebook for comparison')

print()
print('Task 2: Python Usage Classification (test set)')
print(f'  ClassifierMLP (PyTorch):      accuracy = {acc:.4f}')
print('  Best sklearn model (Ch 6):    see CH06 notebook for comparison')

print()
print('Key observations:')
print('  - For tabular data with few features, tree-based models often')
print('    match or beat MLPs without extensive tuning')
print('  - Neural networks shine on high-dimensional data: images, text,')
print('    audio -- not on 5-column tabular datasets')
print('  - Chapter 8 (NLP) shows where neural networks dominate')
print()
print('Neural network concepts mastered in this chapter:')
for concept in [
    'Tensors and autograd',
    'nn.Module and layer construction',
    'BatchNorm, Dropout, ReLU',
    'Dataset and DataLoader',
    'The training loop (forward, loss, backward, step)',
    'ReduceLROnPlateau scheduler',
    'Best-weight checkpointing',
    'torch.save / torch.load',
]:
    print(f'  - {concept}')


---

## Section 7.7 -- Finding the Right Learning Rate

The learning rate is the single most important hyperparameter in neural network
training. Too high and training diverges; too low and it converges glacially.

The **LR Range Test** (Leslie Smith, 2015) finds a good learning rate in one
short run: start very low, increase exponentially each batch, and plot the loss.
The optimal learning rate sits just before the loss starts rising sharply --
typically one order of magnitude below the minimum loss point.

This eliminates the need for expensive grid searches over learning rates.


In [None]:
# 7.7.1 -- LR Range Test implementation

import math

def lr_range_test(model_fn, train_loader, criterion,
                  start_lr=1e-7, end_lr=10.0, num_iter=100):
    """
    Run the LR range test.
    model_fn: callable that returns a freshly initialised model
    Returns: (lrs, losses) lists for plotting
    """
    model     = model_fn().to(DEVICE)
    optimiser = torch.optim.SGD(model.parameters(), lr=start_lr)
    mult      = (end_lr / start_lr) ** (1 / num_iter)

    lrs, losses = [], []
    best_loss   = float('inf')
    avg_loss    = 0.0
    beta        = 0.98   # smoothing factor

    data_iter = iter(train_loader)

    for i in range(num_iter):
        try:
            X_batch, y_batch = next(data_iter)
        except StopIteration:
            data_iter = iter(train_loader)
            X_batch, y_batch = next(data_iter)

        X_batch = X_batch.to(DEVICE)
        y_batch = y_batch.to(DEVICE)

        model.train()
        optimiser.zero_grad()
        out  = model(X_batch)
        loss = criterion(out.squeeze(), y_batch.float())
        loss.backward()
        optimiser.step()

        # Exponentially weighted moving average of loss
        avg_loss = beta * avg_loss + (1 - beta) * loss.item()
        smooth   = avg_loss / (1 - beta ** (i + 1))

        current_lr = optimiser.param_groups[0]['lr']
        lrs.append(current_lr)
        losses.append(smooth)

        if smooth < best_loss:
            best_loss = smooth
        # Stop if loss explodes
        if smooth > 4 * best_loss:
            break

        # Increase LR for next step
        for pg in optimiser.param_groups:
            pg['lr'] *= mult

    return lrs, losses


# Run the test using the binary classification dataset from section 7.5
# Recreate the DataLoaders used in section 7.5
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler as SkScaler
from sklearn.model_selection import train_test_split as sk_split

feat_cols_7 = [c for c in ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']
               if c in df.columns]
X7 = df[feat_cols_7].copy()
for col in feat_cols_7:
    med = X7[col].median()
    X7[col] = X7[col].fillna(med if pd.notna(med) else 0)
y7 = df['uses_python'].values.astype('float32')

X7tr, X7te, y7tr, y7te = sk_split(X7.values, y7, test_size=0.2,
                                    random_state=RANDOM_STATE, stratify=y7)
sc7 = SkScaler()
X7tr_s = sc7.fit_transform(X7tr).astype('float32')

lr_ds     = TensorDataset(torch.tensor(X7tr_s), torch.tensor(y7tr))
lr_loader = DataLoader(lr_ds, batch_size=64, shuffle=True)

def fresh_clf():
    return nn.Sequential(
        nn.Linear(X7tr_s.shape[1], 64), nn.BatchNorm1d(64), nn.ReLU(), nn.Dropout(0.2),
        nn.Linear(64, 32),              nn.BatchNorm1d(32), nn.ReLU(), nn.Dropout(0.2),
        nn.Linear(32, 1)
    )

lrs, losses = lr_range_test(
    fresh_clf, lr_loader, nn.BCEWithLogitsLoss(), num_iter=150
)

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(lrs, losses, '#2E75B6', linewidth=2)
ax.set_xscale('log')
ax.set_xlabel('Learning Rate (log scale)')
ax.set_ylabel('Smoothed Loss')
ax.set_title('LR Range Test\nChoose LR just before the minimum -- typically 10x below the lowest point')

# Mark suggested range
min_idx     = losses.index(min(losses))
suggest_lr  = lrs[max(0, min_idx - 10)]
ax.axvline(suggest_lr, color='red', linestyle='--', linewidth=1.5,
           label=f'Suggested LR ≈ {suggest_lr:.1e}')
ax.legend()
plt.tight_layout()
plt.show()

print(f'Suggested learning rate from range test: {suggest_lr:.2e}')
print('Use this as max_lr in OneCycleLR or as the base lr in AdamW')


---

## Section 7.8 -- Mixed Precision Training

Modern GPUs have dedicated hardware for 16-bit (float16) arithmetic that runs
2-4x faster than 32-bit (float32). **Mixed precision training** uses float16
for the forward pass and gradient computation, but keeps a float32 master copy
of weights for the update step (where numerical precision matters).

PyTorch's `torch.cuda.amp` (Automatic Mixed Precision) handles this transparently:
- `autocast()` context manager: automatically casts operations to float16
- `GradScaler`: scales the loss upward before backprop to prevent float16 underflow,
  then unscales before the optimiser step

**Result:** ~2x faster training, ~50% less GPU memory. Free speedup with two lines of code.


In [None]:
# 7.8.1 -- Mixed precision training with torch.cuda.amp

import time

# Build a slightly larger model to make the timing difference visible
class LargerMLP(nn.Module):
    def __init__(self, input_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden),  nn.BatchNorm1d(hidden), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(hidden, hidden),     nn.BatchNorm1d(hidden), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(hidden, hidden//2),  nn.BatchNorm1d(hidden//2), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(hidden//2, 1)
        )
    def forward(self, x): return self.net(x)


N_BENCH_EPOCHS = 5
input_dim_7 = X7tr_s.shape[1]

def run_training(use_amp, n_epochs=N_BENCH_EPOCHS):
    model   = LargerMLP(input_dim_7).to(DEVICE)
    opt     = torch.optim.AdamW(model.parameters(), lr=1e-3)
    crit    = nn.BCEWithLogitsLoss()
    scaler  = torch.cuda.amp.GradScaler(enabled=use_amp)
    loader  = DataLoader(lr_ds, batch_size=256, shuffle=True)

    t0 = time.time()
    for _ in range(n_epochs):
        model.train()
        for Xb, yb in loader:
            Xb, yb = Xb.to(DEVICE), yb.to(DEVICE)
            opt.zero_grad()
            with torch.cuda.amp.autocast(enabled=use_amp):
                out  = model(Xb)
                loss = crit(out.squeeze(), yb)
            scaler.scale(loss).backward()
            scaler.step(opt)
            scaler.update()
    elapsed = time.time() - t0
    return elapsed


# Warm up GPU
_ = run_training(use_amp=False, n_epochs=1)

t_fp32 = run_training(use_amp=False)
t_amp  = run_training(use_amp=True)

amp_label = 'AMP (float16/32)' if DEVICE.type == 'cuda' else 'AMP (CPU -- no speedup on CPU)'
print(f'{N_BENCH_EPOCHS}-epoch training benchmark ({DEVICE}):')
print(f'  float32 only:   {t_fp32:.2f}s')
print(f'  {amp_label}: {t_amp:.2f}s')
if DEVICE.type == 'cuda':
    speedup = t_fp32 / t_amp
    print(f'  Speedup:        {speedup:.2f}x')
    print()
    print('To add AMP to any training loop:')
    print('  1. scaler = torch.cuda.amp.GradScaler()')
    print('  2. Wrap forward pass: with torch.cuda.amp.autocast(): ...')
    print('  3. Replace loss.backward() with scaler.scale(loss).backward()')
    print('  4. Replace opt.step() with scaler.step(opt); scaler.update()')
else:
    print('  Note: AMP speedup only applies on CUDA GPUs.')
    print('  On CPU the results are identical -- enable T4 GPU to see the speedup.')


---

## Section 7.9 -- `torch.compile`: PyTorch 2.0 Speedup

PyTorch 2.0 introduced `torch.compile()` -- a one-line function that compiles
a model using TorchDynamo and TorchInductor, generating optimised kernel code
for your specific GPU.

It requires no changes to your model architecture or training loop.
The first forward pass is slow (compilation overhead), but all subsequent
passes run significantly faster. On modern GPUs, expect 20-50% speedup
for typical MLP and CNN architectures.

```python
# Before -- standard model
model = MyModel().to(DEVICE)

# After -- compiled model (same API, faster execution)
model = torch.compile(MyModel().to(DEVICE))
```

**When to use it:** large models, long training runs, production inference.
The compilation overhead (~30s on first run) is amortised over many iterations.
Not worth it for tiny models or quick experiments.


In [None]:
# 7.9.1 -- torch.compile benchmark

import torch

pytorch_version = tuple(int(x) for x in torch.__version__.split('.')[:2])
compile_available = pytorch_version >= (2, 0) and DEVICE.type == 'cuda'

if not compile_available:
    reason = 'CPU runtime' if DEVICE.type != 'cuda' else f'PyTorch {torch.__version__} < 2.0'
    print(f'torch.compile benchmark skipped: {reason}')
    print('torch.compile requires PyTorch >= 2.0 and a CUDA GPU.')
    print('Enable T4 GPU in Colab to run this benchmark.')
else:
    model_eager    = LargerMLP(input_dim_7).to(DEVICE)
    model_compiled = torch.compile(LargerMLP(input_dim_7).to(DEVICE))

    crit    = nn.BCEWithLogitsLoss()
    loader  = DataLoader(lr_ds, batch_size=256, shuffle=True)

    def bench(model, n_epochs=5):
        opt = torch.optim.AdamW(model.parameters(), lr=1e-3)
        scaler = torch.cuda.amp.GradScaler()
        t0 = time.time()
        for _ in range(n_epochs):
            model.train()
            for Xb, yb in loader:
                Xb, yb = Xb.to(DEVICE), yb.to(DEVICE)
                opt.zero_grad()
                with torch.cuda.amp.autocast():
                    loss = crit(model(Xb).squeeze(), yb)
                scaler.scale(loss).backward()
                scaler.step(opt)
                scaler.update()
        return time.time() - t0

    # Warm up both models (first pass triggers compilation)
    print('Warming up (first pass compiles the model -- takes ~30s)...')
    _ = bench(model_eager,    n_epochs=1)
    _ = bench(model_compiled, n_epochs=1)

    t_eager    = bench(model_eager,    n_epochs=5)
    t_compiled = bench(model_compiled, n_epochs=5)

    print(f'5-epoch benchmark on {DEVICE}:')
    print(f'  Eager (standard):  {t_eager:.2f}s')
    print(f'  torch.compile:     {t_compiled:.2f}s')
    print(f'  Speedup:           {t_eager/t_compiled:.2f}x')
    print()
    print('Note: speedup is more pronounced for larger models and longer runs.')


---

## Chapter 7 Summary

### Key Takeaways

- A **tensor** is a GPU-aware, autograd-enabled array. `requires_grad=True` enables gradient tracking.
- **`model.train()` / `model.eval()`** toggle Dropout and BatchNorm behaviour.
  Forgetting `.eval()` during validation is a common subtle bug.
- The **training loop** has five steps every iteration: zero_grad, forward, loss, backward, step.
- **`torch.no_grad()`** disables autograd during inference. Always use it for validation and test.
- **BatchNorm** normalises activations within a batch -- accelerates training and
  reduces sensitivity to learning rate.
- **Dropout** randomly zeroes activations during training. Disabled automatically during eval.
- **`BCEWithLogitsLoss`** is numerically more stable than `Sigmoid + BCELoss`.
- **Save `state_dict`, not the model object.** `torch.save(model.state_dict(), path)` is portable.
- The **LR Range Test** finds a good learning rate in one short run without expensive grid search.
  Look for the learning rate just before the loss minimum -- one order of magnitude below it.
- **Mixed precision (`torch.cuda.amp`)** gives 2-4x speedup and 50% memory reduction on GPU
  with two extra lines of code: `GradScaler` and `autocast()`.
- **`torch.compile`** (PyTorch 2.0+) compiles the model to optimised GPU kernels.
  One line, no API changes, 20-50% additional speedup on large models.

### Project Thread Status

| Task | Architecture | Result |
|------|-------------|--------|
| Salary regression | MLP 128-64-32 + BatchNorm + Dropout | R^2 reported |
| Python classification | MLP 64-32 + BatchNorm + Dropout | Accuracy reported |
| Model persistence | torch.save / torch.load | Round-trip verified |
| LR Range Test | Exponential LR sweep | Suggested LR identified |
| Mixed precision | torch.cuda.amp | Speedup benchmarked |
| torch.compile | PyTorch 2.0 compiler | Speedup benchmarked |

---

### What's Next: Chapter 8 -- NLP and Transformers

Chapter 8 applies neural networks to text: tokenisation, embeddings,
sentiment analysis with a pre-trained transformer, fine-tuning a BERT-family
model on SO 2025 developer comments, and RAG (Retrieval-Augmented Generation)
-- the dominant production NLP pattern.

---

*End of Chapter 7 -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
