# 🏋️ Exercise 1 — **Solution Notebook** (Built on the Demo)

This is the **worked solution** that *extends* the demo rather than repeating it.
All original demo cells appear **below** unchanged. The solution tasks are implemented
in the new cells you see **above** the demo.
 
**Scope:** 
- CIFAR‑10 loaders with knobs.
- Quick training/evaluation loops.
- Model comparison.
- Profiling with TensorBoard.
- Ablations for batch size and `num_workers`.

## 🛠️ Setup

This cell:
- Imports PyTorch, TorchVision, and utility packages used across all tasks.
- Sets the device to GPU if available, otherwise defaults to CPU.
- Prints the device being used for the notebook.

In [1]:
# Core imports
import os, time, math, json, shutil, sys
from collections import defaultdict

import torch
from torch import nn, optim
from torch.utils.data import DataLoader

import torchvision
from torchvision import transforms

# TensorBoard logging (for profiler traces & scalars)
from torch.utils.tensorboard import SummaryWriter

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

2025-08-24 19:59:44.120140: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-24 19:59:44.133458: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-08-24 19:59:44.151182: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-08-24 19:59:44.156662: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-08-24 19:59:44.169521: I tensorflow/core/platform/cpu_feature_guar

Device: cuda


## Task A — 🔄 Central Knobs for DataLoaders (Batch Size, Num Workers)

This section:
- Builds reusable CIFAR‑10 DataLoaders with configurable batch size and number of workers.
- Defines:
  - `_ex1_build_datasets`: Creates CIFAR‑10 training and test datasets with standard normalization.
  - `_ex1_build_loaders`: Builds DataLoaders for the datasets.
  - `set_ex1_dataloader_params`: Updates DataLoader parameters and rebuilds the loaders.
- Ensures flexibility for experimentation with different DataLoader configurations.

In [2]:
# ---- DataLoader builder with knobs ----
_ex1_datasets = {}
_ex1_loaders = {}
_ex1_cfg = {"batch_size": 128, "num_workers": 2}

def _ex1_build_datasets():
    global _ex1_datasets
    if _ex1_datasets:
        return _ex1_datasets
    # Standard CIFAR-10 normalization
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    transform_test = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    train_ds = torchvision.datasets.CIFAR10(root="./data", train=True, download=True, transform=transform_train)
    test_ds  = torchvision.datasets.CIFAR10(root="./data", train=False, download=True, transform=transform_test)
    _ex1_datasets = {"train": train_ds, "test": test_ds}
    return _ex1_datasets

def _ex1_build_loaders():
    global _ex1_loaders
    ds = _ex1_build_datasets()
    bs = _ex1_cfg["batch_size"]
    nw = _ex1_cfg["num_workers"]
    train_loader = DataLoader(ds["train"], batch_size=bs, shuffle=True,  num_workers=nw, pin_memory=True)
    test_loader  = DataLoader(ds["test"],  batch_size=bs, shuffle=False, num_workers=nw, pin_memory=True)
    _ex1_loaders = {"train": train_loader, "test": test_loader}
    return _ex1_loaders

def set_ex1_dataloader_params(batch_size=128, num_workers=2):
    """Update loader knobs and rebuild loaders."""
    _ex1_cfg["batch_size"] = int(batch_size)
    _ex1_cfg["num_workers"] = int(num_workers)
    # Force rebuild
    _ex1_loaders.clear()
    return _ex1_build_loaders()

# Build initial loaders
_ = set_ex1_dataloader_params(batch_size=128, num_workers=2)
print("DataLoader params:", _ex1_cfg)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:01<00:00, 106105046.02it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
DataLoader params: {'batch_size': 128, 'num_workers': 2}


## Utility — 🧩 Small Model Zoo and Quick Train/Eval

This section:
- Provides utility functions for training and evaluating models.
- Defines:
  - `get_ex1_model`: Loads small models like ResNet-18, ResNet-34, or MobileNetV2 with 10 output classes.
  - `ex1_train_one_epoch`: Trains the model for one epoch with optional step limits for quick runs.
  - `ex1_eval_one_epoch`: Evaluates the model for one epoch, computing accuracy and throughput.
- Measures throughput in samples per second for both training and evaluation.

In [3]:
# ---- Models ----
def get_ex1_model(name: str):
    name = name.lower()
    if name == "resnet18":
        m = torchvision.models.resnet18(weights=None, num_classes=10)
    elif name == "resnet34":
        m = torchvision.models.resnet34(weights=None, num_classes=10)
    elif name in ("mobilenet_v2", "mbv2"):
        m = torchvision.models.mobilenet_v2(weights=None, num_classes=10)
    else:
        raise ValueError(f"Unsupported model: {name}")
    return m.to(device)

# ---- Train / Eval helpers ----
def ex1_train_one_epoch(model, loader, optimizer, criterion, max_steps=None):
    model.train()
    n_seen = 0
    t0 = time.time()
    total_loss = 0.0

    for step, (x, y) in enumerate(loader):
        if max_steps is not None and step >= max_steps:
            break
        x = x.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)

        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x.size(0)
        n_seen += x.size(0)

    t1 = time.time()
    throughput = n_seen / (t1 - t0) if (t1 - t0) > 0 else float('nan')
    return {
        "loss": total_loss / max(1, n_seen),
        "num_samples": n_seen,
        "latency_s": t1 - t0,
        "throughput_samp_per_s": throughput,
    }

@torch.no_grad()
def ex1_eval_one_epoch(model, loader, criterion, max_steps=None):
    model.eval()
    n_seen = 0
    correct = 0
    t0 = time.time()
    total_loss = 0.0

    for step, (x, y) in enumerate(loader):
        if max_steps is not None and step >= max_steps:
            break
        x = x.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)
        logits = model(x)
        loss = criterion(logits, y)
        total_loss += loss.item() * x.size(0)

        pred = logits.argmax(dim=1)
        correct += (pred == y).sum().item()
        n_seen += x.size(0)

    t1 = time.time()
    throughput = n_seen / (t1 - t0) if (t1 - t0) > 0 else float('nan')
    accuracy = correct / max(1, n_seen)
    return {
        "loss": total_loss / max(1, n_seen),
        "acc": accuracy,
        "num_samples": n_seen,
        "latency_s": t1 - t0,
        "throughput_samp_per_s": throughput,
    }

### 🚀 Quick Smoke Run (Task A Deliverable)

This cell:
- Runs a short training and evaluation loop to verify that the DataLoaders and model are working correctly.
- Limits the number of steps to ensure the run is quick.
- Prints training and validation statistics, including loss, throughput, and latency.

In [4]:
# ---- Task A: quick run ----
loaders = set_ex1_dataloader_params(batch_size=128, num_workers=2)
train_loader, test_loader = loaders["train"], loaders["test"]

model = get_ex1_model("resnet18")
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

train_stats = ex1_train_one_epoch(model, train_loader, optimizer, criterion, max_steps=20)
val_stats   = ex1_eval_one_epoch(model, test_loader, criterion, max_steps=20)

print("Train stats:", train_stats)
print("Val   stats:", val_stats)

Train stats: {'loss': 4.506065475940704, 'num_samples': 2560, 'latency_s': 1.396545648574829, 'throughput_samp_per_s': 1833.0943944528221}
Val   stats: {'loss': 621.117073059082, 'acc': 0.0984375, 'num_samples': 2560, 'latency_s': 0.40219855308532715, 'throughput_samp_per_s': 6365.0154391701435}


## Task B — 🆚 Add a Second Model and Compare

This section:
- Compares the performance of two models (`resnet18` and `resnet34`) using the same pipeline.
- Measures training and evaluation metrics for both models.
- Returns a dictionary containing the results for each model.

In [5]:
# ---- Task B: compare two models under same loaders ----
def ex1_compare_models(model_names=("resnet18","resnet34"), max_steps=20):
    results = {}
    for name in model_names:
        m = get_ex1_model(name)
        opt = optim.SGD(m.parameters(), lr=0.1, momentum=0.9)
        tr = ex1_train_one_epoch(m, train_loader, opt, criterion, max_steps=max_steps)
        ev = ex1_eval_one_epoch(m, test_loader, criterion, max_steps=max_steps)
        results[name] = {"train": tr, "val": ev}
    return results

compare_models = ex1_compare_models(model_names=("resnet18","resnet34"), max_steps=20)
compare_models

{'resnet18': {'train': {'loss': 4.539602792263031,
   'num_samples': 2560,
   'latency_s': 0.9383394718170166,
   'throughput_samp_per_s': 2728.223715285868},
  'val': {'loss': 15.011657333374023,
   'acc': 0.110546875,
   'num_samples': 2560,
   'latency_s': 0.3471341133117676,
   'throughput_samp_per_s': 7374.67134986188}},
 'resnet34': {'train': {'loss': 5.312578654289245,
   'num_samples': 2560,
   'latency_s': 1.404033899307251,
   'throughput_samp_per_s': 1823.3177997077576},
  'val': {'loss': 349704.790625,
   'acc': 0.1046875,
   'num_samples': 2560,
   'latency_s': 0.36156678199768066,
   'throughput_samp_per_s': 7080.296441658243}}}

## Task C — 📊 PyTorch Profiler → TensorBoard (Top Ops)

This section:
- Profiles the training loop using PyTorch Profiler.
- Saves profiling traces to TensorBoard for visualization.
- Captures:
  - Operator-level performance metrics.
  - CUDA and CPU activity.
  - Memory usage.
- Prints the top operators by total CUDA or CPU time.
- Provides a path to the TensorBoard logs for further analysis.

In [6]:
# ---- Task C: Profiler to TensorBoard ----
from torch.profiler import profile, ProfilerActivity, schedule

log_dir = "./runs/ex1_solution"
if os.path.exists(log_dir):
    shutil.rmtree(log_dir)
writer = SummaryWriter(log_dir=log_dir)

model_prof = get_ex1_model("resnet18")
opt_prof = optim.SGD(model_prof.parameters(), lr=0.1, momentum=0.9)

# A tiny schedule to keep it quick
sched = schedule(wait=1, warmup=1, active=2, repeat=1)

top_ops = []

def _trace_handler(p):
    # Save Chrome trace
    p.export_chrome_trace(os.path.join(log_dir, "trace.json"))
    # Collect simple top-ops by total CUDA time if available, else CPU total
    try:
        evt = p.key_averages().table(sort_by="cuda_time_total", row_limit=10)
    except Exception:
        evt = p.key_averages().table(sort_by="cpu_time_total", row_limit=10)
    print(evt)

activities = [ProfilerActivity.CPU]
if device.type == "cuda":
    activities.append(ProfilerActivity.CUDA)

with profile(activities=activities,
             schedule=sched,
             on_trace_ready=_trace_handler,
             record_shapes=True,
             profile_memory=True,
             with_stack=False) as prof:

    steps = 0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        opt_prof.zero_grad()
        out = model_prof(xb)
        loss = criterion(out, yb)
        loss.backward()
        opt_prof.step()

        prof.step()
        steps += 1
        if steps >= 20:
            break

writer.flush()
writer.close()
print(f"Profiler traces saved to: {log_dir}. Launch TensorBoard pointing to this folder.")

STAGE:2025-08-24 19:59:57 26644:26644 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2025-08-24 19:59:57 26644:26644 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2025-08-24 19:59:57 26644:26644 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
autograd::engine::evaluate_function: ConvolutionBack...         0.51%     529.000us         5.66%       5.827ms     145.675us       0.000us         0.00%      50.437ms       1.261ms           0 b           0 b      53.70 Mb     -72.75 M

## Task D — 🔬 Ablations: Batch Size & `num_workers`

This section:
- Sweeps through different configurations of batch size and number of workers.
- Measures training and validation throughput for each configuration.
- Returns the results as a pandas DataFrame for easy analysis.

In [7]:
# ---- Task D: Ablations ----
import pandas as pd

def ex1_ablate(configs=((64,2),(128,2),(128,4)), max_steps=20):
    rows = []
    for bs, nw in configs:
        set_ex1_dataloader_params(bs, nw)
        m = get_ex1_model("resnet18")
        opt = optim.SGD(m.parameters(), lr=0.1, momentum=0.9)
        tr = ex1_train_one_epoch(m, _ex1_loaders["train"], opt, criterion, max_steps=max_steps)
        ev = ex1_eval_one_epoch(m, _ex1_loaders["test"], criterion, max_steps=max_steps)
        rows.append({
            "batch_size": bs,
            "num_workers": nw,
            "train_throughput": tr["throughput_samp_per_s"],
            "val_throughput": ev["throughput_samp_per_s"],
        })
    return pd.DataFrame(rows)

df_abl = ex1_ablate(configs=((64,2),(128,2),(128,4)), max_steps=20)
df_abl

Unnamed: 0,batch_size,num_workers,train_throughput,val_throughput
0,64,2,2867.734725,5682.424407
1,128,2,2885.691414,7082.141089
2,128,4,2751.918987,9267.580045


## 🔍 Launch TensorBoard

This cell:
- Launches TensorBoard to visualize the profiling traces saved in the `./runs/ex1_solution` directory.
- Opens TensorBoard on port 6006, accessible via the host's browser.

### Launch TensorBoard from VS Code (UI way)

- Open the project folder in VS Code.

- Press `Ctrl/Cmd + Shift + P` → run “Python: Launch TensorBoard” (or “Launch TensorBoard”).

- Choose your log directory (e.g., `runs/` or `logs/`), pick a port (default `6006`).

- TensorBoard opens in an editor tab (or your browser).

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./runs/ex1_solution --host 0.0.0.0 --port 6006