# Exercise 1 — **Solution Notebook** (Built on the Demo)

This is the **worked solution** that *extends* the demo rather than repeating it.
All original demo cells appear **below** unchanged. The solution tasks are implemented
in the new cells you see **above** the demo.

**Generated:** 2025-08-24 19:57:12  
**Scope:** CIFAR‑10 loaders with knobs, quick training/eval, model comparison, profiler→TensorBoard, and ablations.


## Setup

This cell imports PyTorch, TorchVision, and utility packages used across all tasks.


In [1]:
# Core imports
import os, time, math, json, shutil, sys
from collections import defaultdict

import torch
from torch import nn, optim
from torch.utils.data import DataLoader

import torchvision
from torchvision import transforms

# TensorBoard logging (for profiler traces & scalars)
from torch.utils.tensorboard import SummaryWriter

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

2025-08-24 19:59:44.120140: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-24 19:59:44.133458: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-08-24 19:59:44.151182: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-08-24 19:59:44.156662: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-08-24 19:59:44.169521: I tensorflow/core/platform/cpu_feature_guar

Device: cuda


## Task A — Central knobs for DataLoaders (batch size, num_workers)

We build **reusable CIFAR‑10 dataloaders** and expose knobs via `set_ex1_dataloader_params`.
If your demo already defines loaders, you can *ignore them* and just use these solution loaders,
or adapt the function to call your demo's builder.


In [2]:
# ---- DataLoader builder with knobs ----
_ex1_datasets = {}
_ex1_loaders = {}
_ex1_cfg = {"batch_size": 128, "num_workers": 2}

def _ex1_build_datasets():
    """
    TODO: Create CIFAR-10 train/test datasets with standard transforms.
      - Train transforms: RandomCrop(32, padding=4), RandomHorizontalFlip(), ToTensor(), Normalize(mean,std)
      - Test  transforms: ToTensor(), Normalize(mean,std)
      - Use root="./data", set train=True/False, download=True
      - Return a dict: {"train": train_ds, "test": test_ds}
    Hints:
      - mean = (0.4914, 0.4822, 0.4465)
      - std  = (0.2023, 0.1994, 0.2010)
    """
    global _ex1_datasets
    if _ex1_datasets:
        return _ex1_datasets

    raise NotImplementedError("Implement _ex1_build_datasets() per the TODO comments")

def _ex1_build_loaders():
    """
    TODO: Build DataLoaders from the datasets with params in _ex1_cfg.
      - Read batch_size and num_workers from _ex1_cfg
      - train loader: shuffle=True, pin_memory=True
      - test  loader: shuffle=False, pin_memory=True
      - Return a dict: {"train": train_loader, "test": test_loader}
    """
    global _ex1_loaders
    raise NotImplementedError("Implement _ex1_build_loaders() per the TODO comments")

def set_ex1_dataloader_params(batch_size=128, num_workers=2):
    """
    Update loader knobs and rebuild loaders.

    TODO:
      - Cast inputs to int and store in _ex1_cfg
      - Clear _ex1_loaders cache to force rebuild
      - Return the new loaders by calling _ex1_build_loaders()
    """
    raise NotImplementedError("Implement set_ex1_dataloader_params() per the TODO comments")


# Build initial loaders
_ = set_ex1_dataloader_params(batch_size=128, num_workers=2)
print("DataLoader params:", _ex1_cfg)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:01<00:00, 106105046.02it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
DataLoader params: {'batch_size': 128, 'num_workers': 2}


## Utility — Small model zoo and quick train/eval

We define:
- `get_ex1_model(name)` — small torchvision models with `num_classes=10`.
- `ex1_train_one_epoch` and `ex1_eval_one_epoch` — quick loops with `max_steps` for short runs.
- Both compute simple throughput metrics.


In [3]:
# ---- Models ----
def get_ex1_model(name: str):
    name = name.lower()
    if name == "resnet18":
        m = torchvision.models.resnet18(weights=None, num_classes=10)
    elif name == "resnet34":
        m = torchvision.models.resnet34(weights=None, num_classes=10)
    elif name in ("mobilenet_v2", "mbv2"):
        m = torchvision.models.mobilenet_v2(weights=None, num_classes=10)
    else:
        raise ValueError(f"Unsupported model: {name}")
    return m.to(device)

# ---- Train / Eval helpers ----
def ex1_train_one_epoch(model, loader, optimizer, criterion, max_steps=None):
    """
    TODO: Implement a single training epoch.

    Requirements:
      - Set model to train mode.
      - Iterate over DataLoader (respect max_steps if provided).
      - Move inputs/labels to the global `device` with non_blocking=True.
      - Zero grads, forward pass, compute loss, backward, optimizer step.
      - Track total samples and total loss (sum loss * batch_size).
      - Measure wall-clock latency (seconds) and compute throughput (samples/sec).
      - Return a dict with:
          {
            "loss": <mean_loss_over_seen_samples>,
            "num_samples": <n_seen>,
            "latency_s": <epoch_time_seconds>,
            "throughput_samp_per_s": <n_seen / latency_s>,
          }
    Notes:
      - Assume globals: `device`, and `time` imported.
    """
    # ===== TODO 1: enter train mode =====
    # model.train()

    # ===== TODO 2: init counters and start timer =====

    # ===== TODO 3: iterate over loader =====

    # ===== TODO 4: stop timer, compute metrics =====

    # ===== TODO 5: return metrics dict =====

    raise NotImplementedError("Complete ex1_train_one_epoch per the TODOs above.")


@torch.no_grad()
def ex1_eval_one_epoch(model, loader, criterion, max_steps=None):
    """
    TODO: Implement a single evaluation epoch.

    Requirements:
      - Set model to eval mode.
      - Iterate over DataLoader (respect max_steps if provided).
      - Move inputs/labels to the global `device` with non_blocking=True.
      - Forward pass only; compute loss and accuracy (argmax over logits).
      - Track total samples, correct predictions, and summed loss * batch_size.
      - Measure wall-clock latency (seconds) and compute throughput (samples/sec).
      - Return a dict with:
          {
            "loss": <mean_loss_over_seen_samples>,
            "acc": <correct / n_seen>,
            "num_samples": <n_seen>,
            "latency_s": <eval_time_seconds>,
            "throughput_samp_per_s": <n_seen / latency_s>,
          }
    Notes:
      - Assume globals: `device`, and `time` imported.
    """
    # ===== TODO 1: eval mode & init counters/timer =====

    # ===== TODO 2: iterate over loader =====

    # ===== TODO 3: finalize metrics and return =====
    raise NotImplementedError("Complete ex1_eval_one_epoch per the TODOs above.")


### Quick smoke run (Task A deliverable)

One short train/eval to confirm everything is wired. We cap steps so it’s fast.


In [4]:
# ---- Task A: quick run ----
loaders = set_ex1_dataloader_params(batch_size=128, num_workers=2)
train_loader, test_loader = loaders["train"], loaders["test"]

model = get_ex1_model("resnet18")
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

train_stats = ex1_train_one_epoch(model, train_loader, optimizer, criterion, max_steps=20)
val_stats   = ex1_eval_one_epoch(model, test_loader, criterion, max_steps=20)

print("Train stats:", train_stats)
print("Val   stats:", val_stats)

Train stats: {'loss': 4.506065475940704, 'num_samples': 2560, 'latency_s': 1.396545648574829, 'throughput_samp_per_s': 1833.0943944528221}
Val   stats: {'loss': 621.117073059082, 'acc': 0.0984375, 'num_samples': 2560, 'latency_s': 0.40219855308532715, 'throughput_samp_per_s': 6365.0154391701435}


## Task B — Add a second model and compare

We compare `resnet18` vs `resnet34` (you can swap in `mobilenet_v2` if you prefer).
Each uses the exact same pipeline.


In [5]:
# ---- Task B: compare two models under same loaders ----
def ex1_compare_models(model_names=("resnet18", "resnet34"), max_steps=20):
    """
    TODO: Compare multiple models on a short training/eval run.

    Requirements:
      - Loop over each name in model_names.
      - Build a model instance with get_ex1_model(name).
      - Construct an optimizer (SGD with lr=0.1, momentum=0.9).
      - Call ex1_train_one_epoch(...) with train_loader, optimizer, criterion, and max_steps.
      - Call ex1_eval_one_epoch(...) with test_loader, criterion, and max_steps.
      - Store results in a dict keyed by model name, with subkeys "train" and "val".
      - Return the dict at the end.

    Expected return format:
      {
        "resnet18": {"train": {...}, "val": {...}},
        "resnet34": {"train": {...}, "val": {...}},
      }
    """
    results = {}

    # ===== TODO 1: iterate over model_names =====
    # for name in model_names:
    #     # TODO 2: build model with get_ex1_model(name)
    #
    #     # TODO 3: define optimizer (SGD, lr=0.1, momentum=0.9)
    #
    #     # TODO 4: run one short training epoch
    #
    #     # TODO 5: run one short eval epoch
    #
    #     # TODO 6: store results

    # ===== TODO 7: return results =====
    # return results

    raise NotImplementedError("Complete ex1_compare_models per the TODOs above.")


compare_models = ex1_compare_models(model_names=("resnet18","resnet34"), max_steps=20)
compare_models

{'resnet18': {'train': {'loss': 4.539602792263031,
   'num_samples': 2560,
   'latency_s': 0.9383394718170166,
   'throughput_samp_per_s': 2728.223715285868},
  'val': {'loss': 15.011657333374023,
   'acc': 0.110546875,
   'num_samples': 2560,
   'latency_s': 0.3471341133117676,
   'throughput_samp_per_s': 7374.67134986188}},
 'resnet34': {'train': {'loss': 5.312578654289245,
   'num_samples': 2560,
   'latency_s': 1.404033899307251,
   'throughput_samp_per_s': 1823.3177997077576},
  'val': {'loss': 349704.790625,
   'acc': 0.1046875,
   'num_samples': 2560,
   'latency_s': 0.36156678199768066,
   'throughput_samp_per_s': 7080.296441658243}}}

## Task C — PyTorch Profiler → TensorBoard (top ops)

We wrap a very short training loop with the profiler and export traces to TensorBoard.
Open TensorBoard and navigate to **Profile → Operator View** to see top ops.


In [6]:
# ---- Task C: Profiler to TensorBoard (Starter with TODOs) ----
from torch.profiler import profile, ProfilerActivity, schedule

# TODO 1: Define a log directory and reset it if it exists
# Hints:
#   - use log_dir = "./runs/ex1_solution"
#   - if os.path.exists(log_dir): shutil.rmtree(log_dir)
#   - create a SummaryWriter with this log_dir
log_dir = None
writer = None

# TODO 2: Build a small model (e.g., resnet18) and optimizer (SGD)
#   - model_prof = get_ex1_model("resnet18")
#   - opt_prof = optim.SGD(model_prof.parameters(), lr=0.1, momentum=0.9)
model_prof = None
opt_prof = None

# TODO 3: Define a short profiler schedule to keep run quick
#   - schedule(wait=1, warmup=1, active=2, repeat=1)
sched = None

# Helper: trace handler
def _trace_handler(p):
    # TODO 4: Export Chrome trace to log_dir (trace.json)
    # Hints:
    #   - p.export_chrome_trace(os.path.join(log_dir, "trace.json"))
    # TODO 5: Print top ops (CUDA if available, else CPU)
    # Hints:
    #   - use p.key_averages().table(sort_by="cuda_time_total", row_limit=10)
    #   - fall back to cpu_time_total if cuda not available
    pass

# TODO 6: Choose activities
#   - Always include ProfilerActivity.CPU
#   - Add ProfilerActivity.CUDA if device.type == "cuda"
activities = []

# TODO 7: Profile training loop
#   - Wrap with `with profile(...) as prof:`
#   - Include arguments:
#       activities=activities, schedule=sched,
#       on_trace_ready=_trace_handler,
#       record_shapes=True, profile_memory=True, with_stack=False
#   - Inside loop:
#       for xb, yb in train_loader:
#           xb, yb = xb.to(device), yb.to(device)
#           opt_prof.zero_grad()
#           out = model_prof(xb)
#           loss = criterion(out, yb)
#           loss.backward()
#           opt_prof.step()
#           prof.step()
#           stop after ~20 steps
with profile(...):
    pass  # TODO: implement training + prof.step()

# TODO 8: Flush and close the SummaryWriter
# writer.flush()
# writer.close()

print(f"Profiler traces saved to: {log_dir}. Launch TensorBoard pointing to this folder.")


STAGE:2025-08-24 19:59:57 26644:26644 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2025-08-24 19:59:57 26644:26644 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2025-08-24 19:59:57 26644:26644 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
autograd::engine::evaluate_function: ConvolutionBack...         0.51%     529.000us         5.66%       5.827ms     145.675us       0.000us         0.00%      50.437ms       1.261ms           0 b           0 b      53.70 Mb     -72.75 M

## Task D — Ablations: Batch size & `num_workers`

We sweep a few configurations and measure throughput quickly.


In [7]:
# ---- Task D: Ablations ----
import pandas as pd

def ex1_ablate(configs=((64, 2), (128, 2), (128, 4)), max_steps=20):
    """
    TODO: Run a small ablation to study how DataLoader batch size and num_workers
    affect training/evaluation throughput.

    Requirements:
      - Iterate over each (batch_size, num_workers) pair in configs.
      - Update dataloader params using set_ex1_dataloader_params(bs, nw).
      - Build a resnet18 model and SGD optimizer (lr=0.1, momentum=0.9).
      - Train briefly with ex1_train_one_epoch (max_steps).
      - Evaluate briefly with ex1_eval_one_epoch (max_steps).
      - Collect results (bs, nw, train throughput, val throughput) into rows.
      - Return as a tidy pandas DataFrame.
    """

    rows = []

    # ===== loop over configs =====
    for bs, nw in configs:
    #     # TODO 2: set dataloader parameters
    #
    #     # TODO 3: get model (resnet18) and optimizer (SGD)
    #
    #     # TODO 4: run one short training epoch
    #
    #     # TODO 5: run one short eval epoch
    #
    #     # TODO 6: collect throughput results

    # ===== TODO 7: return results as DataFrame =====

    raise NotImplementedError("Complete ex1_ablate per the TODOs above.")


df_abl = ex1_ablate(configs=((64,2),(128,2),(128,4)), max_steps=20)
df_abl

Unnamed: 0,batch_size,num_workers,train_throughput,val_throughput
0,64,2,2867.734725,5682.424407
1,128,2,2885.691414,7082.141089
2,128,4,2751.918987,9267.580045


In [None]:
%load_ext tensorboard
%tensorboard --logdir ./runs/ex1_solution --host 0.0.0.0 --port 6006