# 07. PyTorch Experiment Tracking

Machine learning is very experimental.

In order to figure out which experiments are worth pursuing, that's where **experiment tracking** comes in, it helps you to figure out what doesn't work so you can figure out what **does** work.

In this notebook, we're going to see an example of programmatically tracking experiments.

Resources:
* Book version of notebook: https://www.learnpytorch.io/07_pytorch_experiment_tracking/
* Ask a question: https://github.com/mrdbourke/pytorch-deep-learning/discussions
* Extra-curriculum: https://madewithml.com/courses/mlops/experiment-tracking/

In [1]:
import torch
import torchvision
from sphinx.builders.gettext import timestamp
from torch import nn
from torchvision import transforms
from torchinfo import summary
import matplotlib.pyplot as plt
from tqdm import tqdm
from xlwings.utils import col_name

from going_modular import data_setup, engine
from going_modular.train import train_dataloader, test_dataloader

print(torch.__version__)
print(torchvision.__version__)

2.5.1
0.20.1


In [2]:
# Setup device agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.mps.is_available() else "cpu")
device

device(type='mps')

In [3]:
# Set seeds
def set_seeds (seed: int=42):
    """
    Sets random sets for torch operations.

    Args:
        seed (int, optional): Random seed to set. Defaults to 42.
    """
    # the seed for general torch operations
    torch.manual_seed(seed)

    # Set the seed for CUDA+MPS torch operations (ones that happen on the GPU)
    torch.cuda.manual_seed(seed)
    torch.mps.manual_seed(seed)

In [4]:
set_seeds()

## 1. Get data

Want to get pizza, steak, sushi images.

So we can run experiments building FoodVision Mini and see which model performs best.

In [5]:
import os
import zipfile
from pathlib import Path
from typing import Optional

import requests
from tqdm.auto import tqdm


def download_data(
    source: str,
    destination: str,
    remove_source: bool = True,
    chunk_size: int = 1024
) -> Path:
    """Downloads a zipped dataset from source and unzips to destination."""

    data_path = Path("data")
    image_path = data_path / destination
    data_path.mkdir(parents=True, exist_ok=True)

    if image_path.is_dir():
        print(f"[INFO] {image_path} directory exists, skipping download.")
        return image_path

    print(f"[INFO] Creating directory {image_path}...")
    image_path.mkdir(parents=True, exist_ok=True)

    target_file = data_path / Path(source).name

    print(f"[INFO] Downloading {target_file.name}...")
    response = requests.get(source, stream=True)
    response.raise_for_status()

    total_size = int(response.headers.get("content-length", 0))

    with open(target_file, "wb") as f, tqdm(
        desc="Downloading",
        total=total_size,
        unit="B",
        unit_scale=True,
        unit_divisor=1024,
    ) as pbar:
        for chunk in response.iter_content(chunk_size=chunk_size):
            if chunk:
                f.write(chunk)
                pbar.update(len(chunk))


    print(f"[INFO] Unzipping {target_file.name}...")
    with zipfile.ZipFile(target_file, "r") as zip_ref:
        members = zip_ref.infolist()
        for member in tqdm(members, desc="Extracting", unit="file"):
            zip_ref.extract(member, image_path)

    if remove_source:
        target_file.unlink()

    print("[INFO] Download and extraction complete.")
    return image_path


image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path

[INFO] data/pizza_steak_sushi directory exists, skipping download.


PosixPath('data/pizza_steak_sushi')

## 2. Create Datasets and DataLoaders

### 2.1 Create DataLoaders with manual transforms
The goal with transforms is to ensure your custom data is formatted in a reproducible way as well as a way that will suit pretrained models.

In [6]:
# Setup directories
train_dir = image_path / "train"
test_dir = image_path / "test"
train_dir, test_dir

(PosixPath('data/pizza_steak_sushi/train'),
 PosixPath('data/pizza_steak_sushi/test'))

In [7]:
# Setup ImageNet normalization levels
# See here: https://pytorch.org/vision/0.12/models.html
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.4061], std=[0.229, 0.224, 0.225])

# Create transform pipeline manually
manual_transforms = transforms. Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    normalize
])
print (f"Manually created transforms: {manual_transforms}")

# Create DataLoaders
from going_modular import data_setup
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir,
                                                                               test_dir=test_dir,
                                                                               train_transform=manual_transforms,
                                                                               test_transform=manual_transforms,
                                                                               batch_size=32)
train_dataloader, test_dataloader, class_names

Manually created transforms: Compose(
    Resize(size=(224, 224), interpolation=bilinear, max_size=None, antialias=True)
    ToTensor()
    Normalize(mean=[0.485, 0.456, 0.4061], std=[0.229, 0.224, 0.225])
)


(<torch.utils.data.dataloader.DataLoader at 0x30514ac30>,
 <torch.utils.data.dataloader.DataLoader at 0x30461d2b0>,
 ['pizza', 'steak', 'sushi'])

### 2.2 Create DataLoaders using automatically created transforms

The same principle applies for automatically created transforms we want our custom data in the same format as the pretrained data the model trained on.

In [8]:
# Setup dirs
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup pretrained weights
import torchvision
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # "DEFAULT" = best available

# Get the transforms from weights (these are the transforms used to train a particular or obtain a particular set of weights)
automatic_transforms = weights.transforms()
print (f"Automatically created transforms: {automatic_transforms}")
# Create DataLoaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir,
                                                                               test_dir=test_dir,
                                                                               train_transform=automatic_transforms,
                                                                               test_transform=automatic_transforms,
                                                                               batch_size=32)
train_dataloader, test_dataloader, class_names

Automatically created transforms: ImageClassification(
    crop_size=[224]
    resize_size=[256]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BICUBIC
)


(<torch.utils.data.dataloader.DataLoader at 0x3051d20f0>,
 <torch.utils.data.dataloader.DataLoader at 0x3051d21b0>,
 ['pizza', 'steak', 'sushi'])

## 3. Getting a pretrained model, freeze the base layers and change the classifier head

In [9]:
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # "DEFAULT" = best available
model = torchvision.models.efficientnet_b0(weights= weights).to(device)

In [10]:
# Freeze all base layers by setting their requires_grad attribute to False
for param in model.features.parameters():
    # print(param)
    param.requires_grad = False

In [11]:
# Adjust the classifier head
set_seeds()
model.classifier = nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=1280, out_features=len(class_names), bias=True)
).to(device)

In [12]:
from torchinfo import summary

summary(model,
        input_size=(32, 3, 224, 224),
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"])

Layer (type (var_name))                                      Input Shape          Output Shape         Param #              Trainable
EfficientNet (EfficientNet)                                  [32, 3, 224, 224]    [32, 3]              --                   Partial
├─Sequential (features)                                      [32, 3, 224, 224]    [32, 1280, 7, 7]     --                   False
│    └─Conv2dNormActivation (0)                              [32, 3, 224, 224]    [32, 32, 112, 112]   --                   False
│    │    └─Conv2d (0)                                       [32, 3, 224, 224]    [32, 32, 112, 112]   (864)                False
│    │    └─BatchNorm2d (1)                                  [32, 32, 112, 112]   [32, 32, 112, 112]   (64)                 False
│    │    └─SiLU (2)                                         [32, 32, 112, 112]   [32, 32, 112, 112]   --                   --
│    └─Sequential (1)                                        [32, 32, 112, 112]   [32, 

## 4. train a single model and track results

In [13]:
# Define loss function optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

To track experiments, we're going to use TensorBoard: https://www.tensorflow.org/tensorboard

And to interact with IensorBoard, we can use PyTorch's Summarywriter - https:// pytorch.org/docs/stable/tensorboard.html
* Also see here: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.Summarywriter

In [14]:
# Setup a SummaryWriter
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
writer

<torch.utils.tensorboard.writer.SummaryWriter at 0x305e2b290>

In [15]:
from tqdm.auto import tqdm
from typing import Dict, List, Tuple

from going_modular.engine import train_step, test_step

def train(model: torch.nn.Module,
          train_dataloader: torch.utils.data.DataLoader,
          test_dataloader: torch.utils.data.DataLoader,
          optimizer: torch.optim.Optimizer,loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device,
          writer: SummaryWriter = writer) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
        model: A PyTorch model to be trained and tested.
        train_dataloader: A DataLoader instance for the model to be trained on.
        test_dataloader: A DataLoader instance for the model to be tested on.
        optimizer: A PyTorch optimizer to help minimize the loss function.
        loss_fn: A PyTorch loss function to calculate loss on both datasets.
        epochs: An integer indicating how many epochs to train for.
        device: A target device to compute on (e.g. "cuda" or "cpu").

    Returns:
        A dictionary of training and testing loss as well as training and
        testing accuracy metrics. Each metric has a value in a list for
        each epoch.
        In the form: {train_loss: [...],
        train_acc: [...],
        test_loss: [...],
        test_acc: [...]}

    For example if training for epochs=2:
        {train_loss: [2.0616, 1.0537],
        train_acc: [0.3945, 0.3945],
        test_loss: [1.2641, 1.5706],
        test_acc: [0.3400, 0.2973]}
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
               }

    model.to(device)
    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                           dataloader=train_dataloader,
                                           loss_fn=loss_fn,
                                           optimizer=optimizer,
                                           device=device)
        test_loss, test_acc = test_step(model=model,
                                        dataloader=test_dataloader,
                                        loss_fn=loss_fn,
                                        device=device)

        # Print out what's happening
        print(f"Epoch: {epoch+1} | "
              f"train_loss: {train_loss:.4f} | "
              f"train_acc: {train_acc:.4f} | "
              f"test_loss: {test_loss:.4f} | "
              f"test_acc: {test_acc:.4f}"
              )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)


        ### New: Experiment tracking
        writer.add_scalars(main_tag="Loss",
                          tag_scalar_dict={"train_loss":train_loss,
                                           "test_loss": test_loss},
                          global_step=epoch)
        writer.add_scalars(main_tag="Accuracy",
                          tag_scalar_dict={"train_acc":train_acc,
                                           "test_acc": test_acc},
                          global_step=epoch)
    writer.add_graph(model=model,
                     input_to_model=torch.randn(32, 3, 224, 224).to(device))

    # Close the writer
    writer.close()
    ## End new ##

    # Return the filled results at the end of the epochs
    return results

In [16]:
# Train model
set_seeds()
results = train(model=model,
                train_dataloader=train_dataloader,
                test_dataloader=test_dataloader,
                optimizer=optimizer,
                loss_fn=loss_fn,
                epochs=5,
                device=device)

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0823 | train_acc: 0.4062 | test_loss: 0.8991 | test_acc: 0.5909
Epoch: 2 | train_loss: 0.8564 | train_acc: 0.7695 | test_loss: 0.7927 | test_acc: 0.8456
Epoch: 3 | train_loss: 0.7914 | train_acc: 0.7891 | test_loss: 0.7373 | test_acc: 0.8561
Epoch: 4 | train_loss: 0.7206 | train_acc: 0.7500 | test_loss: 0.6338 | test_acc: 0.8759
Epoch: 5 | train_loss: 0.6368 | train_acc: 0.7812 | test_loss: 0.6190 | test_acc: 0.8665


In [17]:
results

{'train_loss': [1.0823059901595116,
  0.8563981279730797,
  0.791380912065506,
  0.7206062972545624,
  0.636790856719017],
 'train_acc': [0.40625, 0.76953125, 0.7890625, 0.75, 0.78125],
 'test_loss': [0.8990757266680399,
  0.792700986067454,
  0.7373119592666626,
  0.6338079373041788,
  0.6189916928609213],
 'test_acc': [0.5909090909090909,
  0.8456439393939394,
  0.8560606060606061,
  0.8759469696969697,
  0.8664772727272728]}

## 5. View our models results with TensorBoard

There are a few ways to view TensorBoard results:
https://www.tensorflow.org/tensorboard/get_started

In [18]:
import shutil
import sys
import os

print("Python:", sys.executable)
print("TensorBoard:", shutil.which("tensorboard"))


Python: /opt/anaconda3/bin/python
TensorBoard: None


In [19]:
import os
import sys

tb_path = os.path.join(os.path.dirname(sys.executable), "tensorboard")
os.environ["TENSORBOARD_BINARY"] = tb_path

print("Using TensorBoard at:", tb_path)


Using TensorBoard at: /opt/anaconda3/bin/tensorboard


In [20]:
%reload_ext tensorboard

In [32]:
%tensorboard --logdir runs

## 6. Create a function to prepare `SummaryWriter()` instance

By default, our `SummaryWriter()` class saves to `runs`.

How about if we wanted to save different examples to different folders?

In essence, one experiment = one folder.

For example, we'd like to track:
* Experiment date/timestamp
* Experiment name
* Model name
* Extra - is there anything else that should be tracked?

Let's create a function to create a `SummaryWriter()` instance to take all of these things into account.

So ideally we end up tracking experiments to a directory:

`runs/YYYY-MM-DD/experiment_name/model_name/extra`

In [22]:
from torch.utils.tensorboard import SummaryWriter
def create_writer(experiment_name: str,
                  model_name: str,
                  extra: str = None) -> SummaryWriter:
    """
    Creates a torch.utils.tensorboard.SummaryWriter instance tracking to a specific directory.

    Args:
        experiment_name: The name of the experiment.
        model_name: The name of the model.
        extra: An optional string to append to the experiment name.

    Returns:
        A SummaryWriter instance.
    """
    from datetime import datetime
    import os

    # Get timestamp of current date in reverse order
    timestamp = datetime.now().strftime("%Y-%m-%d")

    if extra:
        # Create log directory path
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name, extra)
    else:
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name)

    print(f"[INFO] Created SummaryWriter saving to {log_dir}")
    return SummaryWriter(log_dir=log_dir)

In [25]:
from datetime import datetime
timestamp = datetime.now().strftime("%Y-%m-%d")
timestamp

'2025-12-14'

In [23]:
example_writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb0",
                               extra="5_epochs")
example_writer

[INFO] Created SummaryWriter saving to runs/2025-12-14/data_10_percent/effnetb0/5_epochs


<torch.utils.tensorboard.writer.SummaryWriter at 0x318da6840>

### 6.1 Update the `train()` function to include a writer parameter

In [34]:
from tqdm.auto import tqdm
from typing import Dict, List, Tuple

from going_modular.engine import train_step, test_step

def train(model: torch.nn.Module,
          train_dataloader: torch.utils.data.DataLoader,
          test_dataloader: torch.utils.data.DataLoader,
          optimizer: torch.optim.Optimizer,loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device,
          writer: SummaryWriter = None) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
        model: A PyTorch model to be trained and tested.
        train_dataloader: A DataLoader instance for the model to be trained on.
        test_dataloader: A DataLoader instance for the model to be tested on.
        optimizer: A PyTorch optimizer to help minimize the loss function.
        loss_fn: A PyTorch loss function to calculate loss on both datasets.
        epochs: An integer indicating how many epochs to train for.
        device: A target device to compute on (e.g. "cuda" or "cpu").

    Returns:
        A dictionary of training and testing loss as well as training and
        testing accuracy metrics. Each metric has a value in a list for
        each epoch.
        In the form: {train_loss: [...],
        train_acc: [...],
        test_loss: [...],
        test_acc: [...]}

    For example if training for epochs=2:
        {train_loss: [2.0616, 1.0537],
        train_acc: [0.3945, 0.3945],
        test_loss: [1.2641, 1.5706],
        test_acc: [0.3400, 0.2973]}
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
               }

    model.to(device)
    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                           dataloader=train_dataloader,
                                           loss_fn=loss_fn,
                                           optimizer=optimizer,
                                           device=device)
        test_loss, test_acc = test_step(model=model,
                                        dataloader=test_dataloader,
                                        loss_fn=loss_fn,
                                        device=device)

        # Print out what's happening
        print(f"Epoch: {epoch+1} | "
              f"train_loss: {train_loss:.4f} | "
              f"train_acc: {train_acc:.4f} | "
              f"test_loss: {test_loss:.4f} | "
              f"test_acc: {test_acc:.4f}"
              )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)


        ### New: Experiment tracking
        if writer:
            writer.add_scalars(main_tag="Loss",
                               tag_scalar_dict={"train_loss":train_loss,
                                               "test_loss": test_loss},
                               global_step=epoch)
            writer.add_scalars(main_tag="Accuracy",
                               tag_scalar_dict={"train_acc":train_acc,
                                               "test_acc": test_acc},
                               global_step=epoch)
            writer.add_graph(model=model,
                             input_to_model=torch.randn(32, 3, 224, 224).to(device))

            # Close the writer
            writer.close()
    # Return the filled results at the end of the epochs
    return results

## 7. Setting up a series of modelling experiments
* Challenge: Setup 2x modeling experiments with efnetto, pizza, steak sushi data and train one model for 5 epochs and another model for 10 epochs

### 7.1 What kind of experiments should you run?

The number of machine learning experiments you can run, is like the number of different models you can build... almost limitless.

However, you can't test everything...

So what should you test?
* Change the number of epochs
* Change the number of hidden layers/units
* Change the amount of data (right now we're using 10% of the Food101 dataset for pizza, steak, sushi)
* Change the learning rate
* Try different kinds of data augmentation
* Choose a different model architecture

This is why transfer learning is powerful, because it's a working model that you can apply to your own problem

### 7.2 What experiments are we going to run?

We're going to turn 3 dials:
1. Model size - EffnetB0 vs EffnetB2 (in terms of number of params)
2. Dataset size - 10% of pizza, steak, sushi images vs 20% (generally more data = better results)
3. Training time - 5 epochs vs 10 epochs (generally longer training time = better
results... before the model starts to overfit)

To begin, we're still keeping things relatively small so that our experiments run quickly.

In [39]:
from Progress.course_progress_func import update_progress, progress_pie_chart, monthly_progress, progress_report_print
update_progress(video_index=213, done=True)

[1m[92mUpdated progress report.[0m 
[1mVideo:[0m 212. Discussing the Experiments We Are Going to Try 
[1mDuration:[0m 6m 
[1mStatus:[0m Done 
[1mDate:[0m 14 Dec 2025 09:07 PM 
[1mSection progress:[0m 
[1mSection :[0m 9.PyTorch Experiment Tracking
[1mStatus  :[0m 9 videos remaining, 1h 22m to finish the section
[1mProgress: [0m 59%|[34m█████████████████▏           [0m 13/22[0m


In [40]:
progress_report_print()

[1m[94m      ___         ___           ___           ___           ___           ___           ___           ___     
     /  /\       /  /\         /  /\         /  /\         /  /\         /  /\         /  /\         /  /\    
    /  /::\     /  /::\       /  /::\       /  /:/_       /  /::\       /  /:/_       /  /:/_       /  /:/_   
   /  /:/\:\   /  /:/\:\     /  /:/\:\     /  /:/ /\     /  /:/\:\     /  /:/ /\     /  /:/ /\     /  /:/ /\  
  /  /:/~/:/  /  /:/~/:/    /  /:/  \:\   /  /:/_/::\   /  /:/~/:/    /  /:/ /:/_   /  /:/ /::\   /  /:/ /::\ 
 /__/:/ /:/  /__/:/ /:/___ /__/:/ \__\:\ /__/:/__\/\:\ /__/:/ /:/___ /__/:/ /:/ /\ /__/:/ /:/\:\ /__/:/ /:/\:\
 \  \:\/:/   \  \:\/:::::/ \  \:\ /  /:/ \  \:\ /~~/:/ \  \:\/:::::/ \  \:\/:/ /:/ \  \:\/:/~/:/ \  \:\/:/~/:/
  \  \::/     \  \::/~~~~   \  \:\  /:/   \  \:\  /:/   \  \::/~~~~   \  \::/ /:/   \  \::/ /:/   \  \::/ /:/ 
   \  \:\      \  \:\        \  \:\/:/     \  \:\/:/     \  \:\        \  \:\/:/     \__\/ /:/     \__\