In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import torch
import torchvision

torch.__version__, torchvision.__version__

('2.1.2', '0.16.2')

# Setup device agnostic code

**Note**: Sometimes, depending on your data/hardware, you might find that your model trains faster on CPU than GPU. It could be that the overhead for copying data/model to and from the GPU outweighs the compute benefits offered by the GPU.

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

# Experiment Tracking

It is possible to track experiments using Python dictionaries and comparing their metric print outs during training. But this is a very manual process. If there a dozen (or more) different models to be compared, **experiment tracking** becomes a necessity.

Considering that *machine learning* and *deep learning* are very experimental, different models and hyperparameters need to be tried out. In order to track the results of various combinations of data, model architectures and training regimes, **experiment tracking helps to figure out what works and what doesn't**.

## Different ways to track machine learning experiments 

There are as many different ways to track machine learning experiments as there are experiments to run. Due to its tight integration with PyTorch and widespread use, TensorBoard will be used to track experiments. It is a part of the TensorFlow deep learning library and an excellent way to visualize different parts of a model. However, the same principles are similar across all of the other tools for experiment tracking. The following table covers a few.

| **Method** | **Setup** | **Pros** | **Cons** | **Cost** |
| ----- | ----- | ----- | ----- | ----- |
| Python dictionaries, CSV files, print outs | None | Easy to setup, runs in pure Python | Hard to keep track of large numbers of experiments | Free |
| [TensorBoard](https://www.tensorflow.org/tensorboard/get_started) | Minimal, install [`tensorboard`](https://pypi.org/project/tensorboard/) | Extensions built into PyTorch, widely recognized and used, easily scales. | User-experience not as nice as other options. | Free |
| [Weights & Biases Experiment Tracking](https://wandb.ai/site/experiment-tracking) | Minimal, install [`wandb`](https://docs.wandb.ai/quickstart), make an account | Incredible user experience, make experiments public, tracks almost anything. | Requires external resource outside of PyTorch. | Free for personal use | 
| [MLFlow](https://mlflow.org/) | Minimal, install `mlflow` and starting tracking | Fully open-source MLOps lifecycle management, many integrations. | Little bit harder to setup a remote tracking server than other services. | Free |

# 1. Setup boilerplate code for experiment tracking

In [4]:
import pathlib
from torchvision.datasets import ImageFolder

target_dir = pathlib.Path("data/food-101/pizza-steak-sushi")

# get a set of pre-trained weights
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT

# get transforms used to create pre-trained model
transforms = weights.transforms()
train_data = ImageFolder(root=target_dir / "train", transform=transforms)
test_data = ImageFolder(root=target_dir / "test", transform=transforms)

In [5]:
import os
from torch.utils.data import DataLoader

# setup batch size and number of workers
BATCH_SIZE = 32
NUM_WORKERS = os.cpu_count()

# create data loaders
train_dataloader = DataLoader(dataset=train_data,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=NUM_WORKERS)

test_dataloader = DataLoader(dataset=test_data,
                              batch_size=BATCH_SIZE,
                              shuffle=False,
                              num_workers=NUM_WORKERS)

In [6]:
from torch import nn

model = torchvision.models.efficientnet_b0(weights=weights).to(device)

# freeze all base layers in backbone
for param in model.features.parameters():
    param.requires_grad = False

# update the classifier head
torch.manual_seed(42), torch.cuda.manual_seed(42)
model.classifier = nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=1280, out_features=len(train_data.classes))
).to(device)

# 2. Track experiments with TensorBoard

## 2.1. Adjust `train()` function to track results with `SummaryWriter()`

To track experiments, the `torch.utils.tensorboard.SummaryWriter()` class is used to save various parts of a model's training progress to file in TensorBoard format. By default, the `SummaryWriter()` class saves information about the model to a file set by the `log_dir` parameter, the default location being `runs/CURRENT_DATETIME_HOSTNAME`.

In [7]:
from tqdm.auto import tqdm
from timeit import default_timer as timer

import torchmetrics
from torch.utils.tensorboard import SummaryWriter

def train_model(
        model: torch.nn.Module, 
        loss_fn: torch.nn.Module, 
        optim: torch.optim.Optimizer, 
        accuracy: torchmetrics.Metric, 
        f1: torchmetrics.Metric, 
        train_dataloader: DataLoader, 
        test_dataloader: DataLoader, 
        epochs: int = 5,
        model_name: str = "baseline-model"):
    """Performs training and evaluation of the model"""

    total_train_time = 0

    # create a writer with all default settings
    writer = SummaryWriter()

    for epoch in tqdm(range(epochs)):
        start = timer()

        # training
        train_loss_per_batch = train_acc_per_batch = train_f1_per_batch = 0

        model.train()
        for X, y in train_dataloader:
            X, y = X.to(device), y.to(device)

            # forward pass
            logits = model(X)
            loss = loss_fn(logits, y)
            train_loss_per_batch += loss.item()

            # backward pass
            optim.zero_grad()
            loss.backward()

            # update parameters
            optim.step()

            # calculate accuracy and f1 score
            train_acc_per_batch += accuracy(logits.softmax(dim=1), y).item()
            train_f1_per_batch += f1(logits.softmax(dim=1), y).item()

        train_loss_per_batch /= len(train_dataloader)
        train_acc_per_batch /= len(train_dataloader)
        train_f1_per_batch /= len(train_dataloader)

        # testing
        test_loss_per_batch = test_acc_per_batch = test_f1_per_batch = 0

        model.eval()
        with torch.inference_mode():
            for X, y in test_dataloader:
                X, y = X.to(device), y.to(device)

                # forward pass
                logits = model(X)
                loss = loss_fn(logits, y)
                test_loss_per_batch += loss.item()

                # calculate accuracy and f1 score
                test_acc_per_batch += accuracy(logits.softmax(dim=1), y).item()
                test_f1_per_batch += f1(logits.softmax(dim=1), y).item()

        test_loss_per_batch /= len(test_dataloader)
        test_acc_per_batch /= len(test_dataloader)
        test_f1_per_batch /= len(test_dataloader)

        end = timer()
        total_train_time += end - start
        print(f"Epoch: {epoch + 1}/{epochs}, "
              f"train_loss: {train_loss_per_batch:.4f}, test_loss: {test_loss_per_batch:.4f}, "
              f"train_acc: {train_acc_per_batch:.4f}, test_acc: {test_acc_per_batch:.4f}, "
              f"train_f1: {train_f1_per_batch:.4f}, test_f1: {test_f1_per_batch:.4f}, "
              f"time: {end - start:.2f}s")
        
        ### Experiment tracking ###

        # add loss results to SummaryWriter
        loss_dict = {"train_loss": train_loss_per_batch, "test_loss": test_loss_per_batch}
        writer.add_scalars(main_tag="Loss", tag_scalar_dict=loss_dict, global_step=epoch)

        # add accuracy results to SummaryWriter
        acc_dict = {"train_acc": train_acc_per_batch, "test_acc": test_acc_per_batch}
        writer.add_scalars(main_tag="Accuracy", tag_scalar_dict=acc_dict, global_step=epoch)
        
        # add f1 score results to SummaryWriter
        f1_dict = {"train_f1": train_f1_per_batch, "test_f1": test_f1_per_batch}
        writer.add_scalars(main_tag="F1", tag_scalar_dict=f1_dict, global_step=epoch)
        
        # track the PyTorch model architecture
        writer.add_graph(model=model, input_to_model=torch.randn(32, 3, 224, 224).to(device))
    
    # close the writer
    writer.close()
        
    return {
        "train_loss": train_loss_per_batch,
        "train_acc": train_acc_per_batch,
        "train_f1": train_f1_per_batch,
        "test_loss": test_loss_per_batch,
        "test_acc": test_acc_per_batch,
        "test_f1": test_f1_per_batch,
        "total_train_time": total_train_time,
        "model_name": model_name
    }

In [8]:
from torchmetrics import Accuracy, F1Score

# set seed for reproducibility
torch.manual_seed(42), torch.cuda.manual_seed(42)

# pick loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optim = torch.optim.Adam(params=model.parameters(), lr=0.001)

# define eval metrics
accuracy = Accuracy(task="multiclass", num_classes=len(train_data.classes)).to(device)
f1 = F1Score(task="multiclass", num_classes=len(train_data.classes)).to(device)

# train model
model_metrics = train_model(model, loss_fn, optim, accuracy, f1, train_dataloader, test_dataloader, model_name="efficientnet-b0")

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1/5, train_loss: 1.0883, test_loss: 0.8914, train_acc: 0.4180, test_acc: 0.6818, train_f1: 0.4180, test_f1: 0.6818, time: 84.41s


 20%|██        | 1/5 [01:37<06:29, 97.41s/it]

Epoch: 2/5, train_loss: 0.8937, test_loss: 0.8082, train_acc: 0.6641, test_acc: 0.7746, train_f1: 0.6641, test_f1: 0.7746, time: 85.57s


 40%|████      | 2/5 [03:15<04:54, 98.04s/it]

Epoch: 3/5, train_loss: 0.7450, test_loss: 0.7433, train_acc: 0.8438, test_acc: 0.7538, train_f1: 0.8438, test_f1: 0.7538, time: 84.56s


 60%|██████    | 3/5 [04:53<03:15, 97.80s/it]

Epoch: 4/5, train_loss: 0.7797, test_loss: 0.6849, train_acc: 0.6992, test_acc: 0.8040, train_f1: 0.6992, test_f1: 0.8040, time: 85.69s


 80%|████████  | 4/5 [06:32<01:38, 98.12s/it]

Epoch: 5/5, train_loss: 0.6322, test_loss: 0.6428, train_acc: 0.7695, test_acc: 0.8362, train_f1: 0.7695, test_f1: 0.8362, time: 84.81s


100%|██████████| 5/5 [08:09<00:00, 97.99s/it]


In [9]:
# Check out the model results
model_metrics

{'train_loss': 0.6321721002459526,
 'train_acc': 0.76953125,
 'train_f1': 0.76953125,
 'test_loss': 0.6428378224372864,
 'test_acc': 0.8361742496490479,
 'test_f1': 0.8361742496490479,
 'total_train_time': 425.051124458,
 'model_name': 'efficientnet-b0'}

# 3. View experiments with TensorBoard

TensorBoard can be viewed in two main ways:

| Code environment | How to view TensorBoard | Resource |
| ----- | ----- | ----- |
| VS Code | Press `SHIFT + CMD + P` to open the Command Palette and search for the command "Python: Launch TensorBoard". | [VS Code Guide on TensorBoard and PyTorch](https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration) |
| Jupyter and Colab Notebooks | Make sure [TensorBoard is installed](https://pypi.org/project/tensorboard/), load it with `%load_ext tensorboard` and then view your results with `%tensorboard --logdir DIR_WITH_LOGS`. | [`torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html) and [Get started with TensorBoard](https://www.tensorflow.org/tensorboard/get_started) |

Experiments can also be uploaded to [tensorboard.dev](https://tensorboard.dev/) to share them publicly with others.

# References

1. [The Bitter Lesson - Richard Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)
2. [Official documentation - How to use Tensorboard with Pytorch](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html)
3. [Official documentation - Reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)
4. [VS Code documentation - TensorBoard integration](https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration)
5. [A Gentle Introduction to Batch Normalization for Deep Neural Networks](https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks)
6. [Ground Truth Notebook](https://www.learnpytorch.io/07_pytorch_experiment_tracking/#adjust-train-function-to-track-results-with-summarywriter)