# Hyperparameter tuning with Ray Tune

In this notebook we use perform hyperparameter tuning using the Ray Tune library. [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) is an industry standard tool for distributed hyperparameter tuning. We will integrate hyperparameter tuning into the problem of training a classifier on the CIFAR10 dataset.

*Following the official tutorial: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html*

*Which is an extension of: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html*

## Installation

Imports:

In [1]:
from functools import partial
import os
import tempfile
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split

import torchvision
import torchvision.transforms as transforms

from ray import tune
from ray import train
from ray.train import Checkpoint, get_checkpoint
from ray.tune.schedulers import ASHAScheduler
import ray.cloudpickle as pickle

  from .autonotebook import tqdm as notebook_tqdm
2024-07-15 14:55:31,340	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-07-15 14:55:31,875	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


GPU setup if possible:

In [2]:
# set up device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
if device.type == 'cuda':
    print(f'using: {torch.cuda.get_device_name(0) if device.type == "cuda" else f"CPU with cores available: {os.cpu_count()}"}')

using: NVIDIA GeForce MX350


## Data

We wrap the data loaders in their own function and pass a global data directory. This way we can share a data directory between different trials.

The data used is the CIFAR10 dataset, note that PyTorch will automatically download this data into the `data/` directory if it is not found:

In [3]:
def load_data(data_dir='data/'):
    # turn into a tensor and then normalize it
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )

    # get our datasets (will download if not found)
    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform
    )

    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform
    )

    return trainset, testset

Lets also set where we want the data:

In [4]:
data_dir = os.path.abspath('data/')

## Network

We can only tune those parameters that are configurable. In this example, we can specify the layer sizes of the fully connected layers, given by `l1` and `l2` respectively. The network used consists ofconvolutional layers and max pooling layers before reaching a fully connected head:

In [5]:
class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)  # 5x5 kernel taking in 3 channels and outputting 6 channels
        self.pool = nn.MaxPool2d(2, 2)  # max pooling with a 2x2 kernel and a stride of 2 (so no overlap but no gaps)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
Net()

Net(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

## Setting up Ray Tune

Before we can take a look at training, let's first set our configurations for the hyperparameter tuning (define Ray Tune's search space). As an example:

In [6]:
config = {
    "l1": tune.choice([2 ** i for i in range(9)]),
    "l2": tune.choice([2 ** i for i in range(9)]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([2, 4, 8, 16])
}
config

{'l1': <ray.tune.search.sample.Categorical at 0x2114e1246d0>,
 'l2': <ray.tune.search.sample.Categorical at 0x2114fd03750>,
 'lr': <ray.tune.search.sample.Float at 0x2114dab1fd0>,
 'batch_size': <ray.tune.search.sample.Categorical at 0x2114fd03e50>}

The ``tune.choice()`` accepts a list of values that are uniformly sampled from. In this example, the ``l1`` and ``l2`` parameters should be powers of 2 between 4 and 256, so either 4, 8, 16, 32, 64, 128, or 256. The ``lr`` (learning rate) should be uniformly sampled between 0.0001 and 0.1. Lastly, the ``batch size`` is a choice between 2, 4, 8, and 16.

At each trial, Ray Tune will now randomly sample a combination of parameters from these search spaces. It will then train a number of models in parallel and find the best performing one among these. We will also use the ``ASHAScheduler`` which will terminate bad performing trials early.

Let's also set up the resources we will have access to:

In [7]:
# use half the available cpu cores, and use any gpus
resources_per_trial = {"cpu": os.cpu_count() // 2, "gpu": torch.cuda.device_count()}

The `num_samples` parameter is part of the configuration and can be used to sample our data multiple times instead of only once:

In [8]:
num_samples = 10

Defining our scheduler for the tuning:

In [9]:
scheduler = ASHAScheduler(
    metric="loss",
    mode="min",
    max_t=10,
    grace_period=1,
    reduction_factor=2,
)

## Setting up Training and Testing

The full code for the training function is quite complex, as seen below.

First we set up the network using the configuration, before loading the previous checkpoints state (if it exists). Then we load the data, where we have an 80/20 training/validation split. We then train the data for one epoch before finding its validation metrics, and recording this checkpoint. This is done for 10 epochs, with all metrics reported to Ray Tune throughout.

Ray Tune can then use these metrics to decide which hyperparameter configuration lead to the best results. These metrics can also be used to stop bad performing trials early in order to avoid wasting resources on those trials.

In [10]:
def train_cifar(config, device="cpu", data_dir=None):
    # set up network on device using configuration, along with parameters
    net = Net(config["l1"], config["l2"])
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    # if we have a checkpoint, then load it
    checkpoint = get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            with open(data_path, "rb") as fp:
                checkpoint_state = pickle.load(fp)
            start_epoch = checkpoint_state["epoch"]
            net.load_state_dict(checkpoint_state["net_state_dict"])
            optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
    else:
        start_epoch = 0

    # load the data and split into training and validation
    trainset, testset = load_data(data_dir)

    train_val_split = int(len(trainset) * 0.8)  # 80/20 split
    train_subset, val_subset = random_split(
        trainset, [train_val_split, len(trainset) - train_val_split]
    )

    trainloader = torch.utils.data.DataLoader(
        train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=0
    )
    valloader = torch.utils.data.DataLoader(
        val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=0
    )

    # train for 10 epochs total
    for epoch in range(start_epoch, 10):
        # train this epoch
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(
                    "[%d, %5d] loss: %.3f"
                    % (epoch + 1, i + 1, running_loss / epoch_steps)
                )
                running_loss = 0.0

        # get the validation metrics
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        # save a checkpoint of our current epoch and network state
        checkpoint_data = {
            "epoch": epoch,
            "net_state_dict": net.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        }
        with tempfile.TemporaryDirectory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            with open(data_path, "wb") as fp:
                pickle.dump(checkpoint_data, fp)

            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            # tell raytune how we performed
            train.report(
                {"loss": val_loss / val_steps, "accuracy": correct / total},
                checkpoint=checkpoint,
            )

    print("Finished Training")

Ee also use a hold-out test set with data that has not been used for training the model, lets wrap this in a function:

In [11]:
def test_accuracy(net, device="cpu"):
    # load testing data
    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=0
    )

    # evaluate the model
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

## Tuning

Now we can finally start the tuning.

We wrap the ``train_cifar`` function with ``functools.partial`` to set our parameters. We can also tell Ray Tune what resources should be available for each trial.

In [13]:
result = tune.run(
    partial(train_cifar, data_dir=data_dir, device=device),
    resources_per_trial=resources_per_trial,
    config=config,
    num_samples=num_samples,
    scheduler=scheduler
)

best_trial = result.get_best_trial("loss", "min", "last")
print(f"Best trial config: {best_trial.config}")
print(f"Best trial final validation loss: {best_trial.last_result['loss']}")
print(f"Best trial final validation accuracy: {best_trial.last_result['accuracy']}")

2024-07-15 14:57:45,427	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949


0,1
Current time:,2024-07-15 15:49:30
Running for:,00:51:44.93
Memory:,6.7/11.7 GiB

Trial name,status,loc,batch_size,l1,l2,lr,iter,total time (s),loss,accuracy
train_cifar_d4897_00000,TERMINATED,127.0.0.1:8112,4,256,2,0.00829845,10,815.185,2.30639,0.1
train_cifar_d4897_00001,TERMINATED,127.0.0.1:14188,4,2,32,0.0491035,1,104.856,2.32841,0.098
train_cifar_d4897_00002,TERMINATED,127.0.0.1:17588,16,128,64,0.00260339,10,477.295,1.09617,0.6222
train_cifar_d4897_00003,TERMINATED,127.0.0.1:3472,16,2,4,0.00540818,10,382.508,1.59017,0.3832
train_cifar_d4897_00004,TERMINATED,127.0.0.1:12132,4,128,2,0.000200899,1,114.262,2.30472,0.102
train_cifar_d4897_00005,TERMINATED,127.0.0.1:15960,4,1,1,0.00254347,2,249.805,1.89697,0.1902
train_cifar_d4897_00006,TERMINATED,127.0.0.1:8296,2,1,1,0.0353829,1,189.712,2.33851,0.1002
train_cifar_d4897_00007,TERMINATED,127.0.0.1:8272,8,4,16,0.0951171,1,31.1576,2.35162,0.0978
train_cifar_d4897_00008,TERMINATED,127.0.0.1:1296,16,8,2,0.00957997,4,205.509,1.63253,0.3348
train_cifar_d4897_00009,TERMINATED,127.0.0.1:16976,16,64,64,0.00202307,10,403.011,1.12562,0.6052


Trial name,accuracy,loss,should_checkpoint
train_cifar_d4897_00000,0.1,2.30639,True
train_cifar_d4897_00001,0.098,2.32841,True
train_cifar_d4897_00002,0.6222,1.09617,True
train_cifar_d4897_00003,0.3832,1.59017,True
train_cifar_d4897_00004,0.102,2.30472,True
train_cifar_d4897_00005,0.1902,1.89697,True
train_cifar_d4897_00006,0.1002,2.33851,True
train_cifar_d4897_00007,0.0978,2.35162,True
train_cifar_d4897_00008,0.3348,1.63253,True
train_cifar_d4897_00009,0.6052,1.12562,True


2024-07-15 15:49:30,424	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'C:/Users/seani/ray_results/train_cifar_2024-07-15_14-57-45' in 0.0289s.
2024-07-15 15:49:30,439	INFO tune.py:1041 -- Total run time: 3105.01 seconds (3104.90 seconds for the tuning loop).


Best trial config: {'l1': 128, 'l2': 64, 'lr': 0.0026033864414838764, 'batch_size': 16}
Best trial final validation loss: 1.0961708369731904
Best trial final validation accuracy: 0.6222


Now lets get our best trial model and evaluate on the test data:

In [14]:
# get the model
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
best_trained_model.to(device)

# get its checkpoint so we can load the state
best_checkpoint = result.get_best_checkpoint(trial=best_trial, metric="accuracy", mode="max")
with best_checkpoint.as_directory() as checkpoint_dir:
    data_path = Path(checkpoint_dir) / "data.pkl"
    with open(data_path, "rb") as fp:
        best_checkpoint_data = pickle.load(fp)

    best_trained_model.load_state_dict(best_checkpoint_data["net_state_dict"])

    # with the loaded state we can then get the test accuracy
    test_acc = test_accuracy(best_trained_model, device)
    print("Best trial test set accuracy: {}".format(test_acc))

Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.6119
