## Introduction

In this notebook, we will be tuning a deep fully-connected neural network by experimenting with its number of layers & number of neurons. To this end, we will be using Optuna, a framework for tuning the hyperparameters of machine learning models. We will be applying RandomSearch & Bayesian Optimization methods to create a baseline, so to compare its results to the hyperparameter tuning results of the differential evolution algorithm we implemented.

Let's start by importing the necessary packages.

## Data Loading & Transforms

In [1]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [2]:
import numpy as np
import pandas as pd
import pickle
import time

import torch
import torchvision.datasets as datasets
from torchvision import transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data

# !pip install optuna
# Kaggle already has it installed in its notebooks btw.
import optuna
import optuna
from optuna.trial import TrialState

import os

import joblib

In [3]:
print('You are in', os.getcwd())
data_dir = os.getcwd()

You are in /home/dick


In [4]:
DEVICE = torch.device("cpu")
classes = 10
epochs = 100

Let's now download the datasets and transform them into pytorch tensors.

In [5]:
mnist_trainset = datasets.MNIST(root=data_dir, train=True, download=True, transform=transforms.ToTensor())
mnist_testset = datasets.MNIST(root=data_dir, train=False, download=True, transform=transforms.ToTensor())

  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


The train dataset has a size of 60000 samples and the test has 10000 hand-written digit images in it. The class labels change between 0 and 9.

In [6]:
print(mnist_trainset)
print()
print(mnist_testset)

Dataset MNIST
    Number of datapoints: 60000
    Root location: /home/dick
    Split: Train
    StandardTransform
Transform: ToTensor()

Dataset MNIST
    Number of datapoints: 10000
    Root location: /home/dick
    Split: Test
    StandardTransform
Transform: ToTensor()


We can now create the dataloader objects for the datasets we downloaded. In order to make a fair comparison, we will be using the default hyperparameters of the notebook that uses the optimization via differential evolution algorithm. But in case you'd like to increase the batch_size or shuffle the rows to prevent any overfitting, feel free to do so.

In [7]:
train_loader = torch.utils.data.DataLoader(mnist_trainset, batch_size=10, shuffle=False)
test_loader = torch.utils.data.DataLoader(mnist_testset, batch_size=10, shuffle=False)

Before creating the model, let's print the shapes of the training instances in one batch. We won't be checking the test instances since the same results apply to them. 

In [8]:
print('Shape of the input features in one batch:', next(iter(train_loader))[0].shape)
print('Shape of the targets in one batch:', next(iter(train_loader))[1].shape)

Shape of the input features in one batch: torch.Size([10, 1, 28, 28])
Shape of the targets in one batch: torch.Size([10])


To make it clear, the numbers for the shape of input tensors above correspond the **[Batch_Size, RGB Channels, Height, Width]**. MNIST is black and white so its **channel count is 1**. The rest self-explanatory, our batches are comprised of **10 image instances** in the form of **28 x 28 pixels**.

## Defining the Model

We've finished creating the data loaders and now we can start defining the model. Remember that we will be tuning only the number of layers&neurons, other hyperparameters will be the same as that of DE.ipynb.

In [9]:
def baseline_model(trial):
    
    # In the default setting; the number of layers were in the range of 5 and 70, while the number of hidden neurons
    # were changing between 4 and 10.
    n_layers = trial.suggest_int("n_layers", 5, 70)
    layers = []
    
    in_features = 28 * 28
    for i in range(n_layers):
        out_features = trial.suggest_int("n_units_l{}".format(i), 4, 10)
        layers.append(nn.Linear(in_features, out_features))
        layers.append(nn.ReLU())

        in_features = out_features
        
    layers.append(nn.Linear(in_features, 10))
    layers.append(nn.LogSoftmax(dim = 1))

    return nn.Sequential(*layers)

We can now create our objective function to tune with Optuna.

## Hyperparameter Tuning

Optuna requires an objective function to start its experiments. Below we will be creating it by instantiating the model and fetching the datasets, followed by training and testing.

In [10]:
# Reduce these numbers for faster training.
BATCHSIZE = 10
N_TRAIN_EXAMPLES = BATCHSIZE * 6000
N_VALID_EXAMPLES = BATCHSIZE * 1000
early_stopping_rounds = 3

In [11]:
def objective(trial):

    model = baseline_model(trial).to(DEVICE)

    # Tune the specific hyperparameters of the gradient descent if you wish.
    # optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"])
    # lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    
    optimizer_name = 'Adam'
    lr = 0.001
    
    optimizer = getattr(optim, optimizer_name)(model.parameters(), lr=lr)
    loss_fn = torch.nn.NLLLoss()
    
    # Get the MNIST dataloaders.
    #train_loader, valid_loader = train_loader, test_loader
    valid_loader = test_loader
    
    train_batch_count = min(len(train_loader.dataset), N_TRAIN_EXAMPLES) / BATCHSIZE
    valid_batch_count = min(len(valid_loader.dataset), N_VALID_EXAMPLES) / BATCHSIZE
    # Training of the model.
    stop_counter = 0
    best_val_loss = float('inf')
    for epoch in range(epochs):
        if stop_counter > early_stopping_rounds:
            print()
            print('No improvement in validation data for the specified number of early stopping rounds. Stopping training.')
            break
                
        model.train()
        trainloss = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            # Limiting training data for faster epochs.
            if batch_idx * BATCHSIZE >= N_TRAIN_EXAMPLES:
                break
            
            
            data, target = data.view(data.size(0), -1).to(DEVICE), target.to(DEVICE)
            
            optimizer.zero_grad()
            output = model(data)

            loss = loss_fn(output, target)
            loss.backward()
            optimizer.step()
            trainloss += loss.item()
            
        trainloss = trainloss / train_batch_count
        # Validation of the model.
        model.eval()
        validationloss = 0
        correct = 0
        with torch.no_grad():
            for batch_idx, (data, target) in enumerate(valid_loader):
                # Limiting validation data.
                if batch_idx * BATCHSIZE >= N_VALID_EXAMPLES:
                    break
                    
                data, target = data.view(data.size(0), -1).to(DEVICE), target.to(DEVICE)
                
                output = model(data)
                loss = loss_fn(output, target)
                validationloss += loss.item()
                
                # Get the index of the max probability.
                pred = output.argmax(dim=1, keepdim=True)
                correct += pred.eq(target.view_as(pred)).sum().item()
                
        validationloss = validationloss / valid_batch_count
        
        if validationloss < best_val_loss:
            best_val_loss = validationloss
            stop_counter = 0
        else:
            stop_counter += 1
        
        accuracy = correct /  min(len(valid_loader.dataset), N_VALID_EXAMPLES)
        
        print('Epoch {} => Training_Loss:{:.4f}, Validation_Loss:{:.8f}, Validation_Acc:{:.8f}, Stop_counter:{}'.format(epoch, trainloss, validationloss, accuracy, stop_counter))
        trial.report(accuracy, epoch)

        # Handle pruning based on the intermediate value. Doing this will stop tuning experiments which doesn't seem very promising.
        # In case you'd like to get the worst-case runtime of optuna, feel free to comment the lines below.
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return accuracy

We can now create the study object and start our experiments. One can add a time limit to the experiments by defining the 'timeout' paramater to the optimize attribute of the study.

Here, we will be using the TPE sampler(default sampler of optuna), which results in optimizing the hyperparameters using bayesian statistics.

In [12]:
# You can change the logging type if you'd like to see more verbosity in trials. For now I just want to see the progress and no other details
# regarding the trials.
optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10, timeout=1500, show_progress_bar = True)

pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])

print("Study statistics: ")
print("  Number of finished trials: ", len(study.trials))
print("  Number of pruned trials: ", len(pruned_trials))
print("  Number of complete trials: ", len(complete_trials))

print("Best trial:")
best_trial = study.best_trial

print("  Value: ", best_trial.value)

print("  Params: ")
for key, value in best_trial.params.items():
    print("    {}: {}".format(key, value))

  self._init_valid()


  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 0 => Training_Loss:2.3022, Validation_Loss:2.30153231, Validation_Acc:0.10280000, Stop_counter:0
Epoch 1 => Training_Loss:2.3018, Validation_Loss:2.30150786, Validation_Acc:0.10280000, Stop_counter:0
Epoch 2 => Training_Loss:2.3018, Validation_Loss:2.30147860, Validation_Acc:0.10280000, Stop_counter:0
Epoch 3 => Training_Loss:2.3017, Validation_Loss:2.30146436, Validation_Acc:0.10280000, Stop_counter:0
Epoch 4 => Training_Loss:2.3017, Validation_Loss:2.30145326, Validation_Acc:0.10280000, Stop_counter:0
Epoch 5 => Training_Loss:2.3017, Validation_Loss:2.30144351, Validation_Acc:0.10280000, Stop_counter:0
Epoch 6 => Training_Loss:2.3016, Validation_Loss:2.30143410, Validation_Acc:0.10280000, Stop_counter:0
Epoch 7 => Training_Loss:2.3016, Validation_Loss:2.30142711, Validation_Acc:0.10280000, Stop_counter:0
Epoch 8 => Training_Loss:2.3016, Validation_Loss:2.30142132, Validation_Acc:0.10280000, Stop_counter:0
Epoch 9 => Training_Loss:2.3016, Validation_Loss:2.30141070, Validation_A

So the bayesian optimization baseline took approximately 5 minutes to complete its run on the CPU, and came up with an accuracy value of 39%. It proposes 9 layers, with every layer haaving 8 or 9 neurons, expect for its third layer.

Let's save the results of the trials for later usage.

In [18]:
!mkdir studies

mkdir: cannot create directory ‘studies’: File exists


In [19]:
study_path = './studies'

In [20]:
joblib.dump(study, study_path + '/mnist_optuna_tpe_10trials2.pkl')
study = joblib.load(study_path + '/mnist_optuna_tpe_10trials2.pkl')
df_bayes = study.trials_dataframe()
df_bayes.head(3)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,state
0,0,,2022-01-21 20:45:35.656644,2022-01-21 20:45:35.656696,00:00:00.000052,FAIL


Let's look at the tuning results obtained by Random Search.

In [31]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

random_sampler=optuna.samplers.RandomSampler(seed = 10)

study = optuna.create_study(sampler = random_sampler, direction="maximize")
study.optimize(objective, n_trials=10, timeout=1500, show_progress_bar = True)

pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])

print("Study statistics: ")
print("  Number of finished trials: ", len(study.trials))
print("  Number of pruned trials: ", len(pruned_trials))
print("  Number of complete trials: ", len(complete_trials))

print("Best trial:")
trial = study.best_trial

print("  Value: ", trial.value)

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

  self._init_valid()


  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 0 => Training_Loss:2.3023, Validation_Loss:2.30151453, Validation_Acc:0.10280000, Stop_counter:0
Epoch 1 => Training_Loss:2.3017, Validation_Loss:2.30152230, Validation_Acc:0.10280000, Stop_counter:1
Epoch 2 => Training_Loss:2.3017, Validation_Loss:2.30151260, Validation_Acc:0.10280000, Stop_counter:0
Epoch 3 => Training_Loss:2.3016, Validation_Loss:2.30150102, Validation_Acc:0.10280000, Stop_counter:0
Epoch 4 => Training_Loss:2.3016, Validation_Loss:2.30149903, Validation_Acc:0.10280000, Stop_counter:0
Epoch 5 => Training_Loss:2.3015, Validation_Loss:2.30149046, Validation_Acc:0.10280000, Stop_counter:0
Epoch 6 => Training_Loss:2.3015, Validation_Loss:2.30148129, Validation_Acc:0.10280000, Stop_counter:0
Epoch 7 => Training_Loss:2.3015, Validation_Loss:2.30147279, Validation_Acc:0.10280000, Stop_counter:0
Epoch 8 => Training_Loss:2.3015, Validation_Loss:2.30146418, Validation_Acc:0.10280000, Stop_counter:0
Epoch 9 => Training_Loss:2.3015, Validation_Loss:2.30145525, Validation_A

With an accuracy of 14%, random search proved much worse than bayesian search. The model it proposes also have a lot more parameters than the one proposed by the bayesian approach.

In [38]:
joblib.dump(study, study_path + '/mnist_optuna_random_10trials2.pkl')
study = joblib.load(study_path + '/mnist_optuna_random_10trials2.pkl')
df_random = study.trials_dataframe()
df_random.head(3)

## Conclusion

Both the bayesian & random search approaches returned weak accuracy values. This may be due to the number tuning rounds, which we set it to 10. Increasing it will result in better results, especially for the bayesian search baseline since it improves itself based on the result of priors(former trials), so 10 trials is probably too little for it to come up with better options.