### Goal
This tutorial shows you how to evaluate multiple trained models and find the best-performing one. We train ANNs with different numbers of hidden units on the MNSIT dataset and find the one with the best accuracy on the test dataset.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
import devtorch
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Let's train a few models

In [9]:
class ANNClassifier(devtorch.DevModel):
    
    def __init__(self, n_in, n_hidden, n_out):
        super().__init__()
        self._n_in = n_in
        self._n_hidden = n_hidden
        self._n_out = n_out
        self.layer1 = nn.Linear(n_in, n_hidden, bias=False)
        self.layer2 = nn.Linear(n_hidden, n_out, bias=False)
        self.init_weight(self.layer1.weight, "glorot_uniform")
        self.init_weight(self.layer2.weight, "glorot_uniform")
    
    @property
    def hyperparams(self):
        return {**super().hyperparams, "params": {"n_in": self._n_in, "n_hidden": self._n_hidden, "n_out": self._n_out}}
    
    def forward(self, x):
        x = F.leaky_relu(self.layer1(x.flatten(1, 3)))
        return F.leaky_relu(self.layer2(x))

In [31]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST("../../data", train=True, download=True, transform=transform)
test_dataset = datasets.MNIST("../../data", train=False, download=True, transform=transform)

def loss(output, target):
    return F.cross_entropy(output, target.long())

root = "../../data/multi"  # where to save the checkpoint

for n_hidden in [1, 10, 100]:
    model = ANNClassifier(784, n_hidden, 10)
    model_id = f"units_{n_hidden}"  # the name of the checkpoint - if not is provided devtorch auto generates this.
    trainer = devtorch.get_trainer(loss, root=root, id=model_id, model=model, train_dataset=train_dataset, n_epochs=8, batch_size=128, lr=0.001, device="cuda")
    trainer.train(save=True)

INFO:trainer:Completed epoch 0 with loss 972.7274848222733 in 7.4804s
INFO:trainer:Completed epoch 1 with loss 955.9463212490082 in 7.4609s
INFO:trainer:Completed epoch 2 with loss 948.9528036117554 in 7.3285s
INFO:trainer:Completed epoch 3 with loss 941.0648840665817 in 7.3228s
INFO:trainer:Completed epoch 4 with loss 924.0126966238022 in 7.3340s
INFO:trainer:Completed epoch 5 with loss 901.7036259174347 in 7.3348s
INFO:trainer:Completed epoch 6 with loss 892.2680432796478 in 7.3324s
INFO:trainer:Completed epoch 7 with loss 884.9409943819046 in 7.3330s
INFO:trainer:Completed epoch 0 with loss 329.5164772942662 in 7.3363s
INFO:trainer:Completed epoch 1 with loss 153.00060449913144 in 7.3250s
INFO:trainer:Completed epoch 2 with loss 137.13186548650265 in 7.3231s
INFO:trainer:Completed epoch 3 with loss 129.66909927129745 in 7.3273s
INFO:trainer:Completed epoch 4 with loss 124.7119121607393 in 7.3354s
INFO:trainer:Completed epoch 5 with loss 120.97485350817442 in 7.3396s
INFO:trainer:Com

### Inspect different model hyperparams

Every model was saved using a name that included the number of hidden units in the model (i.e. model_id = f"units_{n_hidden}"). However, say we didn't include the number of hidden units in the name, and we want to check how the different models compare to each other. For this we can use the devtorch.build_models_df function shown below:

In [32]:
def hyperparams_mapper(hyperparams):
    return {"n_hidden": hyperparams["model"]["params"]["n_hidden"]} 

devtorch.build_models_df(root, hyperparams_mapper)

Unnamed: 0_level_0,n_hidden
model_id,Unnamed: 1_level_1
units_1,1
units_10,10
units_100,100


### Get best model

We can now compare the different model accuracies using the devtorch.build_metric_df function:

In [33]:
def model_loader(hyperparams):
    return ANNClassifier(**hyperparams["model"]["params"])

def eval_metric(output, target):
    return (torch.max(output, 1)[1] == target).sum().cpu().item()

In [48]:
metric_df = devtorch.build_metric_df(
    root,
    model_loader,
    test_dataset,
    eval_metric,
    model_ids=None,  # None loads all models. Or specify a list of model_ids that you would like to use.
    batch_size=256,
    device="cuda",
    dtype=torch.float,
    # **kwargs: you can pass additional custom arguments that will get passed to the models forward call.
)

INFO:validator:Computing metric for units_1...
INFO:validator:Computing metric for units_10...
INFO:validator:Computing metric for units_100...


In [49]:
metric_df.groupby("model_id").sum()["metric_score"]/len(test_dataset)

model_id
units_1      0.1903
units_10     0.9291
units_100    0.9718
Name: metric_score, dtype: float64