# train

CIFAR-10 is one of the best-known image recognition benchmark datasets in the deep learning space. It is an independently relabelled subset of the now-retired 80 Million Tiny Images dataset containing just 10 different classes overall:

```csv
airplanes,cars,birds,cats,deer,dogs,frogs,horses,ships,trucks
```

The model from the paper accompanying the release of this dataset was 20% inacurrate at the time of the dataset's release in 2009-2010. A ResNet achieved 4% inaccuracy on this classification task back in 2016. As of 2020, benchmark inaccuracy on this task has dropped down to around 1%, according to [paperswithcode](https://paperswithcode.com/sota/image-classification-on-cifar-10), rendering this simple benchmark dataset a (mostly) solved problem. How far we've come in ten years!

In this notebook we will train a simple convolutional neural net (CNN), written in PyTorch, on this dataset, demonstrating some of features of the Spell platform in the process.

## initial model

`CIFAR10` is available as a prepackaged dataset via `torchvision.data`. Note that, as a convention, we recommend downloading/mounting datasets to the `/mnt/` directory.

In [1]:
import torchvision

transform = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.RandomPerspective(),
    torchvision.transforms.ToTensor()
])
dataset = torchvision.datasets.CIFAR10("/mnt/cifar10/", train=True, transform=transform, download=True)

Files already downloaded and verified


In [2]:
dataset

Dataset CIFAR10
    Number of datapoints: 50000
    Root location: /mnt/cifar10/
    Split: Train
    StandardTransform
Transform: Compose(
               RandomHorizontalFlip(p=0.5)
               RandomPerspective(p=0.5)
               ToTensor()
           )

In [3]:
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [4]:
X_ex, y_ex = next(iter(dataloader))

In [5]:
X_ex.shape, y_ex.shape

(torch.Size([32, 3, 32, 32]), torch.Size([32]))

In [6]:
X_ex[0]

tensor([[[0.5137, 0.5294, 0.5294,  ..., 0.5412, 0.5373, 0.5294],
         [0.5451, 0.5529, 0.5569,  ..., 0.5686, 0.5569, 0.5647],
         [0.5804, 0.5882, 0.5922,  ..., 0.6039, 0.5922, 0.6039],
         ...,
         [0.0157, 0.0353, 0.0196,  ..., 0.8824, 0.8902, 0.8902],
         [0.0118, 0.0196, 0.0157,  ..., 0.8745, 0.8784, 0.8784],
         [0.0039, 0.0039, 0.0000,  ..., 0.8588, 0.8392, 0.8431]],

        [[0.7059, 0.7137, 0.7216,  ..., 0.7294, 0.7176, 0.7255],
         [0.7098, 0.7176, 0.7176,  ..., 0.7255, 0.7216, 0.7294],
         [0.7333, 0.7373, 0.7373,  ..., 0.7451, 0.7451, 0.7529],
         ...,
         [0.0039, 0.0196, 0.0196,  ..., 0.8745, 0.8863, 0.8902],
         [0.0039, 0.0078, 0.0157,  ..., 0.8706, 0.8784, 0.8824],
         [0.0039, 0.0039, 0.0000,  ..., 0.8667, 0.8392, 0.8431]],

        [[0.8510, 0.8627, 0.8627,  ..., 0.8706, 0.8627, 0.8588],
         [0.8510, 0.8588, 0.8510,  ..., 0.8588, 0.8588, 0.8549],
         [0.8627, 0.8667, 0.8627,  ..., 0.8745, 0.8784, 0.

In [7]:
y_ex

tensor([7, 9, 9, 8, 1, 8, 2, 2, 0, 2, 0, 0, 8, 7, 8, 5, 9, 7, 2, 0, 6, 1, 9, 2,
        5, 7, 7, 8, 5, 9, 6, 3])

Next, we define the model. This model is a PyTorch implementation of `cifar10_cnn.py` from [Keras's example repository](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py). This is a very simple convolutional net with a feedforward head.

In [8]:
import torch
from torch import nn

class CIFAR10Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn_block_1 = nn.Sequential(*[
            nn.Conv2d(3, 32, 3),
            nn.ReLU(),
            nn.Conv2d(32, 32, 3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Dropout(0.25)
        ])
        self.cnn_block_2 = nn.Sequential(*[
            nn.Conv2d(32, 32, 3),
            nn.ReLU(),
            nn.Conv2d(32, 32, 3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Dropout(0.25)
        ])
        self.flatten = lambda inp: torch.flatten(inp, 1)
        self.head = nn.Sequential(*[
            nn.Linear(800, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, 10)
        ])
    
    def forward(self, X):
        X = self.cnn_block_1(X)
        X = self.cnn_block_2(X)
        X = self.flatten(X)
        X = self.head(X)
        return X

In [9]:
clf = CIFAR10Model()
clf.cuda()
clf.forward(X_ex.cuda())

tensor([[ 0.0564, -0.0351, -0.0408,  0.0134, -0.0342,  0.0464, -0.0170, -0.0187,
          0.0066,  0.1017],
        [ 0.0363, -0.0267, -0.0080,  0.0199, -0.0397,  0.0223,  0.0141, -0.0222,
          0.0240,  0.0678],
        [ 0.0201, -0.0339, -0.0081,  0.0263, -0.0137,  0.0421,  0.0132, -0.0238,
          0.0063,  0.0338],
        [ 0.0316, -0.0498, -0.0134, -0.0228, -0.0386,  0.0235,  0.0184, -0.0051,
          0.0069,  0.0663],
        [ 0.0459, -0.0410, -0.0340,  0.0152, -0.0261,  0.0106, -0.0131, -0.0257,
         -0.0175,  0.0725],
        [ 0.0168, -0.0545, -0.0368,  0.0115, -0.0336,  0.0123,  0.0006, -0.0544,
         -0.0121,  0.0392],
        [ 0.0511, -0.0244, -0.0268,  0.0225, -0.0433,  0.0171, -0.0108, -0.0183,
         -0.0120,  0.0647],
        [ 0.0187, -0.0182, -0.0229,  0.0184,  0.0037,  0.0273, -0.0214, -0.0423,
          0.0150,  0.0753],
        [ 0.0419, -0.0385, -0.0244,  0.0510, -0.0440,  0.0714,  0.0249, -0.0244,
          0.0277,  0.0620],
        [-0.0140, -

Now the training loop.

In [10]:
from torch import optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(clf.parameters())

In [11]:
criterion(clf.forward(X_ex.cuda()), y_ex.cuda())

tensor(2.3002, device='cuda:0', grad_fn=<NllLossBackward>)

In [12]:
import numpy as np

def train():
    NUM_EPOCHS = 10
    for epoch in range(1, NUM_EPOCHS + 1):
        losses = []

        for i, (X_batch, y_cls) in enumerate(dataloader):
            optimizer.zero_grad()

            y = y_cls.cuda()
            X_batch = X_batch.cuda()

            y_pred = clf(X_batch)
            loss = criterion(y_pred, y)
            loss.backward()
            optimizer.step()

            curr_loss = loss.item()
            if i % 200 == 0:
                print(
                    f'Finished epoch {epoch}/{NUM_EPOCHS}, batch {i}. Loss: {curr_loss:.3f}.'
                )

            losses.append(curr_loss)

        print(
            f'Finished epoch {epoch}. '
            f'avg loss: {np.mean(losses)}; median loss: {np.min(losses)}'
        )

In [13]:
train()

Finished epoch 1/10, batch 0. Loss: 2.296.
Finished epoch 1/10, batch 200. Loss: 1.928.
Finished epoch 1/10, batch 400. Loss: 1.823.
Finished epoch 1/10, batch 600. Loss: 1.735.
Finished epoch 1/10, batch 800. Loss: 1.578.
Finished epoch 1/10, batch 1000. Loss: 1.532.
Finished epoch 1/10, batch 1200. Loss: 1.561.
Finished epoch 1/10, batch 1400. Loss: 1.461.
Finished epoch 1. avg loss: 1.7841214290888585; median loss: 1.163797378540039
Finished epoch 2/10, batch 0. Loss: 1.663.
Finished epoch 2/10, batch 200. Loss: 1.401.
Finished epoch 2/10, batch 400. Loss: 1.728.
Finished epoch 2/10, batch 600. Loss: 1.167.
Finished epoch 2/10, batch 800. Loss: 1.632.
Finished epoch 2/10, batch 1000. Loss: 1.687.
Finished epoch 2/10, batch 1200. Loss: 1.801.
Finished epoch 2/10, batch 1400. Loss: 1.213.
Finished epoch 2. avg loss: 1.4924049427397954; median loss: 0.9426190853118896
Finished epoch 3/10, batch 0. Loss: 1.706.
Finished epoch 3/10, batch 200. Loss: 1.484.
Finished epoch 3/10, batch 400.

That completes our initial model definition. I added checkpointing and metrics tracking (via `send_metric`) to the following training script:

In [14]:
%%writefile ../models/train_basic.py
import torchvision
from torch.utils.data import DataLoader
import torch
from torch import nn
from torch import optim
import numpy as np
from spell.metrics import send_metric

import os
CWD = os.environ["PWD"]
if not os.path.exists(f"{CWD}/checkpoints/"):
    os.mkdir(f"{CWD}/checkpoints/")
IS_GPU = torch.cuda.is_available()

transform_train = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
train_dataset = torchvision.datasets.CIFAR10("/mnt/cifar10/", train=True, transform=transform_train, download=True)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=False)


class CIFAR10Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn_block_1 = nn.Sequential(*[
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Dropout(0.25)
        ])
        self.cnn_block_2 = nn.Sequential(*[
            nn.Conv2d(64, 64, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Dropout(0.25)
        ])
        self.flatten = lambda inp: torch.flatten(inp, 1)
        self.head = nn.Sequential(*[
            nn.Linear(64 * 8 * 8, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, 10)
        ])

    def forward(self, X):
        X = self.cnn_block_1(X)
        X = self.cnn_block_2(X)
        X = self.flatten(X)
        X = self.head(X)
        return X


clf = CIFAR10Model()
if IS_GPU:
    clf.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(clf.parameters())


def train():
    NUM_EPOCHS = 10
    for epoch in range(1, NUM_EPOCHS + 1):
        losses = []

        for i, (X_batch, y_cls) in enumerate(train_dataloader):
            optimizer.zero_grad()

            if IS_GPU:
                y = y_cls.cuda()
                X_batch = X_batch.cuda()

            y_pred = clf(X_batch)
            loss = criterion(y_pred, y)
            loss.backward()
            optimizer.step()

            curr_loss = loss.item()
            if i % 200 == 0:
                print(
                    f'Finished epoch {epoch}/{NUM_EPOCHS}, batch {i}. Loss: {curr_loss:.3f}.'
                )
                send_metric("loss", curr_loss)

            losses.append(curr_loss)

        print(
            f'Finished epoch {epoch}. '
            f'avg loss: {np.mean(losses)}; median loss: {np.median(losses)}'
        )

        torch.save(clf.state_dict(), f"{CWD}/checkpoints/epoch_{epoch}.pth")
    torch.save(clf.state_dict(), f"{CWD}/checkpoints/model_final.pth")


if __name__ == "__main__":
    train()


Overwriting ../models/train_basic.py


We are on a dirty commit because we created this notebook file and this model training script in a Spell workspace, and it does not yet exist in the backing `git` repository (for more on how Spell runs and workspaces interact with git refer to ["How runs interact with git"](https://spell.ml/docs/run_overview#how-runs-interact-with-git) in our docs). The good news is that this is easy to do using our built-in JupyterLab git extension ([`jupyterlab/jupyterlab-git`](https://github.com/jupyterlab/jupyterlab-git)) on the sidebar:

![](https://i.imgur.com/zRUN7vh.png)

In [4]:
# This is temporarily necessary.
# !spell login --identity aleksey@spell.ml --password jF4D@4#meRZF

[0m[0mHello, Aleksey Bilogur!
[0m[0m

In [281]:
!spell run --machine-type t4 \
    --github-url https://github.com/spellml/cnn-cifar10.git \
    python models/train_basic.py

[0m💫 Casting spell #90…
[0m✨ Stop viewing logs with ^C
[1m[36m⭐[0m Machine_Requested… Run created -- waiting for a t4 machine.[0mm^C

[0m✨ Your run is still running remotely.
[0m✨ Use 'spell kill 90' to terminate your run
[0m✨ Use 'spell logs 90' to view logs again
[0m[K[0m[?25h[0m[0m

## improved model training script

The following updated training script includes several additional bells and whistles typical of a Spell run model training script:

* Model checkpointing has been added. If a checkpoint file is present, the training job will resume from the latest checkpoint automatically.
* It uses the dataset on disk, if it already exists.
* The number of epochs, convolutional block dropout, output head dropout, convolutional filter count, and dense layer filter count are all configurable using command line arguments.
* Logs to Spell metrics.
* Logs to Tensorboard.
* Uses a validation set, and generates validation statistics.

In [83]:
%%writefile ../models/train.py
import torchvision
from torch.utils.data import DataLoader
import torch
from torch import nn
from torch import optim
from torch.utils.tensorboard import SummaryWriter

import numpy as np
import argparse
from spell.metrics import send_metric

import re
import os
CWD = os.environ["PWD"]
if not os.path.exists(f"{CWD}/checkpoints/"):
    os.mkdir(f"{CWD}/checkpoints/")
writer = SummaryWriter(f"{CWD}/tensorboard/")

parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, dest='epochs', default=20)
parser.add_argument('--batch_size', type=int, dest='batch_size', default=32)

parser.add_argument('--conv1_filters', type=int, dest='conv1_filters', default=32)
parser.add_argument('--conv2_filters', type=int, dest='conv2_filters', default=64)
parser.add_argument('--dense_layer', type=int, dest='dense_layer', default=512)

parser.add_argument('--conv1_dropout', type=float, dest='conv1_dropout', default=0.25)
parser.add_argument('--conv2_dropout', type=float, dest='conv2_dropout', default=0.25)
parser.add_argument('--dense_dropout', type=float, dest='dense_dropout', default=0.5)

parser.add_argument('--from_checkpoint', type=str, dest='from_checkpoint', default="")

args = parser.parse_args()

# Used for testing purposes.
# class Args:
#     def __init__(self):
#         self.epochs = 50
#         self.batch_size = 32
#         self.conv1_filters = 32
#         self.conv2_filters = 64
#         self.dense_layer = 512
#         self.conv1_dropout = 0.25
#         self.conv2_dropout = 0.25
#         self.dense_dropout = 0.5
#         self.from_checkpoint = False
# args = Args()

transform_train = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
transform_test = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

download = not os.path.exists("/mnt/cifar10/")
if download:
    print("CIFAR10 dataset not on disk, downloading...")
else:
    print("CIFAR10 dataset is already on disk! Skipping download.")

train_dataset = torchvision.datasets.CIFAR10("/mnt/cifar10/", train=True, transform=transform_train, download=download)
train_dataloader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=False)
test_dataset = torchvision.datasets.CIFAR10("/mnt/cifar10/", train=False, transform=transform_test, download=download)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=args.batch_size, shuffle=False)


class CIFAR10Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn_block_1 = nn.Sequential(*[
            nn.Conv2d(3, args.conv1_filters, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(args.conv1_filters, args.conv2_filters, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Dropout(args.conv1_dropout)
        ])
        self.cnn_block_2 = nn.Sequential(*[
            nn.Conv2d(args.conv2_filters, args.conv2_filters, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(args.conv2_filters, args.conv2_filters, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Dropout(args.conv2_dropout)
        ])
        self.flatten = lambda inp: torch.flatten(inp, 1)
        self.head = nn.Sequential(*[
            nn.Linear(args.conv2_filters * 8 * 8, args.dense_layer),
            nn.ReLU(),
            nn.Dropout(args.dense_dropout),
            nn.Linear(args.dense_layer, 10)
        ])

    def forward(self, X):
        X = self.cnn_block_1(X)
        X = self.cnn_block_2(X)
        X = self.flatten(X)
        X = self.head(X)
        return X


clf = CIFAR10Model()

if args.from_checkpoint:
    if args.from_checkpoint == "latest":
        start_epoch = max([int(re.findall("[0-9]{1,2}", fp)[0]) for fp in os.listdir("/mnt/checkpoints/")])
    else:
        start_epoch = args.from_checkpoint
    clf.load_state_dict(torch.load(f"/mnt/checkpoints/epoch_{start_epoch}.pth"))
    print(f"Resuming training from epoch {start_epoch}...")
else:
    start_epoch = 1

clf.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.RMSprop(clf.parameters(), lr=0.0001, weight_decay=1e-6)


def test(epoch, num_epochs):
    losses = []
    n_right, n_total = 0, 0
    clf.eval()

    for i, (X_batch, y_cls) in enumerate(test_dataloader):
        with torch.no_grad():
            y = y_cls.cuda()
            X_batch = X_batch.cuda()

            y_pred = clf(X_batch)
            loss = criterion(y_pred, y)
            losses.append(loss.item())
            _, y_pred_cls = y_pred.max(1)
            n_right, n_total = n_right + (y_pred_cls == y_cls.cuda()).sum().item(), n_total + len(X_batch)

    val_acc = n_right / n_total
    val_loss = np.mean(losses)
    send_metric("val_loss", val_loss)
    send_metric("val_acc", val_acc)
    writer.add_scalar("val_loss", val_loss, (len(train_dataloader) // 200 + 1) * epoch + (i // 200))
    writer.add_scalar("val_acc", val_acc, (len(train_dataloader) // 200 + 1) * epoch + (i // 200))
    print(
        f'Finished epoch {epoch}/{num_epochs} avg val loss: {val_loss:.3f}; median val loss: {np.median(losses):.3f}; '
        f'val acc: {val_acc:.3f}.'
    )


def train():
    clf.train()
    NUM_EPOCHS = args.epochs

    for epoch in range(start_epoch, NUM_EPOCHS + 1):
        losses = []

        for i, (X_batch, y_cls) in enumerate(train_dataloader):
            optimizer.zero_grad()

            y = y_cls.cuda()
            X_batch = X_batch.cuda()

            y_pred = clf(X_batch)
            loss = criterion(y_pred, y)
            loss.backward()
            optimizer.step()

            train_loss = loss.item()
            if i % 200 == 0:
                print(
                    f'Finished epoch {epoch}/{NUM_EPOCHS}, batch {i}. loss: {train_loss:.3f}.'
                )
                send_metric("train_loss", train_loss)
                writer.add_scalar("train_loss", train_loss, (len(train_dataloader) // 200 + 1) * epoch + (i // 200))
            losses.append(train_loss)

        print(
            f'Finished epoch {epoch}. '
            f'avg loss: {np.mean(losses)}; median loss: {np.median(losses)}'
        )
        test(epoch, NUM_EPOCHS)
        if epoch % 5 == 0:
            torch.save(clf.state_dict(), f"{CWD}/checkpoints/epoch_{epoch}.pth")

    torch.save(clf.state_dict(), f"{CWD}/checkpoints/model_final.pth")


if __name__ == "__main__":
    train()


Overwriting ../models/train.py


Some test runs:

In [73]:
!spell run --machine-type t4 \
    --github-url https://github.com/spellml/cnn-cifar10.git \
    --tensorboard-dir /spell/tensorboard/ -- \
    python models/train.py

[0m💫 Casting spell #94…
[0m✨ Stop viewing logs with ^C
[0m^C

[0m✨ Your run is still running remotely.
[0m✨ Use 'spell kill 94' to terminate your run
[0m✨ Use 'spell logs 94' to view logs again
[0m[0m

In [38]:
!spell run --machine-type t4 \
    --github-url https://github.com/spellml/cnn-cifar10.git \
    --tensorboard-dir /spell/tensorboard/ -- \
    python models/train.py --batch_size 64 --dense_dropout 0.25

[0m💫 Casting spell #93…
[0m✨ Stop viewing logs with ^C
[0m[K[0m[?25h[0m✨ Machine_Requested… done
[1m[36m⭐[0m Building… Machine acquired -- commencing run[0mm^C

[0m✨ Your run is still running remotely.
[0m✨ Use 'spell kill 93' to terminate your run
[0m✨ Use 'spell logs 93' to view logs again
[0m[K[0m[?25h[0m[0m

In [85]:
!spell run --machine-type t4 \
    --github-url https://github.com/spellml/cnn-cifar10.git \
    --tensorboard-dir /spell/tensorboard/ \
    --mount runs/94/checkpoints/:/mnt/checkpoints/ \
    --mount uploads/cifar10/:/mnt/cifar10/ -- \
    python models/train.py --from_checkpoint latest

[0m💫 Casting spell #100…
[0m✨ Stop viewing logs with ^C
[0m[K[0m[?25h[0m✨ Building… doneuired -- commencing run[0m
[1m[36m🌟[0m Machine_Requested… Run created -- waiting for a t4 machine.[0m^C

[0m✨ Your run is still running remotely.
[0m✨ Use 'spell kill 100' to terminate your run
[0m✨ Use 'spell logs 100' to view logs again
[0m[K[0m[?25h[0m[0m

In [None]:
!spell hyper grid \
    --machine-type t4 \
    --param batch_size=16,32,64 \
    --param conv2_filters=32,64 \
    --github-url https://github.com/spellml/cnn-cifar10.git \
    --tensorboard-dir /spell/tensorboard/ \
    --mount uploads/cifar10/:/mnt/cifar10/ -- \
    python models/train.py \
        --epochs 20 \
        --batch_size :batch_size: \
        --conv2_filters :conv2_filters: