# Track an experiment while training a Pytorch model locally or in your notebook


Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine learning that lets you build, train, debug, deploy, and monitor your machine learning models. SageMaker Studio provides all the tools you need to take your models from data preparation to experimentation to production while boosting your productivity. In a single unified visual interface, customers can perform the following tasks:

* Write and execute code in Jupyter notebooks

* Prepare data for machine learning

* Build and train machine learning models

* Deploy the models and monitor the performance of their predictions

* Track and debug the machine learning experiments


In this example, we will train a simple PyTorch MNIST classifier using a Studio notebook. To execute this notebook, you should select the `PyTorch 1.12 Python 3.8 CPU Optimized` image and a `ml.c5.large` instance.

## Setup

We will use TensorBoard to inspect our training logs. Install it using the cell below.

In [None]:
! pip install tensorboard==2.13.0

In [None]:
from torchvision import datasets, transforms
import torch
from torch.utils.tensorboard import SummaryWriter
from matplotlib import pyplot as plt
import os


## Download the dataset
Let's now use the torchvision library to download the MNIST dataset and apply a transformation on each image.

In [None]:
# download the dataset
# this will not only download data to ./mnist folder, but also load and transform (normalize) them
datasets.MNIST.urls = [
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz",
]

train_set = datasets.MNIST(
    "mnist_data",
    train=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
    download=True,
)

test_set = datasets.MNIST(
    "mnist_data",
    train=False,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
    download=True,
)

View and example image from the dataset

In [None]:
plt.imshow(train_set.data[2].numpy())

## Train a MNIST classifier

We will define a simple CNN architecture and a train the classifier.


Define your CNN architecture and training function. You can use `run.log_metric` with a defined step to log the metrics of your model for each epoch, in order to plot those metrics with SageMaker Experiments. With `run.log_confusion_matrix` you can automatically plot the confusion matrix of your model.dd

In [None]:
# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py
class Net(torch.nn.Module):
    def __init__(self, hidden_channels, kernel_size, drop_out):
        super(Net, self).__init__()
        self.conv1 = torch.nn.Conv2d(1, hidden_channels, kernel_size=kernel_size)
        self.conv2 = torch.nn.Conv2d(hidden_channels, 20, kernel_size=kernel_size)
        self.conv2_drop = torch.nn.Dropout2d(p=drop_out)
        self.fc1 = torch.nn.Linear(320, 50)
        self.fc2 = torch.nn.Linear(50, 10)

    def forward(self, x):
        x = torch.nn.functional.relu(torch.nn.functional.max_pool2d(self.conv1(x), 2))
        x = torch.nn.functional.relu(
            torch.nn.functional.max_pool2d(self.conv2_drop(self.conv2(x)), 2)
        )
        x = x.view(-1, 320)
        x = torch.nn.functional.relu(self.fc1(x))
        x = torch.nn.functional.dropout(x, training=self.training)
        x = self.fc2(x)
        return torch.nn.functional.log_softmax(x, dim=1)

Create a Tensorboard `SummaryWriter`.

In [None]:
writer = SummaryWriter('runs/mnist_experiment_1')

Define the functions for model training.

In [None]:
def log_performance(model, data_loader, device, epoch, metric_type="Test"):
    model.eval()
    loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in data_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss += torch.nn.functional.nll_loss(
                output, target, reduction="sum"
            ).item()  # sum up batch loss
            # get the index of the max log-probability
            pred = output.max(1, keepdim=True)[1]
            correct += pred.eq(target.view_as(pred)).sum().item()
    loss /= len(data_loader.dataset)
    accuracy = 100.0 * correct / len(data_loader.dataset)
    # log metrics

    writer.add_scalar(f"{metric_type} average loss", loss, epoch)
    writer.add_scalar(f"{metric_type} accuracy", accuracy, epoch)

    print(
        "{} Average loss: {:.4f}, {} Accuracy: {:.4f}%;\n".format(
            metric_type, loss, metric_type, accuracy
        )
    )


def train_model(
    train_set, test_set, optimizer="sgd", epochs=10, hidden_channels=10
):
    """
    Function that trains the CNN classifier to identify the MNIST digits.
    Args:
        train_set (torchvision.datasets.mnist.MNIST): train dataset
        test_set (torchvision.datasets.mnist.MNIST): test dataset
        optimizer (str): the optimization algorthm to use for training your CNN
                         available options are sgd and adam
        epochs (int): number of complete pass of the training dataset through the algorithm
        hidden_channels (int): number of hidden channels in your model
    """

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # set the seed for generating random numbers
    torch.manual_seed(42)

    train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=1000, shuffle=True)
    print(
        "Processes {}/{} ({:.0f}%) of train data".format(
            len(train_loader.sampler),
            len(train_loader.dataset),
            100.0 * len(train_loader.sampler) / len(train_loader.dataset),
        )
    )

    print(
        "Processes {}/{} ({:.0f}%) of test data".format(
            len(test_loader.sampler),
            len(test_loader.dataset),
            100.0 * len(test_loader.sampler) / len(test_loader.dataset),
        )
    )
    model = Net(hidden_channels, kernel_size=5, drop_out=0.5).to(device)
    model = torch.nn.DataParallel(model)
    momentum = 0.5
    lr = 0.01
    log_interval = 100
    if optimizer == "sgd":
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=momentum)
    else:
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(1, epochs + 1):
        print("Training Epoch:", epoch)
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader, 1):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = torch.nn.functional.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % log_interval == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)], Train Loss: {:.6f};".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.sampler),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )
        log_performance(model, train_loader, device, epoch, "Train")
        log_performance(model, test_loader, device, epoch, "Test")
    return model


def save_model(model, model_dir):
    print("Saving the model.")
    path = os.path.join(model_dir, "model.pth")
    # recommended way from http://pytorch.org/docs/master/notes/serialization.html
    torch.save(model.cpu().state_dict(), path)

Before we train our model, we will launch setup and install TensorBoard in SageMaker Studio. Follow these instructions:

1. Click on the  Amazon SageMaker Studio button on the top left corner of Studio to open the Amazon SageMaker Studio Launcher. This launcher must be opened from your root directory.

2. In the Launcher, under Utilities and files, click System terminal.

3. From the terminal, run the following commands. You must run this from the `/home/sagemaker-user` root directory.

    ```bash
    pip install tensorboard
    tensorboard --logdir laughing-potato/runs/mnist_experiment_1
    ```

To launch TensoBoard, follow these instructions:

1. To launch TensorBoard, copy your Studio URL and replace lab? with `proxy/6006/` as follows. You must include the trailing `/` character.

    `https://<YOUR_URL>.studio.region.sagemaker.aws/jupyter/default/proxy/6006/`

2. Navigate to the URL to examine your results.


In [None]:
%%time

train_model(
    train_set=train_set,
    test_set=test_set,
    epochs=3,
    hidden_channels=2,
    optimizer="adam",
)

Close the `SummaryWriter`.

In [None]:
writer.close()