# Exercise 1: Classification of handwritten digits using an MLP

In this exercise, you will train a multi-layer perceptron (MLP) to classify handwritten digits from the MNIST dataset. The MNIST dataset consists of 28x28 grayscale images of handwritten digits (0 to 9). The task is to classify each image into one of the 10 classes (one for each digit).

The steps you will follow are:
* Set up the environment
* Load the dataset
* Define the model
* Train the model
* Evaluate the model

## Set up the environment

In [None]:
# Set up the environment for this notebook

# Install dependencies for this notebook
! pip install -q torch matplotlib torchvision

# Import PyTorch and other required Python libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms

import numpy as np
import matplotlib.pyplot as plt

# Set up the device for model training
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS")
else:
    device = torch.device("cpu")
    print("Using CPU")

## Load the dataset

We obtain the MNIST data using the torchvision.datasets module. The first time we run this code, the data will be downloaded and stored in a local directory (`./data`). After that, the data will be directly loaded from the local directory.

In [None]:
# Download the training and test datasets and create data loaders

batch_size = 128

train_dataset = datasets.MNIST(
    "./data", train=True, download=True, transform=transforms.ToTensor()
)

# split the training dataset into training and validation datasets
train_dataset, validation_dataset = torch.utils.data.random_split(
    train_dataset, [40000, 20000]
)

test_dataset = datasets.MNIST(
    "./data", train=False, transform=transforms.ToTensor()
)

train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True
)

validation_loader = torch.utils.data.DataLoader(
    dataset=validation_dataset, batch_size=batch_size, shuffle=False
)

test_loader = torch.utils.data.DataLoader(
    dataset=validation_dataset, batch_size=batch_size, shuffle=False
)

One job of the dataloader is to convert the data to PyTorch tensors. Additionally, it batches up the data and supplies it to the model in so-called "mini-batches" during training according to the batch size we specified above.

Let's take a closer look at one of these mini-batches. Each mini-batch is a list containing two tensors: the first tensor contains the images in the "mini-batch", and the second tensor contains the corresponding labels for the "mini-batch". Since it is so cumbersome to say "mini-batch" all the time, sometimes we will refer to a mini-batch as a "batch"

In [None]:
# We iterate over the training loader to see that we get the following:
# - X_train: a tensor of size (batch_size, 1, 28, 28) with a batch of images
# - y_train: a tensor of size (batch_size) with the corresponding labels

for X_train, y_train in train_loader:
    print("X_train:", X_train.size(), "type:", X_train.type())
    print("y_train:", y_train.size(), "type:", y_train.type())

    # we just want to show the dimensions of the first batch
    break

Let's take a look at the first 10 images and labels from this batch.

In [None]:
pltsize = 1
plt.figure(figsize=(10 * pltsize, pltsize))

for i in range(10):
    plt.subplot(1, 10, i + 1)
    plt.axis("off")
    plt.imshow(X_train[i, :, :, :].numpy().reshape(28, 28), cmap="gray_r")
    plt.title(f"Class: {y_train[i].item()}")

## Define the model

Here we define our Multi-Layer Perceptron (MLP) network.  Recall that the MLP is composed of a series of fully-connected layers.  In this case, we will define a network with two hidden layers.

In PyTorch, we define our MLP using a class. We have to write the `__init__()` and `forward()` methods, and PyTorch will automatically generate a `backward()` method for computing the gradients for the backward pass.

Read through this code line-by-line, and follow the comments explaining what each line does. If you don't get everything, that's ok. If there is a new term you do not understand, e.g. ReLU, please feel free to look it up on the internet.

Also, we will be coving PyTorch more in-depth in future exercises.

In [None]:
# Define the model as an MLP with two hidden layers of size 50 and ReLU activations.
# Read more about the different types of layers here: http://pytorch.org/docs/nn.html
# We also add dropout layers to prevent overfitting.
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.fc1 = nn.Linear(28 * 28, 50)  # 28*28 is the size of the input image
        self.fc1_drop = nn.Dropout(0.2)  # Dropout layers help prevent overfitting

        self.fc2 = nn.Linear(50, 50)  # 50 is the size of the hidden layer
        self.fc2_drop = nn.Dropout(0.2)  # Dropout layers help prevent overfitting

        self.fc3 = nn.Linear(50, 10)  # 10 is the size of the output layer

    def forward(self, x):
        x = x.view(-1, 28 * 28)  # Flatten the input image

        x = F.relu(self.fc1(x))  # Apply ReLU activation to the first layer
        x = self.fc1_drop(x)  # Apply dropout to the first layer

        x = F.relu(self.fc2(x))  # Apply ReLU activation to the second layer
        x = self.fc2_drop(x)  # Apply dropout to the second layer

        # Apply log softmax to the output layer to get a probability distribution
        # over the 10 classes of digits
        return F.log_softmax(self.fc3(x), dim=1)


# We instantiate the model and move it to the GPU if available.
model = Net().to(device)

# We use Stachastic Gradient Descent (SGD) as our optimizer and Cross Entropy as our loss function.
# See http://pytorch.org/docs/optim.html#algorithms for more information about the different optimization
# algorithms PyTorch offers.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
loss_fn = nn.CrossEntropyLoss()

print(model)

## Train the model

While training the model, we will monitor its performance on the validation set. Sometimes we use the performance on the validation set to make decisions about when to stop training the model (e.g. if the performance on the validation set starts to degrade, we may stop training early to avoid overfitting).

Let's now define functions to `train()` and `validate()` the model. 

In [None]:
# Functions to train and validate the model

def train(training_loss_values, epoch, log_interval=100):
    model.train()  # Set model to training mode
    train_loss = 0
    # Loop over each batch from the training set
    for batch_idx, (data, target) in enumerate(train_loader):
        data = data.to(device)  # Copy data to GPU if needed
        target = target.to(device)  # Copy target to GPU if needed

        # Run an optimization step. See https://pytorch.org/docs/stable/optim.html#taking-an-optimization-step
        optimizer.zero_grad()  # Zero out gradients from previous step
        output = model(data)  # Pass data through the network
        loss = loss_fn(output, target)  # Calculate loss
        train_loss += loss.data.item()  # Calculate loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights

        if batch_idx % log_interval == 0:
            print(
                "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                    epoch,
                    batch_idx * len(data),
                    len(train_loader.dataset),
                    100.0 * batch_idx / len(train_loader),
                    loss.data.item(),
                )
            )
    train_loss /= len(train_loader)  # Calculate average loss
    training_loss_values.append(train_loss)  # Add to loss vector


def validate(validation_loss_vector, accuracy_vector):
    model.eval()  # Set model to evaluation mode
    val_loss, correct = (
        0,
        0,
    )  # Initialize loss and number of correct classifications to 0
    for data, target in validation_loader:
        data = data.to(device)  # Copy data to GPU if needed
        target = target.to(device)  # Copy target to GPU if needed
        output = model(data)  # Pass data through the network
        val_loss += loss_fn(output, target).data.item()  # Calculate loss
        pred = output.data.max(1)[1]  # get the index of the max log-probability
        correct += (
            pred.eq(target.data).cpu().sum()
        )  # Add number of correct classifications

    val_loss /= len(validation_loader)  # Calculate average loss
    validation_loss_vector.append(val_loss)  # Add to loss vector

    accuracy = (
        100.0 * correct.to(torch.float32) / len(validation_loader.dataset)
    )  # Calculate accuracy
    accuracy_vector.append(accuracy)  # Add to accuracy vector

    print(
        "\nValidation set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(
            val_loss, correct, len(validation_loader.dataset), accuracy
        )
    )


We will now train our model for 10 epochs and afterwards plot the training and validation losses over the training epochs. An epoch is a single pass through the training data.


In [None]:
%%time
epochs = 10

training_loss_values, validation_loss_values, accuracy_values = [], [], []
for epoch in range(1, epochs + 1):
    train(training_loss_values, epoch)
    validate(validation_loss_values, accuracy_values)

We now plot the training and validation losses over the training epochs.

Loss is a measure of how far a model's predictions are from its label, i.e. how bad the model is. The goal of training a model is to find a set of parameters that minimizes the loss. The loss is calculated using a loss function, which takes the model's prediction and the correct label as input and returns a number. The lower the loss, the better the model.

The training process tries to find the set of parameters that minimizes the loss on the training data, i.e. the training loss. However, we also want to know how well the model performs on data it has never seen before, i.e. the validation data. If the model performs much worse on the validation data than the training data, it is memorizing patterns in the training data rather than learning general rules that apply to data in general. This is called overfitting.

In [None]:
# Plot training loss and validation loss vs. epochs

plt.figure(figsize=(5, 3))
plt.plot(
    np.arange(1, epochs + 1), training_loss_values, label="Training loss", color="red"
)
plt.plot(np.arange(1, epochs + 1), validation_loss_values, label="Validation loss", color="blue")
plt.legend()
plt.title("Loss with 20% Dropout")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(np.arange(1, epochs + 1))
plt.show()


We take note of the 20% dropout rate in the hidden layers. Dropout is a regularization technique that helps prevent overfitting. It works by randomly setting the outputs of neurons in a layer to zero during the forward pass of the training process. This has the effect of making the network more robust to changes in the weights of the network, and prevents the network from memorizing the training data. Using dropout also increases the loss on the training data, while explains why the training loss is higher than the validation loss.

## Evaluate the model

Now that we have trained our model, we want to evaluate its performance on the test set. We do this to get an estimate of how well the model will perform on data it has never seen before. Technically, we should only do this once, at the very end of training, when we are completely finished with all hyperparameter tuning. However, at the end of this exercise, we will ask you to change some hyperparameters and retrain the model.

In [None]:
# Show model predictions and the true labels for a few images from the test dataset

model.eval()  # Set the model to evaluation mode
data, label = next(
    iter(test_loader)
)  # Get a batch of validation data (batch size = 32)
data = data.to(device)  # Send data to device
output = model(data)  # Pass data through the network
pred = output.data.max(1)[
    1
]  # get the index of the max log-probability for each element in the batch

# Graph 10 images and their predicted labels (green for correct, red for incorrect)
plt.figure(figsize=(10, 10))
for i in range(10):
    plt.subplot(1, 10, i + 1)
    plt.axis("off")
    plt.imshow(data[i, :, :, :].cpu().numpy().reshape(28, 28), cmap="gray_r")
    plt.title(
        f"Pred: {pred[i].item()}\nLabel: {label[i].item()}", color=("green" if pred[i] == label[i] else "red")
    )

## Model tuning

Modify the MLP model. Try to improve the classification accuracy, or experiment with the effects of different parameters.

First try changing the learning rate `lr` for the optimizer and run the notebook again.

Did you increase or decrease the learning rate? What was the effect on the training and validation losses? Try the opposite and see what happens.

Next, try adding an additional hidden layer to the MLP model. How does this affect the training and validation losses?

Finally, try changing the number of hidden units a the hidden layer. How does this affect the training and validation losses?

You can also consult the PyTorch documentation at http://pytorch.org/.