# Classification of handwritten digits

This notebook demonstrates the use of simple neural networks to classify handwritten digits.

It uses the well known MNIST dataset, comprising 70,000 greyscale scans of handwritten digits of 28x28 pixels. This dataset has been widely used, both as a benchmark dataset, and for educational purposes. See https://en.wikipedia.org/wiki/MNIST_database for more information.

We will train a machine learning model, in this case a neural network, for a classification task of the MNIST digits. That is, the input of the model is an image and the output is a label, i.e., the corresponding digit.

## Import packages

We'll use
* time -- a standard Python library providing time related functions
* matplotlib for showing images -- more info in https://matplotlib.org/
* PyTorch is a machine learning library -- more info in https://pytorch.org/
  * torch.nn for using neural networks
  * torch.utils.data for loading datasets
  * torchvision.datasets for access to the MNIST dataset
  * torchvision.transforms for data transformation among images and tensors

In [None]:
import time
import matplotlib.pyplot as plt
import torch
from torch import nn
from torch.utils.data import *
from torchvision import datasets
from torchvision.transforms import *

Determine if there is any AI acceleration in the machine. Otherwise, use the CPU. Typical values are "cuda" for NVIDIA GPUs and "mps" for Apple Sillion.

In [None]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

## Dataset loading

Next we load the MNIST dataset and transform it into a sequence of tensors. Tensors are the basic data type employed by pyTorch.

At the first time, it will download the dataset from the internet, otherwise it will use the previously downloaded one.

Then, split the dataset into three subsets:
* __train set__ -- used for training the model (70% in this example)
* __test set__ -- used for evaluating the model on instances unseen during training (15%)
* __validation set__ -- used to prevent overfitting during training (not used here, 15%)

In [None]:
# Load the dataset
full_dataset = datasets.MNIST(
    root="data",
    download=True,
    transform=ToTensor(),
)

# Get the dataset size in number of instances
dataset_size = len(full_dataset)

# Split the dataset
train_dataset, validation_dataset, test_dataset = random_split(
    full_dataset, [int(0.7*dataset_size),
                   int(0.15*dataset_size),
                   int(0.15*dataset_size)]
)

In training (but also during testing), data is split into __batches__. The model weights are only updated after each batch.

Next we define the batch size and the dataloaders necessary for what follows. Dataloaders will divide data into batches. Small batches introduce more noise during learning, which may help escaping from local minima, but might slow down convergence. Large batches tipically speeds up training but may lead to overfitting more easily.

It also shows the dimensions of one batch from the test dataset, both the input (x) and the output (y). The dimensions are: batch size, number of channels (1), and width and height or each images (28 by 28).

In [None]:
# This line defined the batch size
batch_size = 64

# Create data loaders.
train_dataloader      = DataLoader(train_dataset, batch_size=batch_size)
validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size)
test_dataloader       = DataLoader(test_dataset, batch_size=batch_size)

# Show the shape of the first instance from the test dataset
for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

Here we show some examples from the three sets, one row per set: train, validation, and test. On top of each images is the label in the dataset (the groundtruth).

In [None]:
(fig,ax) = plt.subplots(3, 10)
for i,ds in enumerate([train_dataset, validation_dataset, test_dataset]):
    for j in range(10):
        ax[i,j].imshow(ds[j][0][0], cmap="gray")
        ax[i,j].set_axis_off()
        ax[i,j].set_title(ds[j][1])
plt.axis('off')
plt.show()

## Create a neural network model

The next cells defines several neural network models. All networks have 28*28 inputs and 10 outputs. Each output corresponds to a label, among the 10 digits. The classification label will be the output with greatest value.

The first one, __ClassicMNIST__, is a simple fully connected model. There are several alternatives, from a simple linear network, to 2 hidden layers. The activation function for the hidden layers is a ReLU, meaning rectifier linear unit, where the output is equal to the input if it is non-negative, and zero otherwise.

In [None]:
class ClassicMNIST(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()

        # linear network
        self.linear = nn.Sequential(
            nn.Linear(28*28, 10)
        )

        # 1 hidden layer
        self.hidden1 = nn.Sequential(
            nn.Linear(28*28, 20),
            nn.ReLU(),
            nn.Linear(20, 10)
        )

        # 2 hidden layers
        self.hidden2 = nn.Sequential(
            nn.Linear(28*28, 40),
            nn.ReLU(),
            nn.Linear(40, 20),
            nn.ReLU(),
            nn.Linear(20, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        # Edit the next line to define which of the above neural networks to use
        logits = self.hidden2(x)
        return logits

__SmallMNISTCNN__ is a model comprising: (1) two CNN layers, each one with batch normalization, ReLU activation functions, and a max pooling layer, and (2) two fully connected layers, with a ReLU and a dropout in between them.

A CNN (convoluional neural network) is an architecture where weights are shared among several receptive fields, somehow mimicking animal visual cortex.

Batch normalization layers helps stabilizing training by normalizing its inputs to zero mean and unit variance over each batch.

Max pooling layers aggregate data by choosing the maximun value among groups of inputs (receptive fields).

Dropout layers randomly deactivates some neurons during training, forcing learning to more distributed among the network and helps preventing overfitting.

In [None]:
class SmallMNISTCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2))
        self.fc = nn.Sequential(
            nn.Linear(32*7*7, 128),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(128, 10))

    def forward(self, x):
        x = self.cnn(x)
        x = torch.flatten(x, start_dim=1)
        x = self.fc(x)
        return x

Next we define a __LeNet5__ model, based on a well-known CNN model created by the pioneer Yann LeCun back in 1998. CNN are a cornerstone of any modern state-of-the-art deep learming model for images and similar data.

In [None]:
class LeNet5(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(6),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2, stride = 2),
            #
            nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2, stride = 2)
        )
        self.fc = nn.Sequential(
            nn.Linear(256, 120),
            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Linear(84, 10)
        )

    def forward(self, x):
        x = self.cnn(x)
        x = torch.flatten(x, start_dim=1)
        x = self.fc(x)
        return x

Next we instantiate the model and transfer it to the AI accelerator (if any). The structure of the model is printed.

We also define the loss function (the cross entropy loss) and the optimization algorithm (Adam).

The cross entropy loss, a information theoretic concept, promotes accurate answers.

Adam optimizer is a widely used optimization methods. It uses backpropagation to compute the gradient of the loss function with respect to the weights, and makes a step towards its descent, plus a momentum term to make the gradient descent faster.

In [None]:
## Uncomment the model you want to try out, leaving all others commented out

model = ClassicMNIST().to(device)
#model = SmallMNISTCNN().to(device)
#model = LeNet5().to(device)

print(model)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

## Training the model

We start by defining the train and the test functions.

The train function iterates over all batches of the training set. For each batch, computes the prediction error (loss) and __backpropagates__ the error through the network and adjusts the network parameters towards decreasing the loss. This is known as stochastic gradient descent, since at each step, the batch is different from the previous one. This helps escaping from local minima.

Besides printing the start of each epoch, it also prints the loss in 1 second intervals.

In [None]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    t = 0
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Prints loss progress every second
        if time.monotonic()-t > 1:
            t = time.monotonic()
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

In the test function, go though all batches in the test dataset, computing the total loss, and the amount of correctly classified instances. It prints the accuracy and the average loss.

In [None]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f}")

Here we actually perform the training of the model instantiated above.

The first line defines the number of epochs, that is, how many times training goes through all batches in the training set.

The "for" loop iterates over all epochs, outputing the performance in both the validation and test sets.

In [None]:
# Define here the number of epochs. Increase if loss descent does not stabilize in time.
epochs = 5

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    print("Validation: ", end="")
    test(validation_dataloader, model, loss_fn)
    print("Test: ", end="")
    test(test_dataloader, model, loss_fn)
print("Done!")

Use this cell __only__ if you wish to save the trained model to a file (otherwise, just skip it)

In [None]:
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

Use this cell __only__ if you wish to loadm a model from a file (otherwise, just skip it)

In [None]:
model = NeuralNetwork().to(device)
model.load_state_dict(torch.load("model.pth", weights_only=True))

## Testing the model

The next cell shows classification of some of the instances from the dataset.

As before, three rows are shown, one for each one of the train, validation, and test sets. At the top of each image, the first number is the __classification result__ and the second number, in paranthesis, is the __groundtruth__ (that is, the true value).

In [None]:
model.eval()
(fig,ax) = plt.subplots(3, 10)
for i,ds in enumerate([train_dataset, validation_dataset, test_dataset]):
    for j in range(10):
        x, y_gt = ds[j]
        x = x[None,:].to(device)
        y_out = model(x)
        pred = y_out[0].argmax(0)
        
        ax[i,j].imshow(ds[j][0][0], cmap="gray")
        ax[i,j].set_axis_off()
        ax[i,j].set_title(f"{int(pred)} ({y_gt})")
plt.axis('off')
plt.show()