<a href="https://colab.research.google.com/github/sdgroeve/Machine_Learning_course_UGent_D012554_2025/blob/main/notebooks/mnist_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MNIST Digit Classification with Logistic Regression and Validation

This notebook demonstrates how to build a simple logistic regression model to classify handwritten digits from the MNIST dataset using PyTorch.


## Setting Up the Environment

First, we'll import the necessary libraries and set a random seed for reproducibility.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
from torchvision.models.vision_transformer import VisionTransformer
import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### Libraries Used:

- **torch**: The main PyTorch library for tensor operations and neural network building blocks
- **torch.nn**: Contains neural network layers, loss functions, and other components
- **torch.optim**: Provides optimization algorithms like SGD, Adam, etc.
- **torch.utils.data**: Contains utilities for data loading and batching, including the `random_split` function for creating validation sets
- **torchvision**: Provides datasets, model architectures, and image transformations for computer vision
- **matplotlib.pyplot**: For visualization and plotting
- **numpy**: For numerical operations

We set a random seed to ensure reproducibility of our results. This means that if someone else runs this code with the same seed, they should get identical results.

## Loading and Preprocessing the MNIST Dataset

The MNIST dataset consists of 28x28 grayscale images of handwritten digits (0-9). It contains 60,000 training images and 10,000 test images.

We'll use torchvision's built-in datasets module to download and load the data. We'll also split the training data into training and validation sets.

In [None]:
# Define data transformation
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert images to PyTorch tensors and normalize pixel values to [0, 1]
])

# Download and load the full training data
full_train_dataset = datasets.MNIST(
    root='./data',          # Directory where the data will be stored
    train=True,             # Use the training split
    download=True,          # Download the data if it's not already downloaded
    transform=transform     # Apply the defined transformations
)

validation_split = 0.1  # 10% of training data for validation

# Split training data into training and validation sets
val_size = int(len(full_train_dataset) * validation_split)
train_size = len(full_train_dataset) - val_size
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])

print(f"Training set size: {train_size}")
print(f"Validation set size: {val_size}")

# Download and load test data
test_dataset = datasets.MNIST(
    root='./data',          # Directory where the data will be stored
    train=False,            # Use the test split
    download=True,          # Download the data if it's not already downloaded
    transform=transform     # Apply the defined transformations
)

print(f"Test dataset size: {len(test_dataset)}")

### Data Preprocessing Explanation:

1. **transforms.Compose**: Combines multiple transforms together. In this case, we're only using one transform, but in more complex scenarios, you might apply multiple transformations like resizing, cropping, normalization, etc.

2. **transforms.ToTensor()**: Converts PIL images or NumPy arrays to PyTorch tensors. It also scales the pixel values from [0, 255] to [0, 1], which is a common preprocessing step for neural networks.

3. **datasets.MNIST**: This is a built-in dataset class in torchvision that handles downloading and loading the MNIST dataset.
   - `root='./data'`: Specifies where to store the dataset files
   - `train=True/False`: Whether to use the training or test split
   - `download=True`: Automatically download the dataset if it's not already present
   - `transform=transform`: Apply the defined transformations to the images

4. **random_split**: Splits a dataset into non-overlapping new datasets of given lengths.
   - We calculate the validation set size as a percentage of the full training dataset
   - We split the full training dataset into training and validation sets
   - This gives us a way to monitor the model's performance on unseen data during training

The MNIST dataset is already split into training and test sets. We further split the training set into training and validation sets. The training set is used to train the model, the validation set is used to monitor performance during training, and the test set is used to evaluate the final model's performance on completely unseen data.

## Creating Data Loaders

Data loaders handle batching, shuffling, and loading the data in parallel using multiple workers.

In [None]:
batch_size = 32        # Number of samples processed before the model is updated

# Create data loaders
train_loader = DataLoader(
    dataset=train_dataset,  # The dataset to load from
    batch_size=batch_size,  # How many samples per batch
    shuffle=True            # Whether to shuffle the data
)

val_loader = DataLoader(
    dataset=val_dataset,    # The dataset to load from
    batch_size=batch_size,  # How many samples per batch
    shuffle=False           # No need to shuffle validation data
)

test_loader = DataLoader(
    dataset=test_dataset,   # The dataset to load from
    batch_size=batch_size,  # How many samples per batch
    shuffle=False           # No need to shuffle test data
)

# Calculate how many batches we have
print(f"Number of training batches: {len(train_loader)}")
print(f"Number of validation batches: {len(val_loader)}")
print(f"Number of test batches: {len(test_loader)}")

### Data Loader Explanation:

The `DataLoader` class provides an efficient way to iterate through the dataset in batches during training and evaluation.

- **dataset**: The dataset from which to load the data
- **batch_size**: How many samples to load per batch
- **shuffle**: Whether to shuffle the data at the start of each epoch
  - For the training data, we set `shuffle=True` to ensure that the model sees the data in a different order each epoch, which helps prevent the model from memorizing the order of the samples
  - For the validation and test data, we set `shuffle=False` because the order doesn't matter for evaluation

We create three data loaders:
1. **train_loader**: For training the model
2. **val_loader**: For validating the model during training
3. **test_loader**: For final evaluation of the model

The DataLoader will automatically handle the batching of data, which means it will group the data into batches of size `batch_size` and provide an iterator to go through these batches.

## Visualizing Training Examples

Let's visualize some examples from the training dataset to get a better understanding of what our model will be working with.

In [None]:
# Function to visualize examples from the dataset
def show_examples(dataset, num_examples=5):
    plt.figure(figsize=(15, 3))
    for i in range(num_examples):
        # For datasets wrapped in random_split, we need to access the dataset differently
        if isinstance(dataset, torch.utils.data.Subset):
            img, label = dataset.dataset[dataset.indices[i]]
        else:
            img, label = dataset[i]
        plt.subplot(1, num_examples, i+1)  # Create a subplot in a 1 x num_examples grid
        plt.imshow(img.squeeze().numpy(), cmap='gray')  # Display the image in grayscale
        plt.title(f'Label: {label}')       # Set the title to the label
        plt.axis('off')                    # Hide the axes
    plt.tight_layout()                     # Adjust the spacing between subplots
    plt.show()

# Show 5 examples from the training dataset
show_examples(train_dataset, num_examples=5)

### Visualization Function Explanation:

The `show_examples` function displays a specified number of examples from the dataset:

1. `plt.figure(figsize=(15, 3))`: Creates a figure with a width of 15 inches and a height of 3 inches

2. For each example:
   - Check if the dataset is a `Subset` (which is the case for our training and validation sets after using `random_split`)
   - If it is a `Subset`, access the underlying dataset using the indices from the subset
   - Otherwise, access the dataset directly
   - `plt.subplot(1, num_examples, i+1)`: Creates a subplot in a 1 × num_examples grid at position i+1
   - `img.squeeze().numpy()`: Removes dimensions of size 1 and converts the tensor to a NumPy array
   - `plt.imshow(..., cmap='gray')`: Displays the image in grayscale
   - `plt.title(f'Label: {label}')`: Sets the title to show the label
   - `plt.axis('off')`: Hides the x and y axes

3. `plt.tight_layout()`: Adjusts the spacing between subplots to avoid overlap

4. `plt.show()`: Displays the figure

This visualization helps us understand what the MNIST images look like. Each image is a 28×28 grayscale image of a handwritten digit from 0 to 9.

## Building the Model

Now, let's define our model. In the context of neural networks, logistic regression is essentially a single-layer neural network with a softmax activation function.

In [None]:
# Define the logistic regression model
class LogisticRegressionModel(nn.Module):
    def __init__(self, input_size, num_classes):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(input_size, num_classes)  # Single fully connected layer
        self.sigmoid = nn.Sigmoid()  # Sigmoid activation function

    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten the input: reshape from [batch_size, 1, 28, 28] to [batch_size, 784]
        x = self.linear(x)
        x = self.sigmoid(x)
        return x

# Initialize the model
input_size = 28 * 28  # MNIST images are 28x28 pixels
num_classes = 10       # There are 10 digit classes (0-9)
model = LogisticRegressionModel(input_size, num_classes).to(device)

# Print the model architecture
print(model)

### Model Architecture Explanation:

Our logistic regression model is very simple:

1. **Model Class**: We define a custom `LogisticRegressionModel` class that inherits from `nn.Module`, which is the base class for all neural networks in PyTorch.

2. **Initialization (`__init__`)**:
   - `super(LogisticRegressionModel, self).__init__()`: Calls the constructor of the parent class
   - `self.linear = nn.Linear(input_size, num_classes)`: Creates a single fully connected (linear) layer that maps from `input_size` (784) to `num_classes` (10)

3. **Forward Pass (`forward`)**:
   - `x.view(x.size(0), -1)`: Reshapes the input tensor from [batch_size, 1, 28, 28] to [batch_size, 784], effectively flattening the 28×28 images into 784-dimensional vectors
   - `self.linear(x)`: Applies the linear transformation, producing raw scores (logits) for each class

4. **Model Instantiation**:
   - `input_size = 28 * 28`: The number of input features (784)
   - `num_classes = 10`: The number of output classes (digits 0-9)
   - `model = LogisticRegressionModel(input_size, num_classes).to(device)`: Creates an instance of the model and moves it to the appropriate device (CPU or GPU)

This model performs logistic regression by applying a single linear transformation to the flattened input images. The output is a vector of 10 values (logits), one for each digit class. These logits will be converted to probabilities using the softmax function, which is implicitly applied by the cross-entropy loss function we'll use for training.

In [None]:
# @title
# class TwoLayerNN(nn.Module):
#     def __init__(self, input_size, hidden_size, output_size):
#         super(TwoLayerNN, self).__init__()
#         self.fc1 = nn.Linear(input_size, hidden_size)  # First layer
#         self.relu = nn.ReLU()  # Activation function
#         self.fc2 = nn.Linear(hidden_size, output_size)  # Second layer

#     def forward(self, x):
#         x = x.view(-1, 784)  # Flatten the input
#         x = self.fc1(x)
#         x = self.relu(x)
#         x = self.fc2(x)
#         return x  # No softmax needed since CrossEntropyLoss includes it

# model = TwoLayerNN(28*28, 128, 10).to(device)

In [None]:
# @title
# class CNNModel(nn.Module):
#     def __init__(self):
#         super(CNNModel, self).__init__()
#         self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)  # Conv layer 1
#         self.relu = nn.ReLU()
#         self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)  # Conv layer 2
#         self.pool = nn.MaxPool2d(kernel_size=2, stride=2)  # Max pooling
#         self.fc1 = nn.Linear(64 * 7 * 7, 128)  # Fully connected layer
#         self.fc2 = nn.Linear(128, 10)  # Output layer (10 classes)

#     def forward(self, x):
#         x = self.conv1(x)
#         x = self.relu(x)
#         x = self.pool(x)
#         x = self.conv2(x)
#         x = self.relu(x)
#         x = self.pool(x)
#         x = x.view(x.size(0), -1)  # Flatten
#         x = self.fc1(x)
#         x = self.relu(x)
#         x = self.fc2(x)
#         return x  # No softmax needed since CrossEntropyLoss includes it

# model = CNNModel().to(device)

## Defining Loss Function and Optimizer

Now, we need to define:
1. A loss function to measure how well the model is performing
2. An optimizer to update the model parameters based on the computed gradients

In [None]:
learning_rate = 0.03    # Step size at each iteration while moving toward a minimum of the loss function

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Combines LogSoftmax and NLLLoss in one single class
optimizer = optim.SGD(model.parameters(), lr=learning_rate)  # Stochastic Gradient Descent optimizer

### Loss Function and Optimizer Explanation:

1. **Loss Function (Cross-Entropy Loss)**:
   - `nn.CrossEntropyLoss()`: This is a commonly used loss function for multi-class classification problems
   - It combines two operations:
     - Softmax: Converts the raw model outputs (logits) into probabilities
     - Negative Log-Likelihood Loss: Measures the performance of a classification model whose output is a probability value between 0 and 1
   - The loss increases as the predicted probability diverges from the actual label

2. **Optimizer (Stochastic Gradient Descent)**:
   - `optim.SGD(model.parameters(), lr=learning_rate)`: Creates an SGD optimizer that will update the model parameters
   - `model.parameters()`: The parameters to optimize (weights and biases of the model)
   - `lr=learning_rate`: The learning rate (0.01), which controls the step size during optimization

The optimizer will use the gradients computed during backpropagation to update the model parameters in a direction that minimizes the loss function.

## Creating an Evaluation Function

Let's define a function to evaluate the model on a given dataset. This will be used to evaluate the model on both the validation and test sets.

In [None]:
# Function to evaluate the model
def evaluate(model, data_loader):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0
    running_loss = 0.0

    with torch.no_grad():  # Disable gradient calculation
        for images, labels in data_loader:
            images, labels = images.to(device), labels.to(device)  # Move data to the appropriate device
            outputs = model(images)  # Get model predictions
            loss = criterion(outputs, labels)  # Calculate the loss

            running_loss += loss.item()  # Accumulate the loss
            _, predicted = torch.max(outputs.data, 1)  # Get the predicted class
            total += labels.size(0)  # Count the total number of samples
            correct += (predicted == labels).sum().item()  # Count the number of correct predictions

    accuracy = 100 * correct / total  # Accuracy as a percentage
    avg_loss = running_loss / len(data_loader)  # Average loss

    return avg_loss, accuracy

### Evaluation Function Explanation:

The `evaluate` function assesses the model's performance on a given dataset:

1. `model.eval()`: Sets the model to evaluation mode, which disables features like dropout and uses the running statistics for batch normalization (though our simple model doesn't use these)

2. `with torch.no_grad()`: Disables gradient calculation, which reduces memory usage and speeds up computation since we don't need gradients for evaluation

3. For each batch:
   - Move the data to the appropriate device (CPU or GPU)
   - Pass the images through the model to get predictions
   - Calculate the loss between predictions and true labels
   - Accumulate the loss
   - Get the predicted class (the index with the highest value)
   - Count the total number of samples and the number of correct predictions

4. Calculate and return the average loss and accuracy:
   - `accuracy = 100 * correct / total`: Accuracy as a percentage
   - `avg_loss = running_loss / len(data_loader)`: Average loss over all batches

This function allows us to evaluate the model on both the validation set during training and the test set after training is complete.

## Training the Model

Now, let's train our model for the specified number of epochs. During training, we'll track the loss and accuracy on both the training and validation sets to monitor the model's progress.

In [None]:
num_epochs = 20         # Number of complete passes through the training dataset

# Training loop
train_losses = []      # To store the training loss for each epoch
train_accuracies = []  # To store the training accuracy for each epoch
val_losses = []        # To store the validation loss for each epoch
val_accuracies = []    # To store the validation accuracy for each epoch

for epoch in range(num_epochs):
    # Training phase
    model.train()  # Set the model to training mode
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_loader:  # Iterate through the training data in batches
        images, labels = images.to(device), labels.to(device)  # Move data to the appropriate device

        # Forward pass
        outputs = model(images)  # Get model predictions
        loss = criterion(outputs, labels)  # Calculate the loss

        # Backward and optimize
        optimizer.zero_grad()  # Zero the parameter gradients
        loss.backward()        # Compute gradients
        optimizer.step()       # Update parameters

        # Track statistics
        running_loss += loss.item()  # Accumulate the loss
        _, predicted = torch.max(outputs.data, 1)  # Get the predicted class
        total += labels.size(0)  # Count the total number of samples
        correct += (predicted == labels).sum().item()  # Count the number of correct predictions

    # Calculate training metrics
    train_loss = running_loss / len(train_loader)  # Average loss for the epoch
    train_accuracy = 100 * correct / total  # Accuracy as a percentage

    # Validation phase
    val_loss, val_accuracy = evaluate(model, val_loader)  # Evaluate on validation set

    # Store metrics
    train_losses.append(train_loss)
    train_accuracies.append(train_accuracy)
    val_losses.append(val_loss)
    val_accuracies.append(val_accuracy)

    # Print epoch statistics
    print(f'Epoch [{epoch+1}/{num_epochs}], '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_accuracy:.2f}%, '
          f'Val Loss: {val_loss:.4f}, Val Acc: {val_accuracy:.2f}%')

### Training Loop Explanation:

The training loop iterates through the dataset multiple times (epochs) to train the model:

1. **Initialization**:
   - `train_losses`, `train_accuracies`, `val_losses`, `val_accuracies`: Lists to store the metrics for each epoch

2. **For each epoch**:
   
   - **Training phase**:
     - `model.train()`: Sets the model to training mode
     - Initialize counters for loss, correct predictions, and total samples
     - For each batch:
       - Move the data to the appropriate device (CPU or GPU)
       - **Forward pass**: Pass the images through the model and calculate the loss
       - **Backward pass and optimization**: Clear gradients, compute gradients, and update parameters
       - **Track statistics**: Accumulate loss and count correct predictions
     - Calculate training loss and accuracy for the epoch
   
   - **Validation phase**:
     - Call the `evaluate` function on the validation set to get validation loss and accuracy
   
   - **Store and print metrics**:
     - Store the training and validation metrics for later visualization
     - Print the statistics for the current epoch

This training loop follows the standard pattern for training neural networks with validation: train on the training set, evaluate on the validation set, and track metrics to monitor progress. By monitoring both training and validation metrics, we can detect issues like overfitting (when the model performs well on the training data but poorly on the validation data).

## Visualizing Training and Validation Metrics

Let's visualize how the loss and accuracy changed during training for both the training and validation sets. This will help us understand the model's learning process and detect any potential issues like overfitting.

In [None]:
# Plot training and validation metrics
plt.figure(figsize=(12, 5))

# Plot loss
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Training')
plt.plot(val_losses, label='Validation')
plt.title('Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(train_accuracies, label='Training')
plt.plot(val_accuracies, label='Validation')
plt.title('Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

### Visualization Explanation:

We create two plots side by side to visualize the training and validation metrics:

1. **Loss Plot**:
   - Shows how the loss decreases over epochs for both training and validation sets
   - A decreasing loss indicates that the model is learning and improving its predictions
   - If the training loss continues to decrease but the validation loss starts to increase, it may indicate overfitting

2. **Accuracy Plot**:
   - Shows how the accuracy increases over epochs for both training and validation sets
   - An increasing accuracy indicates that the model is making more correct predictions
   - If the training accuracy continues to increase but the validation accuracy plateaus or decreases, it may indicate overfitting

These plots help us understand the model's learning dynamics and can indicate if the model is converging to a good solution or if there are issues like overfitting or underfitting.

## Evaluating the Model on Test Set

Finally, let's evaluate our trained model on the test dataset to see how well it generalizes to completely unseen data.

In [None]:
# Evaluate the model on test set
test_loss, test_accuracy = evaluate(model, test_loader)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.2f}%')

### Test Evaluation Explanation:

We use the `evaluate` function we defined earlier to evaluate the model on the test set. This gives us a final measure of how well our model generalizes to completely unseen data.

The test set is a separate dataset that was not used during training or validation. It provides an unbiased evaluation of the final model's performance. A good model should have a test accuracy that is close to its validation accuracy. If the test accuracy is significantly lower than the validation accuracy, it might indicate that the validation set was not representative of the general data distribution.