# Mandatory Assignment 1

## Part I: Multi-layer Perceptron

***

*Please fill in the following information:*

Name: <Your name> <br>
Student ID: <Your student ID> <br>
Group: <Your group number>

****

### Introduction

Traditional machine learning methods, like logistic regression in Scikit-Learn, are often sufficient for datasets that are linearly separable. However, more complex problems sometimes demand a more intricate approach. Neural networks, which incorporate additional layers, excel at learning these complex relationships, leading to improved performance. These extra layers, known as "hidden" layers, process the input into one or more intermediate forms before generating the final prediction.

Logistic regression achieves this transformation using a single fully-connected layer, also referred to as a Single-Layer Perceptron. This layer performs a linear transformation (a matrix multiplication combined with a bias). In contrast, a neural network with multiple connected layers is typically referred to as a Multi-Layer Perceptron (MLP). For instance, in the simple MLP shown below, a 4-dimensional input is mapped to a 5-dimensional hidden representation, which is subsequently transformed into a single output used for prediction.

<img src="../media/MLP.png" width="500"/>

In this assignment, your task will be to construct an MLP for the well-known MNIST dataset.

#### Nonlinearities revisited

WNonlinearities are usually applied between the layers of a neural network. As discussed in class 2, there are several reasons for this. A key reason is that without any nonlinearity, a sequence of linear transformations (fully connected layers) reduces to a single linear transformation, limiting the model's expressiveness to that of a single layer. Including nonlinearities between layers prevents this reduction, enabling neural networks to approximate far more complex functions. This is what makes neural networks so powerful.

Numerous nonlinear activation functions are frequently employed in neural networks, but one of the most commonly used is the [rectified linear unit (ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)):

\begin{align}
x = \max(0,x)
\end{align}

***

### Assignment

Build a 2-layer MLP for MNIST digit classfication. Feel free to play around with the model architecture and see how the training time/performance changes, but to begin, try the following:

* Image (784 dimensions) ->  
* fully connected layer (500 hidden units) -> 
* nonlinearity (ReLU) ->  
* fully connected (10 hidden units) -> 
* softmax


*Some hints*:
- Even as we add additional layers, we still only require a single optimizer to learn the parameters. 
- To get the best performance, you may want to play with the learning rate and the number of training epochs.

***


In [1]:
# auxilary imports
import random
import matplotlib.pyplot as plt
from tqdm import tqdm

# pytorch
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# sklearn for metrics
from sklearn.metrics import classification_report

### Activate GPU
If available. Note that this is not necessary, but it will speed up your training.

In [None]:
# Device will determine whether to run the training on GPU or CPU.
if torch.backends.mps.is_available():  # GPU on MacOS
    device = "mps"
elif torch.cuda.is_available():  # GPU on Linux/Windows
    device = "cuda"
else:  # default to CPU if no GPU is available
    device = "cpu"

device = torch.device(device)
print(f"Running pytorch version {torch.__version__}) with backend = {device}")

### Load data

In [None]:
train = datasets.MNIST(
    root = 'data',  # The root directory where the dataset will be stored
    download = True,  # If the dataset is not found at root, it will be downloaded
    train = True,  # The train dataset (as opposed to the test dataset)
    transform = transforms.ToTensor()  # transformations to be applied to the dataset, in this case, convert the images to tensors
)
test = datasets.MNIST(
    root = 'data',
    download = True,
    train = False,
    transform = transforms.ToTensor()
)

train_loader = DataLoader(
    train,  # The dataset
    batch_size = 10,  # The size of each batch (10 images in this case)
    shuffle = False # Whether to shuffle the dataset
)
test_loader = DataLoader(
    test,
    batch_size = 10,
    shuffle = True
)

### Inspect data

In [None]:
train, test

In [None]:
# Pick a random example from the training set
selection = random.randrange(len(train)-1)
image, label = train[selection]

# Plot the image
print(f"Default image shape: {image.shape}")
image = image.view([28,28])

print(f"Reshaped image shape: {image.shape}")
plt.imshow(image, cmap="gray")

# Print the label
print(f"The label for this image: {label}")

***

### Build the model

In [36]:
class MLP(nn.Module):

  def __init__(self):
    super().__init__()

    # TODO: Define the layers of the network
  
  def forward(self, X : torch.Tensor):

    # TODO: Define the forward pass of the network

    return X

### Hyperparameters

In [25]:
LR = ...  # TODO: set the learning rate
EPOCHS = ... # TODO: set the number of epochs (i.e. passes over the dataset)
LOSS = nn.CrossEntropyLoss()  # the loss function - We suggest using CrossEntropyLoss

### Instantiate the model and optimizer

In [27]:
mlp = MLP()  # Create an instance of the MLP model
mlp.to(device)  # Move the model to the device (GPU or CPU)

optimizer = torch.optim.SGD(  # The optimizer
   mlp.parameters(),
   lr=LR,
   # Feel free to experiment with other parameters
)

### Training

In [None]:

mlp.train() # Set the model to training mode

for epoch in range(EPOCHS):
    
    running_loss = 0.0  # i.e. the loss for this epoch

    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}"):
        
        images, labels = batch  # unpack images and labels from the batch

        images = images.to(device) # Move the images to GPU if available
        labels = labels.to(device) # Move the labels to GPU if available

        # Zero gradients
        optimizer.zero_grad()  # Zero the gradients, i.e. reset the gradients to zero so that they don't accumulate between batches

        # Forward pass
        images = images.view(-1, 28 * 28)  # Flatten the images
        outputs = mlp()  # Forward pass the images through the model

        # Compute loss
        loss = LOSS(outputs, labels)  # Compute the loss

        running_loss += loss.item()  # Add the loss to the running loss

        # Backward pass and optimization
        loss.backward()  # Compute the gradients
        optimizer.step() # Update the weights of the model using the gradients

    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {running_loss/len(train_loader):.4f}")

### Evaluation

In [None]:
# Evaluation Loop
mlp.eval()  # Set model to evaluation mode
y_true = []  # To store true labels
y_pred = []  # To store predicted labels

with torch.no_grad():  # Disable gradient tracking, as we are not updating the model parameters
    
    for batch in tqdm(test_loader, desc="Testing"):
        images, labels = batch

        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        images = images.view(-1, 28 * 28) # Flatten the images to 1D
        outputs = mlp(images)  # Forward pass
        predictions = torch.argmax(outputs, dim=1)

        # Store labels for the classification report
        y_true.extend(labels.cpu().numpy())  # Move to CPU and convert to numpy
        y_pred.extend(predictions.cpu().numpy())  # Move to CPU and convert to numpy

# Print Classification Report
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=[str(i) for i in range(10)]))
