<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day20_MLP_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day20
## MLP Training

#### CS167: Machine Learning, Fall 2025


#**Do This: Apply for access to Colab Pro for Education**

1. Open Colab
2. Settings --> Select "Colab Pro"
3. Click on "Learn more" button
4. Click on "No cost for students and educators" button
5. Fill out form.
6. Verify in email

**What do Colab Pro for Education do for you?**

- It gives you the same features as a paid Colab Pro subscription for one year free of charge.
- Better compute resources compared to the free tier: access to more powerful GPUs, more memory, and longer session runtimes.


## __Put the Model on Training Device (GPU or CPU)__

*Not **Manditory** for today, but it can save you a few moments of computation time*

We want to accelerate the training process using graphical processing unit (GPU). Fortunately, in Colab we can access for GPU. You need to enable it from _Runtime (or click on the down arrow near RAM & DISK in upper right)-->Change runtime type-->GPU or TPU_

In [None]:
# check to see if torch.cuda is available, otherwise it will use CPU
import torch
import torch.nn as nn
import numpy as np
device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using {device} device")
# if it prints 'cuda' then colab is running using GPU device



---



#**Review**

# Creating Linear Layers using PyTorch

In [None]:
# create input values
torch.manual_seed(2) # for reproducibility (you will get the same random number every time you run this cell)

number_of_samples = 1
random_X = torch.randn(number_of_samples, 2) # two X values

## **Let's build the simple 1-hidden layer feedforward neural network!**

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/mlp_toy_examle_wo_weights.png" width=400/>
</div>


In [None]:
# creation of our network
torch.manual_seed(2)
my_first_mlp = nn.Sequential(
                nn.Linear(2, 3),
                nn.ReLU(),
                nn.Linear(3, 1)
)

In [None]:
# forward pass step
rand_input = random_X
output = my_first_mlp(rand_input)
print('final output: ', output.data)



---


#**Introducing creating a custom PyTorch Network class**

A multilayer perceptron is the simplest type of neural network. It consists of perceptrons (aka nodes, neurons) arranged in layers.
Create a network class with two methods:
- _init()_
- _forward()_


In [None]:
import torch
from torch import nn

# You can give any name to your new network, e.g., SimpleMLP.
# However, you have to mandatorily inherit from nn.Module to
# create your own network class. That way, you can access a lot of
# useful methods and attributes from the parent class nn.Module

class SimpleMLP(nn.Module):
  def __init__(self):
    super().__init__()
    # your network layer construction should take place here
    # ...
    # ...

  def forward(self, x):
    # your code for MLP forward pass should take place here
    # ...
    # ...
    return x

#**Now, using a custom Pytorch Network class**

## **Let's build the simple 1-hidden layer feedforward neural network!**

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/mlp_toy_examle_wo_weights.png" width=400/>
</div>


In [None]:
import torch
from torch import nn

class SimpleMLP(nn.Module):
  def __init__(self):
    super().__init__()
    # your network layer construction should take place here
    self.network_layers = nn.Sequential(
                nn.Linear(2, 3),    # input layer: 2 features → 3 neurons
                nn.ReLU(),
                nn.Linear(3, 1)      # output layer: 3 → 1 output
        )

  def forward(self, x):
    # your code for MLP forward pass should take place here
    output = self.network_layers(x)
    return output

In [None]:
torch.manual_seed(2)

# Create an instance
model = SimpleMLP()

# Forward pass
output = model(random_X)
print('output: ', output.data)



---



#__Building Modular Code for Multilayer Perceptron (MLP)__

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/mlp_network1.png" width=800/>
</div>

Let's create the MLP as shown in the picture above using this template. In general, we will follow this template for constructing other neural networks such as CNN, RNN, and Transformer in PyTorch. Hence, it is a very generic setup. Here are the useful PyTorch modules we will be using for MLP construction:
- [nn.Linear()](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)
  - creates the dense connections (corresponding to the weights of edges) between two adjacent layers (_left layer_ and _right layer_)
  - just provide __#neurons_left_layer__ and __#neurons_right_layer__
- [nn.Sigmoid()](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#sigmoid)
- [nn.ReLU()](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#relu)
- [nn.Softmax()](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html#softmax)

Test a forward pass of our MLP using one of the training samples. You need to convert a matrix of numbers into a contiguous vector using the following PyTorch module:
- [nn.flatten()](https://docs.pytorch.org/docs/stable/generated/torch.flatten.html)


In [None]:
import torch
from torch import nn
import pdb

class SimpleMLPv1(nn.Module):
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()

    # your network layer construction should take place here
    self.network_layers = nn.Sequential(
                nn.Linear(784, 256),  # linear transformation module (input=784, output=256)
                nn.ReLU(),
                nn.Linear(256, 10) # linear transformation module (input=256, output=10)
                              # usually this number should be equal to the total number of classes in your classification task
    )

  def forward(self, x):
    # your code for MLP forward pass should take place here
    x = self.flatten(x)
    output = self.network_layers(x)
    return output


#**Another model with 2 hidden layers**

In [None]:
import torch
from torch import nn
import pdb

class SimpleMLPv2(nn.Module):
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()

    self.network_layers = nn.Sequential(
                nn.Linear(784, 512),  # linear transformation module (input=784, output=512)
                nn.ReLU(),
                nn.Linear(512, 256),  # linear transformation module (input=512, output=256)
                nn.ReLU(),
                nn.Linear(256, 10) # linear transformation module (input=256, output=10)
                              # usually this number should be equal to the total number of classes in your classification task
    )

  def forward(self, x):
    x = self.flatten(x)
    output = self.network_layers(x)
    return output


In [None]:
# check the structure of your MLP
mlp_model = SimpleMLPv2()
print(mlp_model)

In [None]:
# check the sizes of weights and biases of your MLP's 1st hidden layers

# Access the first Linear layer (input → 512)
first_linear = mlp_model.network_layers[0]

print("Weights shape:", first_linear.weight.shape)
print("Biases shape:", first_linear.bias.shape)

In [None]:
# check the randomly initialized values of weights and biases of your MLP's 1st hidden layers
print('weights of first_hidden_layer: \n ', first_linear.weight)
print('bias of first_hidden_layer: \n ', first_linear.bias)

In [None]:
# check the sizes of weights and biases of your MLP's 2nd hidden layer
# Access the first Linear layer (512 → 256)
second_linear = mlp_model.network_layers[2]
print("Weights shape:", second_linear.weight.shape)
print("Biases shape:", second_linear.bias.shape)

In [None]:
# check the sizes of weights and biases of your MLP's 3rd layer
# Access the first Linear layer (256 → 10)
third_linear = mlp_model.network_layers[4]
print("Weights shape:", third_linear.weight.shape)
print("Biases shape:", third_linear.bias.shape)

In [None]:
## Exercise -- Create a model SimpleMLPv3 with 3 hidden layers
# Input layer has 784 inputs
# First layer has 256 nodes
# Second layer has 128 nodes
# Third layer has 64 nodes
# Output layer has 10


#__Load the Dataset for your MLP__

We can easily import some [built-in datasets](https://docs.pytorch.org/vision/main/datasets.html) from PyTorch's [torchvision.datasets](torchvision.datasets) module
- [MNIST](https://en.wikipedia.org/wiki/MNIST_database)
  - each image size: 28x28 grayscale image
  - each image is associated with a label from __10 classes__
  - training set of 60,000 examples and a test set of 10,000 examples

<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b1/MNIST_dataset_example.png" width=500/>
</div>


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import torch
from torch.utils.data import Dataset
import torchvision
# torchvision has many deep learning benchmark datasets MNIST, CIFAR-10, Caltech-50, etc
import torchvision.datasets as datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt


training_data = datasets.MNIST(
    root="/content/drive/MyDrive/CS167/datasets", # headsup! You can replace this path so that it points to a directory in your Google Drive
    train=True,
    download=True,
    transform=ToTensor() # specify the feature and label transformations
)

test_data = datasets.MNIST(
    root="/content/drive/MyDrive/CS167/datasets", # headsup! You can replace this path so that it points to a directory in your Google Drive
    train=False,
    download=True,
    transform=ToTensor()
)


In [None]:
# now that the files are downloaded, let's look at an image from the training set

## note you can access any of the 60,000 training images
image, label = training_data[41] # I chose 41, just for fun -- try some others
print(image.shape, label)

plt.figure(figsize=(2, 2))  # small figure to avoid upscaling
plt.imshow(image.squeeze(), cmap="gray", interpolation="none")
plt.title(f"Label: {label}")
plt.show()

##__Prepare Your Data with DataLoader for Training/Testing__
We just explored one sample of data at a time. As we have seen in our discussion of the optimizer, specifically __Stochastic Gradient Descent (SGD)__, during training your network, we may need to pass them in __minibatches__. PyTorch has a module called __DataLoader__, which will do this automatically for us as long as we provide the right arguments:
- prepare the __minibatches__ with the given _batch_size_ eg 16, 32, 64, 128, etc
- multiprocessing to speed up the data retrieval
- reshuffle the data at every __epoch__


In [None]:
from torch.utils.data import DataLoader
#                             pairs of items,    minibatch size,        random shuffling turned ON
train_dataloader = DataLoader(training_data,     batch_size=128,        shuffle=True)
test_dataloader  = DataLoader(test_data,         batch_size=128,        shuffle=False) # for testing/inference: it is not necessary to shuffle


Take it one batch at a time...

In [None]:
# explore the data from the train_dataloader
train_inputs, train_labels = next(iter(train_dataloader)) # returns a batch of 128 train-images and train-labels

print(f"Images batch shape: {train_inputs.shape}")
print(f"Labels batch shape: {train_labels.shape}")

#display some more examples from the MNIST dataset from this batch
figure = plt.figure(figsize=(5, 5))
cols, rows = 5, 2

for i in range(cols * rows):
    img= train_inputs[i]
    label = train_labels[i]
    figure.add_subplot(rows, cols, i+1)
    plt.title(int(label))
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

#__Forward Pass using your Dataset and your MLP__

The forward method inside our network class, __SimpleMLPv2__, will be invoked if we provide an input tensor __'X'__ to the network object we instantiated earlier, i.e., __mlp_model__, as follows:
- _output = mlp_model(X)_

Finally, we convert the ouput from the model into probabilities using __Softmax()__ module:
- _predicted_probability = softmax_activation(output)_




---


remember ... we've set up a network in the code above (mlpmodel -- via the class SimpleMLPv2) to take in 784 inputs (28x28 pixel values) and output 10 values ... but we haven't trained it yet.... But, we can still put an image through the network and see what it will predict.

In [None]:
img   = train_inputs[127]                 # I picked the sample at the 127 index, but you can pick any index in between 0 to batch_size-1=127
label = train_labels[127]

softmax_activation = nn.Softmax(dim=1)

# Load up the model
mlp_model = SimpleMLPv2()

# data and model should be placed to the same device (either GPU or CPU)
X = img.unsqueeze(0).to(device)         # sending the data tensor to GPU (if available)
mlp_model.to(device)                      # sending the model to GPU (if available) print(f"device {device} and model: \n {mlp_model}")
output = mlp_model(X)                     # last layer of our network will return 10 values each will range in between in [-infty, infty]

predicted_probability = softmax_activation(output)  # these raw numbers scaled to values [0, 1] representing the model’s predicted probabilities for each class

print('predited probability \n', predicted_probability)
y_pred = predicted_probability.argmax()
print(f"Predicted class: {y_pred}")
print(f"Actual class: {label}")


##__Defining Loss function__

- [nn.CrossEntropyLoss()](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)
  - useful when training a __classification problem__ with __C__ classes.
  - criterion computes the cross entropy loss between input logits (raw scores before softmax) and target
- [nn.MSELoss()](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss)
  - useful when training a __regression problem__
  - criterion that measures the mean squared error (squared L2 norm) between each element in the input _x_ and target _y_


In [None]:
# initialize the loss function
loss_fn = nn.CrossEntropyLoss() # this is useful for multiclass classification task

##__Initializing the Optimizer__

Optimiztaion, as we have discussed earlier, is process of adjusting model parameters to reduce model error in each training step.

PyTorch provides a selection of optimization algorithms in the [torch.optim](https://pytorch.org/docs/stable/optim.html) package. Some of them are as follows:
- [torch.optim.SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD)
- [torch.optim.Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam)
- [torch.optim.RMSprop](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html#torch.optim.RMSprop)

In addition to selecting the optimizer, we can also select the hyperparameters which are referred to as *adjustable parameters* crucial for controlling the model optimization process. You can influence the training and convergence of the model by tweaking these hyperparameters:
- __epochs:__ denotes the number of iterations over the dataset
- __batch size:__ represents the quantity of data samples in each iteration propagated through the network before updating the parameters
- __learning rate:__ determines the extent of parameter updates made at each batch/epoch



In [None]:
learning_rate = 1e-3
batch_size    = 128 # If the total sample size is N, setting batch_size=128 will divide the data into N÷128 mini-batches of tensors
epochs        = 10
# let's use SGD optimization algorithm for training our model
optimizer     = torch.optim.SGD(mlp_model.parameters(), lr=learning_rate)

#__Putting Everything Together MLP__

__Putting Everything Together using our SimpleMLPv2 Network on Fashion-MNIST Dataset__

(we've set up all of these steps above, exept for step 4 and 5)

In [None]:
# Step 1: load the Torch library and other utilities
#----------------------------------------------------

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import time


In [None]:
# Step 2: load the dataset, ie, we are experimenting with MNIST
#--------------------------------------------------------------------------------------------------


training_data = datasets.MNIST(
    root="/content/drive/MyDrive/CS167/datasets", # headsup! You can replace this path so that it points to a directory in your Google Drive
    train=True,
    download=True,
    transform=ToTensor() # specify the feature and label transformations
)

test_data = datasets.MNIST(
    root="/content/drive/MyDrive/CS167/datasets", # headsup! You can replace this path so that it points to a directory in your Google Drive
    train=False,
    download=True,
    transform=ToTensor()
)


In [None]:
# Step 3: Create your MLP Network (this is just copied from above)
#--------------------------------------------------------------------------------------------------

import torch
from torch import nn
import pdb

class SimpleMLPv2(nn.Module):
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()

    self.network_layers = nn.Sequential(
                nn.Linear(784, 512),  # linear transformation module (input=784, output=512)
                nn.ReLU(),
                nn.Linear(512, 256),  # linear transformation module (input=512, output=256)
                nn.ReLU(),
                nn.Linear(256, 10) # linear transformation module (input=256, output=10)
                              # usually this number should be equal to the total number of classes in your classification task
    )

  def forward(self, x):
    x = self.flatten(x)
    output = self.network_layers(x)
    return output


In [None]:
## Step 4: Your training and testing functions
#--------------------------------------------------------------------------------------

def train_loop(dataloader, model, loss_fn, optimizer):
    """
    Executes one full training epoch for the given model.

    Iterates over all batches in the provided DataLoader, performing the following steps:
    - Moves input and target tensors to the selected device (CPU or GPU)
    - Computes predictions and loss for each batch
    - Performs backpropagation and optimizer updates
    - Tracks and prints training loss periodically

    Args:
        dataloader (torch.utils.data.DataLoader):
            The DataLoader providing batches of training data (inputs and labels).
        model (torch.nn.Module):
            The neural network model to be trained.
        loss_fn (torch.nn.Module or callable):
            The loss function used to compute the training loss.
        optimizer (torch.optim.Optimizer):
            The optimizer responsible for updating the model’s parameters.

    Returns:
        float: The average training loss across all batches in this epoch.
    """
    size = len(dataloader.dataset)

    model.train()                   # set the model to training mode for best practices

    train_loss = 0

    for batch, (X, y) in enumerate(dataloader):

        # compute prediction and loss
        X = X.to(device)                  # send data to the GPU device (if available)
        y = y.to(device)
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()      # compute gradients
        optimizer.step()     # apply updates
        optimizer.zero_grad()# clear old gradients

        train_loss += loss.item()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

    return train_loss/len(dataloader)

def test_loop(dataloader, model, loss_fn):
    """
    Evaluates the model’s performance on a test (or validation) dataset.

    Runs a forward pass over all batches in the provided DataLoader with gradient
    computation disabled, accumulating loss and accuracy metrics.

    Args:
        dataloader (torch.utils.data.DataLoader):
            The DataLoader providing batches of test or validation data.
        model (torch.nn.Module):
            The trained neural network model to evaluate.
        loss_fn (torch.nn.Module or callable):
            The loss function used to compute the evaluation loss.

    Returns:
        float: The average loss over all test batches.

    Prints:
        Accuracy (% of correct predictions) and average test loss.
    """

    model.eval()                    # set the model to evaluation mode for best practices

    size        = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:

            X = X.to(device)                     # send data to the GPU device (if available)
            y = y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    return test_loss

In [None]:

# Step 5: prepare the DataLoader and select your optimizer and set the parameters for learning the model from DataLoader
#------------------------------------------------------------------------------------------------------------------------------

mlp_model = SimpleMLPv2() ## model Class name here
mlp_model.to(device)      ## device should have been determined earlier (at top of notebook)
learning_rate = 1e-3
batch_size_val = 64
epochs = 20
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mlp_model.parameters(), lr=learning_rate)

train_dataloader = DataLoader(training_data, batch_size=batch_size_val)
test_dataloader = DataLoader(test_data, batch_size=batch_size_val)


train_losses = []
test_losses  = []
start_time   = time.time()
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    avg_train_loss = train_loop(train_dataloader, mlp_model, loss_fn, optimizer)
    avg_test_loss  = test_loop(test_dataloader, mlp_model, loss_fn)
    train_losses.append(avg_train_loss)
    test_losses.append(avg_test_loss)

print("Total fine-tuning time: %.3f sec" %( (time.time()-start_time)) )
print("Total fine-tuning time: %.3f hrs" %( (time.time()-start_time)/3600) )

print(mlp_model.__class__.__name__, " model has been trained!")


In [None]:
# visualizing the loss curves
plt.plot(range(1,epochs+1), train_losses)
plt.plot(range(1,epochs+1), test_losses)
plt.title('Model average losses after each epoch')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()

In [None]:
# Now with a trained model.... let's see how well it does on a few specific examples:

from torch.utils.data import DataLoader
test_dataloader  = DataLoader(test_data,         batch_size=128,        shuffle=False) # for testing/inference: it is not necessary to shuffle
# we need to load data a batch at a time -- loading all of the data in memory is not efficient (or even possible sometimes)

test_inputs, test_labels = next(iter(test_dataloader)) # returns a batch of 128 train-images and train-labels

mlp_model.eval() # puts model into evaluation mode (training = False)
images_shown = 12

X_batch, y_batch = next(iter(test_dataloader)) # returns a batch of 128 train-images and train-labels
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)

test_inputs, test_labels = next(iter(test_dataloader)) # returns a batch of 128 train-images and train-labels
test_inputs = test_inputs.to(device) #make sure we are on the same device (GPU or CPU)
test_labels = test_labels.to(device)

# run a forward pass -- no need to compute gradients
with torch.no_grad():
    logits = mlp_model(X_batch)

# what are the predictions?
preds = logits.argmax(dim=1)

# plot values in a grid
plt.figure(figsize=(10,6))
for i in range(images_shown):
    ax = plt.subplot(3, 4, i+1)
    plt.imshow(test_inputs[i].cpu().squeeze(), cmap="gray", interpolation="nearest")
    title = f"P: {int(preds[i])}"
    if preds[i] == test_labels[i]:
        title += " ✓"
    else:
        title += f" ✗ (T: {int(test_labels[i])})"
    ax.set_title(title, fontsize=8)
    ax.axis("off")
plt.tight_layout()
plt.show()