# Architecture and Hyperparameter Search

In [1]:
# Imports for this project
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
torch.cuda.is_available()
torch.cuda.get_device_name(0)

ImportError: cannot import name 'Tuple' from 'types' (C:\Python313\Lib\types.py)

## 1. The Basic Model

#### The model has 28 ∗ 28 = 784 input features, and 10 output features. How many parameters does it have, in total?

7850 parameters.

In total, there is an individual parameter for each input feature so, 784, but we need to consider each class, 10 in total, so it would be 784 * 10 = 7840 parameters. However, each class has an associated bias, so the total would be 7840 + 10 = 7850 parameters.


##### Train a simple linear softmax model, and try to minimize its loss without overfitting. What is the minimal loss you achieve on the training set, and corresponding loss on the testing set? Accuracy on the training set, and corresponding accuracy on the testing set? Explain the choices you made, including step size, and how you knew that training further was not going to be worthwhile.

In [5]:
# Grab the MNIST dataset
training_set = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
testing_set = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor())

In [9]:
# Begin data preprocessing

def data_prepocessor(dataset: torch.utils.data.Dataset):
    """ 
    Used to prepare the data for processing by normalizing the data and flattening the images.

    Args:
        dataset (torch.utils.data.Dataset): The dataset to preprocess.
    
    Returns:
        return_tensors Tuple[torch.Tensor, torch.Tensor]: A tuple containing the preprocessed data and the corresponding
    """
    
    new_data = (dataset.data / 255.0) - 0.5
    flattened_img_data = new_data.view(new_data.shape[0], -1)
    targets = dataset.targets

    return flattened_img_data, targets

x_train, y_train = data_prepocessor(training_set)
x_test, y_test = data_prepocessor(testing_set)

In [None]:
"""
Establish a NN class for multi-class classification of MNIST dataset. 
"""

class NumberClassifierNN(nn.Module):
    """
    A simple neural network for classifying MNIST digits.
    
    Emulates logistic regression by using a single linear layer.
    """
    
    def __init__(self):
        super().__init__()

        self.layer1 = nn.Linear(784, 10)  # Input layer representing all 784 input features mapped to 10 output classes

        # No activation function needed for logits output
        self.activation_function = nn.Identity()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the neural network.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, 784)
        
        Returns:
            torch.Tensor: Output tensor of shape (batch_size, 10) representing class scores for each digit (0-9)
        """

        logits = self.layer1(x)
        logits = self.activation_function(logits)

        return logits

In [None]:
# Training block

# Initialize model, loss function, and optimizer
model = NumberClassifierNN()
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.20)

# Convert training data to TensorDataset for DataLoader
training_dataset = data.TensorDataset(x_train, y_train)

batch_size = len(training_dataset)

# DataLoader for batching
data_loader = data.DataLoader(training_dataset, batch_size=batch_size, shuffle=True)

epochs = 150

for epoch in range(epochs):
    total_loss = 0

    # Run through the data. Will be one big batch for now
    for inputs, labels in data_loader:
        # Forward pass + computing the loss of the batch
        logits = model(inputs)
        loss = loss_function(logits, labels)

        # Clear the gradient from previous step, and perform gradient descent
        optimizer.zero_grad()
        loss.backward() # Update the gradients
        optimizer.step() # Update the weights

        total_loss += loss.item() # Accumulate the loss, though there will only be one batch

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

In [None]:
# Flip mode of model
model.eval()

# Find accuracy and loss on training set
with torch.no_grad():
    train_logits = model(x_train)
    train_loss = loss_function(train_logits, y_train).item()
    train_pred = torch.argmax(train_logits, dim=1)
    train_acc = (train_pred == y_train).float().mean().item()

# Find accuracy and loss on testing set
with torch.no_grad():
    test_logits = model(x_test)
    test_loss = loss_function(test_logits, y_test).item()
    test_pred = torch.argmax(test_logits, dim=1)
    test_acc = (test_pred == y_test).float().mean().item()

print(f"Train Loss: {train_loss:.4f}, Train Accuracy: {train_acc * 100:.2f}%")
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc * 100:.2f}%")



Analysis of Part 2

Results

After training my simple linear softmax model, I was able to achieve a minimal loss of 0.4321 and 0.4120 on the training data and testing data respectively. As such, my model has an accuracy of 88.55% and 89.17% on the training and testing data respectively.

Decisions Made

As I began training the model, I first chose a learning rate of 0.1 as I thought it would be safe to see how fast the model would minimize the loss over several epochs, 10 epochs for the first run. For this run, I saw that the model was learning at a steady rate but did not reach its full potential, around 1.58 total loss, due to the number of epochs I set. As such, I kept the rate the same and increased the epochs to 25. 

During the second run, I was able to get a loss of 1.083, but I noticed that the rate of loss improvement was not decreasing, so I tried experiementing with a different learning rate. 

For the third run, I tried a rate of 0.2 and had 25 epochs. This time, I noticed that the model was able to get a loss of 0.8568 which was better than the previous run but still could be improved upon.

For the fourth run, I tried the same learning rate of 0.2 but increased the number of epochs to 50 and got a loss of 0.6082. Though this was the best run I got, I decided to try a greater learning rate.

For the fifth run, I used a learning rate of 0.25 and 50 epochs and had a loss of 0.6081. Again, this was better than the previous attempt, but I noticed that every other epoch would cause the loss to jump up then decrease for the next epoch after it. As such, I determined that a learning rate greater than this wouldn't be a good idea. As a result, I tried experimenting with learing rates between 0.2 - 0.25 and saw that a learning rate of 0.2 was optimal since it never showed that rebounding behaviour. 

For my next significant attempt, I wanted to increase the number of epochs to see how low the loss could go, so I used a learning rate of 0.2 and 100 epochs. I was able to obtain a loss of 0.4786 but wanted to see how far I could train the model before I overfitted. So, I began testing greater and greater epochs and used the testing data to verify that I was doing good.

After this process, I was finally able to land on a loss of 0.4328 by using a learning rate of 0.2 and 150 epochs. Anything above 150 epochs would keep the loss at a range between 0.43 - 0.44.

##### Does the end result change if you start with a different initialization for your parameters? And if so, in what way?

In [None]:
# Training block

# Initialize model, loss function, and optimizer
model = NumberClassifierNN()

loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.20)

# Convert training data to TensorDataset for DataLoader
training_dataset = data.TensorDataset(x_train, y_train)

batch_size = len(training_dataset)

# DataLoader for batching
data_loader = data.DataLoader(training_dataset, batch_size=batch_size, shuffle=True)

epochs = 150

# # Set initial values to zero for weights and biases
# nn.init.constant_(model.layer1.weight, 1000.0)
# nn.init.constant_(model.layer1.bias, 10000.0)

for epoch in range(epochs):
    total_loss = 0

    # Run through the data. Will be one big batch for now
    for inputs, labels in data_loader:
        # Forward pass + computing the loss of the batch
        logits = model(inputs)
        loss = loss_function(logits, labels)

        # Clear the gradient from previous step, and perform gradient descent
        optimizer.zero_grad()
        loss.backward() # Update the gradients
        optimizer.step() # Update the weights

        total_loss += loss.item() # Accumulate the loss, though there will only be one batch

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

## 2. A Fixed-Size Layer Model

#### In this section, I want to consider a model with hidden layers. In particular, assume that the network has k ≥ 1 hidden layers, and each layer has m ≥ 1 nodes. Notice, I’m fixing each hidden layer to be the same size (for simplicity’s sake). We can assume a tanh activation function.

##### Find a formula for Parameters(k, m), the number of trainable parameters in a network of this shape.

The first layer has 784 nodes, number of pixels in each photo, and each node in the first layer connects to each node in the second layer with m nodes. As such, each individual node has Z parameters where Z is....

$Z = (\sum_{i=1}^{784}{1}) + 1$

The summation represents the number of weights attached to an individual node from the input layer and the single 1 represents the bias for the current node in the first hidden layer.

Since this second layer has m nodes, we can generalize the formula for the total number of parameters in the second layer to...

$Z * m = m * ((\sum_{i=1}^{784}{1}) + 1)$

If the model has more than one hidden layer, then we know that the third layer also has m nodes. Each node in this third layer will have Q parameters where Q is...

$Q = (\sum_{i=1}^{m}1) + 1$

Furthermore, the entire third layer would have... 

$Q * m = m * ((\sum_{i=1}^{m}1) + 1)$

For any hidden layer beyond the first hidden layer, we can make a general formula for the number of parameters represented with S...

$S = (k - 1) * (Q * m) = (k - 1) * (m * ((\sum_{i=1}^{m}1) + 1))$

We do k - 1 instead of just k since parameters are connected between nodes rather than being just the individual layer of nodes.

When the hidden layer is connected to the output layer, we know that there are 10 output nodes. As such, the number of parameters for one node in the output layer would be V...

$V = Q = (\sum_{i=1}^{m}1) + 1$

Which can be generalized for the entire output layer...

$V * 10 = 10 * ((\sum_{i=1}^{m}1) + 1)$

For the entire NN, we can say that the number of parameters it has is

$Parameters(k, m) = (Z * m) + (k - 1)(Q * m) + (V * 10)$

$= m * ((\sum_{i=1}^{784}{1}) + 1) + (k - 1) * (m * ((\sum_{i=1}^{m}1) + 1)) +  10 * ((\sum_{i=1}^{m}1) + 1)$

$= 785m + (k-1)(m^2 + m) + 10m + 10$

$= 795m + km^2 + km -m^2 - m + 10$

$= km^2 - m^2 + 794m + km + 10$

$= (k - 1)m^2 + (794 + k)m + 10$

##### For a fixed number of parameters P , what are the smallest and largest values k can have such that Parameters(k, m) = P ? Note, I am essentially asking the smallest and largest number of layers a network like this could have, with a fixed number of parameters. Let $k_P$ be this max number of layers. Note: What should you do if $k_P$ is not an integer?

Lets start by finding the smallest value for k. Since k is just the number of layers for this model, k can just be of size 1. This means that...

$P = (k - 1)m^2 + (794 + k)m + 10$

$P = (1 - 1)m^2 + (794 + 1)m + 10$

$P = 795m + 10$

As such, the minimum size of k is just 1 and depends on your choice of P.

To find $k_p$, we can use the same equation from above. 

$P = (k - 1)m^2 + (794 + k)m + 10$

Given in the problem statement, we know that P is some constant. Looking at the equation, we can see that if I were to increase m by some factor, k would need to decrease in order to keep P constant. Using this logic, we can see that to get the largest possible k, it would be best to let m = 1. So...

$P = k - 1 + 794 + k + 10$

$P = 2k + 803$

$k = \frac{P - 803}{2}$

$k_p = \frac{P - 803}{2}$

Now, we know that $k_p$ is dependent on the number of parameters. In this case, $k_p$ is found with a neural net of m = 1 and some specificied constant number of parameters, P.

If $k_p$ ever ends up being not an integer, then it would be best to floor $k_p$ since if we ceil it, we would end up trying to build an additional layer that is missing the full m nodes required for it. If you floor it, you will be able to create a neural net with a number of layers that has m nodes per layer.

##### For a given number of parameters P , let $m_{P(k)}$ be the number of nodes per layer that make Parameters(k, m) as close to P as possible. So a network with k layers and $m_{P(k)}$ nodes per layer should have approximately P total parameters.


##### For a given P value, we can plot a graph of ‘final training loss’ for networks of shape (k, $m_{P(k)}$) as a function of k. For (at least) three sufficiently large values of P (so that the networks are non-trivial) that are sufficiently distinct (so that you don’t accidentally create the same network shapes for two different P values), plot the three ’final training loss’ curves on the same axis. What trends to you observe, and how do these compare to the baseline model?

For this problem, when we are trying to find $m_p(k)$, we can treat P and k as constant since both will be known numbers when we are graphing

Using the solved number of parameters formula...

$P = (k - 1)m^2 + (794 + k)m + 10$

$(k - 1)m^2 + (794 + k)m + (10 - P) = 0$

We can now use the Quadratic formula to find $m_{P(k)}$...

$m_{P(k)} = \frac{-(794 + k) \pm \sqrt{(794 + k)^2 - 4(k - 1)(10 - P)}}{2(k - 1)}$

Now, we can move onto defining our NN class and training/graphing loop

In [None]:
class Custom_NN(nn.Module):
    """
    A full feed-forward neural network with k hidden layers of size m.
    """
    def __init__(self, k: int, m: int):
        super().__init__()
        
        layers = []
        # Input layer
        layers.append(nn.Linear(784, m))
        layers.append(nn.Tanh())

        # k-1 hidden layers
        for _ in range(k - 1):
            layers.append(nn.Linear(m, m))
            layers.append(nn.Tanh())
            
        # Encapsulate all hidden layers
        self.hidden_layers = nn.Sequential(*layers)
        
        # Output layer
        self.output_layer = nn.Linear(m, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.hidden_layers(x)
        x = self.output_layer(x)
        return x

## 3. Improving Architecture

##### For a P of your choice and the optimal network shape as determined above - try to find an even better network shape (layers of unequal size, for instance) that gives better results for the same (approximate) total number of parameters. Is it better to have uniform layers? Layers of decreasing size? Increasing size? Experiment with it, and summarize your results. You may want to save the best model you find. Bonus: Does regularization (/weight decay) help?