# Assignment overview <ignore>
The overarching goal of this assignment is to produce a research report in which you implement, analyse, and discuss various Neural Network techniques. You will be guided through the process of producing this report, which will provide you with experience in report writing that will be useful in any research project you might be involved in later in life.

All of your report, including code and Markdown/text, ***must*** be written up in ***this*** notebook. This is not typical for research, but is solely for the purpose of this assignment. Please make sure you change the title of this file so that XXXXXX is replaced by your candidate number. You can use code cells to write code to implement, train, test, and analyse your NNs, as well as to generate figures to plot data and the results of your experiments. You can use Markdown/text cells to describe and discuss the modelling choices you make, the methods you use, and the experiments you conduct. So that we can mark your reports with greater consistency, please ***do not***:

* rearrange the sequence of cells in this notebook.
* delete any cells, including the ones explaining what you need to do.

If you want to add more code cells, for example to help organise the figures you want to show, then please add them directly after the code cells that have already been provided. 

Please provide verbose comments throughout your code so that it is easy for us to interpret what you are attempting to achieve with your code. Long comments are useful at the beginning of a block of code. Short comments, e.g. to explain the purpose of a new variable, or one of several steps in some analyses, are useful on every few lines of code, if not on every line. Please do not use the code cells for writing extensive sentences/paragraphs that should instead be in the Markdown/text cells.

# Abstract/Introduction (instructions) - 15 MARKS <ignore>
Use the next Markdown/text cell to write a short introduction to your report. This should include:
* a brief description of the topic (image classification) and of the dataset being used (CIFAR10 dataset). (2 MARKS)
* a brief description of how the CIFAR10 dataset has aided the development of neural network techniques, with examples. (3 MARKS)
* a descriptive overview of what the goal of your report is, including what you investigated. (5 MARKS)
* a summary of your major findings. (3 MARKS)
* two or more relevant references. (2 MARKS)

### Abstract


Through structured experimentation this assignment explores and demonstrates a number of roperties of artificial Neural networks. Using the CIFAR-10 dataset and classification task and a relatively simple convolutional neural network, the effect that a number of network and paramaeter choices shall be explored. 

Looking first at varying learning rates and learning rate decay then at dropout for regularisation, transfer learning and batch normalisation, it is hoped that a number of important and essential properties shall be described and demonstrated. 

Whilst the aim of the task is not strictly to aim for best peformance, the impacy that the different tecnicques and choices make on performance shall be discussed. 

It shall be shown that.........

training dynamics and performance of    

### Introduction

The labelled CIFAR-10 dataset was created as part of study exploring different approaches to training generative models for natural images [3]. It and it's larger sibling CIFAR-100 have been used for benchmarking and testing in many exploratory and ground breaking papers relating to computer vision and image classification since, not least in the development of Alexnet [x],  Resnet [y] and most recently in the development of  transformer-for vision architectures [z], as well as many others [a, bm c]. It is fitting, then, to use it to explore some of the fundamental properties of aritificial neural networks (NN) in this assignment.

The objective of this assignment examine the impact of NN design choices on performance and gradient flow in a deep convulutional neural network trained to perform multiclass classification on CIFAR images. This shall be done through 3 experiments, each examining a different property of NN model learning. 

The first experiment examined the effect that altering the learning rate has on training and performance. As well as experimenting with different learning rates, a learning rate 'scheduler' was designed and its performance compared to a high performing learning unsscheduled learning. 

The second experiment aimed to understand and demonstrate the impact of introducing a dropout layer into the arhchitecture of the network. Different dropout rates were trialled and their effects compared to baseline performance in both training and evaluation. The effect of dropout was also tested in a transfer learning context.

The third experiment was focused on understanding the effects of both dropout and batch normalisation on gradient flow in the network during backpropagation. It was designed to show the impact that these interventions have on the propogation of the gradient through the different layers of the network at different points of the training cycle. 

The experiments showed.....

# Methodology (instructions) - 55 MARKS <ignore>
Use the next cells in this Methodology section to describe and demonstrate the details of what you did, in practice, for your research. Cite at least two academic papers that support your model choices. The overarching prinicple of writing the Methodology is to ***provide sufficient details for someone to replicate your model and to reproduce your results, without having to resort to your code***. You must include at least these components in the Methodology:
* Data - Decribe the dataset, including how it is divided into training, validation, and test sets. Describe any pre-processing you perform on the data, and explain any advantages or disadvantages to your choice of pre-processing. 
* Architecture - Describe the architecture of your model, including all relevant hyperparameters. The architecture must include 3 convolutional layers followed by two fully connected layers. Include a figure with labels to illustrate the architecture.
* Loss function - Describe the loss function(s) you are using, and explain any advantages or disadvantages there are with respect to the classification task.
* Optimiser - Describe the optimiser(s) you are using, including its hyperparameters, and explain any advantages or disadvantages there are to using that optimser.
* Experiments - Describe how you conducted each experiment, including any changes made to the baseline model that has already been described in the other Methodology sections. Explain the methods used for training the model and for assessing its performance on validation/test data.


## Data (7 MARKS) <ignore>

As already mentioned, the data used in this assignment is the CIFAR-10 dataset developed by Krizhevsky and Hinton, et al. [x]. It consists of 60,000 colour 32x32 images, each of which belongs to one 10 mutually exclusive classes and is labelled correspondingly. These classes describe the suject of the image and are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck.

The Cifar-10 dataset, along with many others, is available for download via a conveient Pytorch `datasets` method which uses a boolean flag to enable the user to load both the 50,000 training images and 10,000 test images into separate `torch.Dataset` instances extremely easily, and this was the method used here. As part of this process it is possible to apply manual transforms to the data as it is loaded. Using the `Normalize` method the data so that the pixel values of each of the 3 input channels for all of the datasets has a mean of 0 and and a standard deviation of 1, the idea being to focus the learning on the underlying properties of the data rather than any incidental variance in the raw data.  

The training set consisting of 50,000 labelled samples was then split to create a validation or development set of 5000 samples (with a random seed set for consistence). The final numbers for training, validation and testing then were 45,000, 5,000 and 10,000 respectively. The class distribution for for each dataset were checked and found to be well balanced which means that accuracy should be a reliable measure of overall performance of the model on test data. (althgouh this differed in experiments 2 and 3)

Batching for stochastic gradient descent was handled by the `DataLoader` class, which yields samples from the shuffled dataset without replacement in batches of a size that can be specified by the user until the data has been exhauted. The data is then shuffled again and new batches are drawn from the newly shuffled data so that the batches in every epoch were different, encouraging more stoacasticity - a key element in SGD being said to approximat the true gradient of the loss. 

It was decided that a simple train, validation, test split would be appropriate for the task at hand. Cross-validation is good for getting a very accurate idea of the likely performance of a model on the test set, and exposes the model to all of the possible training data. However, performance was not the chief aim of this assignment - rather it was understanding and the comparison of performance on training, validation and a test set is sifficient.   


<figure><center><img src="./figs/classdisttraining.png" width=300><img src="./figs/classdistval.png" width=300><img src="./figs/class dist test.png" width=300><figcaption> Figure 1. Dropout is more effective in convolutional layers when applied to whole activation maps. </figcaption></center></figure>


## Architecture (17 MARKS) <ignore>


<figure><center><img src="./figs/baseline_model_diagram.png" width=800><figcaption> Figure 1. BaselineNet Convolutional Neural Network. </figcaption></center></figure>


The baseline architecture was designed in accordance with the assigonment brief and some intitial experimentation with hyperparamaters.

The basic architecture visable in Fig 1. shall be outlined in more detail and then reasoning and results that led to some of the decisions shall be discussed.

The inputs to the model are the 32 x 32 colour images which have a depth of 3 owing to the RGB channels. These are convolved with the first convolutional layer `conv1` comprised of 16 filters as outlined in `Table 1` below above. The resultant output was passed through a relu activation and put through a max pooling filter `pool1` which compacts the previous layers spatial dimensions by half whilst preserving the number of channels. It should be noted pooling layers `pool2` and `pool3` perform the same operation and have the same structure. The output of `pool1` was convolved with `conv2` leading to a 32 channel, 16x16 output which were passed through a relu activation followed by another max pooling layer `pool2`. The final convolutional layers `conv3` increased the channel size to 64, before the activations were passed through a relu activation and then max pooled again before being `flattened` and passed to the first fully connected layer `fc1` and through aother relu activation into `fc2` and finally to the 10 dimensional output layer where it goes through a softmax actication.

***<center>Table 1: Convolutional Neural Network Architecture***</center> 

| Layer     |k (n filters) | F (filter dimensions)| S (stride) | P (padding) | Input Dimensions | Output Dimensions |
|-----------|-------------|----------------------|------------|-------------|------------------|-------------------|
| `conv1`   | 16          | 3x3 (x3)                | 1          | 1           | 32x32x3          | 32x32x16          |
| `pool1`   | -           | 2x2 (x16)               | 2          | -           | 32x32x16         | 16x16x16          |
| `conv2`   | 32          | 3x3 (x16)               | 1          | 1           | 16x16x16         | 16x16x32          |
| `pool2`   | -           | 2x2 (x32)               | 2          | -           | 16x16x32         | 8x8x32            |
| `conv3`   | 64          | 3x3 (x32)               | 1          | 1           | 8x8x32           | 8x8x64            |
| `pool3`   | -           | 2x2 (x64)               | 2          | -           | 8x8x64           | 4x4x64            |



The choice of hyperparameters for the BaselineNet architecture were made based on a combination of the assignment brief, initial experimentation, and common practices in the field.

The batch size of 64 was selected as a balance between computational efficiency and the ability to capture a representative sample of the dataset in each iteration. This size allows for efficient data processing on on GPU while providing a reasonable approximation of the gradient during training.

The filter dimensions of 3x3 were chosen as they have been shown to be effective in capturing local spatial patterns while keeping the number of parameters relatively low [VGG???]. 

The increasing number of filters (16, 32, 64) in the convolutional layers allows the network to learn progressively more complex and abstract features as the depth increases. 

The setting the stride combined with a padding of 1 in the convolutional layers ensures that the spatial resolution is preserved, while prevents information loss at the edges. 

The max pooling layers with a pool size of 2x2 and stride of 2 help to reduce the spatial dimensions, thereby reducing the number of parameters and providing a form of translation invariance. 

Overall, these choices strike a balance between model complexity, computational efficiency, and the ability to learn meaningful features from the CIFAR-10 dataset.

Owing to the number of training runs required to get accurate, averages results, parameter size was a legitimate consideration as it impacted on training time signfiriantly.  

```
The choice of ReLU (Rectified Linear Unit) as the activation function throughout the BaselineNet architecture is based on its proven effectiveness and computational efficiency. ReLU has become the default activation function for many deep learning models due to its ability to alleviate the vanishing gradient problem and promote sparse representations [ReLU_Advantages]. It introduces non-linearity into the network, allowing it to learn complex patterns and representations. ReLU is computationally efficient compared to other activation functions like sigmoid or tanh, as it involves a simple thresholding operation. This efficiency enables faster training and inference times. Additionally, ReLU has been shown to accelerate convergence during training by providing a consistent gradient flow [ReLU_Convergence].
Overall, these choices strike a balance between model complexity, computational efficiency, and the ability to learn meaningful features from the CIFAR-10 dataset. Owing to the number of training runs required to get accurate, averages results, parameter size was a legitimate consideration as it impacted on training time signfiriantly.
```


## Loss function (3 MARKS) <ignore>

The loss function used for each experiment was cross-entropy loss, implimented using the `nn.CrossEntropyLoss` class from Pytorch [x].

It is widely used method for quantifying loss function for classification problems such as this where the target is binomial or miultinomial. Cross-entropy works on logits that have been transformed by a softmax activation into what is a effectively a probability distribution across the output classes. It compares this output probability distribution to a one-hot encoded version of the class label, where the value at the index for the true class is 1, and all the others are 0. This acts as a target probability distribution and the cross entropy loss calculation function essentially quantifies the difference between this predicted distribution and the one-hot encoded true label distribution. 

Mathematically, for a single sample with true label $y$ and predicted probabilities $\hat{y}$, the cross-entropy loss is calculated as:
$$\text{CE}(y, \hat{y}) = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$

where $C$ is the number of classes, $y_i$ is the true label (0 or 1) for class $i$, and $\hat{y}_i$ is the predicted probability for class $i$. By minimizing the average cross-entropy loss over all training samples, the model learns to assign high probabilities to the correct class and low probabilities to the incorrect ones.

The logarithm in the formula strongly penalizes misclassifications and encourages the model to produce well-calibrated probability estimates. Cross-entropy is a principled and effective loss function for multi-class classification, aligning with the maximum likelihood estimation objective. It drives the model to minmise the difference between its output probabilities and the true class probability, encouraging it to learn a mapping from features to the correct output probabilities. 

Practically, the use of the Pytorch module precludes the need for a softmax layer in the model architecture itself as the optimiser takes in the raw logits and then applys the `nn.LogSoftmax()` activation function [x] and the `nn.NLLLoss()` [x] (Negative Log-Likelihood Loss) in a single operation that encapsulates the above. The function used as it is here in a mini-batch stochastic gradient decent context also handles the averaging of the loss across the mini-batch. This averaging is important because it allows the loss to be invariant to the batch size and provides a stable estimate of the overall loss for the batch and then across batches in the epoch.

## Optimiser (4 MARKS) <ignore>

The optmiser used to handle parameter updates and impliment gradient descent was stochastic gradient descent (SGD), implimented using the `optim.SGD` class from Pytorch. 

SGD is the most straighforward and in many ways the original optimiser in artiificial neural networks. Aside from pure gradient descent calculated as an average of the gradients for the entire training dataset, it is as straightforward an approach to optimisation as one can use.

As an approach the idea is that it estimates the true gradient of the loss function using a small subset of the training data (a mini-batch) and update the parameters with this approximate gradient, weighted by a learning rate which, in this approach, is fixed, and user defined. 

This process is repeated for multiple mini-batch samples taken from the training data without replacement (until the entire data set has been seen - representing an 'epoch' of training) and then repeated until a stopping criterion is met - in this case a set number of epochs.

Mathematically, the estimated gradient for a mini-batch of size $B$ sampled from the training data is computed as:
$$\nabla_\theta L(\theta_t) \approx \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(\theta_t; x_i, y_i)$$
where $(x_i, y_i)$ represents the $i$-th example in the mini-batch.

There a number of more sophiticated optimisers availabl when training NNs today, not least the 'Adam' (Adaptive moment Estimation) optimiser [x] which is near uniquitous and recommended for most cases is "one of the more robust and effective optimization algorithms to use in deep learning" [x]. These approaches, by encoporating properties such as the 'momentum' of the gradient, as well as adaptive learning rates have been shown to allow for a smoother and more direct journey through paramater space to the miminimum loss. However, one of the objective was to explore the effect of the learning rate on performance, and with SGD the paramaters are directly updated the parameters based only on the gradient and the learning rate. By keeping to this very direct forumlation it easier to understand and interpret the impact of the learning rate on the model's performance - the learning rate has a clear and direct influence on the step size of the parameter updates, making it straightforward to study its effect. SGD is highly sensitive to the choice of the learning rate and this sensitivity is precisely what makes SGD suitable for studying the impact of learning rate on model performance. By varying the learning rate and observing the corresponding changes in model behavior, you can gain insights into the optimal learning rate range and its effect on convergence speed and generalization. The absence of adaptive learning rates in SGD in particular  ensures that the learning rate remains consistent throughout the training process.
This consistency allows for a clearer analysis of the relationship between the learning rate and model performance, without the confounding effects of adaptive learning rate schedules.

An interesting furthe development would be to introduce momentum and compare performance, then try Adagrad, then try Adam. But as the aim is just to explore learning rate, I felt it better to keep the optimiser algorithm as simple as possible. 


## Experiments <ignore>
### Experiment 1 (8 MARKS)

#### 1.1 - Learning Rates

Initial exploratory trials of single training runs for learning rates chosen between $0.5$ and $1e^-6$ established the extremes of model behaviour with respect to those learning rate (no learning at a learning rate of $1e^-4$ or below, unstable learning at $0.2$ or above. These invrestigation suggested a sensible range from which to select the 5 learning rates to compare which were $0.1, 0.075. 0,05, 0.025 \text{, and }0.01$

The data was loaded and split as described above. A batch size of 64 was used and the training run for 50 epochs for each trial. The inbuilt Pytroch SGD optimizer was used, along with the CrossEntropyLoss criterion.

For each learning rate, 5 trials were conducted. That is; 5 different models were instantiated, trained and evlauated for each learning rate in order to get a true sense of performance. The data for each trial was collated into a dictionary and stored JSON format so all data was accessable for visualisation. 

Individual trials were run as follows. A random seed was set before each model was initialised with the same 5 seeds used across all learning rates to ensure fair comparison as all models started with the same initial weights.  

Models were trained by mini-batch stoachastic gradient descent as desvined above. During training each batch was scored in terms of loss and accuracy.Batch scores were averaged across the epoch to give the training loss and accuracy for that epoch. After training for each epoch, the model was was then taken out of training mode - halting gradient computations - and the validation set was iterated through in batches, with the validation batch losses and accuracies again being averaged to give a validation loss and accuracy for the epoch. Smoothing was applied where validation volatitliy made results hard to interpred.

Accumulating these measures across epochs rather than batches is a somewhat aribitraty although conventional approach. It is a convenient way to keep track of how many times the model has been exposed to all of the training data and is easy to understand when plotting performance graphs. 

By running this experiment with different learning rates and multiple random seeds, a reasonable understanding of how the learning rate affects the model's performance can be obtained.

The averaged results provide insights into the model's training progress and generalization ability.

#### 1.2 - Learning Rate Scheduler

Having established above the performance of different learning rates it was clear that the model could tolerate a relatively high initial learning rate but this needed to drop significantly and arrive at or beneath 0.02 by the end of the 50 epochs to ensure a more fine grained exploration of the loss landscape in later stages. 

A number of approached to learning rate scheduling are available, but it made sense to try a number of simple appraoches first, so a simple step decay, exponential decay and inverse time decay were plotted and trialled to try and establish which ahcieved the desired shape of decay, and with what decay rate/schedule. This was found to be an inverse time function with a decay rate of 0.25  

$\alpha_t = \frac{\alpha_0}{1 + kt}$

Where: $\alpha_t$ is the learning rate at time step $t$, $\alpha_0$ is the initial learning rate, $k$ is the decay rate, $t$ is the current time step or iteration.

This was used a function for modifying the learning rate during training and the trianing and validation performance were gathered as in experiment 1.1 across 5 trials. 

The performance of these models trained with the learning rate scheduler was also assessed on the test data, and their performance compared to that of the non-learning rate scheduled models on the test data. 

### Experiment 2 (8 MARKS) <ignore>

#### 2.1 - Dropout Rates

For this experiment, in accordance with the assignment brief, the original training data was re-split into 2 halves to create a new dataset for training. 

A set of dropout rates for experimentation was defined $0, 0.2, 0.4, 0.6, \text{, and }0.8$.

Similar to the previous method for training and evaluating a varying learning rate, 5 trials for each dropout rate were carrried out with models initialised with consistent seeding, then trained all other hyperpapramters being fixed and results recorded as above. 

The experiment's results can help show the effect of dropout rgularisation on model performance, and to determine the optimal dropout rate for the DropoutNet model on this specific classification task. 

It allows for the comparison of different dropout rates and their effect on reducing overfitting and improving generalization. The results can guide the selection of an appropriate dropout rate that achieves the best balance between training performance and generalization.

#### 2.2 - Dropout and Transfer Leaning

The second part of this expeiment the experiment aims to investigate the performance of dropout regularization in the context of transfer learning.

It compares the performance of a number of models both in terms of pefrormance during training, but also on the withheld test set. 

The performance of the best performing model from experiment 1 (which did not have any opportunity to have any further training on the new data split) was to be compared with 
*i)* a model trained on the original data without dropout which was been partially retrained on the new data
*ii)* a model trained don the original data *with** dropout which was partially retrained on the new data. 

In both of the latter cases the partial retaining amounted to transfer learning where trained models had some weights 'frozen' (kept fixed) whilst others were reintialised and made trainable on the new data.

Performance for all models was compared during training and validation as well as on the test dataset 

Transfer learning was implimented as follows.

Two models, one with dropout, one without, were initialised and trained as in previous experiments, iterating over 5 random seeds,with  training data for 50 epochs, with performanace monitored across training and validation sets.

The final instance of each model was saved to disk so the trained models weights were stores, along with a record of it;s initial performance on disk. 

The validation and training datasets were then swapped, the models were loaded and all of their layers were frozen except their fully connected layers which were manually re-initialised, meaning they were subject to training. 

These two models were then trained as in previous experiment - 5 times each by iterating over the random seeds, but this time the were trained and validated on the swapped data. 

These two models had effectively been trained twice on slightly differently distributed datasets, and so their final layers had been trained on the new data. 

Their performance on this new dataset as well as on the test set was plotted and stored. 

By conducting this experiment, the performance of the models with and without dropout regularization can be compared in a transfer learning scenario. The use of swapped train and validation data allows for evaluating the models' ability to generalize to a different data distribution.

The averaged results and smoothed plots provide insights into how the pretrained models with and without dropout perform when fine-tuned on the swapped data. The test results on the original test dataloader assess the models' performance on unseen data.

The experiment's results can help determine the effectiveness of dropout regularization in transfer learning and whether the pretrained model with dropout outperforms the model without dropout in this specific scenario. It provides valuable information on the models' ability to adapt to new data distributions and generalize well.


### Experiment 3 (8 MARKS) <ignore>

This experiment investigates the gradient flow in three different neural network models: BaselineNet (without regularization), DropoutNet (with dropout regularization), and BatchNormNet (with batch normalization). The goal was to analyze and compare the mean and standard deviation of the gradients in the first 5 epochs and the last 5 epochs of training for each model.

This was done by extracting the raw gradient values for each layer from the model during training for the first 5 training steps, and the last 5 training steps. 

Conveniently pytorch has this values acessable as aaccessable value in the model. 

# 3.1
this process was done for the baseline model

# 3.2
this process was done for the dropout model

# 3.3
for this experiment it was required to impliment batch normalisation
this process was done for the batch norm model

```
The experiment follows these steps:

1. Set the number of epochs to 30 and the learning rate to 0.05.

2. For the BaselineNet model:
   - Initialize the model and set the random seed to 1984 for reproducibility.
   - Define the loss function (CrossEntropyLoss) and optimizer (SGD) with the specified learning rate.
   - Collect the absolute gradients for the first 5 epochs and the last 5 epochs using the `collect_gradients_abs` function.
   - Compute the mean and standard deviation of the absolute gradients for the first 5 epochs and the last 5 epochs using the `compute_gradient_statistics_abs` function.
   - Plot the mean and standard deviation of the absolute gradients for the first 5 epochs and the last 5 epochs using the `plot_gradient_statistics_abs` function.

3. Repeat step 2 for the DropoutNet model with a dropout rate of 0.6.

4. Repeat step 2 for the BatchNormNet model.

The `collect_gradients_abs` function collects the absolute gradients for the specified epochs during training. It iterates over the batches in the `train_dataloader` and performs the forward pass, loss computation, and backward pass. The absolute gradients of each layer are collected for the first 5 batches of the first epoch and the last 5 batches of the last epoch.

The `compute_gradient_statistics_abs` function computes the mean and standard deviation of the absolute gradients for each layer based on the collected gradients.

The `plot_gradient_statistics_abs` function visualizes the mean and standard deviation of the absolute gradients for each layer using a bar plot. It creates a figure with two subplots: one for the mean gradients and one for the standard deviations. The x-axis represents the layers, and the y-axis represents the mean or standard deviation values. The bars are grouped by the first 5 epochs and the last 5 epochs for comparison.

By running this experiment, you can observe and compare the gradient flow in the BaselineNet, DropoutNet, and BatchNormNet models. The plots will show the mean and standard deviation of the absolute gradients for each layer in the first 5 epochs and the last 5 epochs. This analysis can provide insights into how the gradients evolve during training and how different regularization techniques (dropout and batch normalization) affect the gradient flow compared to the baseline model.

The results can help understand the impact of regularization on the gradient magnitudes and the stability of the gradient flow throughout the training process.
```
# 3.4 bacth norm on performance

the batch norm model was trained as before with 5 instances averaged and performance assessed on training validation and test in comparison to others




In [1]:
############################################
### Code for building the baseline model ###
############################################

# relevant imports

import torch
import torch.nn as nn
import torch.nn.functional as F # as per convention

class BaselineNet(nn.Module):
    def __init__(self):
        super().__init__()
        # max pool layers - not strictly needed to be seperate instances but helps with reference to the diagram
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)

        self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = F.relu(self.conv3(x))
        x = self.pool3(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Results (instructions) - 55 MARKS <ignore>
Use the Results section to summarise your findings from the experiments. For each experiment, use the Markdown/text cell to describe and explain your results, and use the code cell (and additional code cells if necessary) to conduct the experiment and produce figures to show your results.

### Experiment 1 (17 MARKS) <ignore>

1.
<figure><center><img src="./figs/e1/lrchaos.png" width=200><figcaption> BLURB. </figcaption></center></figure>
<figure><center><img src="./figs/e1/tranval_no_learn.png" width=200><figcaption> 3. </figcaption></center></figure>

<figure><center><img src="./figs/e1//lr1.png" width=200><figcaption> 1. </figcaption></center></figure>
<figure><center><img src="./figs/e1/lr2.png" width=200><figcaption> BLURB. </figcaption></center></figure>
<figure><center><img src="./figs/e1/lr3.png" width=200><figcaption> BLURB. </figcaption></center></figure>
<figure><center><img src="./figs/e1/lr4.png" width=200><figcaption> BLURB. </figcaption></center></figure>
<figure><center><img src="./figs/e1/lr5.png" width=200><figcaption> BLURB. </figcaption></center></figure>

<figure><center><img src="./figs/e1/smoothed loss accuracy.png" width=200><figcaption> BLURB. </figcaption></center></figure>
<figure><center><img src="./figs/e1/leraning rates test performance.PNG" width=200><figcaption> BLURB. </figcaption></center></figure>
2.

<figure><center><img src="./figs/e1/lr_scheculer experiments.png" width=200><figcaption> BLURB. </figcaption></center></figure>
<figure><center><img src="./figs/e1/lr decay comparison.PNG" width=100><figcaption> BLURB. </figcaption></center></figure>

<figure><center><img src="./figs/e1/LR SCHEDULER final results.png" width=200><figcaption> WITH SCHEDULER. </figcaption></center></figure>
<figure><center><img src="./figs/e1/results accuracy camparison lr and scheduler.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>


In [2]:

#############################
### code for Experiment 1 ###
#############################

# UTIL functions that are used here and in all other experiments are included at the bottom of this cell. 
# This choice was made so the experiment code came first to help with readability
# it does mean some function calls show as undefined


# imports 
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import torch.optim as optim
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import math

# use GPU where available
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")


# EXPERIMENT 1.1 ------------- Learning Rates -------------

# DATA LOADING AND SPLITTING

# set seed for data split
torch.manual_seed(0)

# create transform object so conversion to Tensor and normalising carried out on data download (functionality as part of torchvision.datasets method)
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# get the data - 'train' boolean specifies whether to get training or test data
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)

# set value for validation split (10% validation)
num_validation_samples = 5000
num_train_samples = len(train_data) - num_validation_samples

# split training data
train_data, val_data = random_split(train_data, [num_train_samples, num_validation_samples])

# confirm split number
print(len(train_data)) # 50000 training egs  
print(len(val_data)) # 10000 test egs
print(len(test_data)) # 10000 test egs

# set batch side for initialising dataloaders intialise for different datasets
batch_size = 64
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)


# RUNNING TRAINING AND VALIDATIOB

num_epochs = 50
random_seeds = list(range(1, 6))

learning_rates_for_experiment = [0.1, 0.075, 0.05, 0.025, 0.01]
# initialise dictionary for storing data for saving to JSON
averaged_results = {lr:{} for lr in learning_rates_for_experiment}
path_to_save = f'./run_data/learning_rates/FINAL.json'
path_to_load = f'./run_data/learning_rates/FINAL.json'
save_experiment = True
# iterate over learning rates to be tested
for learning_rate in learning_rates_for_experiment:
    # initialise empty lists for collecting data for each learning rates (over the 5 runs)
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    # 5 random seeds = 5 different runs for each learning rate
    for random_seed in random_seeds:
        # set seed prior to initialising model (as used for initial weights as well as any dropout layers)
        torch.manual_seed(random_seed)
        # initialise model, criterion and optimiser
        model = BaselineNet().to(device)
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)
        
        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
    
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[learning_rate] = {'seeds':random_seeds,
                                       'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=5, title=f'lr: {learning_rate}')

if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read


# PLOTTING

plot_all_models_performance_from_disk(path_to_load, enforce_axis=True)
plot_performance_comparison_from_file(path_to_load, enforce_axis=True)
display_accuracy_heatmap(path_to_load)


# EXPERIMENT 1.2 ------------- LR SCHEDULER -------------

# INVESTIGATE LR DECAY

# exploring different learning_rate decay approaches and plotting them to see how the learning rate will actually behave across 50 epochs
def adjust_learning_rate(epoch, initial_lr, decay_type, decay_rate=0.1, decay_interval=10):
    if decay_type == 'inverse_time':
        new_lr = initial_lr / (1 + decay_rate * epoch)
    elif decay_type == 'exponential':
        new_lr = initial_lr * (math.e ** (-1 * decay_rate * epoch))
    elif decay_type == 'step':
        num_decays = epoch // decay_interval
        new_lr = initial_lr * (decay_rate ** num_decays)
    return new_lr

def plot_learning_rate_decay(num_epochs, initial_lr, decay_functions):
    fig, axs = plt.subplots(len(decay_functions), figsize=(8, 4 * len(decay_functions)))
    if len(decay_functions) == 1:
        axs = [axs]
    
    for i, (decay_type, decay_rate, decay_interval) in enumerate(decay_functions):
        lr_values = [adjust_learning_rate(epoch, initial_lr, decay_type, decay_rate, decay_interval) for epoch in range(num_epochs)]
        
        if decay_type == 'step':
            title = f'Decay Function: {decay_type}, Decay Rate: {decay_rate}, Decay Interval: {decay_interval}'
        else:
            title = f'Decay Function: {decay_type}, Decay Rate: {decay_rate}'
        
        axs[i].plot(range(num_epochs), lr_values)
        axs[i].set_title(title)
        axs[i].set_xlabel('Epoch')
        axs[i].set_ylabel('Learning Rate')
    
    plt.tight_layout()
    plt.show()

num_epochs = 50
initial_lr = 0.1

decay_functions = [
    ('inverse_time', 0.1, 0),
    ('inverse_time', 0.05, 0),
    ('step', 0.5, 10),
    ('step', 0.1, 5),
    ('exponential', 0.25, 0),
    ('exponential', 0.1, 0)
]

plot_learning_rate_decay(num_epochs, initial_lr, decay_functions)


# RUN TRAINING AND VALIDATION WITH LEARNING RATE DECAY

# implimenting the most LR decay shceduler that best fit what I wanted to happen
# creating function that will be passed in to the training function to be applied after evey epoch
def adjust_initial_learning_rate(optimiser, epoch, initial_lr=0.1, decay_rate=0.25):    
    new_lr = initial_lr / (1 + decay_rate *epoch)
    for param_group in optimiser.param_groups:
        param_group['lr'] = new_lr
    print('LR:',new_lr)
    return optimiser


num_epochs = 50

initial_learning_rate = 0.1
decay_rate = 0.25

random_seeds = list(range(1, 6))

averaged_results = {decay_rate:{}}
path_to_save = f'./run_data/lr_decay/final_decaying_lr_initial_lr_{initial_learning_rate}_decay_{decay_rate}.json'
path_to_load = f'./run_data/lr_decay/final_decaying_lr_initial_lr_{initial_learning_rate}_decay_{decay_rate}.json'

save_experiment = True

epoch_train_losses_by_run = []
epoch_val_losses_by_run = []
epoch_train_accuracies_by_run = []
epoch_val_accuracies_by_run = []
test_losses = []
test_accuracies = []
reports = []
    
for random_seed in random_seeds:
    print('DECAY: ', decay_rate)
    print('seed:', random_seed)
    torch.manual_seed(random_seed)

    model = BaselineNet().to(device)
    criterion = nn.CrossEntropyLoss()
    optimiser = optim.SGD(model.parameters(), lr=initial_learning_rate)

    model,train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, train_report,val_report = run_training_and_validation(model, device, initial_learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, manual_lr_schedule=True, scheduler_func=adjust_initial_learning_rate, plot=True)
    epoch_train_losses_by_run.append(train_epoch_losses)
    epoch_val_losses_by_run.append(val_epoch_losses)
    epoch_train_accuracies_by_run.append(train_epoch_accuracy)
    epoch_val_accuracies_by_run.append(val_epoch_accuracy)
    
    test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
    test_losses.append(test_loss)
    test_accuracies.append(test_accuracy)
    reports.append(report)

    
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[decay_rate] = {'seeds':random_seeds,
                                       'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'LR: {initial_learning_rate}, DECAY: {decay_rate}')
    
if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read

# PLOTTING
lr_decay_data = path_to_load
plot_all_models_performance_from_disk(lr_decay_data, enforce_axis=True)
plot_performance_comparison_from_file(lr_decay_data, enforce_axis=True)
display_accuracy_heatmap(lr_decay_data)


# ---------UTILITY FUNCTIONS USED ACROSS ALL EXPERIMENTS---------

# These functions comprised a utils.py file during development

# MODEL RELATED (click to expand) :
def run_training_and_validation(model, device, initial_lr, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, metrics = False, manual_lr_schedule = False, scheduler_func=None, plot = False):

    # key function which performs training and validation of a model for params and data. 
    
    # returns all of the data gathered from the training and validation run organised by epoch. Optional params added during development to accomodate different experiments (eg lr_scheduling)
    
    # optional metrics and plot paramaters allow for plotting as well as generation of classification report used for analysis of results
    
    # when plotting, includes a call to plot_single_train_val_smoothed() util function defined below
    # when training includes a call to the get_accuracy() function below

    train_epoch_losses = []
    train_epoch_accuracy = []
    val_epoch_losses = []
    val_epoch_accuracy = []
    
    for epoch in range(num_epochs):
        train_running_batch_losses = []
        train_running_batch_accuracy = []
        
        if epoch == num_epochs-1:
            train_all_preds = []
            train_all_labels = []
            val_all_preds = []
            val_all_labels = []
        
        if manual_lr_schedule:
            optimiser = scheduler_func(optimiser, epoch, initial_lr)

        model.train()
        for i, (images, labels) in enumerate(train_dataloader):
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            
            accuracy = get_accuracy(outputs, labels)
            
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()

            train_running_batch_losses.append(loss.item())
            train_running_batch_accuracy.append(accuracy)
            # if i % 50 == 0:
            #   training_progress_bar.set_description(f'Training Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(train_dataloader)}], Loss: {loss.item():.4f}, Acc: {accuracy:.4f}')
            
            if epoch == num_epochs-1:
                _, preds = torch.max(outputs, 1)
                train_all_preds.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy for sklearn
                train_all_labels.extend(labels.cpu().numpy())  # Move labels to CPU and convert to numpy

        train_epoch_losses.append(sum(train_running_batch_losses)/len(train_running_batch_losses))
        train_epoch_accuracy.append(sum(train_running_batch_accuracy)/len(train_running_batch_accuracy))
        model.eval()
        with torch.no_grad():
            val_running_batch_losses = []
            val_running_batch_accuracy = []

            for i, (images, labels) in enumerate(val_dataloader):
                images = images.to(device)
                labels = labels.to(device)
                
                outputs = model(images)
                loss = criterion(outputs, labels)
                
                accuracy = get_accuracy(outputs, labels)

                val_running_batch_losses.append(loss.item())
                val_running_batch_accuracy.append(accuracy)
                # if i % 20 == 0:
                #   val_progress_bar.set_description(f'Validation Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(val_dataloader)}], Loss: {loss.item():.4f}, Acc: {accuracy:.4f}')
                
                if epoch == num_epochs-1:
                    _, preds = torch.max(outputs, 1)
                    val_all_preds.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy for sklearn
                    val_all_labels.extend(labels.cpu().numpy())  # Move labels to CPU and convert to numpy

            val_epoch_losses.append(sum(val_running_batch_losses)/len(val_running_batch_losses))
            val_epoch_accuracy.append(sum(val_running_batch_accuracy)/len(val_running_batch_accuracy))
            print(f'Epoch [{epoch+1}/{num_epochs}] - Train Loss: {train_epoch_losses[epoch]:.4f}, Acc: {train_epoch_accuracy[epoch]:.4f} | Val Loss: {val_epoch_losses[epoch]:.4f}, Acc: {val_epoch_accuracy[epoch]:.4f}')
            class_names = ['plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck']
            
    if plot:
        plot_single_train_val_smoothed(train_epoch_losses, val_epoch_losses, train_epoch_accuracy, val_epoch_accuracy, num_epochs, smoothing_window=10, title=f'single run lr={initial_lr}, decay={manual_lr_schedule}')
    
    if metrics:
        train_report = classification_report(train_all_labels, train_all_preds, target_names=(class_names))
        val_report = classification_report(val_all_labels, val_all_preds, target_names=(class_names))
        # print('FINAL EPOCH TRAINING SUMMARY:')
        # print(train_report)
        # print('FINAL EPOCH VALIDATION SUMMARY:')
        # print(val_report)
        
        return (model,train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, train_report,val_report)
    else:
        return (model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, 0,0)

def get_accuracy(logits, targets):
    
        # key function used in all training and valdation and testing runs to calculate the accuracy of predictions made by a model using.
        
        # takes in logits (raw output scores from the model) and targets (actual class labels) and returns a float representing the accuracy of the predictions.

        # get the indices of the maximum value of all elements in the input tensor (which are the predicted class labels)
        _, predicted_labels = torch.max(logits, 1)
        
        # calculate the number of correctly predicted labels.
        correct_predictions = (predicted_labels == targets).sum().item()
        
        # calculate the accuracy.
        accuracy = correct_predictions / targets.size(0)
        
        return accuracy

def run_testing(model, device, criterion, test_dataloader):
    # this function was used to test trained models on the test dataset
    # its returns loss accuracy and the classification report for analysis
    model.eval()
    with torch.no_grad():
        test_running_batch_losses = []
        test_running_batch_accuracy = []
        test_all_preds = []
        test_all_labels = []

        for i, (images, labels) in enumerate(test_dataloader):
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            accuracy = get_accuracy(outputs, labels)

            test_running_batch_losses.append(loss.item())
            test_running_batch_accuracy.append(accuracy)
            # test_progress_bar.set_description(f'testidation Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(test_dataloader)}], Loss: {loss.item():.4f}, Acc: {accuracy:.4f}')
            _, preds = torch.max(outputs, 1)
            test_all_preds.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy for sklearn
            test_all_labels.extend(labels.cpu().numpy())  # Move labels to CPU and convert to numpy

    test_loss = sum(test_running_batch_losses)/len(test_running_batch_losses)
    test_accuracy = sum(test_running_batch_accuracy)/len(test_running_batch_accuracy)

    print('TESTING COMPLETE!!')
    print(f'Test Loss: {test_loss:.4f}, Test Acc: {test_accuracy:.4f}')
    report = classification_report(test_all_labels, test_all_preds, target_names=(['plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck']))
    print(report)
    return test_loss, test_accuracy, report

# PLOTTING/VISUALISING RELATED (click to expand):
def plot_single_train_val_smoothed(train_epoch_losses, val_epoch_losses, train_epoch_accuracy, val_epoch_accuracy, num_epochs, smoothing_window=5, title=None):
    # function used in many contexts to plot training and validation losses and accuracies of a single run
    # takes in the values returne from a single run of training and validation and plots them 
    # smoothing param allows for clearer picture of the progress during validation especially as it can be volatile 
    
    # convert lists to pandas Series
    train_epoch_losses_series = pd.Series(train_epoch_losses)
    val_epoch_losses_series = pd.Series(val_epoch_losses)
    train_epoch_accuracy_series = pd.Series(train_epoch_accuracy)
    val_epoch_accuracy_series = pd.Series(val_epoch_accuracy)

    # calculate moving averages using the provided smoothing window
    smooth_train_epoch_losses = train_epoch_losses_series.rolling(window=smoothing_window).mean()
    smooth_val_epoch_losses = val_epoch_losses_series.rolling(window=smoothing_window).mean()
    smooth_train_epoch_accuracy = train_epoch_accuracy_series.rolling(window=smoothing_window).mean()
    smooth_val_epoch_accuracy = val_epoch_accuracy_series.rolling(window=smoothing_window).mean()

    fig, ax = plt.subplots(1, 2, figsize=(14, 5))

    # Plot training and validation loss with moving averages
    ax[0].plot(train_epoch_losses, label='Training Loss', alpha=0.3)
    ax[0].plot(val_epoch_losses, label='Validation Loss', alpha=0.3)
    ax[0].plot(smooth_train_epoch_losses, label='Smoothed Training Loss', color='blue')
    ax[0].plot(smooth_val_epoch_losses, label='Smoothed Validation Loss', color='orange')
    ax[0].set_xlabel('Epochs')
    ax[0].set_ylabel('Loss')
    ax[0].set_title('Training and Validation Loss')
    ax[0].legend()

    # Set x-axis to show each epoch as a tick
    ax[1].set_xticks(range(0, num_epochs + 1, 10))

    # Plot training and validation accuracy with moving averages
    ax[1].plot(train_epoch_accuracy, label='Training Accuracy', alpha=0.3)
    ax[1].plot(val_epoch_accuracy, label='Validation Accuracy', alpha=0.3)
    ax[1].plot(smooth_train_epoch_accuracy, label='Smoothed Training Accuracy', color='blue')
    ax[1].plot(smooth_val_epoch_accuracy, label='Smoothed Validation Accuracy', color='orange')
    ax[1].set_xlabel('Epochs')
    ax[1].set_ylabel('Accuracy')
    ax[1].set_title('Training and Validation Accuracy')
    ax[1].legend()

    # Set x-axis to show each epoch as a tick
    ax[1].set_xticks(range(0, num_epochs + 1, 10))

    # Set y-axis for accuracy to range from 0 to 1 with ticks at intervals of 0.1
    ax[1].set_ylim(0, 1)
    ax[1].set_yticks([i * 0.1 for i in range(11)])
    if title:
        fig.suptitle(title, fontsize=16)

    plt.tight_layout()
    plt.show()

def display_accuracy_heatmap(path_to_load):
    # helper function for displaying best performing models in a convenient way
    with open(path_to_load, 'r') as file:
        results = json.load(file)
    
    rates = []
    av_test_losses = []
    av_test_accuracy = []
    for rate, value_dict in results.items():
        rates.append(rate)
        av_test_losses.append(value_dict['av_test_loss'])
        av_test_accuracy.append(value_dict['av_test_accuracy'])
    
    # Creating the DataFrame
    df = pd.DataFrame({
        'Average Test Loss': av_test_losses,
        'Average Test Accuracy': av_test_accuracy
    }, index=rates)
    
    # Applying conditional formatting to highlight the best value in each column
    def highlight_best(column):
        if column.name == 'Average Test Loss':
            is_best = column == column.min()
        else:
            is_best = column == column.max()
        return ['background: green' if v else '' for v in is_best]
    
    styled_df = df.style.apply(highlight_best, axis=0)
    
    return styled_df

def plot_single_model_performance(single_var_multi_run_data, title=None, enforce_axis=False):
    # function used for plotting the performance of single variable being investigated of n multiple runs 
    # for example during experiments 1.1 and 2.1
    
    # plots individual runs in background and a clearer average run 
    
    epochs = range(1, len(single_var_multi_run_data['av_train_losses']) + 1)
    n_runs = len(single_var_multi_run_data['all_train_losses'])
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    if title:
        title += f' across {n_runs} runs'
        fig.suptitle(title, fontsize=12)

    # Plot losses
    for train_loss, val_loss in zip(single_var_multi_run_data['all_train_losses'], single_var_multi_run_data['all_val_losses']):
        ax1.plot(epochs, train_loss, color='blue', alpha=0.3, linewidth=0.5, label='Individual Run Training Losses')
        ax1.plot(epochs, val_loss, color='orange', alpha=0.3, linewidth=0.5, label='Individual Run Validation Losses')
    ax1.plot(epochs, single_var_multi_run_data['av_train_losses'], color='blue', linewidth=1.2, label='Average Training Loss')
    ax1.plot(epochs, single_var_multi_run_data['av_val_losses'], color='orange', linewidth=1.2, label='Average Validation Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Losses')
    
    # Remove duplicate labels in the legend
    handles, labels = ax1.get_legend_handles_labels()
    unique_labels = ["Average Training Loss", "Average Validation Loss", "Individual Run Training Losses", "Individual Run Validation Losses"]
    unique_handles = [handles[labels.index(label)] for label in unique_labels]
    ax1.legend(unique_handles, unique_labels)

    # Plot accuracies
    for train_acc, val_acc in zip(single_var_multi_run_data['all_train_accuracies'], single_var_multi_run_data['all_val_accuracies']):
        ax2.plot(epochs, train_acc, color='blue', alpha=0.3, linewidth=0.5, label='Individual Run Training Accuracies')
        ax2.plot(epochs, val_acc, color='orange', alpha=0.3, linewidth=0.5, label='Individual Run Validation Accuracies')
    ax2.plot(epochs, single_var_multi_run_data['av_train_acc'], color='blue', linewidth=1.2, label='Average Training Accuracy')
    ax2.plot(epochs, single_var_multi_run_data['av_val_acc'], color='orange', linewidth=1.2, label='Average Validation Accuracy')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy')
    ax2.set_title('Accuracies')
    
    # Remove duplicate labels in the legend
    handles, labels = ax2.get_legend_handles_labels()
    unique_labels = ["Average Training Accuracy", "Average Validation Accuracy", "Individual Run Training Accuracies", "Individual Run Validation Accuracies"]
    unique_handles = [handles[labels.index(label)] for label in unique_labels]
    ax2.legend(unique_handles, unique_labels)
    
    if enforce_axis:
        ax1.set_ylim(0, 5)
        ax2.set_ylim(0, 1)

    plt.tight_layout()
    plt.show()    
    
def plot_all_models_performance_from_disk(path_to_load, variable_name=None, enforce_axis=False):
    with open(path_to_load, 'r') as file:
        averaged_results = json.load(file)
        
    for variable_val, data in averaged_results.items():
        plot_single_model_performance(data, title=f'Training/Validation Losses and Accuracy for {variable_name} = {variable_val} across', enforce_axis=enforce_axis)

def plot_performance_comparison_from_file(path_to_load, enforce_axis=False, smooth_window=5):
    with open(path_to_load, 'r') as file:
        results = json.load(file)
    learning_rates = list(results.keys())
    num_epochs = len(results[learning_rates[0]]['av_train_losses'])

    fig_size = (12, 16)
    fig, ((ax_train_loss, ax_train_acc), (ax_val_loss, ax_val_acc),
          (ax_train_loss_smoothed, ax_train_acc_smoothed),
          (ax_val_loss_smoothed, ax_val_acc_smoothed)) = plt.subplots(4, 2, figsize=fig_size)

    plot_metrics(ax_train_loss, results, learning_rates, num_epochs, 'av_train_losses', 'Average Training Loss')
    plot_metrics(ax_train_acc, results, learning_rates, num_epochs, 'av_train_acc', 'Average Training Accuracy')
    plot_metrics(ax_val_loss, results, learning_rates, num_epochs, 'av_val_losses', 'Average Validation Loss')
    plot_metrics(ax_val_acc, results, learning_rates, num_epochs, 'av_val_acc', 'Average Validation Accuracy')
    plot_metrics(ax_train_loss_smoothed, results, learning_rates, num_epochs, 'av_train_losses', 'Smoothed Training Loss', smoothed=True, smooth_window=smooth_window)
    plot_metrics(ax_train_acc_smoothed, results, learning_rates, num_epochs, 'av_train_acc', 'Smoothed Training Accuracy', smoothed=True, smooth_window=smooth_window)
    plot_metrics(ax_val_loss_smoothed, results, learning_rates, num_epochs, 'av_val_losses', 'Smoothed Validation Loss', smoothed=True, smooth_window=smooth_window)
    plot_metrics(ax_val_acc_smoothed, results, learning_rates, num_epochs, 'av_val_acc', 'Smoothed Validation Accuracy', smoothed=True, smooth_window=smooth_window)

    if enforce_axis:
        for ax in [ax_val_acc, ax_val_loss, ax_train_acc, ax_train_loss,
                   ax_val_acc_smoothed, ax_val_loss_smoothed, ax_train_acc_smoothed, ax_train_loss_smoothed]:
            ax.set_ylim(0, 5) if 'Loss' in ax.get_ylabel() else ax.set_ylim(0, 1)

    plt.tight_layout()
    plt.show()

    if len(learning_rates) > 2:
        plot_comparative_metrics(results, learning_rates, num_epochs, 'Comparative Accuracies', 'av_train_acc', 'av_val_acc', enforce_axis)
        plot_comparative_metrics(results, learning_rates, num_epochs, 'Comparative Accuracies (Smoothed)', 'av_train_acc', 'av_val_acc', enforce_axis, smoothed=True, smooth_window=smooth_window)
    elif len(learning_rates) == 2:
        fig_acc_two, ax_acc_two = plt.subplots(figsize=(6, 4))
        fig_acc_two.suptitle('Comparative Accuracies', fontsize=12)

        for lr in learning_rates:
            ax_acc_two.plot(range(1, num_epochs + 1), results[lr]['av_val_acc'], label=f"Validation ({lr})", linestyle='-')
            ax_acc_two.plot(range(1, num_epochs + 1), results[lr]['av_train_acc'], label=f"Training ({lr})", linestyle='--')

        ax_acc_two.set_xlabel('Epoch')
        ax_acc_two.set_ylabel('Accuracy')
        ax_acc_two.set_title('Accuracy Comparison')
        ax_acc_two.legend(loc='upper right')

        if enforce_axis:
            ax_acc_two.set_ylim(0, 1)

        plt.tight_layout()
        plt.show()

        plot_comparative_metrics(results, learning_rates, num_epochs, 'Comparative Accuracies (Smoothed)', 'av_train_acc', 'av_val_acc', enforce_axis, smoothed=True, smooth_window=smooth_window)

def plot_metrics(ax, results, learning_rates, num_epochs, metric_key, title, smoothed=False, smooth_window=5):
    for lr in learning_rates:
        if smoothed:
            metric = np.convolve(results[lr][metric_key], np.ones(smooth_window) / smooth_window, mode='valid')
            ax.plot(range(smooth_window // 2, num_epochs - smooth_window // 2 + 1), metric, label=str(lr))
        else:
            ax.plot(range(1, num_epochs + 1), results[lr][metric_key], label=str(lr))
    ax.set_xlabel('Epoch')
    ax.set_ylabel(title)
    ax.set_title(title)
    ax.legend(title='Learning Rates', loc='lower right')

def plot_comparative_metrics(results, learning_rates, num_epochs, fig_title, train_key, val_key, enforce_axis=False, smoothed=False, smooth_window=5):
    fig, (ax_train, ax_val) = plt.subplots(1, 2, figsize=(12, 4))
    fig.suptitle(fig_title, fontsize=12)

    plot_metrics(ax_train, results, learning_rates, num_epochs, train_key, f'Training {fig_title}', smoothed, smooth_window)
    plot_metrics(ax_val, results, learning_rates, num_epochs, val_key, f'Validation {fig_title}', smoothed, smooth_window)

    if enforce_axis:
        ax_train.set_ylim(0, 1)
        ax_val.set_ylim(0, 1)

    plt.tight_layout()
    plt.show()


### Experiment 2 (19 MARKS) <ignore>

2.1
<figure><center><img src="./figs/e2/dr0.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/dr02.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/dr04.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/dr046.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/dr08.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/overall dropout comparisons.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/dropout rates test results.PNG" width=200><figcaption> COMPARISON. </figcaption></center></figure>



2.2
MODEL NUMBERS: 0 = BASELINE , 1=DROPOUT  
pretrained 
<figure><center><img src="./figs/e2/pretrained_0.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/pretrained_1.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/pretrained comparison.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/pretrained test results.PNG" width=200><figcaption> COMPARISON. </figcaption></center></figure>

transfer learning

<figure><center><img src="./figs/e2/retrained_baseline.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/retrained_dropout.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/retrained_comparison.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e2/retrained comparison test results.PNG" width=200><figcaption> COMPARISON. </figcaption></center></figure>





In [6]:
#############################
### Code for Experiment 2 ###
#############################

# --- EXPERIMENT 2.1 - Dropout Rates ---

# DATA LOADING AND NEW SPLIT
torch.manual_seed(0)

batch_size = 64

transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)

# half and half split
num_validation_samples = 25000
num_train_samples = len(train_data) - num_validation_samples
train_data, val_data = random_split(train_data, [num_train_samples, num_validation_samples])

print(len(train_data)) # 50000 training egs  
print(len(val_data)) # 25000 test egs
print(len(test_data)) # 10000 test egs

train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

# DROPOUT MODEL DEFINITION

class DropoutNet(nn.Module):
    def __init__(self, dropout_rate):
        super().__init__()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=64)
        self.dropout = nn.Dropout(p=dropout_rate)  # Dropout layer after the first FC layer
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)  # Applying dropout after activation
        x = self.fc2(x)
        return x


# TRAINING WITH DIFFERENT DROPOUT RATES

num_epochs = 50
learning_rate = 0.05

random_seeds = list(range(1, 6))
dropout_rates_for_experiment = [0, 0.2, 0.4, 0.6, 0.8]

averaged_results = {dr:{} for dr in dropout_rates_for_experiment}

path_to_save = f'./run_data/dropout/C2_final_dropout_rate_compatison_lr_{learning_rate}_{num_epochs}_epochs.json'
path_to_load = f'./run_data/dropout/C2_final_dropout_rate_compatison_lr_{learning_rate}_{num_epochs}_epochs.json'
save_experiment = True


for dropout_rate in dropout_rates_for_experiment:
    print('DR: ', dropout_rate) 
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    for random_seed in random_seeds:
        print('DR: ', dropout_rate) 
        print('seed:', random_seed)
        torch.manual_seed(random_seed)
        
        model = DropoutNet(dropout_rate).to(device)
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)

        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
        
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[dropout_rate] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    print('average for ')
    print('DR: ', dropout_rate) 
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'DROPOUT: {dropout_rate}')

if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read
        
# PLOTTING
dropout_data = path_to_load
plot_all_models_performance_from_disk(dropout_data, enforce_axis=True)
plot_performance_comparison_from_file(dropout_data, enforce_axis=True)
display_accuracy_heatmap(dropout_data)

# --- EXPERIMENT 2.1 - TRANSFER LEARNINNG ---

# SWAP DATASETS WITH NEW DATALOADERS

torch.manual_seed(0)

batch_size = 64

original_train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
original_val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)

swapped_train_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
swapped_val_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

# TRAINING ON ORIGINAL DATA

# train and save models ready transfer learning 
# train two models - one dropout, one not dropout, train them on the ORIGINAL half and half data, then save a copy of the models to disk
best_dropout_rate = 0.6

num_epochs = 50
learning_rate = 0.05

random_seeds = [list(range(1, 6))]


path_to_save = f'./run_data/transfer_learning/transfer_learn_original_dat_{num_epochs}_epochs_lr_{learning_rate}.json'
path_to_load = f'./run_data/transfer_learning/transfer_learn_original_dat_{num_epochs}_epochs_lr_{learning_rate}.json'

models = [0, 1]
averaged_results = {i:{} for i in models}

save_experiment = True

# train them both on the original data
for i, model in enumerate(models):
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    for random_seed in random_seeds:
        print('MODEL: ', i) 
        print('seed:', random_seed)
        torch.manual_seed(random_seed)
        
        model = BaselineNet() if i == 0 else DropoutNet(dropout_rate=best_dropout_rate)
        model.to(device)
        
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)
        
        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, original_train_dataloader, original_val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
        
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[i] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    print('average for ')
    print('Model: ', i) 
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'PRETRAINING MODEL: {i}')
    
    # save last version of model to disk for retraining    
    torch.save(model, f'./models/trained_model_{i}.pth')

    
if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read
        
# PLOTTING
pre_training_data = path_to_load
plot_all_models_performance_from_disk(pre_training_data, enforce_axis=True)
plot_performance_comparison_from_file(pre_training_data, enforce_axis=True)
display_accuracy_heatmap(pre_training_data)


# PERFORM TRANSFER LEARNING
# load in the two pretrained models and then reinitialise some layers
# retrain on the SWAPPED data

num_epochs = 50
learning_rate = 0.05
random_seeds = list(range(1,6))

path_to_save = f'./run_data/transfer_learning/transfer_learning_data_{num_epochs}_epochs_lr_{learning_rate}.json'
path_to_load = f'./run_data/transfer_learning/transfer_learning_data_{num_epochs}_epochs_lr_{learning_rate}.json'

models = [0, 1]
averaged_results = {i:{} for i in models}

save_experiment = True

# train them both on the swapped train and val data - test data same
for i, model in enumerate(models):
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    for random_seed in random_seeds:
        print('MODEL: ', i) 
        print('seed:', random_seed)
        torch.manual_seed(random_seed)
        # here handle the loading of saved model and reinitiailisation of the fully connected layers
        if i == 0:
            pretrained_model_non_dropout = torch.load('./models/trained_model_0.pth')
            pretrained_model_non_dropout.fc1 =  nn.Linear(in_features=64 * 4 * 4, out_features=64)
            pretrained_model_non_dropout.fc2 = nn.Linear(in_features=64, out_features=10)
            model = pretrained_model_non_dropout
        elif i == 1:
            pretrained_model_best_dropout = torch.load('./models/trained_model_1.pth')
            pretrained_model_best_dropout.fc1 =  nn.Linear(in_features=64 * 4 * 4, out_features=64)
            pretrained_model_best_dropout.fc2 = nn.Linear(in_features=64, out_features=10)
            model = pretrained_model_best_dropout
        model.to(device)
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)
        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, swapped_train_dataloader, swapped_val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
        
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[i] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    print('average for ')
    print('Model: ', i) 
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'TRANSFER LEARNING MODEL: {i}')
    


if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read

# plotting results

transfer_learned_data = path_to_load
plot_all_models_performance_from_disk(transfer_learned_data, enforce_axis=True)
plot_performance_comparison_from_file(transfer_learned_data, enforce_axis=True)
display_accuracy_heatmap(transfer_learned_data)


Files already downloaded and verified
25000
25000
10000


### Experiment 3 (19 MARKS) <ignore>

3.1
<figure><center><img src="./figs/e3/gradients baseline model.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>

3.2
<figure><center><img src="./figs/e3/gradients dropout model.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>

3.3
<figure><center><img src="./figs/e3/gradients batchnorm model (matching others).png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e3/gradients batchnorm model (not matching others).png" width=200><figcaption> COMPARISON. </figcaption></center></figure>


3.4 
running and plotting a batch norm model (as defined)

<figure><center><img src="./figs/e3/batch norm performance.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>

<figure><center><img src="./figs/e3/batch norm test results.PNG" width=200><figcaption> COMPARISON. </figcaption></center></figure>

running and plotting a batch norm model (extra capacity)
<figure><center><img src="./figs/e3/nathc norm extra capacity.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>
<figure><center><img src="./figs/e3/nathc norm extra capacity test results.PNG.png" width=200><figcaption> COMPARISON. </figcaption></center></figure>


In [None]:
#############################
### Code for Experiment 3 ###
#############################

# return to original data splits

batch_size = 64

torch.manual_seed(0)

transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])


train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)

num_validation_samples = 5000
num_train_samples = len(train_data) - num_validation_samples

train_data, val_data = random_split(train_data, [num_train_samples, num_validation_samples])

print(len(train_data)) # 50000 training egs  
print(len(val_data)) # 10000 test egs
print(len(test_data)) # 10000 test egs

train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)



# define functions for accumulating gradients

def collect_gradients_abs_4(model, dataloader, device, criterion, optimizer, num_epochs):
    first_5_episodes_gradients_abs = {name: [] for name, _ in model.named_parameters()}
    last_5_episodes_gradients_abs = {name: [] for name, _ in model.named_parameters()}

    for epoch in range(num_epochs):
        model.train().to(device)
        for batch_count, (images, labels) in enumerate(dataloader, 1):
            images = images.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()

            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()

            episode_gradients_abs = {}
            for name, param in model.named_parameters():
                if param.grad is None and param.requires_grad:
                    # print('HEHRHE')
                    episode_gradients_abs[name] = torch.zeros_like(param.data)
                elif param.grad is not None:
                    # print('NONONO')
                    episode_gradients_abs[name] = torch.abs(param.grad.clone().detach())

            if epoch == 0 and batch_count <= 5:
                for name, grad_abs in episode_gradients_abs.items():
                    first_5_episodes_gradients_abs[name].append(grad_abs)
            elif epoch == num_epochs - 1 and batch_count > len(dataloader) - 5:
                for name, grad_abs in episode_gradients_abs.items():
                    last_5_episodes_gradients_abs[name].append(grad_abs)

            optimizer.step()

    return first_5_episodes_gradients_abs, last_5_episodes_gradients_abs

def compute_gradient_statistics_abs_4(gradients_abs):
    mean_gradients_abs = {}
    std_gradients_abs = {}
    for layer_name, layer_gradients_abs in gradients_abs.items():
        layer_gradients_abs = torch.stack(layer_gradients_abs)
        mean_gradients_abs[layer_name] = torch.mean(layer_gradients_abs, dim=0)
        std_gradients_abs[layer_name] = torch.std(layer_gradients_abs, dim=0)
    return mean_gradients_abs, std_gradients_abs

def plot_gradient_statistics_abs_4(mean_gradients_first5_abs, std_gradients_first5_abs, mean_gradients_last5_abs, std_gradients_last5_abs, skip_bn=True):
    if skip_bn:
        # Filter out batch normalization layers
        layer_names = [name for name in mean_gradients_first5_abs.keys() if not name.startswith('bn')]
    else:
        layer_names = list(mean_gradients_first5_abs.keys())

    num_layers = len(layer_names)
    x = np.arange(num_layers)
    width = 0.35

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    fig.suptitle('Gradient Statistics (Absolute Means, Absolute Standard Deviations)', fontsize=16)

    # Plot mean absolute gradients
    ax1.bar(x - width/2, [torch.mean(mean_gradients_first5_abs[name]).item() for name in layer_names], width, label='First 5 Epochs')
    ax1.bar(x + width/2, [torch.mean(mean_gradients_last5_abs[name]).item() for name in layer_names], width, label='Last 5 Epochs')
    ax1.set_xticks(x)
    ax1.set_xticklabels(layer_names, rotation=45)
    ax1.set_xlabel('Layer')
    ax1.set_ylabel('Mean of Absolute Gradients')
    ax1.set_title('Mean of Absolute Gradients vs Layer')
    ax1.legend()

    # Plot standard deviations of absolute gradients
    ax2.bar(x - width/2, [torch.mean(std_gradients_first5_abs[name]).item() for name in layer_names], width, label='First 5 Epochs')
    ax2.bar(x + width/2, [torch.mean(std_gradients_last5_abs[name]).item() for name in layer_names], width, label='Last 5 Epochs')
    ax2.set_xticks(x)
    ax2.set_xticklabels(layer_names, rotation=45)
    ax2.set_xlabel('Layer')
    ax2.set_ylabel('Standard Deviation of Absolute Gradients')
    ax2.set_title('Standard Deviation of Absolute Gradients vs Layer')
    ax2.legend()

    plt.tight_layout()
    plt.show()

def plot_model_comparison(first_5_mean_gradients_non_drop, first_5_mean_gradients_dropout, first_5_mean_gradients_bn,
                          last_5_mean_gradients_non_drop, last_5_mean_gradients_dropout, last_5_mean_gradients_bn,
                          first_5_std_gradients_non_drop, first_5_std_gradients_dropout, first_5_std_gradients_bn,
                          last_5_std_gradients_non_drop, last_5_std_gradients_dropout, last_5_std_gradients_bn):
    layer_names = [name for name in first_5_mean_gradients_non_drop.keys() if not name.startswith('bn')]
    print(layer_names)
    print(last_5_mean_gradients_non_drop['conv1.weight'])
    num_layers = len(layer_names)
    x = np.arange(num_layers)
    width = 0.2

    fig, axs = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Model Comparison - Gradient Statistics', fontsize=16)

    # Plot mean absolute gradients for the first 5 epochs
    
    print([first_5_mean_gradients_non_drop[name].shape for name in layer_names])
    axs[0, 0].bar(x - width, [torch.mean(first_5_mean_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[0, 0].bar(x, [torch.mean(first_5_mean_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[0, 0].bar(x + width, [torch.mean(first_5_mean_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[0, 0].set_xticks(x)
    axs[0, 0].set_xticklabels(layer_names, rotation=45)
    axs[0, 0].set_xlabel('Layer')
    axs[0, 0].set_ylabel('Mean of Absolute Gradients')
    axs[0, 0].set_title('First 5 Epochs - Mean of Absolute Gradients')
    axs[0, 0].legend()
    # axs[0, 0].set_ylim(0, 0.04)
    

    # Plot mean absolute gradients for the last 5 epochs
    axs[0, 1].bar(x - width, [torch.mean(last_5_mean_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[0, 1].bar(x, [torch.mean(last_5_mean_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[0, 1].bar(x + width, [torch.mean(last_5_mean_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[0, 1].set_xticks(x)
    axs[0, 1].set_xticklabels(layer_names, rotation=45)
    axs[0, 1].set_xlabel('Layer')
    axs[0, 1].set_ylabel('Mean of Absolute Gradients')
    axs[0, 1].set_title('Last 5 Epochs - Mean of Absolute Gradients')
    axs[0, 1].legend()
    # axs[0, 1].set_ylim(0, 0.2)
    

    # Plot standard deviation of absolute gradients for the first 5 epochs
    axs[1, 0].bar(x - width, [torch.mean(first_5_std_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[1, 0].bar(x, [torch.mean(first_5_std_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[1, 0].bar(x + width, [torch.mean(first_5_std_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[1, 0].set_xticks(x)
    axs[1, 0].set_xticklabels(layer_names, rotation=45)
    axs[1, 0].set_xlabel('Layer')
    axs[1, 0].set_ylabel('Standard Deviation of Absolute Gradients')
    axs[1, 0].set_title('First 5 Epochs - Standard Deviation of Absolute Gradients')
    axs[1, 0].legend()

    # Plot standard deviation of absolute gradients for the last 5 epochs
    axs[1, 1].bar(x - width, [torch.mean(last_5_std_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[1, 1].bar(x, [torch.mean(last_5_std_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[1, 1].bar(x + width, [torch.mean(last_5_std_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[1, 1].set_xticks(x)
    axs[1, 1].set_xticklabels(layer_names, rotation=45)
    axs[1, 1].set_xlabel('Layer')
    axs[1, 1].set_ylabel('Standard Deviation of Absolute Gradients')
    axs[1, 1].set_title('Last 5 Epochs - Standard Deviation of Absolute Gradients')
    axs[1, 1].legend()
    # axs[1, 1].set_ylim(0, 0.2)


    plt.tight_layout()
    plt.show()
# set epochs and learning rate
# Set epochs and learning rate
num_epochs = 50
learning_rate = 0.05

# 3.1 Gradient flow for the original model
torch.manual_seed(1984)
non_drop_model = BaselineNet()
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(non_drop_model.parameters(), lr=learning_rate)
first_5_epochs_gradients_abs_non_drop, last_5_epochs_gradients_abs_non_drop = collect_gradients_abs_4(non_drop_model, train_dataloader, device, criterion, optimiser, num_epochs)
first_5_mean_gradients_non_drop, first_5_std_gradients_non_drop = compute_gradient_statistics_abs_4(first_5_epochs_gradients_abs_non_drop)
last_5_mean_gradients_non_drop, last_5_std_gradients_non_drop = compute_gradient_statistics_abs_4(last_5_epochs_gradients_abs_non_drop)
plot_gradient_statistics_abs_4(first_5_mean_gradients_non_drop, first_5_std_gradients_non_drop, last_5_mean_gradients_non_drop, last_5_std_gradients_non_drop)

# 3.2 Gradient flow for the model with dropout
torch.manual_seed(1984)
drop_model = DropoutNet(0.6)
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(drop_model.parameters(), lr=learning_rate)
first_5_epochs_gradients_abs_dropout, last_5_epochs_gradients_abs_dropout = collect_gradients_abs_4(drop_model, train_dataloader, device, criterion, optimiser, num_epochs)
first_5_mean_gradients_dropout, first_5_std_gradients_dropout = compute_gradient_statistics_abs_4(first_5_epochs_gradients_abs_dropout)
last_5_mean_gradients_dropout, last_5_std_gradients_dropout = compute_gradient_statistics_abs_4(last_5_epochs_gradients_abs_dropout)
plot_gradient_statistics_abs_4(first_5_mean_gradients_dropout, first_5_std_gradients_dropout, last_5_mean_gradients_dropout, last_5_std_gradients_dropout)

# 3.3 Gradient flow for the model with batch normalization
# create model with BAtch norm as per brief

class BatchNormNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=64)
        self.bn4 = nn.BatchNorm1d(64)
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.pool(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.bn4(self.fc1(x)))
        x = self.fc2(x)
        return x
    
torch.manual_seed(1984)
bn_model = BatchNormNet()
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(bn_model.parameters(), lr=learning_rate)
first_5_epochs_gradients_abs_bn, last_5_epochs_gradients_abs_bn = collect_gradients_abs_4(bn_model, train_dataloader, device, criterion, optimiser, num_epochs)
first_5_mean_gradients_bn, first_5_std_gradients_bn = compute_gradient_statistics_abs_4(first_5_epochs_gradients_abs_bn)
last_5_mean_gradients_bn, last_5_std_gradients_bn = compute_gradient_statistics_abs_4(last_5_epochs_gradients_abs_bn)
plot_gradient_statistics_abs_4(first_5_mean_gradients_bn, first_5_std_gradients_bn, last_5_mean_gradients_bn, last_5_std_gradients_bn, skip_bn=True)
plot_gradient_statistics_abs_4(first_5_mean_gradients_bn, first_5_std_gradients_bn, last_5_mean_gradients_bn, last_5_std_gradients_bn, skip_bn=False)

# 3.4 
# properly train a batch norm model 

num_epochs = 50
learning_rate = 0.05

random_seeds = list(range(1, 6))
path_to_save = f'./run_data/batch_norm/batch_norm_{num_epochs}_epochs_LR_{learning_rate}.json'
path_to_load = f'./run_data/batch_norm/batch_norm_{num_epochs}_epochs_LR_{learning_rate}.json'
averaged_results = {'bn':{}}
save_experiment = True

# train them both on the original data

epoch_train_losses_by_run = []
epoch_val_losses_by_run = []
epoch_train_accuracies_by_run = []
epoch_val_accuracies_by_run = []
test_losses = []
test_accuracies = []
reports = []

for random_seed in random_seeds:
    print('seed:', random_seed)
    
    torch.manual_seed(random_seed)
    
    model = BatchNormNet()
    model.to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimiser = optim.SGD(model.parameters(), lr=learning_rate)
    
    model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
    epoch_train_losses_by_run.append(train_epoch_losses)
    epoch_val_losses_by_run.append(val_epoch_losses)
    epoch_train_accuracies_by_run.append(train_epoch_accuracy)
    epoch_val_accuracies_by_run.append(val_epoch_accuracy)
    
    test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
    test_losses.append(test_loss)
    test_accuracies.append(test_accuracy)
    reports.append(report)
    
average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
average_test_loss = sum(test_losses)/len(test_losses)
average_test_accuracy = sum(test_accuracies)/len(test_accuracies)

averaged_results['bn'] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                    'av_val_losses': average_val_losses,
                                    'av_train_acc': average_train_accuracies,
                                    'av_val_acc': average_val_accuracies,
                                    'all_train_losses':epoch_train_losses_by_run,
                                    'all_val_losses': epoch_val_losses_by_run,
                                    'all_train_accuracies': epoch_train_accuracies_by_run,
                                    'all_val_accuracies': epoch_val_accuracies_by_run,
                                    'all_test_losses':test_losses, 
                                    'all_test_accuracies':test_accuracies,
                                    'av_test_loss': average_test_loss,
                                    'av_test_accuracy':average_test_accuracy}
print('average for ')
plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'BATCH NORM MODEL')

    
if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read

batch_norm = 'run_data/batch_norm/batch_norm_50_epochs_LR_0.05.json'
plot_all_models_performance_from_disk(batch_norm, enforce_axis=True)
plot_performance_comparison_from_file(batch_norm, enforce_axis=True)
display_accuracy_heatmap(batch_norm)


# Conclusions and Discussion (instructions) - 25 MARKS <ignore>
In this section, you are expected to:
* briefly summarise and describe the conclusions from your experiments (8 MARKS).
* discuss whether or not your results are expected, providing scientific reasons (8 MARKS).
* discuss two or more alternative/additional methods that may enhance your model, with scientific reasons (4 MARKS). 
* Reference two or more relevant academic publications that support your discussion. (4 MARKS)

*Write your Conclusions/Discussion here*

# References (instructions) <ignore>
Use the cell below to add your references. A good format to use for references is like this:

[AB Name], [CD Name], [EF Name] ([year]), [Article title], [Journal/Conference Name] [volume], [page numbers] or [article number] or [doi]

Some examples:

JEM Bennett, A Phillipides, T Nowotny (2021), Learning with reinforcement prediction errors in a model of the Drosophila mushroom body, Nat. Comms 12:2569, doi: 10.1038/s41467-021-22592-4

SO Kaba, AK Mondal, Y Zhang, Y Bengio, S Ravanbakhsh (2023), Proc. 40th Int. Conf. Machine Learning, 15546-15566

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,
Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv preprint arXiv:2010.11929,
2020.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[3] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected
convolutional networks. CoRR, abs/1608.06993, 2016.
[4] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features
from tiny images. 2009.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-
fication with deep convolutional neural networks. Communications of the
ACM, 60(6):84–90, 2017.
[6] Nguyen Huu Phong and Bernardete Ribeiro. Rethinking recurrent neu-
ral networks and other improvements for image classification. CoRR,
abs/2007.15161, 2020.
2

[7] Pytorch Foundation. CrossEntropyLoss - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html (accessed May 12, 2024). 
[x] Pytorch Foundation. LogSoftmax - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax (accessed May 12, 2024). 
[x] Pytorch Foundation. NLLLoss - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss (accessed May 12, 2024). 
[x] Pytorch Foundation. SGD - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.optim.SGD.html (accessed May 12, 2024). 
[x] Pytorch Foundation. datasets - PyTorch 2.3 documentation, https://pytorch.org/vision/0.8/datasets.html (accessed May 12, 2024). 

1] P Kingma Diederik. Adam: A method for stochastic optimization. (No
Title), 201

[1] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, "Dive into Deep Learning," arXiv preprint arXiv:2106.11342, 2021. [Online]. Available: https://d2l.ai/ (accessed May 12, 2024).