# Assignment overview <ignore>
The overarching goal of this assignment is to produce a research report in which you implement, analyse, and discuss various Neural Network techniques. You will be guided through the process of producing this report, which will provide you with experience in report writing that will be useful in any research project you might be involved in later in life.

All of your report, including code and Markdown/text, ***must*** be written up in ***this*** notebook. This is not typical for research, but is solely for the purpose of this assignment. Please make sure you change the title of this file so that XXXXXX is replaced by your candidate number. You can use code cells to write code to implement, train, test, and analyse your NNs, as well as to generate figures to plot data and the results of your experiments. You can use Markdown/text cells to describe and discuss the modelling choices you make, the methods you use, and the experiments you conduct. So that we can mark your reports with greater consistency, please ***do not***:

* rearrange the sequence of cells in this notebook.
* delete any cells, including the ones explaining what you need to do.

If you want to add more code cells, for example to help organise the figures you want to show, then please add them directly after the code cells that have already been provided. 

Please provide verbose comments throughout your code so that it is easy for us to interpret what you are attempting to achieve with your code. Long comments are useful at the beginning of a block of code. Short comments, e.g. to explain the purpose of a new variable, or one of several steps in some analyses, are useful on every few lines of code, if not on every line. Please do not use the code cells for writing extensive sentences/paragraphs that should instead be in the Markdown/text cells.

# Abstract/Introduction (instructions) - 15 MARKS <ignore>
Use the next Markdown/text cell to write a short introduction to your report. This should include:
* a brief description of the topic (image classification) and of the dataset being used (CIFAR10 dataset). (2 MARKS)
* a brief description of how the CIFAR10 dataset has aided the development of neural network techniques, with examples. (3 MARKS)
* a descriptive overview of what the goal of your report is, including what you investigated. (5 MARKS)
* a summary of your major findings. (3 MARKS)
* two or more relevant references. (2 MARKS)

### Abstract
Through structured experimentation this assignment explores and demonstrates a number of fundamental properties of artificial neural networks during training and testing. 

Using a relatively simple convolutional neural network to classify images in the CIFAR-10 dataset, the effects of network and hyperparamater choices are demonstrated and analysed with a focus not on performance but on clarity of understanding.

Many of the results demonstrated the behaviours associated with the interventions employed. However, poor performance by the model using batch normalisation was not expected, although was ultimately understandble given the context.

### Introduction
The labelled CIFAR-10 dataset and it's larger sibling CIFAR-100 have been used for benchmarking and testing in many exploratory and ground breaking papers relating to computer vision and image classification, not least in the development of Alexnet [8], Resnet [4] and most recently transformer-for vision architectures [3]. It is fitting, then, to use it to explore some of the fundamental properties of aritificial neural networks (NN) in this assignment.

The first experiment examined the effect that altering the learning rate (LR) has on training and performance. As well as experimenting with different learning rates, a LR 'scheduler' was designed and its effect on performance analysed through comparison to models with static learning rates.

The second experiment aimed to demonstrate and offer insight into the impact of introducing a dropout layer into the arhchitecture of the network. Different dropout rates were trialled and their effects compared to baseline performance during both training and evaluation. The effect of dropout was also tested in a transfer learning context.

The third experiment focused on analysing gradient flow during back propagation in different architectures. Gradients were measured and plotted in all layers during training in the baseline model, a model with dropout, and a model with batch normalisation, with results compared to gain insight. The performance of the model training with batch normalisation was also compared to that of the other models.

Approaches and methods are introduces in the methodology section, with results and analysis offered afterwards.

# Methodology (instructions) - 55 MARKS <ignore>
Use the next cells in this Methodology section to describe and demonstrate the details of what you did, in practice, for your research. Cite at least two academic papers that support your model choices. The overarching prinicple of writing the Methodology is to ***provide sufficient details for someone to replicate your model and to reproduce your results, without having to resort to your code***. You must include at least these components in the Methodology:
* Data - Decribe the dataset, including how it is divided into training, validation, and test sets. Describe any pre-processing you perform on the data, and explain any advantages or disadvantages to your choice of pre-processing. 
* Architecture - Describe the architecture of your model, including all relevant hyperparameters. The architecture must include 3 convolutional layers followed by two fully connected layers. Include a figure with labels to illustrate the architecture.
* Loss function - Describe the loss function(s) you are using, and explain any advantages or disadvantages there are with respect to the classification task.
* Optimiser - Describe the optimiser(s) you are using, including its hyperparameters, and explain any advantages or disadvantages there are to using that optimser.
* Experiments - Describe how you conducted each experiment, including any changes made to the baseline model that has already been described in the other Methodology sections. Explain the methods used for training the model and for assessing its performance on validation/test data.


## Data (7 MARKS) <ignore>

The CIFAR-10 dataset was developed as a labelled subset of the 80 million tiny images dataset [7]. It consists of 60,000 low resolution (32x32) colour images split into 50,000 training examples and 10,000 testing examples. Each image belongs to one 10 mutually exclusive classes and is labelled accordingly. These classes describe the suject of the image ('airplane', 'cat', 'ship', etc.).

It is conveniently accessable, along with many other benchmarking datasets, via the Pytorch `datasets` method which enables the user to load both training and test data into separate `torch.Dataset` instances extremely easily, and this was the method used here. 
 
As part of this process it is possible to apply manual transforms to the data as it is loaded and here the data was both converted to tensors and standrdised using this approach. The standardisation (so that the pixel values in the 3 input channels had a mean of 0 and and a standard deviation of 1) ensured and the model would learn only the informative variation in the data.  

The training instances were split to create a validation set of 5000 samples (with a random seed set for consistency across experiments). The class distribution for for each dataset was found to be well balanced (see Fig 1) meaning simple accuracy will be a reliable measure of overall performance across the classes.

<figure><center><img src="./figs/classdisttraining.png" width=200><img src="./figs/classdistval.png" width=200><img src="./figs/class dist test.png" width=200><figcaption style="max-width: 600px"> Figure 1. Class distributions across the training, validation, and testing datasets</figcaption></center></figure>

Data Batching for stochastic gradient descent was handled by the `DataLoader` class, which yields samples without replacement from the shuffled dataset in batches of a size that can be specified by the user.

It was decided that a single train and validation split would be appropriate for the task at hand. Cross-validation was discounted as the benefit of a more accurate idea of the likely performance of the model, or exposure to absolutely all of the possible training data was not an important consideration here. This is because the objective is to understand the minutiae of model behaviour rather than to maximise final performance. 

## Architecture (17 MARKS) <ignore>


<figure><center><img src="./figs/baseline_model_diagram.png" width=800><figcaption style="max-width: 600px"> Fig 2. BaselineNet Convolutional Neural Network architecture. </figcaption></center></figure>

<figure><center><img src="./figs/TABLE.PNG" width=600><figcaption style="max-width: 600px"> Table 1: Convolutional Neural Network Architecture</figcaption></center></figure>

The choices for the initial architecture were based on a combination of the assignment brief, initial experimentation, and common practices in the field.

Fig 2. shows the overall arhcitecture of the model, whilst table 1 shows the detail of the convolutional layers. There were a number of considerations that went into these choices. 

Filter dimensions of 3x3 were chosen as they have been shown to be effective in capturing local spatial patterns while keeping the number of parameters relatively low. Indeed, VGG net demonstrated the power of stacked 3x3 filter-based convolutional layers [14], and although they were used in a much deeper network there, that network was classifying much higher resolution images.

The increasing number of filters in the convolutional layers allows the network to learn progressively more complex and abstract features as the depth increases, and was another property shown to be effective in the VGG network [14]. 

Setting the stride and padding to 1 in the convolutional layers ensured that the spatial resolution was preserved, while preventing information loss at the edges.

The max pooling layers all have a pool size of 2x2 and stride of 2. This reduces the spatial dimensions and therefore reduces the number of parameters but it also and provides a form of translation invariance because the exact position of a feature within the 2x2 window becomes less important; the max pooling layer only keeps the maximum activation value within each window.

A batch size of 64 was selected as a balance between computational efficiency and the ability to capture a representative sample of the dataset in each iteration.

The choice of size for the fully connected layer was a balance between the capacity requirements of the model and the number of paramaters that could realistically be trained over numerous runs during the experiments. As `fc1` takes as its input the <lt>$1024$</lt> activations from the flattened convolutional layer before, the weights of this layer are <lt>$1024*d$</lt> where $d$ is the dimensionality of `fc1`  The final value of <lt>$64$</lt> outputs resulted in <lt>$65,536$</lt> trainable paramters which was a good compromise.

ReLU was chosen for the non-linear activation throughout for the same reasons it is often chosen, namely its ability to avoid vanishing gradients owing to the fact it does not saturate as other activations such as sigmoid or tanh do, and so it avoids values close to 0 on differentiation during back propogation.


## Loss function (3 MARKS) <ignore>

The loss function used for each experiment was cross-entropy loss, implimented using the `nn.CrossEntropyLoss` class from Pytorch [9].

It is widely used in classification problems such as this where the target variable is binomial or miultinomial. 

It works by first transforming the raw logits of the output layer into what is a effectively a probability distribution via the softmax activation function. Where <lt>$C$</lt> is the number of classes, it outputs is  $C$-dimensional vector of real numbers in the range (0, 1) that sum to 1.

To calculate the loss this distribution is compared to a one-hot encoded version of the true class label. This acts as a target probability distribution for the comparison and the cross entropy loss calculation essentially quantifies the difference via the following calculation.  

For a single sample with true label <lt>$y$</lt> and predicted probabilities <lt>$\hat{y}$</lt>, the cross-entropy loss is calculated as:
<lt>$$\text{CE}(y, \hat{y}) = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$</lt>

where <lt>$y_i$</lt> is the true label (0 or 1) for class <lt>$i$</lt>, and <lt>$\hat{y}_i$</lt> is the predicted probability for class <lt>$i$</lt> as given by the softmax output. 

By minimizing the average cross-entropy loss over all training samples, the model learns to assign high probabilities to the correct class and low probabilities to the incorrect ones.

Practically, the Pytorch module precludes the need for a softmax layer in the model architecture itself as the optimiser takes in the raw logits and then applys the `nn.LogSoftmax()` activation function [11] and the `nn.NLLLoss()` [12] (Negative Log-Likelihood Loss) in a single operation that encapsulates the above. 

## Optimiser (4 MARKS) <ignore>

The optmiser used to handle parameter updates and impliment gradient descent was stochastic gradient descent (SGD), implimented using the `optim.SGD` class from Pytorch. 

SGD estimates the true gradient of the loss function with respect to the paramaters of the model by calulating the gradient of a small subset of the training data (a mini-batch) and updates the parameters of the model with this approximate gradient, weighted by a LR which - in this approach - is fixed, and is a user defined hyperparamater that can be tuned. 

This process is repeated for multiple mini-batch samples taken from the training data without replacement (until the entire data set has been seen - representing an 'epoch' of training) and then repeated until a stopping criterion is met - in this case a set number of epochs.

Mathematically, the estimated gradient for a mini-batch of size $B$ sampled from the training data is computed as:
<lt>$$\nabla_\theta L(\theta_t) \approx \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(\theta_t; x_i, y_i)$$</lt>
where <lt>$(x_i, y_i)$</lt> represents the <lt>$i$</lt>-th example in the mini-batch.

A number of more sophiticated optimisers are available when training NNs today, however as performance was not the chief consideration, SGD was chosen to make analysing the impact of LR on performance straightforward and transparent. With SGD paramaters are directly updated based only on the gradient and the learning rate. By keeping to this very direct forumlation it easier to understand and interpret the impact of the LR on the model's performance.

## Experiments <ignore>
### Experiment 1 (8 MARKS)

#### 1.1 - Learning Rates
A number of exploratory training runs of different LRs between <lt>$0.5$</lt> and <lt>$1e^-6$</lt> were carried out to established the extremes of model behaviour. A top value above which learning would be unstable, and a low value below which no learning would occur were found ($0.15$ and $0.001$ respectively). 

These findings suggested the range from which to select the 5 LRs to compare for the experiment which were chosen as <lt>$0.1, 0.075. 0,05, 0.025 \text{, and }0.01$</lt>

For all trials the data was loaded and processed as described above using the model, criterion and optimiser specified. 

For each learning rate, 5 trials were conducted. That is; 5 different models were instantiated, trained and evlauated. The results for each of these trials were recorded. The 5 trials were run by iterating over a list of random seeds that was kept constant for all trials in all experiments to allow for comparison and consistency in weight initialisation, dropout activation selection and all other random processes.  

Models were trained by mini-batch stoachastic gradient descent as described above. During training each batch was scored in terms of loss and accuracy where loss was as above and accuracy was a simply count of how many images were correctly classified divided by the number of images in the batch.

Batch scores were averaged across the epoch to give the training loss and accuracy for that epoch. After each epoch of training, the model was taken out of training mode - halting gradient computations - and the validation scores were calculated by iterating through the validation data in batches. At the end of training each model was then evaluated against the test dataset. In order to obtain the 'test score' for each LR as shown in Fig 4, the average of the 5 models instantiated for that LR was taken. Test scores were calculated as validation scores were. 

Accumulating these metrics across epochs rather than batches is a somewhat aribitraty although conventional approach. It is a convenient way to keep track of how many times the model has been exposed to all of the training data and is easy to understand when plotting performance graphs. 

The data for each learning rate's performance was stored in a JSON file for later plotting and analysis. 

#### 1.2 - LR Scheduler

Having established above the performance of different LRs it was clear that the model could tolerate a relatively high initial LR but that this needed to drop significantly and arrive at or beneath 0.02 by the end of the 50 epochs to ensure a more fine grained exploration of the loss landscape in later stages. 

A number of approaches to LR scheduling were explored visually as in Fig 5. below. The function and decay rate that best fit the finding of experiment 1.1 was 'inverse time decay' with a decay rate of 0.25. This function is defined as <lt>$\alpha_t = \frac{\alpha_0}{1 + kt}$</lt> where <lt>$\alpha_t$</lt> is the LR at time step <lt>$t$</lt>, <lt>$\alpha_0$</lt> is the initial learning rate, <lt>$k$</lt> is the decay rate, <lt>$t$</lt> is the current time step or iteration. 

How this function modifies the LR over the epochs can be seen in plot 1 of Fig 5.

A model was then trained as above using this LR decay function which was applied evey epoch. The results of this training were gathered and plotted as can be seen in Fig 6.

### Experiment 2 (8 MARKS) <ignore>

#### 2.1 - Dropout Rates
For this experiment the original training data was re-split into two halves to create new training and validation datasets of 25,000, each and a new model was defined encorporating dropout in the fully connected layers.

Of the two fully connected layers in the model, one is connected to the output layer and would not typically have dropout applied (as these connections are directly outputting the logits used for classification). Through flattening the final convolutional layer is in a sense fully connected to `fc1`. However, it is not generally a good idea to apply dropout to CNN activations, as it can disrupt the spatial structure and correlation in the feature map representations. It was decided then to use a single dropout layer applied to the activations of only `fc1`.

The set of dropout rates for experimentation was defined as <lt>$0, 0.2, 0.4, 0.6, \text{, and }0.8$</lt> as 0 had to be included and 1 would mean no activation were passed forward at all.

The same approach as in experiment 1.1 was taken to training and validation, with 5 trials carried out for each dropout rate with models initialised with consistent seeding. 

The experiment's results (seen in Fig 9) show the effect of dropout rgularisation on model performance, and Fig 11 established the optimal dropout rate for the DropoutNet model on this specific classification task - which was 0.6. 

#### 2.2 - Dropout and Transfer Leaning

The second part of this expeiment investigated the performance of dropout regularization in the context of transfer learning.

It compared the performance of the best performing model from experiment 1 with:
*i)* a model pretrained on the original data *without* dropout then retrained on the new data
*ii)* a model pretrained don the original data *with** dropout then retrained on the new data. 

In both of the latter cases the retaining was partial and amounted to transfer learning where pretrained models had some weights 'frozen' whilst others were reintialised and made trainable on the new data.

Transfer learning was implimented as follows.

Two models, one with dropout, one without, were initialised and trained as in previous experiments, iterating over 5 random seeds gathering training, validation and testing performance data. The final instance of each model was saved to disk so the trained models weights were stored.

The validation and training datasets were then swapped, the models were loaded and all of their layers were frozen except their fully connected layers which were manually re-initialised, meaning they were subject to training. 

These two models were then trained on the new, swapped data as in previous experiments. By the end of this process these models (in their different layers) had effectively been exposed to two slightly differently distributed datasets - pretrained on the original data, and then their final layers had been trained on the new data. 

Their performance during this retraining on training, validation and testing data was recorded. 

The averaged results and smoothed plots seen in Figs 14, 16, 17 and 18 provide insights into how the pretrained models with and without dropout perform when fine-tuned on the swapped data. The test results on the original test dataset assess the models' performance on unseen data and can be compared to other models. 

### Experiment 3 (8 MARKS) <ignore>

#### 3. Gradient Flow Analysis

#### 3.1, 3.2, 3.3

This experiment investigated gradient flow in different networks. It compared the previously seen models with a model with batch normalisation implimented in order to discover any differences in how gradient propogates through these different arhcitectures. 

For this experiment a new model implimenting batch normalisation was defined using the inbuilt Python `nn.BatchNorm`. This impliments a process by which the activations of a layer are normalised by subtracting the batch mean and dividing by the batch standard deviation. After normalisation, the activations are scaled and shifted using learnable parameters (`bn.weight` and `bn.bias` in Fig 23.). It has been shown to enable faster learning rates and to have a regularising effect among other benefits [1], [6]. It was applied to all except for the last layer here. 

The process for each of the baseline, dropout and batchnormalised models was the same and was as follows. 

The gradient for each layer in each model was gathered and averaged across the first 5 episodes and the last 5 episodes during training. PyTorch conveniently makes these values accessible as a property of the model, and all that was required was to collect and calculate the averages for each layer across the correct episodes.

Training was carried out over 30 epochs. The original data split was re-instigated, and a fixed LR of 0.05 was selected. 

For each model the same random seed was initialised, then the model, criterion and loss initialised as in previous experiments. During training, rather than gathering performance data, the gradient data was collected as described above. 

This data was then plotted in a variety of forms to highlight the trends in the data. 

It should be noted that rather than the raw gradient values being collected, it was the absolute values. The reasons for this are made clear in the results section for this experiment. 

#### 3.4

Finally, a batch normalised model was trained on the original data for 50 epochs as in previous experiments, with performance on the original training, validation and tes datasets recorded and plotted. It was compared and analysed in relation to other models performance.

In [1]:
############################################
### Code for building the baseline model ###
############################################

# relevant imports

import torch
import torch.nn as nn
import torch.nn.functional as F # as per convention

class BaselineNet(nn.Module):
    def __init__(self):
        super().__init__()
        # max pool layers - not strictly needed to be seperate instances but helps with reference to the diagram
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)

        self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = F.relu(self.conv3(x))
        x = self.pool3(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Results (instructions) - 55 MARKS <ignore>
Use the Results section to summarise your findings from the experiments. For each experiment, use the Markdown/text cell to describe and explain your results, and use the code cell (and additional code cells if necessary) to conduct the experiment and produce figures to show your results.

### Experiment 1 (17 MARKS) <ignore>

#### 1.1
As can be seen in Fig 1, initial experminetation established reasonable limits within which to select LRs for further testing. Rates of 0.15 and above lead to unusual, erratic behaviour such as that seen in Fig 1.2 where the LR is so high that the model cannot converge to an optimal solution and instead overshoots. On the other hand 1.1 shows the other extreme where the LR is so low no learning can occur.

<figure><center><img src="./figs/e1/lrchaos.png" width=700><img src="./figs/e1/tranval_no_learn.png" width=700><figcaption style="max-width: 600px"> Fig 1. Showing behavioural extremes for different learning rates: unstable learning at a LR of 0.2, and minimal learning at a LR of 0.001 </figcaption></center></figure>

The performances of different LRs can be seen in Fig 2. below and are well summarised in Fig 3. Looking at Fig 2., it can be seen that as LRs get smaller the generalisation gap between the training and validation loss and accuracy is slower to develop, and less extreme. This shows that models trained with higher LRs are able to fit to the training data more quickly, but also overfit to it more quickly. The impact of the LR can also be seen the volatility in of the training loss which is markablty lower at lower learning rates.

<figure><center><img src="./figs/e1//lr1.png" width=700><img src="./figs/e1/lr2.png" width=700><img src="./figs/e1/lr3.png" width=700><img src="./figs/e1/lr4.png" width=700><img src="./figs/e1/lr5.png" width=700><figcaption style="max-width: 600px"> Fig 2. Performance plots showing individual and averaged training and validation losses and accuracies for models trained with descending LRs across 50 epochs of training </figcaption></center></figure>

In terms of performance on unseen data, the test performances in Fig 4. and the smoothed validation losses and accuracies in Fig 3. give a good overview of how LRs affect this, with lower LRs leading to a reduced validition loss at the end of the 50 epochs owing to slower fitting (and thus overfitting), but also lower accuracy in test and validation. This was in contrast to the quicker rise to high accuracy for those with high learning rates,followed by a plateaing and gradual decline. 

<figure><center><img src="./figs/e1/smoothed loss accuracy.png" width=700><figcaption style="max-width: 600px"> Fig 3. Smoothed averaged results for accuracies and losses across 50 epochs on validation data for models trained with different learning rates</figcaption></center></figure>

<figure><center><img src="./figs/e1/leraning rates test performance.PNG" width=300><figcaption style="max-width: 600px"> Fig 4. Test set performance of models trained with different LRs highlighting the best result for each metric in green</figcaption></center></figure>

#### 1.2

Having observed the different performances above, it was clear the ideal balance would be a LR that began at the highest end of the LRs above (0.1), but that decayed reasonably quickly in order to avoid the onset of overiftting around 10 epochs. 

Different approaches to decay and how they affect LR over the 50 epochs can be seen in Fig 5. The smooth inverse time fuinction with a decay rate of 0.25 seemed to have the ideal combination and was found to perform well relative to the others. 

<figure><center><img src="./figs/e1/lr_scheculer experiments.png" width=350><figcaption style="max-width: 600px"> Fig 5. Different LR decay schedules affect on the active LR across 50 epochs </figcaption></center></figure>

As can be seen when a model using this shceduler is compared with a model using a static LR (see fig 1.3), there is a slight improvment in overall performance with a shceduler, although however the most substantial difference appears to be in the stability of the validation loss and accuracies despite seeming to over fit to the training data. The LR scheduled model's validation accuracy stabalises in a way that did not occur with any of the  the other models that saturated at close to 100% training accuracy before. This is likely because the even decreasing LR means that after a certain point the paramaters will settle as they will only be getting the negligable updates. 

<figure><center><img src="./figs/e1/LR SCHEDULER final results.png" width=700><figcaption style="max-width: 600px"> Fig 6. Performance over 50 epochs of training for model trained with LR scheduler </figcaption></center></figure>
<figure><center><img src="./figs/e1/results accuracy camparison lr and scheduler.png" width=350><figcaption style="max-width: 600px"> Fig 7. Comparison of performance across training of model trained with a LR scheduler, and the best performing model without a scheduler (LR of 0.05)</figcaption></center></figure>
<figure><center><img src="./figs/e1/lr decay comparison.PNG" width=400><figcaption style="max-width: 600px"> Fig 8. Comparison of test results between a model trained with a LR scheduler, and the best performing model trained without a scheduler (LR of 0.05) highlighting the best result for each metric in green</figcaption></center></figure>

That this model achieves 100% on the training set is noteable in itself - something which none of the earlier models did. This is again likely owing to (in the case of high LR models) being too coarse to hone in on a particular point in paramater space that would give it 100% accuracy, or (in the case of the very low learning rates) possibly being unable to tranverse the loss landscape effectively owing to too small a gradient, possibly getting stuck in sub-optimal minima.  

Overall this experiment demonstrates well the impact that different LRs can have on learning in a NN model. 

In [2]:

#############################
### code for Experiment 1 ###
#############################

# UTIL functions that are used here and in all other experiments are included at the bottom of this cell. 
# This choice was made so the experiment code came first to help with readability
# it does mean some function calls show as undefined


# imports 
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import torch.optim as optim
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import math

# use GPU where available
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")


# EXPERIMENT 1.1 ------------- LRs -------------

# DATA LOADING AND SPLITTING

# set seed for data split
torch.manual_seed(0)

# create transform object so conversion to Tensor and normalising carried out on data download (functionality as part of torchvision.datasets method)
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# get the data - 'train' boolean specifies whether to get training or test data
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)

# set value for validation split (10% validation)
num_validation_samples = 5000
num_train_samples = len(train_data) - num_validation_samples

# split training data
train_data, val_data = random_split(train_data, [num_train_samples, num_validation_samples])

# confirm split number
print(len(train_data)) # 50000 training egs  
print(len(val_data)) # 10000 test egs
print(len(test_data)) # 10000 test egs

# set batch side for initialising dataloaders intialise for different datasets
batch_size = 64
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)


# RUNNING TRAINING AND VALIDATIOB

num_epochs = 50
random_seeds = list(range(1, 6))

learning_rates_for_experiment = [0.1, 0.075, 0.05, 0.025, 0.01]
# initialise dictionary for storing data for saving to JSON
averaged_results = {lr:{} for lr in learning_rates_for_experiment}
path_to_save = f'./run_data/learning_rates/FINAL.json'
path_to_load = f'./run_data/learning_rates/FINAL.json'
save_experiment = True
# iterate over LRs to be tested
for learning_rate in learning_rates_for_experiment:
    # initialise empty lists for collecting data for each LRs (over the 5 runs)
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    # 5 random seeds = 5 different runs for each learning rate
    for random_seed in random_seeds:
        # set seed prior to initialising model (as used for initial weights as well as any dropout layers)
        torch.manual_seed(random_seed)
        # initialise model, criterion and optimiser
        model = BaselineNet().to(device)
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)
        
        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
    
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[learning_rate] = {'seeds':random_seeds,
                                       'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=5, title=f'lr: {learning_rate}')

if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read


# PLOTTING

plot_all_models_performance_from_disk(path_to_load, enforce_axis=True)
plot_performance_comparison_from_file(path_to_load, enforce_axis=True)
display_accuracy_heatmap(path_to_load)


# EXPERIMENT 1.2 ------------- LR SCHEDULER -------------

# INVESTIGATE LR DECAY

# exploring different learning_rate decay approaches and plotting them to see how the LRwill actually behave across 50 epochs
def adjust_learning_rate(epoch, initial_lr, decay_type, decay_rate=0.1, decay_interval=10):
    if decay_type == 'inverse_time':
        new_lr = initial_lr / (1 + decay_rate * epoch)
    elif decay_type == 'exponential':
        new_lr = initial_lr * (math.e ** (-1 * decay_rate * epoch))
    elif decay_type == 'step':
        num_decays = epoch // decay_interval
        new_lr = initial_lr * (decay_rate ** num_decays)
    return new_lr

def plot_learning_rate_decay(num_epochs, initial_lr, decay_functions):
    fig, axs = plt.subplots(len(decay_functions), figsize=(8, 4 * len(decay_functions)))
    if len(decay_functions) == 1:
        axs = [axs]
    
    for i, (decay_type, decay_rate, decay_interval) in enumerate(decay_functions):
        lr_values = [adjust_learning_rate(epoch, initial_lr, decay_type, decay_rate, decay_interval) for epoch in range(num_epochs)]
        
        if decay_type == 'step':
            title = f'Decay Function: {decay_type}, Decay Rate: {decay_rate}, Decay Interval: {decay_interval}'
        else:
            title = f'Decay Function: {decay_type}, Decay Rate: {decay_rate}'
        
        axs[i].plot(range(num_epochs), lr_values)
        axs[i].set_title(title)
        axs[i].set_xlabel('Epoch')
        axs[i].set_ylabel('Learning Rate')
    
    plt.tight_layout()
    plt.show()

num_epochs = 50
initial_lr = 0.1

decay_functions = [
    ('inverse_time', 0.1, 0),
    ('inverse_time', 0.05, 0),
    ('step', 0.5, 10),
    ('step', 0.1, 5),
    ('exponential', 0.25, 0),
    ('exponential', 0.1, 0)
]

plot_learning_rate_decay(num_epochs, initial_lr, decay_functions)


# RUN TRAINING AND VALIDATION WITH LRDECAY

# implimenting the most LR decay shceduler that best fit what I wanted to happen
# creating function that will be passed in to the training function to be applied after evey epoch
def adjust_initial_learning_rate(optimiser, epoch, initial_lr=0.1, decay_rate=0.25):    
    new_lr = initial_lr / (1 + decay_rate *epoch)
    for param_group in optimiser.param_groups:
        param_group['lr'] = new_lr
    print('LR:',new_lr)
    return optimiser


num_epochs = 50

initial_learning_rate = 0.1
decay_rate = 0.25

random_seeds = list(range(1, 6))

averaged_results = {decay_rate:{}}
path_to_save = f'./run_data/lr_decay/final_decaying_lr_initial_lr_{initial_learning_rate}_decay_{decay_rate}.json'
path_to_load = f'./run_data/lr_decay/final_decaying_lr_initial_lr_{initial_learning_rate}_decay_{decay_rate}.json'

save_experiment = True

epoch_train_losses_by_run = []
epoch_val_losses_by_run = []
epoch_train_accuracies_by_run = []
epoch_val_accuracies_by_run = []
test_losses = []
test_accuracies = []
reports = []
    
for random_seed in random_seeds:
    print('DECAY: ', decay_rate)
    print('seed:', random_seed)
    torch.manual_seed(random_seed)

    model = BaselineNet().to(device)
    criterion = nn.CrossEntropyLoss()
    optimiser = optim.SGD(model.parameters(), lr=initial_learning_rate)

    model,train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, train_report,val_report = run_training_and_validation(model, device, initial_learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, manual_lr_schedule=True, scheduler_func=adjust_initial_learning_rate, plot=True)
    epoch_train_losses_by_run.append(train_epoch_losses)
    epoch_val_losses_by_run.append(val_epoch_losses)
    epoch_train_accuracies_by_run.append(train_epoch_accuracy)
    epoch_val_accuracies_by_run.append(val_epoch_accuracy)
    
    test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
    test_losses.append(test_loss)
    test_accuracies.append(test_accuracy)
    reports.append(report)

    
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[decay_rate] = {'seeds':random_seeds,
                                       'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'LR: {initial_learning_rate}, DECAY: {decay_rate}')
    
if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read

# PLOTTING
lr_decay_data = path_to_load
plot_all_models_performance_from_disk(lr_decay_data, enforce_axis=True)
plot_performance_comparison_from_file(lr_decay_data, enforce_axis=True)
display_accuracy_heatmap(lr_decay_data)


# ---------UTILITY FUNCTIONS USED ACROSS ALL EXPERIMENTS---------

# These functions comprised a utils.py file during development

# MODEL RELATED (click to expand) :
def run_training_and_validation(model, device, initial_lr, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, metrics = False, manual_lr_schedule = False, scheduler_func=None, plot = False):

    # key function which performs training and validation of a model for params and data. 
    
    # returns all of the data gathered from the training and validation run organised by epoch. Optional params added during development to accomodate different experiments (eg lr_scheduling)
    
    # optional metrics and plot paramaters allow for plotting as well as generation of classification report used for analysis of results
    
    # when plotting, includes a call to plot_single_train_val_smoothed() util function defined below
    # when training includes a call to the get_accuracy() function below

    train_epoch_losses = []
    train_epoch_accuracy = []
    val_epoch_losses = []
    val_epoch_accuracy = []
    
    for epoch in range(num_epochs):
        train_running_batch_losses = []
        train_running_batch_accuracy = []
        
        if epoch == num_epochs-1:
            train_all_preds = []
            train_all_labels = []
            val_all_preds = []
            val_all_labels = []
        
        if manual_lr_schedule:
            optimiser = scheduler_func(optimiser, epoch, initial_lr)

        model.train()
        for i, (images, labels) in enumerate(train_dataloader):
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            
            accuracy = get_accuracy(outputs, labels)
            
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()

            train_running_batch_losses.append(loss.item())
            train_running_batch_accuracy.append(accuracy)
            # if i % 50 == 0:
            #   training_progress_bar.set_description(f'Training Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(train_dataloader)}], Loss: {loss.item():.4f}, Acc: {accuracy:.4f}')
            
            if epoch == num_epochs-1:
                _, preds = torch.max(outputs, 1)
                train_all_preds.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy for sklearn
                train_all_labels.extend(labels.cpu().numpy())  # Move labels to CPU and convert to numpy

        train_epoch_losses.append(sum(train_running_batch_losses)/len(train_running_batch_losses))
        train_epoch_accuracy.append(sum(train_running_batch_accuracy)/len(train_running_batch_accuracy))
        model.eval()
        with torch.no_grad():
            val_running_batch_losses = []
            val_running_batch_accuracy = []

            for i, (images, labels) in enumerate(val_dataloader):
                images = images.to(device)
                labels = labels.to(device)
                
                outputs = model(images)
                loss = criterion(outputs, labels)
                
                accuracy = get_accuracy(outputs, labels)

                val_running_batch_losses.append(loss.item())
                val_running_batch_accuracy.append(accuracy)
                # if i % 20 == 0:
                #   val_progress_bar.set_description(f'Validation Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(val_dataloader)}], Loss: {loss.item():.4f}, Acc: {accuracy:.4f}')
                
                if epoch == num_epochs-1:
                    _, preds = torch.max(outputs, 1)
                    val_all_preds.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy for sklearn
                    val_all_labels.extend(labels.cpu().numpy())  # Move labels to CPU and convert to numpy

            val_epoch_losses.append(sum(val_running_batch_losses)/len(val_running_batch_losses))
            val_epoch_accuracy.append(sum(val_running_batch_accuracy)/len(val_running_batch_accuracy))
            print(f'Epoch [{epoch+1}/{num_epochs}] - Train Loss: {train_epoch_losses[epoch]:.4f}, Acc: {train_epoch_accuracy[epoch]:.4f} | Val Loss: {val_epoch_losses[epoch]:.4f}, Acc: {val_epoch_accuracy[epoch]:.4f}')
            class_names = ['plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck']
            
    if plot:
        plot_single_train_val_smoothed(train_epoch_losses, val_epoch_losses, train_epoch_accuracy, val_epoch_accuracy, num_epochs, smoothing_window=10, title=f'single run lr={initial_lr}, decay={manual_lr_schedule}')
    
    if metrics:
        train_report = classification_report(train_all_labels, train_all_preds, target_names=(class_names))
        val_report = classification_report(val_all_labels, val_all_preds, target_names=(class_names))
        # print('FINAL EPOCH TRAINING SUMMARY:')
        # print(train_report)
        # print('FINAL EPOCH VALIDATION SUMMARY:')
        # print(val_report)
        
        return (model,train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, train_report,val_report)
    else:
        return (model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, 0,0)

def get_accuracy(logits, targets):
    
        # key function used in all training and valdation and testing runs to calculate the accuracy of predictions made by a model using.
        
        # takes in logits (raw output scores from the model) and targets (actual class labels) and returns a float representing the accuracy of the predictions.

        # get the indices of the maximum value of all elements in the input tensor (which are the predicted class labels)
        _, predicted_labels = torch.max(logits, 1)
        
        # calculate the number of correctly predicted labels.
        correct_predictions = (predicted_labels == targets).sum().item()
        
        # calculate the accuracy.
        accuracy = correct_predictions / targets.size(0)
        
        return accuracy

def run_testing(model, device, criterion, test_dataloader):
    # this function was used to test trained models on the test dataset
    # its returns loss accuracy and the classification report for analysis
    model.eval()
    with torch.no_grad():
        test_running_batch_losses = []
        test_running_batch_accuracy = []
        test_all_preds = []
        test_all_labels = []

        for i, (images, labels) in enumerate(test_dataloader):
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            accuracy = get_accuracy(outputs, labels)

            test_running_batch_losses.append(loss.item())
            test_running_batch_accuracy.append(accuracy)
            # test_progress_bar.set_description(f'testidation Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(test_dataloader)}], Loss: {loss.item():.4f}, Acc: {accuracy:.4f}')
            _, preds = torch.max(outputs, 1)
            test_all_preds.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy for sklearn
            test_all_labels.extend(labels.cpu().numpy())  # Move labels to CPU and convert to numpy

    test_loss = sum(test_running_batch_losses)/len(test_running_batch_losses)
    test_accuracy = sum(test_running_batch_accuracy)/len(test_running_batch_accuracy)

    print('TESTING COMPLETE!!')
    print(f'Test Loss: {test_loss:.4f}, Test Acc: {test_accuracy:.4f}')
    report = classification_report(test_all_labels, test_all_preds, target_names=(['plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck']))
    print(report)
    return test_loss, test_accuracy, report

# PLOTTING/VISUALISING RELATED (click to expand):
def plot_single_train_val_smoothed(train_epoch_losses, val_epoch_losses, train_epoch_accuracy, val_epoch_accuracy, num_epochs, smoothing_window=5, title=None):
    # function used in many contexts to plot training and validation losses and accuracies of a single run
    # takes in the values returne from a single run of training and validation and plots them 
    # smoothing param allows for clearer picture of the progress during validation especially as it can be volatile 
    
    # convert lists to pandas Series
    train_epoch_losses_series = pd.Series(train_epoch_losses)
    val_epoch_losses_series = pd.Series(val_epoch_losses)
    train_epoch_accuracy_series = pd.Series(train_epoch_accuracy)
    val_epoch_accuracy_series = pd.Series(val_epoch_accuracy)

    # calculate moving averages using the provided smoothing window
    smooth_train_epoch_losses = train_epoch_losses_series.rolling(window=smoothing_window).mean()
    smooth_val_epoch_losses = val_epoch_losses_series.rolling(window=smoothing_window).mean()
    smooth_train_epoch_accuracy = train_epoch_accuracy_series.rolling(window=smoothing_window).mean()
    smooth_val_epoch_accuracy = val_epoch_accuracy_series.rolling(window=smoothing_window).mean()

    fig, ax = plt.subplots(1, 2, figsize=(14, 5))

    # Plot training and validation loss with moving averages
    ax[0].plot(train_epoch_losses, label='Training Loss', alpha=0.3)
    ax[0].plot(val_epoch_losses, label='Validation Loss', alpha=0.3)
    ax[0].plot(smooth_train_epoch_losses, label='Smoothed Training Loss', color='blue')
    ax[0].plot(smooth_val_epoch_losses, label='Smoothed Validation Loss', color='orange')
    ax[0].set_xlabel('Epochs')
    ax[0].set_ylabel('Loss')
    ax[0].set_title('Training and Validation Loss')
    ax[0].legend()

    # Set x-axis to show each epoch as a tick
    ax[1].set_xticks(range(0, num_epochs + 1, 10))

    # Plot training and validation accuracy with moving averages
    ax[1].plot(train_epoch_accuracy, label='Training Accuracy', alpha=0.3)
    ax[1].plot(val_epoch_accuracy, label='Validation Accuracy', alpha=0.3)
    ax[1].plot(smooth_train_epoch_accuracy, label='Smoothed Training Accuracy', color='blue')
    ax[1].plot(smooth_val_epoch_accuracy, label='Smoothed Validation Accuracy', color='orange')
    ax[1].set_xlabel('Epochs')
    ax[1].set_ylabel('Accuracy')
    ax[1].set_title('Training and Validation Accuracy')
    ax[1].legend()

    # Set x-axis to show each epoch as a tick
    ax[1].set_xticks(range(0, num_epochs + 1, 10))

    # Set y-axis for accuracy to range from 0 to 1 with ticks at intervals of 0.1
    ax[1].set_ylim(0, 1)
    ax[1].set_yticks([i * 0.1 for i in range(11)])
    if title:
        fig.suptitle(title, fontsize=16)

    plt.tight_layout()
    plt.show()

def display_accuracy_heatmap(path_to_load):
    # helper function for displaying best performing models in a convenient way
    with open(path_to_load, 'r') as file:
        results = json.load(file)
    
    rates = []
    av_test_losses = []
    av_test_accuracy = []
    for rate, value_dict in results.items():
        rates.append(rate)
        av_test_losses.append(value_dict['av_test_loss'])
        av_test_accuracy.append(value_dict['av_test_accuracy'])
    
    # Creating the DataFrame
    df = pd.DataFrame({
        'Average Test Loss': av_test_losses,
        'Average Test Accuracy': av_test_accuracy
    }, index=rates)
    
    # Applying conditional formatting to highlight the best value in each column
    def highlight_best(column):
        if column.name == 'Average Test Loss':
            is_best = column == column.min()
        else:
            is_best = column == column.max()
        return ['background: green' if v else '' for v in is_best]
    
    styled_df = df.style.apply(highlight_best, axis=0)
    
    return styled_df

def plot_single_model_performance(single_var_multi_run_data, title=None, enforce_axis=False):
    # function used for plotting the performance of single variable being investigated of n multiple runs 
    # for example during experiments 1.1 and 2.1
    
    # plots individual runs in background and a clearer average run 
    
    epochs = range(1, len(single_var_multi_run_data['av_train_losses']) + 1)
    n_runs = len(single_var_multi_run_data['all_train_losses'])
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    if title:
        title += f' across {n_runs} runs'
        fig.suptitle(title, fontsize=12)

    # Plot losses
    for train_loss, val_loss in zip(single_var_multi_run_data['all_train_losses'], single_var_multi_run_data['all_val_losses']):
        ax1.plot(epochs, train_loss, color='blue', alpha=0.3, linewidth=0.5, label='Individual Run Training Losses')
        ax1.plot(epochs, val_loss, color='orange', alpha=0.3, linewidth=0.5, label='Individual Run Validation Losses')
    ax1.plot(epochs, single_var_multi_run_data['av_train_losses'], color='blue', linewidth=1.2, label='Average Training Loss')
    ax1.plot(epochs, single_var_multi_run_data['av_val_losses'], color='orange', linewidth=1.2, label='Average Validation Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Losses')
    
    # Remove duplicate labels in the legend
    handles, labels = ax1.get_legend_handles_labels()
    unique_labels = ["Average Training Loss", "Average Validation Loss", "Individual Run Training Losses", "Individual Run Validation Losses"]
    unique_handles = [handles[labels.index(label)] for label in unique_labels]
    ax1.legend(unique_handles, unique_labels)

    # Plot accuracies
    for train_acc, val_acc in zip(single_var_multi_run_data['all_train_accuracies'], single_var_multi_run_data['all_val_accuracies']):
        ax2.plot(epochs, train_acc, color='blue', alpha=0.3, linewidth=0.5, label='Individual Run Training Accuracies')
        ax2.plot(epochs, val_acc, color='orange', alpha=0.3, linewidth=0.5, label='Individual Run Validation Accuracies')
    ax2.plot(epochs, single_var_multi_run_data['av_train_acc'], color='blue', linewidth=1.2, label='Average Training Accuracy')
    ax2.plot(epochs, single_var_multi_run_data['av_val_acc'], color='orange', linewidth=1.2, label='Average Validation Accuracy')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy')
    ax2.set_title('Accuracies')
    
    # Remove duplicate labels in the legend
    handles, labels = ax2.get_legend_handles_labels()
    unique_labels = ["Average Training Accuracy", "Average Validation Accuracy", "Individual Run Training Accuracies", "Individual Run Validation Accuracies"]
    unique_handles = [handles[labels.index(label)] for label in unique_labels]
    ax2.legend(unique_handles, unique_labels)
    
    if enforce_axis:
        ax1.set_ylim(0, 5)
        ax2.set_ylim(0, 1)

    plt.tight_layout()
    plt.show()    
    
def plot_all_models_performance_from_disk(path_to_load, variable_name=None, enforce_axis=False):
    with open(path_to_load, 'r') as file:
        averaged_results = json.load(file)
        
    for variable_val, data in averaged_results.items():
        plot_single_model_performance(data, title=f'Training/Validation Losses and Accuracy for {variable_name} = {variable_val} across', enforce_axis=enforce_axis)

def plot_performance_comparison_from_file(path_to_load, enforce_axis=False, smooth_window=5):
    with open(path_to_load, 'r') as file:
        results = json.load(file)
    learning_rates = list(results.keys())
    num_epochs = len(results[learning_rates[0]]['av_train_losses'])

    fig_size = (12, 16)
    fig, ((ax_train_loss, ax_train_acc), (ax_val_loss, ax_val_acc),
          (ax_train_loss_smoothed, ax_train_acc_smoothed),
          (ax_val_loss_smoothed, ax_val_acc_smoothed)) = plt.subplots(4, 2, figsize=fig_size)

    plot_metrics(ax_train_loss, results, learning_rates, num_epochs, 'av_train_losses', 'Average Training Loss')
    plot_metrics(ax_train_acc, results, learning_rates, num_epochs, 'av_train_acc', 'Average Training Accuracy')
    plot_metrics(ax_val_loss, results, learning_rates, num_epochs, 'av_val_losses', 'Average Validation Loss')
    plot_metrics(ax_val_acc, results, learning_rates, num_epochs, 'av_val_acc', 'Average Validation Accuracy')
    plot_metrics(ax_train_loss_smoothed, results, learning_rates, num_epochs, 'av_train_losses', 'Smoothed Training Loss', smoothed=True, smooth_window=smooth_window)
    plot_metrics(ax_train_acc_smoothed, results, learning_rates, num_epochs, 'av_train_acc', 'Smoothed Training Accuracy', smoothed=True, smooth_window=smooth_window)
    plot_metrics(ax_val_loss_smoothed, results, learning_rates, num_epochs, 'av_val_losses', 'Smoothed Validation Loss', smoothed=True, smooth_window=smooth_window)
    plot_metrics(ax_val_acc_smoothed, results, learning_rates, num_epochs, 'av_val_acc', 'Smoothed Validation Accuracy', smoothed=True, smooth_window=smooth_window)

    if enforce_axis:
        for ax in [ax_val_acc, ax_val_loss, ax_train_acc, ax_train_loss,
                   ax_val_acc_smoothed, ax_val_loss_smoothed, ax_train_acc_smoothed, ax_train_loss_smoothed]:
            ax.set_ylim(0, 5) if 'Loss' in ax.get_ylabel() else ax.set_ylim(0, 1)

    plt.tight_layout()
    plt.show()

    if len(learning_rates) > 2:
        plot_comparative_metrics(results, learning_rates, num_epochs, 'Comparative Accuracies', 'av_train_acc', 'av_val_acc', enforce_axis)
        plot_comparative_metrics(results, learning_rates, num_epochs, 'Comparative Accuracies (Smoothed)', 'av_train_acc', 'av_val_acc', enforce_axis, smoothed=True, smooth_window=smooth_window)
    elif len(learning_rates) == 2:
        fig_acc_two, ax_acc_two = plt.subplots(figsize=(6, 4))
        fig_acc_two.suptitle('Comparative Accuracies', fontsize=12)

        for lr in learning_rates:
            ax_acc_two.plot(range(1, num_epochs + 1), results[lr]['av_val_acc'], label=f"Validation ({lr})", linestyle='-')
            ax_acc_two.plot(range(1, num_epochs + 1), results[lr]['av_train_acc'], label=f"Training ({lr})", linestyle='--')

        ax_acc_two.set_xlabel('Epoch')
        ax_acc_two.set_ylabel('Accuracy')
        ax_acc_two.set_title('Accuracy Comparison')
        ax_acc_two.legend(loc='upper right')

        if enforce_axis:
            ax_acc_two.set_ylim(0, 1)

        plt.tight_layout()
        plt.show()

        plot_comparative_metrics(results, learning_rates, num_epochs, 'Comparative Accuracies (Smoothed)', 'av_train_acc', 'av_val_acc', enforce_axis, smoothed=True, smooth_window=smooth_window)

def plot_metrics(ax, results, learning_rates, num_epochs, metric_key, title, smoothed=False, smooth_window=5):
    for lr in learning_rates:
        if smoothed:
            metric = np.convolve(results[lr][metric_key], np.ones(smooth_window) / smooth_window, mode='valid')
            ax.plot(range(smooth_window // 2, num_epochs - smooth_window // 2 + 1), metric, label=str(lr))
        else:
            ax.plot(range(1, num_epochs + 1), results[lr][metric_key], label=str(lr))
    ax.set_xlabel('Epoch')
    ax.set_ylabel(title)
    ax.set_title(title)
    ax.legend(title='Learning Rates', loc='lower right')

def plot_comparative_metrics(results, learning_rates, num_epochs, fig_title, train_key, val_key, enforce_axis=False, smoothed=False, smooth_window=5):
    fig, (ax_train, ax_val) = plt.subplots(1, 2, figsize=(12, 4))
    fig.suptitle(fig_title, fontsize=12)

    plot_metrics(ax_train, results, learning_rates, num_epochs, train_key, f'Training {fig_title}', smoothed, smooth_window)
    plot_metrics(ax_val, results, learning_rates, num_epochs, val_key, f'Validation {fig_title}', smoothed, smooth_window)

    if enforce_axis:
        ax_train.set_ylim(0, 1)
        ax_val.set_ylim(0, 1)

    plt.tight_layout()
    plt.show()


### Experiment 2 (19 MARKS) <ignore>

#### 2.1
The effect of increasing dropout rates can clearly be seen in figs 9.1 to 9.5. All models were initialised with the same paramaters other than the dopout rate and what can be observed demonstrates the effect of dropout as a regularisation technique - as the dropout rate increases from 0 to 0.8 we see a reduction in the speed and extent to which the model fits to the training data. This is reflected in the final accuracy it obtains on the training data and the speed with which is gets there. It can also be seen in the significant decrease in the the generalisation gap between training and validation performance, where in the absence of dropout (9.1) there is the biggest gap. and the highest dropout leads to the smallest gap.

 <figure><center><img src="./figs/e2/dr0.png" width=800><img src="./figs/e2/dr02.png" width=800><img src="./figs/e2/dr04.png" width=800><img src="./figs/e2/dr046.png" width=800><img src="./figs/e2/dr08.png" width=800><figcaption style="max-width: 600px"> Fig 9. Performance plots showing individual and averaged training and validation losses and accuracies for models trained with increasing dropout rates across 50 epochs of training </figcaption></center></figure>
<figure><center><img src="./figs/e2/overall dropout comparisons.png" width=800><figcaption style="max-width: 600px"> Fig 10. Smoothed averaged results for accuracies and losses across 50 epochs on validation data for models trained with different dropout rates </figcaption></center></figure>
<figure><center><img src="./figs/e2/dropout rates test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 11. Test set performance of models trained with different dropout rates highlighting the best result for each metric in green</figcaption></center></figure>
In Fig 10. the comparative performance of models trained with different dropout rates can be seen clearly. Looking at the validation loss, one can see the onset and of that loss is earlier and it is developmeny more severe for the lower dropout rates. Despite this, accuracy is relatively well preserved as the lower dropout rates still  atttain reasonable performance on both test and validation datasets. That being said, the best test loss and test performance belongs to those models with higher dropout rates, albeit by a small margin.

#### 2.2
The findings of the above are reiterated in the freshly trained models shown in Fig 13. where we see the model trained without dropout (model 0 in these experiments) demonstrating poor generalisability, and marked over fitting, while the model trained with dropout fits less closely to the training data as seen in its poorer performance on the test accuracy. However, what it does learn is mostly generalised to the validation dataset. These findings are also visable in the comparison plot in Fig 13. 

<figure><center><img src="./figs/e2/pretrained_0.png" width=800><figcaption style="max-width: 600px"> Fig 12. Performance plots showing individual and averaged training and validation losses and accuracies for Baseline (non-dropout) model (model 0)during trained on the original data over 5figcaptionstyle=figcaption></center></figure>

<figure><center><img src="./figs/e2/pretrained_1.png" width=800><figcaption style="max-width: 600px"> Fig 13. Performance plots showing individual and averaged training and validation losses and accuracies for model with dropout implimented (model 1) during trained on the original data over 50 epochs. </figcaption></center></figure>

<figure><center><img src="./figs/e2/pretrained comparison.png" width=400><figcaption style="max-width: 600px"> Fig 14. Direct comparison of performance of averaged and smoothed performance of non-dropout (model 0) and dropout (model 1) models over 50 epochs of training and validation on the original data</figcaption></center></figure>

<figure><center><img src="./figs/e2/pretrained test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 15. Test set performance of models trained without (0) and with (1) dropout implimented highlighting the best result for each metric in green. </figcaption></center></figure>

Both of the models above then had their fully connected layers retrained on a reversed version of the original dataset (and so essentially a 'new' dataset in terms of a new distribution) whilst their other parameters remained frozen. Their performance during this second phase of training can be seen in Figs 16 and 17.  

It is clear that the dropout-free model fits and then overfits to this new data extremely quickly, with only a very brief period during which it is learning generalisable information from the new data. The model pre-trained and retrained with dropout on the other hand still overfits but the process is smoother and more gradual with a less sever transition between fitting with generalisation to overfitting. 



<figure><center><img src="./figs/e2/retrained_baseline.png" width=800><figcaption width=500> Fig 16. Performance plots showing individual and averaged training and validation losses and accuracies for Baseline (non-dropout) model (model 0) during retraining on swapped data over 50 epochs</figcaption></center></figure>

<figure><center><img src="./figs/e2/retrained_dropout.png" width=800><figcaption style="max-width: 600px">Fig 17. Performance plots showing individual and averaged training and validation losses and accuracies for model with dropout (model 0) during retraining on swapped data over 50 epochs</figcaption></center></figure>

<figure><center><img src="./figs/e2/retrained_comparison.png" width=400><figcaption style="max-width: 600px">Fig 18. Direct comparison of performance of averaged and smoothed performance of non-dropout (model 0) and dropout (model 1) models over 50 epochs of training and validation on the swapped data </figcaption></center></figure>

<figure><center><img src="./figs/e2/retrained comparison test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 19. Test set performance of models retrained on swapped dataset having been previously trained on original dataset without (0) and with (1) dropout implimented highlighting the best result for each metric in green.</figcaption></center></figure>

In terms of overall performance on the test set, as can be seen in Fig 19., the model with dropout performs better on both metrics, although not enormously better. It also performs better than any other model so far other than that which used the LR scheduler. 

It can therefore be said that the regularisation effect of a single dropout layer was able to improve performance almost to the same level that basic LR scheduling was.

In [6]:
#############################
### Code for Experiment 2 ###
#############################

# --- EXPERIMENT 2.1 - Dropout Rates ---

# DATA LOADING AND NEW SPLIT
torch.manual_seed(0)

batch_size = 64

transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)

# half and half split
num_validation_samples = 25000
num_train_samples = len(train_data) - num_validation_samples
train_data, val_data = random_split(train_data, [num_train_samples, num_validation_samples])

print(len(train_data)) # 50000 training egs  
print(len(val_data)) # 25000 test egs
print(len(test_data)) # 10000 test egs

train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

# DROPOUT MODEL DEFINITION

class DropoutNet(nn.Module):
    def __init__(self, dropout_rate):
        super().__init__()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=64)
        self.dropout = nn.Dropout(p=dropout_rate)  # Dropout layer after the first FC layer
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)  # Applying dropout after activation
        x = self.fc2(x)
        return x


# TRAINING WITH DIFFERENT DROPOUT RATES

num_epochs = 50
learning_rate = 0.05

random_seeds = list(range(1, 6))
dropout_rates_for_experiment = [0, 0.2, 0.4, 0.6, 0.8]

averaged_results = {dr:{} for dr in dropout_rates_for_experiment}

path_to_save = f'./run_data/dropout/C2_final_dropout_rate_compatison_lr_{learning_rate}_{num_epochs}_epochs.json'
path_to_load = f'./run_data/dropout/C2_final_dropout_rate_compatison_lr_{learning_rate}_{num_epochs}_epochs.json'
save_experiment = True


for dropout_rate in dropout_rates_for_experiment:
    print('DR: ', dropout_rate) 
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    for random_seed in random_seeds:
        print('DR: ', dropout_rate) 
        print('seed:', random_seed)
        torch.manual_seed(random_seed)
        
        model = DropoutNet(dropout_rate).to(device)
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)

        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
        
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[dropout_rate] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    print('average for ')
    print('DR: ', dropout_rate) 
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'DROPOUT: {dropout_rate}')

if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read
        
# PLOTTING
dropout_data = path_to_load
plot_all_models_performance_from_disk(dropout_data, enforce_axis=True)
plot_performance_comparison_from_file(dropout_data, enforce_axis=True)
display_accuracy_heatmap(dropout_data)

# --- EXPERIMENT 2.1 - TRANSFER LEARNINNG ---

# SWAP DATASETS WITH NEW DATALOADERS

torch.manual_seed(0)

batch_size = 64

original_train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
original_val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)

swapped_train_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
swapped_val_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

# TRAINING ON ORIGINAL DATA

# train and save models ready transfer learning 
# train two models - one dropout, one not dropout, train them on the ORIGINAL half and half data, then save a copy of the models to disk
best_dropout_rate = 0.6

num_epochs = 50
learning_rate = 0.05

random_seeds = [list(range(1, 6))]


path_to_save = f'./run_data/transfer_learning/transfer_learn_original_dat_{num_epochs}_epochs_lr_{learning_rate}.json'
path_to_load = f'./run_data/transfer_learning/transfer_learn_original_dat_{num_epochs}_epochs_lr_{learning_rate}.json'

models = [0, 1]
averaged_results = {i:{} for i in models}

save_experiment = True

# train them both on the original data
for i, model in enumerate(models):
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    for random_seed in random_seeds:
        print('MODEL: ', i) 
        print('seed:', random_seed)
        torch.manual_seed(random_seed)
        
        model = BaselineNet() if i == 0 else DropoutNet(dropout_rate=best_dropout_rate)
        model.to(device)
        
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)
        
        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, original_train_dataloader, original_val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
        
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[i] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    print('average for ')
    print('Model: ', i) 
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'PRETRAINING MODEL: {i}')
    
    # save last version of model to disk for retraining    
    torch.save(model, f'./models/trained_model_{i}.pth')

    
if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read
        
# PLOTTING
pre_training_data = path_to_load
plot_all_models_performance_from_disk(pre_training_data, enforce_axis=True)
plot_performance_comparison_from_file(pre_training_data, enforce_axis=True)
display_accuracy_heatmap(pre_training_data)


# PERFORM TRANSFER LEARNING
# load in the two pretrained models and then reinitialise some layers
# retrain on the SWAPPED data

num_epochs = 50
learning_rate = 0.05
random_seeds = list(range(1,6))

path_to_save = f'./run_data/transfer_learning/transfer_learning_data_{num_epochs}_epochs_lr_{learning_rate}.json'
path_to_load = f'./run_data/transfer_learning/transfer_learning_data_{num_epochs}_epochs_lr_{learning_rate}.json'

models = [0, 1]
averaged_results = {i:{} for i in models}

save_experiment = True

# train them both on the swapped train and val data - test data same
for i, model in enumerate(models):
    epoch_train_losses_by_run = []
    epoch_val_losses_by_run = []
    epoch_train_accuracies_by_run = []
    epoch_val_accuracies_by_run = []
    test_losses = []
    test_accuracies = []
    reports = []
    
    for random_seed in random_seeds:
        print('MODEL: ', i) 
        print('seed:', random_seed)
        torch.manual_seed(random_seed)
        # here handle the loading of saved model and reinitiailisation of the fully connected layers
        if i == 0:
            pretrained_model_non_dropout = torch.load('./models/trained_model_0.pth')
            pretrained_model_non_dropout.fc1 =  nn.Linear(in_features=64 * 4 * 4, out_features=64)
            pretrained_model_non_dropout.fc2 = nn.Linear(in_features=64, out_features=10)
            model = pretrained_model_non_dropout
        elif i == 1:
            pretrained_model_best_dropout = torch.load('./models/trained_model_1.pth')
            pretrained_model_best_dropout.fc1 =  nn.Linear(in_features=64 * 4 * 4, out_features=64)
            pretrained_model_best_dropout.fc2 = nn.Linear(in_features=64, out_features=10)
            model = pretrained_model_best_dropout
        model.to(device)
        criterion = nn.CrossEntropyLoss()
        optimiser = optim.SGD(model.parameters(), lr=learning_rate)
        model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, swapped_train_dataloader, swapped_val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
        epoch_train_losses_by_run.append(train_epoch_losses)
        epoch_val_losses_by_run.append(val_epoch_losses)
        epoch_train_accuracies_by_run.append(train_epoch_accuracy)
        epoch_val_accuracies_by_run.append(val_epoch_accuracy)
        
        test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
        reports.append(report)
        
    average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
    average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
    average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
    average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
    average_test_loss = sum(test_losses)/len(test_losses)
    average_test_accuracy = sum(test_accuracies)/len(test_accuracies)
    
    averaged_results[i] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                       'av_val_losses': average_val_losses,
                                       'av_train_acc': average_train_accuracies,
                                       'av_val_acc': average_val_accuracies,
                                       'all_train_losses':epoch_train_losses_by_run,
                                       'all_val_losses': epoch_val_losses_by_run,
                                       'all_train_accuracies': epoch_train_accuracies_by_run,
                                       'all_val_accuracies': epoch_val_accuracies_by_run,
                                       'all_test_losses':test_losses, 
                                       'all_test_accuracies':test_accuracies,
                                       'av_test_loss': average_test_loss,
                                       'av_test_accuracy':average_test_accuracy}
    print('average for ')
    print('Model: ', i) 
    plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'TRANSFER LEARNING MODEL: {i}')
    


if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read

# plotting results

transfer_learned_data = path_to_load
plot_all_models_performance_from_disk(transfer_learned_data, enforce_axis=True)
plot_performance_comparison_from_file(transfer_learned_data, enforce_axis=True)
display_accuracy_heatmap(transfer_learned_data)


Files already downloaded and verified
25000
25000
10000


### Experiment 3 (19 MARKS) <ignore>

Figs. 20, 21 and 22 show gradient flow through the different models that have been tested so far plus a third model which has had batch normalisation added. As batcy normalisation brings with it new paramaters and new layers, these have been omitted for easier comparison in Fig 22, although the gradient flow in those layers can be seen in fig 23. 

The absolute value of the gradient was used for all statistics as it provided a clearer representation of gradient magnitudes at the different layers. Absolute values show the size of gradient regardless of sign which was found to be more useful for trying to visualise the propagation of those gradients through the layers. 

#### 3.1
The result in Fig 20. shows that for the baseline model gradient in the first 5 episodes the gradient is small overall, but virtually non existant in the earlier layers. in the last 5 episodes the gradients are higher overall, but also seem to have developed a different spread with larger gradients in the early layers and smaller gradients in later layers. Variability seems to be in proportion to the size of the gradient. 

These results indicate that for the baseline model there were intially very small updates being make to parameters primarily in the later layers, with little gradient reaching the earliest layers. By the end of training this has changes significantly and there is more information being passed to the earlier layers.

<figure><center><img src="./figs/e3/gradients baseline model.png" width=800><figcaption style="max-width: 600px"> Fig 20. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of thebaseline model during training. </figcaption></center></figure>

#### 3.2
Fig 21. shows some marked similarities to Fig 20. indicating similarities in gradient flow between the models with and without dropout. The most significant difference is the magnitude of the gradients, which are higher in both the first and last 5 episodes for the dropout model, though with similar variablity and a very similar pattern of propagation as descibed above.

<figure><center><img src="./figs/e3/gradients dropout model.png" width=800><figcaption style="max-width: 600px"> Fig 21. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of the model with dropout implimented during training. </figcaption></center></figure>

#### 3.3
The results for gradient propogation in the batch normalised model (Figs 22 and 23) are significantly different. Firstly, all of the bias terms for convolutional layers that have had with batch normalisation applied simply dissapear. This is because as the role of the bias paramater is essentially taken over by the parameters of the batch normalisation layer (as seen in Fig 23.) due to the 'absorbtion of bias' phenomenon in batch normalisation [x, y].

In the layers that *are* in common, however, a number of other things are striking. Firstly, the values of the gadients are dramatically higher for all layers in the first 5 episodes which is especially significant for the earlier layers where virtually no gradient was reaching in the un-batch-normalised models. In the last 5 episodes it is broadly similar. 

The distribution of the gradient is also more consistent with in the batch normalised model. Whereas with non-batch normalised models it very much shifts from being mostly updating later layers to then earlier layers, batch norm  is more evenly distributied (with more to the earlier layers) throughout. 

<figure><center><img src="./figs/e3/gradients batchnorm model (matching others).png" width=800><figcaption style="max-width: 600px"> Fig 22. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of the model with dropout implimented during training. Not in this plot the batch normalisation layers and their paramaetyrr gradients are not represented to facilitate comparison with previous models  </figcaption></center></figure>
<figure><center><img src="./figs/e3/gradients batchnorm model (not matching others).png" width=1000><figcaption style="max-width: 600px"> Fig 23. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of the model with dropout implimented during training. batch norm paramater gradients included. </figcaption></center></figure>

<figure><center><img src="./figs/e3/gradint flow relative metrics.png" width=800><figcaption style="max-width: 600px"> Fig 24. Comparison grouped by metric. </figcaption></center></figure>

#### 3.4 
The performance of the batch normalised model can be seen below in Fig 25. and on the test dataset in Fig 26. It can be seen that the model overfits quickly and performs quite poorly on the test data, with quite a substantial instability in the validation peformance. This is perhaps a surprising result given the regularistion effect batch norm is often associated with [x, y], and shall be discussed more in the analysis below, as there are other properties of batch normalisation which may be responsible for this finding. 

<figure><center><img src="./figs/e3/batch norm performance.png" width=800><figcaption style="max-width: 600px"> Fig 25. Performance plots showing individual and averaged training and validation losses and accuracies for a model with batch normalisation applied and trained on orignal data over 50 epochs. </figcaption></center></figure>
<figure><center><img src="./figs/e3/batch norm test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 26. Test performance of model trained with batch normalisation</figcaption></center></figure>

In [None]:
#############################
### Code for Experiment 3 ###
#############################

# return to original data splits

batch_size = 64

torch.manual_seed(0)

transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])


train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)

num_validation_samples = 5000
num_train_samples = len(train_data) - num_validation_samples

train_data, val_data = random_split(train_data, [num_train_samples, num_validation_samples])

print(len(train_data)) # 50000 training egs  
print(len(val_data)) # 10000 test egs
print(len(test_data)) # 10000 test egs

train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)



# define functions for accumulating gradients

def collect_gradients_abs_4(model, dataloader, device, criterion, optimizer, num_epochs):
    first_5_episodes_gradients_abs = {name: [] for name, _ in model.named_parameters()}
    last_5_episodes_gradients_abs = {name: [] for name, _ in model.named_parameters()}

    for epoch in range(num_epochs):
        model.train().to(device)
        for batch_count, (images, labels) in enumerate(dataloader, 1):
            images = images.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()

            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()

            episode_gradients_abs = {}
            for name, param in model.named_parameters():
                if param.grad is None and param.requires_grad:
                    # print('HEHRHE')
                    episode_gradients_abs[name] = torch.zeros_like(param.data)
                elif param.grad is not None:
                    # print('NONONO')
                    episode_gradients_abs[name] = torch.abs(param.grad.clone().detach())

            if epoch == 0 and batch_count <= 5:
                for name, grad_abs in episode_gradients_abs.items():
                    first_5_episodes_gradients_abs[name].append(grad_abs)
            elif epoch == num_epochs - 1 and batch_count > len(dataloader) - 5:
                for name, grad_abs in episode_gradients_abs.items():
                    last_5_episodes_gradients_abs[name].append(grad_abs)

            optimizer.step()

    return first_5_episodes_gradients_abs, last_5_episodes_gradients_abs

def compute_gradient_statistics_abs_4(gradients_abs):
    mean_gradients_abs = {}
    std_gradients_abs = {}
    for layer_name, layer_gradients_abs in gradients_abs.items():
        layer_gradients_abs = torch.stack(layer_gradients_abs)
        mean_gradients_abs[layer_name] = torch.mean(layer_gradients_abs, dim=0)
        std_gradients_abs[layer_name] = torch.std(layer_gradients_abs, dim=0)
    return mean_gradients_abs, std_gradients_abs

def plot_gradient_statistics_abs_4(mean_gradients_first5_abs, std_gradients_first5_abs, mean_gradients_last5_abs, std_gradients_last5_abs, skip_bn=True):
    if skip_bn:
        # Filter out batch normalization layers
        layer_names = [name for name in mean_gradients_first5_abs.keys() if not name.startswith('bn')]
    else:
        layer_names = list(mean_gradients_first5_abs.keys())

    num_layers = len(layer_names)
    x = np.arange(num_layers)
    width = 0.35

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    fig.suptitle('Gradient Statistics (Absolute Means, Absolute Standard Deviations)', fontsize=16)

    # Plot mean absolute gradients
    ax1.bar(x - width/2, [torch.mean(mean_gradients_first5_abs[name]).item() for name in layer_names], width, label='First 5 Epochs')
    ax1.bar(x + width/2, [torch.mean(mean_gradients_last5_abs[name]).item() for name in layer_names], width, label='Last 5 Epochs')
    ax1.set_xticks(x)
    ax1.set_xticklabels(layer_names, rotation=45)
    ax1.set_xlabel('Layer')
    ax1.set_ylabel('Mean of Absolute Gradients')
    ax1.set_title('Mean of Absolute Gradients vs Layer')
    ax1.legend()

    # Plot standard deviations of absolute gradients
    ax2.bar(x - width/2, [torch.mean(std_gradients_first5_abs[name]).item() for name in layer_names], width, label='First 5 Epochs')
    ax2.bar(x + width/2, [torch.mean(std_gradients_last5_abs[name]).item() for name in layer_names], width, label='Last 5 Epochs')
    ax2.set_xticks(x)
    ax2.set_xticklabels(layer_names, rotation=45)
    ax2.set_xlabel('Layer')
    ax2.set_ylabel('Standard Deviation of Absolute Gradients')
    ax2.set_title('Standard Deviation of Absolute Gradients vs Layer')
    ax2.legend()

    plt.tight_layout()
    plt.show()

def plot_model_comparison(first_5_mean_gradients_non_drop, first_5_mean_gradients_dropout, first_5_mean_gradients_bn,
                          last_5_mean_gradients_non_drop, last_5_mean_gradients_dropout, last_5_mean_gradients_bn,
                          first_5_std_gradients_non_drop, first_5_std_gradients_dropout, first_5_std_gradients_bn,
                          last_5_std_gradients_non_drop, last_5_std_gradients_dropout, last_5_std_gradients_bn):
    layer_names = [name for name in first_5_mean_gradients_non_drop.keys() if not name.startswith('bn')]
    print(layer_names)
    print(last_5_mean_gradients_non_drop['conv1.weight'])
    num_layers = len(layer_names)
    x = np.arange(num_layers)
    width = 0.2

    fig, axs = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Model Comparison - Gradient Statistics', fontsize=16)

    # Plot mean absolute gradients for the first 5 epochs
    
    print([first_5_mean_gradients_non_drop[name].shape for name in layer_names])
    axs[0, 0].bar(x - width, [torch.mean(first_5_mean_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[0, 0].bar(x, [torch.mean(first_5_mean_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[0, 0].bar(x + width, [torch.mean(first_5_mean_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[0, 0].set_xticks(x)
    axs[0, 0].set_xticklabels(layer_names, rotation=45)
    axs[0, 0].set_xlabel('Layer')
    axs[0, 0].set_ylabel('Mean of Absolute Gradients')
    axs[0, 0].set_title('First 5 Epochs - Mean of Absolute Gradients')
    axs[0, 0].legend()
    # axs[0, 0].set_ylim(0, 0.04)
    

    # Plot mean absolute gradients for the last 5 epochs
    axs[0, 1].bar(x - width, [torch.mean(last_5_mean_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[0, 1].bar(x, [torch.mean(last_5_mean_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[0, 1].bar(x + width, [torch.mean(last_5_mean_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[0, 1].set_xticks(x)
    axs[0, 1].set_xticklabels(layer_names, rotation=45)
    axs[0, 1].set_xlabel('Layer')
    axs[0, 1].set_ylabel('Mean of Absolute Gradients')
    axs[0, 1].set_title('Last 5 Epochs - Mean of Absolute Gradients')
    axs[0, 1].legend()
    # axs[0, 1].set_ylim(0, 0.2)
    

    # Plot standard deviation of absolute gradients for the first 5 epochs
    axs[1, 0].bar(x - width, [torch.mean(first_5_std_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[1, 0].bar(x, [torch.mean(first_5_std_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[1, 0].bar(x + width, [torch.mean(first_5_std_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[1, 0].set_xticks(x)
    axs[1, 0].set_xticklabels(layer_names, rotation=45)
    axs[1, 0].set_xlabel('Layer')
    axs[1, 0].set_ylabel('Standard Deviation of Absolute Gradients')
    axs[1, 0].set_title('First 5 Epochs - Standard Deviation of Absolute Gradients')
    axs[1, 0].legend()

    # Plot standard deviation of absolute gradients for the last 5 epochs
    axs[1, 1].bar(x - width, [torch.mean(last_5_std_gradients_non_drop[name]).item() for name in layer_names], width, label='Non-Dropout')
    axs[1, 1].bar(x, [torch.mean(last_5_std_gradients_dropout[name]).item() for name in layer_names], width, label='Dropout')
    axs[1, 1].bar(x + width, [torch.mean(last_5_std_gradients_bn[name]).item() for name in layer_names], width, label='Batch Norm')
    axs[1, 1].set_xticks(x)
    axs[1, 1].set_xticklabels(layer_names, rotation=45)
    axs[1, 1].set_xlabel('Layer')
    axs[1, 1].set_ylabel('Standard Deviation of Absolute Gradients')
    axs[1, 1].set_title('Last 5 Epochs - Standard Deviation of Absolute Gradients')
    axs[1, 1].legend()
    # axs[1, 1].set_ylim(0, 0.2)


    plt.tight_layout()
    plt.show()
# set epochs and learning rate
# Set epochs and learning rate
num_epochs = 50
learning_rate = 0.05

# 3.1 Gradient flow for the original model
torch.manual_seed(1984)
non_drop_model = BaselineNet()
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(non_drop_model.parameters(), lr=learning_rate)
first_5_epochs_gradients_abs_non_drop, last_5_epochs_gradients_abs_non_drop = collect_gradients_abs_4(non_drop_model, train_dataloader, device, criterion, optimiser, num_epochs)
first_5_mean_gradients_non_drop, first_5_std_gradients_non_drop = compute_gradient_statistics_abs_4(first_5_epochs_gradients_abs_non_drop)
last_5_mean_gradients_non_drop, last_5_std_gradients_non_drop = compute_gradient_statistics_abs_4(last_5_epochs_gradients_abs_non_drop)
plot_gradient_statistics_abs_4(first_5_mean_gradients_non_drop, first_5_std_gradients_non_drop, last_5_mean_gradients_non_drop, last_5_std_gradients_non_drop)

# 3.2 Gradient flow for the model with dropout
torch.manual_seed(1984)
drop_model = DropoutNet(0.6)
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(drop_model.parameters(), lr=learning_rate)
first_5_epochs_gradients_abs_dropout, last_5_epochs_gradients_abs_dropout = collect_gradients_abs_4(drop_model, train_dataloader, device, criterion, optimiser, num_epochs)
first_5_mean_gradients_dropout, first_5_std_gradients_dropout = compute_gradient_statistics_abs_4(first_5_epochs_gradients_abs_dropout)
last_5_mean_gradients_dropout, last_5_std_gradients_dropout = compute_gradient_statistics_abs_4(last_5_epochs_gradients_abs_dropout)
plot_gradient_statistics_abs_4(first_5_mean_gradients_dropout, first_5_std_gradients_dropout, last_5_mean_gradients_dropout, last_5_std_gradients_dropout)

# 3.3 Gradient flow for the model with batch normalization
# create model with BAtch norm as per brief

class BatchNormNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=64)
        self.bn4 = nn.BatchNorm1d(64)
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.pool(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.bn4(self.fc1(x)))
        x = self.fc2(x)
        return x
    
torch.manual_seed(1984)
bn_model = BatchNormNet()
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(bn_model.parameters(), lr=learning_rate)
first_5_epochs_gradients_abs_bn, last_5_epochs_gradients_abs_bn = collect_gradients_abs_4(bn_model, train_dataloader, device, criterion, optimiser, num_epochs)
first_5_mean_gradients_bn, first_5_std_gradients_bn = compute_gradient_statistics_abs_4(first_5_epochs_gradients_abs_bn)
last_5_mean_gradients_bn, last_5_std_gradients_bn = compute_gradient_statistics_abs_4(last_5_epochs_gradients_abs_bn)
plot_gradient_statistics_abs_4(first_5_mean_gradients_bn, first_5_std_gradients_bn, last_5_mean_gradients_bn, last_5_std_gradients_bn, skip_bn=True)
plot_gradient_statistics_abs_4(first_5_mean_gradients_bn, first_5_std_gradients_bn, last_5_mean_gradients_bn, last_5_std_gradients_bn, skip_bn=False)

# 3.4 
# properly train a batch norm model 

num_epochs = 50
learning_rate = 0.05

random_seeds = list(range(1, 6))
path_to_save = f'./run_data/batch_norm/batch_norm_{num_epochs}_epochs_LR_{learning_rate}.json'
path_to_load = f'./run_data/batch_norm/batch_norm_{num_epochs}_epochs_LR_{learning_rate}.json'
averaged_results = {'bn':{}}
save_experiment = True

# train them both on the original data

epoch_train_losses_by_run = []
epoch_val_losses_by_run = []
epoch_train_accuracies_by_run = []
epoch_val_accuracies_by_run = []
test_losses = []
test_accuracies = []
reports = []

for random_seed in random_seeds:
    print('seed:', random_seed)
    
    torch.manual_seed(random_seed)
    
    model = BatchNormNet()
    model.to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimiser = optim.SGD(model.parameters(), lr=learning_rate)
    
    model, train_epoch_losses, train_epoch_accuracy, val_epoch_losses, val_epoch_accuracy, _,_ = run_training_and_validation(model, device, learning_rate, num_epochs, criterion, optimiser, train_dataloader, val_dataloader, metrics = False, manual_lr_schedule=False, plot=True)
    epoch_train_losses_by_run.append(train_epoch_losses)
    epoch_val_losses_by_run.append(val_epoch_losses)
    epoch_train_accuracies_by_run.append(train_epoch_accuracy)
    epoch_val_accuracies_by_run.append(val_epoch_accuracy)
    
    test_loss, test_accuracy, report = run_testing(model, device, criterion, test_dataloader)
    test_losses.append(test_loss)
    test_accuracies.append(test_accuracy)
    reports.append(report)
    
average_train_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_train_losses_by_run)]
average_val_losses = [sum(epoch_losses) / len(epoch_losses) for epoch_losses in zip(*epoch_val_losses_by_run)]
average_train_accuracies = [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_train_accuracies_by_run)]
average_val_accuracies =  [sum(epoch_accuracies) / len(epoch_accuracies) for epoch_accuracies in zip(*epoch_val_accuracies_by_run)]
average_test_loss = sum(test_losses)/len(test_losses)
average_test_accuracy = sum(test_accuracies)/len(test_accuracies)

averaged_results['bn'] = {'seeds':random_seeds,'av_train_losses': average_train_losses,
                                    'av_val_losses': average_val_losses,
                                    'av_train_acc': average_train_accuracies,
                                    'av_val_acc': average_val_accuracies,
                                    'all_train_losses':epoch_train_losses_by_run,
                                    'all_val_losses': epoch_val_losses_by_run,
                                    'all_train_accuracies': epoch_train_accuracies_by_run,
                                    'all_val_accuracies': epoch_val_accuracies_by_run,
                                    'all_test_losses':test_losses, 
                                    'all_test_accuracies':test_accuracies,
                                    'av_test_loss': average_test_loss,
                                    'av_test_accuracy':average_test_accuracy}
print('average for ')
plot_single_train_val_smoothed(average_train_losses,average_val_losses,average_train_accuracies,average_val_accuracies, num_epochs, smoothing_window=3, title=f'BATCH NORM MODEL')

    
if save_experiment:
    with open(path_to_save, 'w') as file:
        json.dump(averaged_results, file, indent=4)  # 'indent' makes the output formatted and easier to read

batch_norm = 'run_data/batch_norm/batch_norm_50_epochs_LR_0.05.json'
plot_all_models_performance_from_disk(batch_norm, enforce_axis=True)
plot_performance_comparison_from_file(batch_norm, enforce_axis=True)
display_accuracy_heatmap(batch_norm)


# Conclusions and Discussion (instructions) - 25 MARKS <ignore>
In this section, you are expected to:
* briefly summarise and describe the conclusions from your experiments (8 MARKS).
* discuss whether or not your results are expected, providing scientific reasons (8 MARKS).
* discuss two or more alternative/additional methods that may enhance your model, with scientific reasons (4 MARKS). 
* Reference two or more relevant academic publications that support your discussion. (4 MARKS)

These experiments demonstrated some of the fundamental properties of aritficial nueral networks.

Experiment one demonstrated the effect that the learning can have on a models ability to fit to training data, and the impact that this has on generalisation. It showed that too high a LR could lead to coarse updates that lead to instability and variability in performance and and in ability to get to the true optimal minimial loss. It also showed that low LRs lead to slow progress but more close fit to training data.

It was shown that a LR scheduler can balance these properties and lead to quick learning with more fine grained accuracy in later stages of training. However, this did not translate to significant benefits in validation and test performance, although there was some.

Experiment two demonstrated the regularisation effect of dropout both in regular trianing and in a trasnfer learning scenario. It was shown to have a signficant impact on the generalisation gap reducing it as the rate increased and reducing the validation loss signifcantly. It was also found to have a profound regularising effect in the transfer learning example. Although it did lead to improvements in performance this were fairly small. 

Experiment 3 demonstrated clearly the powerful impact of batch normalisation on the propogation of gradient backwards through they layers of a neural network. The stark contract in average gradients arriving in the early layers in the intial impacts was striking. The impact it had on model performance however was perhaps a bit dissapointing, but I believe can be understood.

The results in experiments one and two are very much to be expected.

Learning rates are known to have a significant impact on the training dynamics and convergence of neural networks. High LRs can lead to overshooting the optimal solution and oscillations around the minimum, while low LRs result in slow convergence but more stable updates. The use of LR schedulers, such as reducing the LR over time, allows for faster initial convergence while fine-tuning the model in later stages. This is consistent with the observations in experiment one. 

The regularization effect of dropout is also well-established. Dropout introduces noise and stochasticity into the network by randomly dropping activation, preventing over-reliance on individual neurons and promoting more robust representations. This leads to improved generalization and reduced overfitting, as demonstrated in experiment two. Experiment 2 also demonstrates the impact that this can have on a networks ability to fit to data - with a reduced performance on the training data going along with the increased accuracy. 

The results of experiment three, however, were less expected. Although the gradient flow analysis clearly showed the powerful effect of batch normalization on the propagation of gradients backward through the layers of the neural network, the batch-normalized model performed quite poorly on unseen data, with the generalisation performance on the test set being really quite poor. 

This result is was surprising as one of the benefits of batch normalization has been shown to be its regularization effect [1], [6] and I was expecting it to *reduce* overfitting, but it did not. However, my understanding is that one of the headline benefits of batch normalisation it how it can speed up learning due to this early propogation of gradient to all layers (as was seen here). Given this more careful consideration needs to be given to other hyperparamaters which should compliment this drastic change. In these experiments all hyperparameters were fixed other than those being investigated.

There are a number of approaches I would explore to enhance the model. As well as experimenting to find more complimentary hyperparamaters for use with batch normalisation, I would use a more advanced optimizer such as the 'Adam' optimiser [2] which is near uniquitous and recommended for most cases as "one of the more robust and effective optimization algorithms to use in deep learning" [15].

I would also like to experiment with skip connections which were introduced in more recent and successful architectures, most noteably resnets [4]. I would also try experimenting with augmenting the data in the dataset by trying some of the techniques to increase the size and diversity of the training data, such as random rotations, flips, crops, and color jittering. Data augmentation can help improve the model's ability to generalize by exposing it to a wider range of variations and reducing overfitting.



# References (instructions) <ignore>
Use the cell below to add your references. A good format to use for references is like this:

[AB Name], [CD Name], [EF Name] ([year]), [Article title], [Journal/Conference Name] [volume], [page numbers] or [article number] or [doi]

Some examples:

JEM Bennett, A Phillipides, T Nowotny (2021), Learning with reinforcement prediction errors in a model of the Drosophila mushroom body, Nat. Comms 12:2569, doi: 10.1038/s41467-021-22592-4

SO Kaba, AK Mondal, Y Zhang, Y Bengio, S Ravanbakhsh (2023), Proc. 40th Int. Conf. Machine Learning, 15546-15566




<ignore> 

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.

Nguyen Huu Phong and Bernardete Ribeiro. Rethinking recurrent neural networks and other improvements for image classification. CoRR, abs/2007.15161, 2020.

Pytorch Foundation. CrossEntropyLoss - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html (accessed May 12, 2024). 

Pytorch Foundation. LogSoftmax - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax (accessed May 12, 2024). 

Pytorch Foundation. NLLLoss - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss (accessed May 12, 2024). 

Pytorch Foundation. SGD - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.optim.SGD.html (accessed May 12, 2024). 

Pytorch Foundation. datasets - PyTorch 2.3 documentation, https://pytorch.org/vision/0.8/datasets.html (accessed May 12, 2024). 

P Kingma Diederik. Adam: A method for stochastic optimization. (No Title), 201 Available: https://d2l.ai/ (accessed May 12, 2024).

Johan Bjorck, Carla P. Gomes, and Bart Selman. Understanding batch normalization. CoRR, abs/1806.02375, 2018.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. 

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of theACM, 60(6):84–90, 2017.

Nguyen Huu Phong and Bernardete Ribeiro. Rethinking recurrent neural networks and other improvements for image classification. CoRR, abs/2007.15161, 2020.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

1. Bjorck, J., Gomes, C. P., Selman, B. (2018), Understanding batch normalization, CoRR abs/1806.02375.

2. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.698

3. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020), An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929.

4. He, K., Zhang, X., Ren, S., Sun, J. (2016), Deep residual learning for image recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

5. Huang, G., Liu, Z., Weinberger, K. Q. (2016), Densely connected convolutional networks, CoRR abs/1608.06993.

6. Ioffe, S., Szegedy, C. (2015), Batch normalization: Accelerating deep network training by reducing internal covariate shift, CoRR abs/1502.03167.

7. Krizhevsky, A., Hinton, G., et al. (2009), Learning multiple layers of features from tiny images.

8. Krizhevsky, A., Sutskever, I., Hinton, G. E. (2017), Imagenet classification with deep convolutional neural networks, Communications of theACM 60(6), 84–90.

9. Pytorch Foundation (2023), CrossEntropyLoss - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html (accessed May 12, 2024).

10. Pytorch Foundation (2023), datasets - PyTorch 2.3 documentation, https://pytorch.org/vision/0.8/datasets.html (accessed May 12, 2024).

11. Pytorch Foundation (2023), LogSoftmax - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax (accessed May 12, 2024).

12. Pytorch Foundation (2023), NLLLoss - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss (accessed May 12, 2024).

13. Pytorch Foundation (2023), SGD - PyTorch 2.3 documentation, https://pytorch.org/docs/stable/generated/torch.optim.SGD.html (accessed May 12, 2024).

14. Simonyan, K., Zisserman, A. (2014), Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.

15. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2020). Dive into Deep Learning. Retrieved from https://d2l.ai/
