### Abstract
Through structured experimentation this assignment explores and demonstrates a number of fundamental properties of artificial neural networks and how they are trained. 

Using a relatively simple convolutional neural network to classify images in the CIFAR-10 dataset, the a number of network and hyperparamater choices have shall be explored. 

The aim of the task is not to try and maximise classification accuracy on a test set. Rather the impact that different approaches and choices have on performance during training, validation and testing shall be the focus. 

Many of the results conform to the expectations associated with the various interventions tried. However by examining them systematically and reflecting on the model behaviour in context, better understanding can be gained of the nuances of the impact of these techniques. There were also some results that were surprising, especially the poor performance of the model utulising batch normalisation, althogh under further analysis it was understandable.  

### Introduction

The labelled CIFAR-10 dataset was created as part of study exploring different approaches to training generative models for natural images [7]. It and it's larger sibling CIFAR-100 have been used for benchmarking and testing in many exploratory and ground breaking papers relating to computer vision and image classification since, not least in the development of Alexnet [8], Resnet [4] and most recently transformer-for vision architectures [3]. It is fitting, then, to use it to explore some of the fundamental properties of aritificial neural networks (NN) in this assignment.

The first experiment examined the effect that altering the learning rate (LR) has on training and performance. As well as experimenting with different learning rates, a LR 'scheduler' was designed and its performance compared to models with static learning rates.

The second experiment aimed to demonstrate and offer insight into the impact of introducing a dropout layer into the arhchitecture of the network. Different dropout rates were trialled and their effects compared to baseline performance in both training and evaluation. The effect of dropout was also tested in a transfer learning context.

The third experiment was focused on understanding gradient flow during back propagation in different architectures. It was observed in the baseline model and dropout models from experiments 1 and 2, and then was also visualised for a new model with batch normalisation implimented. 

Approaches and methods are introduces in the methodology section, with results and analysis offered afterwards.

## Data (7 MARKS) <ignore>

The CIFAR-10 dataset was developed by Krizhevsky and Hinton, et al. as a labelled subset of the 80 million tiny images dataset [7]. It consists of 60,000 colour 32x32 images split into 50,000 training examples and 10,000 testing examples. Each image belongs to one 10 mutually exclusive classes and is labelled correspondingly. These classes describe the suject of the image and are such things as 'airplane', 'cat', 'ship'.

It is conveniently accessable, along with many other benchmarking datasets, via a conveient Pytorch `datasets` method which takes a boolean flag enabling the user to access load both training and test data into separate `torch.Dataset` instances extremely easily, and this was the method used here. 
 
As part of this process it is possible to apply manual transforms to the data as it is loaded and the data was converted to tensors and normalised using this approach. The normalisation (such that the data in the 3 input channels had a mean of 0 and and a standard deviation of 1) ensured all inputs were standardised, and the model was focussing on only the informative variation, rather than any incidental variance.  

The 50,000 training instances were split to create a validation set of 5000 samples (with a random seed set for consistency across experiments). The class distribution for for each dataset were checked and found to be well balanced (see Fig 1) meaning simple accuracy will be a reliable measure of overall performance across the classes.

<figure><center><img src="./figs/classdisttraining.png" width=200><img src="./figs/classdistval.png" width=200><img src="./figs/class dist test.png" width=200><figcaption style="max-width: 600px"> Figure 1. Class distributions across the training, validation, and testing datasets</figcaption></center></figure>


Data Batching for stochastic gradient descent was handled by the `DataLoader` class, which yields samples without replacement from the shuffled dataset in batches of a size that can be specified by the user.

It was decided that a single train and validation split would be appropriate for the task at hand. Cross-validation was discounted as the benefit of a more accurate idea of the likely performance of the model, or exposure to absolutely all of the possible training data was not an important consideration here. This is because the objective is to understand the minutai of model behaviour rather than maximise final performance. 

## Architecture (17 MARKS) <ignore>


<figure><center><img src="./figs/baseline_model_diagram.png" width=800><figcaption style="max-width: 600px"> Fig 2. BaselineNet Convolutional Neural Network architecture. </figcaption></center></figure>

The Baseline architectural choices were based on a combination of the assignment brief, initial experimentation, and common practices in the field.

Fig 2. shows the overall arhcitecture of the model, whilst Table 1. gives the specific dtails of kernal dimensions for convolutional layers, pooling dimensions, stride and padding values as well as the input and output dimensions for each layer. There were a number of considerations that went into these choices. 

<figure><center><img src="./figs/TABLE.PNG" width=600><figcaption style="max-width: 600px"> Table 1: Convolutional Neural Network Architecture</figcaption></center></figure>

The filter dimensions of 3x3 were chosen as they have been shown to be effective in capturing local spatial patterns while keeping the number of parameters relatively low. Indeed, VGG16 net demonstrated the power of stacked 3x3 filter-based convolutional layers [14], and although they were used in a much deeper network, they were also used on much higher resolution images, and so they are a reasonable choice for what is quite a similar overall architecture here. 

The increasing number of filters (16, 32, 64) in the convolutional layers allows the network to learn progressively more complex and abstract features as the depth increases, and was another property shown to be effective in the VGG network [14]. Setting the stride and padding to 1 in the convolutional layers ensured that the spatial resolution was preserved, while preventing information loss at the edges and is a common tehcnique to achieve this.

The max pooling layers were set to be max pooling, with pool size of 2x2 and stride of 2. This reduce the spatial dimensions, thereby reducing the number of parameters in the network but also and providing a form of translation invariance because the exact position of a feature within the 2x2 window becomes less important; the max pooling layer only keeps the maximum activation value within each window.

The batch size of 64 was selected as a balance between computational efficiency and the ability to capture a representative sample of the dataset in each iteration. This size allows for efficient data processing while providing a reasonable approximation of the gradient during training.

The choice of size of the fully connected layer was a balance between the capacity of the model and the number of paramaters that could realistically be trained over the numerous runs required to get accurate, averaged results for the various experiments. `fc1` takes as its input in <lt>$1024$</lt> activations from the flattened convolutional layer before. The final value of <lt>$64$</lt> outputs resulted in <lt>$65,536$</lt> paramaters for that layer. Initially <lt>$128$</lt> outputs were tried but this resulted in <lt>$131,000$<lt> paramaters, which simply took to long to train and would have added capacity to the model that, as will be seen, was not missing. 

ReLU (Rectified Linear Unit) was chosen as the non-linear activation function throughout the BaselineNet architecture for the same reasons it is often used in NNs, which is its ability to avoid vanishing gradients owing to the fact it does not saturate as other activations such as sigmoid or tanh do. This avoids it differentiating to near 0 and so causing a diminuation of the gradient through many small multiplications. 


## Loss function (3 MARKS) <ignore>

The loss function used for each experiment was cross-entropy loss, implimented using the `nn.CrossEntropyLoss` class from Pytorch [9].

It is widely used in classification problems such as this where the target variable is binomial or miultinomial. 

It works by first transforming the raw logits of the output layer into what is a effectively a probability distribution via the softmax activation function which is applied to each of the output logits. Where <lt>$C$</lt> is the number of classes, it outputs is  $C$-dimensional vector of real numbers in the range (0, 1) that sum to 1 - which is why it can be treated as a probability distribution across the output classes.

The cross entropy loss function compares this distribution to a one-hot encoded version of the true class label. This acts as a target probability distribution and the cross entropy loss calculation essentially quantifies the difference between the predicted distribution in the form of the softmax outputs, and the one-hot encoded true label distribution. 

Mathematically, for a single sample with true label <lt>$y$</lt> and predicted probabilities <lt>$\hat{y}$</lt>, the cross-entropy loss is calculated as:
<lt>$$\text{CE}(y, \hat{y}) = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$</lt>

where <lt>$y_i$</lt> is the true label (0 or 1) for class <lt>$i$</lt>, and <lt>$\hat{y}_i$</lt> is the predicted probability for class <lt>$i$</lt> as given by the softmax output. By minimizing the average cross-entropy loss over all training samples, the model learns to assign high probabilities to the correct class and low probabilities to the incorrect ones.

Practically, the use of the Pytorch module precludes the need for a softmax layer in the actual model architecture itself as the optimiser takes in the raw logits and then applys the `nn.LogSoftmax()` activation function [11] and the `nn.NLLLoss()` [12] (Negative Log-Likelihood Loss) in a single operation that encapsulates the above. The function used as it is here in a mini-batch stochastic gradient decent context also handles the averaging of the loss across the mini-batch. This averaging is important because it allows the loss to be invariant to the batch size and provides a stable estimate of the overall loss for the batch and then across batches in the epoch.

## Optimiser (4 MARKS) <ignore>

The optmiser used to handle parameter updates and impliment gradient descent was stochastic gradient descent (SGD), implimented using the `optim.SGD` class from Pytorch. 

SGD is one of the most straighforward optimisers one can use. It estimates the true gradient of the loss function with respect to the paramaters of the model, by calulating the gradient of a small subset of the training data (a mini-batch) and updates the parameters of the model with this approximate gradient, weighted by a LR which - in this approach - is fixed, and is user defined hyperparamater that can be tuned. 

This process is repeated for multiple mini-batch samples taken from the training data without replacement (until the entire data set has been seen - representing an 'epoch' of training) and then repeated until a stopping criterion is met - in this case a set number of epochs.

Mathematically, the estimated gradient for a mini-batch of size $B$ sampled from the training data is computed as:
<lt>$$\nabla_\theta L(\theta_t) \approx \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(\theta_t; x_i, y_i)$$</lt>
where <lt>$(x_i, y_i)$</lt> represents the <lt>$i$</lt>-th example in the mini-batch.

A number of more sophiticated optimisers are availabl when training NNs today, not least the 'Adam' (Adaptive moment Estimation) optimiser [2] which is near uniquitous and recommended for most cases as "one of the more robust and effective optimization algorithms to use in deep learning" [15]. These approaches, by encoporating properties such as the 'momentum' of the gradient and adaptive LRs have been shown to lead to a smoother and more direct journey through paramater space to the miminimum loss. 

As performance was not the chief consideration, SGD was chosen to make analysing the impact of LR on performance straightforward and transparent. With SGD paramaters are directly updated based only on the gradient and the learning rate. By keeping to this very direct forumlation it easier to understand and interpret the impact of the LR on the model's performance. SGD is highly sensitive to the choice of the LR and this sensitivity is precisely what makes SGD suitable for studying the impact of LR on model performance. 

An interesting continuation of this experiment would be to introduce momentum and compare performance, then Adagrad, then Adam. But as the aim is just to explore learning rate, it was decided keep the optimiser algorithm as simple as possible. 


## Experiments <ignore>
### Experiment 1 (8 MARKS)

#### 1.1 - Learning Rates

In order to explorethe effect of the LR on model performance, first a number of exploratory training runs of different LRs between <lt>$0.5$</lt> and <lt>$1e^-6$</lt> were carried out. These established the extremes of model behaviour with respect to that range of learning rates. A top value above which learning would be unstable, and a low value below which no learning would occur were found ($0.15$ and $0.001$ respectively). 

These invrestigations suggested a range from which to select the 5 LRs to compare for the experiment which were chosen as <lt>$0.1, 0.075. 0,05, 0.025 \text{, and }0.01$</lt>

For all trials the data was loaded processed as described above, with the model, criterion and loss as above as well. 

For each learning rate, 5 trials were conducted. That is; 5 different models were instantiated, trained and evlauated. The results for each of these trials were recorded, along with the average values for each epoch across the 5 trials. Each of the 5 trials were run by iterating over a list of random seeds that was kept constant for all trials in all experiments to allow for comparison and consistency in weight initialisation, dropout activation selection and all other random processes.  

Models were trained by mini-batch stoachastic gradient descent as described above. During training each batch was scored in terms of loss and accuracy where loss was as above and accuracy was a simply count of how many images were correctly classified divided by the number of images in the batch.

Batch scores were averaged across the epoch to give the training loss and accuracy for that epoch. After training for each epoch, the model was then taken out of training mode - halting gradient computations - and the validation set was iterated through in batches, with the validation batch losses and accuracies again being averaged to give a validation loss and accuracy for the epoch. At the end of training of the final validation on the final epoch, each model was then evaluated against the test dataset. In order to obtain the 'test score' used below, the average test score for a LR was calculated as the average of the 5 models instantiated for that learning rate's test scores. Test scores were calculated as validation scores were. 

Accumulating these metrics across epochs rather than batches is a somewhat aribitraty although conventional approach. It is a convenient way to keep track of how many times the model has been exposed to all of the training data and is easy to understand when plotting performance graphs. 

The data for each learning rate's performance was stored in a JSON file for later plotting and analysis. 

#### 1.2 - LR Scheduler

Having established above the performance of different LRs it was clear that the model could tolerate a relatively high initial LR but that this needed to drop significantly and arrive at or beneath 0.02 by the end of the 50 epochs to ensure a more fine grained exploration of the loss landscape in later stages. 

A number of approaches to LR scheduling were explored visually as in Fig 5. below. The function and decay rate that best fit with the finding of experiment 1.1 was 'inverse time decay' with a decay rate of 0.25. This function is defined as <lt>$\alpha_t = \frac{\alpha_0}{1 + kt}$</lt> where <lt>$\alpha_t$</lt> is the LR at time step <lt>$t$</lt>, <lt>$\alpha_0$</lt> is the initial learning rate, <lt>$k$</lt> is the decay rate, <lt>$t$</lt> is the current time step or iteration. 

How this function modifies the LR over the epochs can be seen in plot 1 of Fig 5.

A model was then trained as above using this LRdecay function which was applied evey epoch. The results of this training were gathered and plotted as can be seen in Fig 6. Fig 7. shows a comparison between this model and the best performing model from experiment 1.

### Experiment 2 (8 MARKS) <ignore>

#### 2.1 - Dropout Rates

For this experiment, in accordance with the assignment brief, the original training data was re-split into two halves to create new training and validation datasets of size 25,000 each and a new model was defined which encorporated dropout in the fully connected layers.

Of the two fully connected layers, one is connected to the output layer and would not typically have dropout applied (as these connections are directly outputting the logits which amount to the classification choice of the network). Through flattening, there is a sense the final convolutional layer is fully connected to the first truly 'fully connected layer'. However, it is again not generally a good idea to apply dropout to CNN activations, as it can disrupt the spatial structure and correlation present in the feature map representations. As such it was decided to experiment with a single dropout applied to the activations of the first fully connected layer, which follows the flattening operation. And the experiment would vary the dropout rate in this layer only.

A set of dropout rates for experimentation was defined as <lt>$0, 0.2, 0.4, 0.6, \text{, and }0.8$</lt> because 0 had to be included and 1 would mean no activation were passed forward at all!

The same approach as in experiment 1.1 was taken to gathering data, with 5 trials carried out for each dropout rate and with models initialised with consistent seeding, then trained with all other hyperpapramters being fixed and results recorded as above. 

The experiment's results (seen in Fig 9) show the effect of dropout rgularisation on model performance, and Fig 11 established the optimal dropout rate for the DropoutNet model on this specific classification task - which was 0.6. 

This rate was taken forward into experiment 2.2.

#### 2.2 - Dropout and Transfer Leaning

The second part of this expeiment the experiment aims to investigate the performance of dropout regularization in the context of transfer learning.

It compares the performance of the best performing model from experiment 1 with:
*i)* a model pretrained on the original data *without* dropout then retrained on the new data
*ii)* a model pretrained don the original data *with** dropout then retrained on the new data. 

In both of the latter cases the retaining was partial and amounted to transfer learning where trained models had some weights 'frozen' (kept fixed) whilst others were reintialised and made trainable on the new data.

Performance for all models was compared during training and validation as well as on the test dataset 

Transfer learning was implimented as follows.

Two models, one with dropout, one without, were initialised and trained as in previous experiments, iterating over 5 random seeds gathering taining, validation and testing performance data. The final instance of each model was saved to disk so the trained models weights were stores, along with a record of its performance data on disk. 

The validation and training datasets were then swapped, the models were loaded and all of their layers were frozen except their fully connected layers which were manually re-initialised, meaning they were subject to training. 

These two models were then trained on the new, swapped data as in previous experiments - 5 times each by iterating over the random seeds. These models had effectively been trained twice on slightly differently distributed datasets - pretrained on the original data, and then their final layers had been trained on the new data. 

Their performance during this retraining on training, validation and testing data was recorded. 

By conducting this experiment, the performance of the models with and without dropout regularization was able to be compared in a transfer learning scenario. The use of swapped train and validation data allows for evaluating the models' ability to generalize to a different data distribution.

The averaged results and smoothed plots seen in Figs 14, 16, 17 and 18 provide insights into how the pretrained models with and without dropout perform when fine-tuned on the swapped data. The test results on the original test dataset assess the models' performance on unseen data.

### Experiment 3 (8 MARKS) <ignore>

#### 3. Gradient Flow Analysis

#### 3.1, 3.2, 3.3

This experiment aimed to investigate gradient flow in three different neural network models: BaselineNet (without regularization), DropoutNet (with dropout regularization), and BatchNormNet (with batch normalization). The goal was to compare the mean and standard deviation of the gradients in the first 5 episodes and the last 5 episodes of training for each model to understand the difference that dropout and batch normalisation might make to how the gradient propogates back through a model during training.

To achieve this, functions were written to extract the the raw gradient values for each layer during training for the first 5 training steps and the last 5 training steps. PyTorch conveniently makes these values accessible as a property of the model, and all that was required was to collect the gradients for each layer across the 5 episodes, and then calculate an average for each layer across those 5 episodes. This was done for each of the models specified by each experiment.

The same process was carried out for each model and shall be specified below, however for the BatchNormalisation experiment a new model including batch normalisation had to be defined where batch normalisation is a process by which the activations of a layer are normalised by subtracting the batch mean and dividing by the batch standard deviation. After normalization, the activations are scaled and shifted using learnable parameters (`bn.weight` and `bn.bias` in Fig 23.) It has been shown to enable faster learning rates,  reduces the sensitivity to initialization, and act as a regularizer, improving the overall performance and generalization of the model [1], [6]. It was applied to all except for the last layer here. 

The process for each of the baseline, dropout and batchnormalised models was as follows. 

For this experiment training was over 30 epochs as prolonged training was not essential as performance was not the focus. The original data split was re-instigated, and a fixed LR of 0.05 (the best performing static learning rate) was selected. 

For each model the same random seed was initialised, then the model, criterion and loss initialised as in previous experiments. The models were then trained for 30 epochs, but rather than gathering performance data, gradient data was collected as described above. This data was then plotted in a variety of form to highlight the trends in the data. 

It should be noted that rather than the raw gradient values being collected, it was the absolute values. The reasons for this are made clear in the results section for this experiment. 

#### 3.4

Finally, a batch normalised model was trained on the original data for 50 epochs as in previous experiments, with performance on the original training, validation and tes datasets recorded and plotted. It was compared and analysed in relation to other models performance.

# Results (instructions) - 55 MARKS <ignore>
Use the Results section to summarise your findings from the experiments. For each experiment, use the Markdown/text cell to describe and explain your results, and use the code cell (and additional code cells if necessary) to conduct the experiment and produce figures to show your results.

### Experiment 1 (17 MARKS) <ignore>

#### 1.1
As can be seen in Fig 1, initial experminetation established reasonable limits within which to select LRs for further testing. Rates of 0.15 and above lead to unusual, erratic behaviour such as that seen in Fig 1.2 where the LR is so high that the model cannot converge to an optimal solution and instead overshoots. On the other hand 1.1 shows the other extreme where the LR is so low no learning can occur.

<figure><center><img src="./figs/e1/lrchaos.png" width=700><img src="./figs/e1/tranval_no_learn.png" width=700><figcaption style="max-width: 600px"> Fig 1. Showing behavioural extremes for different learning rates: unstable learning at a LR of 0.2, and minimal learning at a LR of 0.001 </figcaption></center></figure>

The performances of different LRs can be seen in Fig 2. below and are well summarised in Fig 3. Looking at Fig 2., it can be seen that as LRs get smaller the generalisation gap between the training and validation loss and accuracy is slower to develop, and less extreme. This shows that models trained with higher LRs are able to fit to the training data more quickly, but also overfit to it more quickly. The impact of the LR can also be seen the volatility in of the training loss which is markablty lower at lower learning rates.

<figure><center><img src="./figs/e1//lr1.png" width=700><img src="./figs/e1/lr2.png" width=700><img src="./figs/e1/lr3.png" width=700><img src="./figs/e1/lr4.png" width=700><img src="./figs/e1/lr5.png" width=700><figcaption style="max-width: 600px"> Fig 2. Performance plots showing individual and averaged training and validation losses and accuracies for models trained with descending LRs across 50 epochs of training </figcaption></center></figure>

In terms of performance on unseen data, the test performances in Fig 4. and the smoothed validation losses and accuracies in Fig 3. give a good overview of how LRs affect this, with lower LRs leading to a reduced validition loss at the end of the 50 epochs owing to slower fitting (and thus overfitting), but also lower accuracy in test and validation. This was in contrast to the quicker rise to high accuracy for those with high learning rates,followed by a plateaing and gradual decline. 

<figure><center><img src="./figs/e1/smoothed loss accuracy.png" width=700><figcaption style="max-width: 600px"> Fig 3. Smoothed averaged results for accuracies and losses across 50 epochs on validation data for models trained with different learning rates</figcaption></center></figure>

<figure><center><img src="./figs/e1/leraning rates test performance.PNG" width=300><figcaption style="max-width: 600px"> Fig 4. Test set performance of models trained with different LRs highlighting the best result for each metric in green</figcaption></center></figure>

#### 1.2

Having observed the different performances above, it was clear the ideal balance would be a LR that began at the highest end of the LRs above (0.1), but that decayed reasonably quickly in order to avoid the onset of overiftting around 10 epochs. 

Different approaches to decay and how they affect LR over the 50 epochs can be seen in Fig 5. The smooth inverse time fuinction with a decay rate of 0.25 seemed to have the ideal combination and was found to perform well relative to the others. 

<figure><center><img src="./figs/e1/lr_scheculer experiments.png" width=350><figcaption style="max-width: 600px"> Fig 5. Different LR decay schedules affect on the active LR across 50 epochs </figcaption></center></figure>

As can be seen when a model using this shceduler is compared with a model using a static LR (see fig 1.3), there is a slight improvment in overall performance with a shceduler, although however the most substantial difference appears to be in the stability of the validation loss and accuracies despite seeming to over fit to the training data. The LR scheduled model's validation accuracy stabalises in a way that did not occur with any of the  the other models that saturated at close to 100% training accuracy before. This is likely because the even decreasing LR means that after a certain point the paramaters will settle as they will only be getting the negligable updates. 

<figure><center><img src="./figs/e1/LR SCHEDULER final results.png" width=700><figcaption style="max-width: 600px"> Fig 6. Performance over 50 epochs of training for model trained with LR scheduler </figcaption></center></figure>
<figure><center><img src="./figs/e1/results accuracy camparison lr and scheduler.png" width=350><figcaption style="max-width: 600px"> Fig 7. Comparison of performance across training of model trained with a LR scheduler, and the best performing model without a scheduler (LR of 0.05)</figcaption></center></figure>
<figure><center><img src="./figs/e1/lr decay comparison.PNG" width=400><figcaption style="max-width: 600px"> Fig 8. Comparison of test results between a model trained with a LR scheduler, and the best performing model trained without a scheduler (LR of 0.05) highlighting the best result for each metric in green</figcaption></center></figure>

That this model achieves 100% on the training set is noteable in itself - something which none of the earlier models did. This is again likely owing to (in the case of high LR models) being too coarse to hone in on a particular point in paramater space that would give it 100% accuracy, or (in the case of the very low learning rates) possibly being unable to tranverse the loss landscape effectively owing to too small a gradient, possibly getting stuck in sub-optimal minima.  

Overall this experiment demonstrates well the impact that different LRs can have on learning in a NN model. 

### Experiment 2 (19 MARKS) <ignore>

#### 2.1
The effect of increasing dropout rates can clearly be seen in figs 9.1 to 9.5. All models were initialised with the same paramaters other than the dopout rate and what can be observed demonstrates the effect of dropout as a regularisation technique - as the dropout rate increases from 0 to 0.8 we see a reduction in the speed and extent to which the model fits to the training data. This is reflected in the final accuracy it obtains on the training data and the speed with which is gets there. It can also be seen in the significant decrease in the the generalisation gap between training and validation performance, where in the absence of dropout (9.1) there is the biggest gap. and the highest dropout leads to the smallest gap.

 <figure><center><img src="./figs/e2/dr0.png" width=800><img src="./figs/e2/dr02.png" width=800><img src="./figs/e2/dr04.png" width=800><img src="./figs/e2/dr046.png" width=800><img src="./figs/e2/dr08.png" width=800><figcaption style="max-width: 600px"> Fig 9. Performance plots showing individual and averaged training and validation losses and accuracies for models trained with increasing dropout rates across 50 epochs of training </figcaption></center></figure>
<figure><center><img src="./figs/e2/overall dropout comparisons.png" width=800><figcaption style="max-width: 600px"> Fig 10. Smoothed averaged results for accuracies and losses across 50 epochs on validation data for models trained with different dropout rates </figcaption></center></figure>
<figure><center><img src="./figs/e2/dropout rates test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 11. Test set performance of models trained with different dropout rates highlighting the best result for each metric in green</figcaption></center></figure>
In Fig 10. the comparative performance of models trained with different dropout rates can be seen clearly. Looking at the validation loss, one can see the onset and of that loss is earlier and it is developmeny more severe for the lower dropout rates. Despite this, accuracy is relatively well preserved as the lower dropout rates still  atttain reasonable performance on both test and validation datasets. That being said, the best test loss and test performance belongs to those models with higher dropout rates, albeit by a small margin.

#### 2.2
The findings of the above are reiterated in the freshly trained models shown in Fig 13. where we see the model trained without dropout (model 0 in these experiments) demonstrating poor generalisability, and marked over fitting, while the model trained with dropout fits less closely to the training data as seen in its poorer performance on the test accuracy. However, what it does learn is mostly generalised to the validation dataset. These findings are also visable in the comparison plot in Fig 13. 

<figure><center><img src="./figs/e2/pretrained_0.png" width=800><figcaption style="max-width: 600px"> Fig 12. Performance plots showing individual and averaged training and validation losses and accuracies for Baseline (non-dropout) model (model 0)during trained on the original data over 5figcaptionstyle=figcaption></center></figure>

<figure><center><img src="./figs/e2/pretrained_1.png" width=800><figcaption style="max-width: 600px"> Fig 13. Performance plots showing individual and averaged training and validation losses and accuracies for model with dropout implimented (model 1) during trained on the original data over 50 epochs. </figcaption></center></figure>

<figure><center><img src="./figs/e2/pretrained comparison.png" width=400><figcaption style="max-width: 600px"> Fig 14. Direct comparison of performance of averaged and smoothed performance of non-dropout (model 0) and dropout (model 1) models over 50 epochs of training and validation on the original data</figcaption></center></figure>

<figure><center><img src="./figs/e2/pretrained test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 15. Test set performance of models trained without (0) and with (1) dropout implimented highlighting the best result for each metric in green. </figcaption></center></figure>

Both of the models above then had their fully connected layers retrained on a reversed version of the original dataset (and so essentially a 'new' dataset in terms of a new distribution) whilst their other parameters remained frozen. Their performance during this second phase of training can be seen in Figs 16 and 17.  

It is clear that the dropout-free model fits and then overfits to this new data extremely quickly, with only a very brief period during which it is learning generalisable information from the new data. The model pre-trained and retrained with dropout on the other hand still overfits but the process is smoother and more gradual with a less sever transition between fitting with generalisation to overfitting. 



<figure><center><img src="./figs/e2/retrained_baseline.png" width=800><figcaption width=500> Fig 16. Performance plots showing individual and averaged training and validation losses and accuracies for Baseline (non-dropout) model (model 0) during retraining on swapped data over 50 epochs</figcaption></center></figure>

<figure><center><img src="./figs/e2/retrained_dropout.png" width=800><figcaption style="max-width: 600px">Fig 17. Performance plots showing individual and averaged training and validation losses and accuracies for model with dropout (model 0) during retraining on swapped data over 50 epochs</figcaption></center></figure>

<figure><center><img src="./figs/e2/retrained_comparison.png" width=400><figcaption style="max-width: 600px">Fig 18. Direct comparison of performance of averaged and smoothed performance of non-dropout (model 0) and dropout (model 1) models over 50 epochs of training and validation on the swapped data </figcaption></center></figure>

<figure><center><img src="./figs/e2/retrained comparison test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 19. Test set performance of models retrained on swapped dataset having been previously trained on original dataset without (0) and with (1) dropout implimented highlighting the best result for each metric in green.</figcaption></center></figure>

In terms of overall performance on the test set, as can be seen in Fig 19., the model with dropout performs better on both metrics, although not enormously better. It also performs better than any other model so far other than that which used the LR scheduler. 

It can therefore be said that the regularisation effect of a single dropout layer was able to improve performance almost to the same level that basic LR scheduling was.

### Experiment 3 (19 MARKS) <ignore>

Figs. 20, 21 and 22 show gradient flow through the different models that have been tested so far plus a third model which has had batch normalisation added. As batcy normalisation brings with it new paramaters and new layers, these have been omitted for easier comparison in Fig 22, although the gradient flow in those layers can be seen in fig 23. 

The absolute value of the gradient was used for all statistics as it provided a clearer representation of gradient magnitudes at the different layers. Absolute values show the size of gradient regardless of sign which was found to be more useful for trying to visualise the propagation of those gradients through the layers. 

#### 3.1
The result in Fig 20. shows that for the baseline model gradient in the first 5 episodes the gradient is small overall, but virtually non existant in the earlier layers. in the last 5 episodes the gradients are higher overall, but also seem to have developed a different spread with larger gradients in the early layers and smaller gradients in later layers. Variability seems to be in proportion to the size of the gradient. 

These results indicate that for the baseline model there were intially very small updates being make to parameters primarily in the later layers, with little gradient reaching the earliest layers. By the end of training this has changes significantly and there is more information being passed to the earlier layers.

<figure><center><img src="./figs/e3/gradients baseline model.png" width=800><figcaption style="max-width: 600px"> Fig 20. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of thebaseline model during training. </figcaption></center></figure>

#### 3.2
Fig 21. shows some marked similarities to Fig 20. indicating similarities in gradient flow between the models with and without dropout. The most significant difference is the magnitude of the gradients, which are higher in both the first and last 5 episodes for the dropout model, though with similar variablity and a very similar pattern of propagation as descibed above.

<figure><center><img src="./figs/e3/gradients dropout model.png" width=800><figcaption style="max-width: 600px"> Fig 21. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of the model with dropout implimented during training. </figcaption></center></figure>

#### 3.3
The results for gradient propogation in the batch normalised model (Figs 22 and 23) are significantly different. Firstly, all of the bias terms for convolutional layers that have had with batch normalisation applied simply dissapear. This is because as the role of the bias paramater is essentially taken over by the parameters of the batch normalisation layer (as seen in Fig 23.) due to the 'absorbtion of bias' phenomenon in batch normalisation [x, y].

In the layers that *are* in common, however, a number of other things are striking. Firstly, the values of the gadients are dramatically higher for all layers in the first 5 episodes which is especially significant for the earlier layers where virtually no gradient was reaching in the un-batch-normalised models. In the last 5 episodes it is broadly similar. 

The distribution of the gradient is also more consistent with in the batch normalised model. Whereas with non-batch normalised models it very much shifts from being mostly updating later layers to then earlier layers, batch norm  is more evenly distributied (with more to the earlier layers) throughout. 

<figure><center><img src="./figs/e3/gradients batchnorm model (matching others).png" width=800><figcaption style="max-width: 600px"> Fig 22. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of the model with dropout implimented during training. Not in this plot the batch normalisation layers and their paramaetyrr gradients are not represented to facilitate comparison with previous models  </figcaption></center></figure>
<figure><center><img src="./figs/e3/gradients batchnorm model (not matching others).png" width=800><figcaption style="max-width: 600px"> Fig 23. Mean and standard deviation of the gradients of the loss function with respect to the paramaters at each layer of the model with dropout implimented during training. batch norm paramater gradients included. </figcaption></center></figure>

<figure><center><img src="./figs/e3/gradint flow relative metrics.png" width=800><figcaption style="max-width: 600px"> Fig 24. Comparison grouped by metric. </figcaption></center></figure>

#### 3.4 
The performance of the batch normalised model can be seen below in Fig 25. and on the test dataset in Fig 26. It can be seen that the model overfits quickly and performs quite poorly on the test data, with quite a substantial instability in the validation peformance. This is perhaps a surprising result given the regularistion effect batch norm is often associated with [x, y], and shall be discussed more in the analysis below, as there are other properties of batch normalisation which may be responsible for this finding. 

<figure><center><img src="./figs/e3/batch norm performance.png" width=800><figcaption style="max-width: 600px"> Fig 25. Performance plots showing individual and averaged training and validation losses and accuracies for a model with batch normalisation applied and trained on orignal data over 50 epochs. </figcaption></center></figure>
<figure><center><img src="./figs/e3/batch norm test results.PNG" width=300><figcaption style="max-width: 600px"> Fig 26. Test performance of model trained with batch normalisation</figcaption></center></figure>

# Conclusions and Discussion (instructions) - 25 MARKS <ignore>
In this section, you are expected to:
* briefly summarise and describe the conclusions from your experiments (8 MARKS).
* discuss whether or not your results are expected, providing scientific reasons (8 MARKS).
* discuss two or more alternative/additional methods that may enhance your model, with scientific reasons (4 MARKS). 
* Reference two or more relevant academic publications that support your discussion. (4 MARKS)

These experiments demonstrated some of the fundamental properties of aritficial nueral networks.

Experiment one demonstrated the effect that the learning can have on a models ability to fit to training data, and the impact that this has on generalisation. It showed that too high a LR could lead to coarse updates that lead to instability and variability in performance and and in ability to get to the true optimal minimial loss. It also showed that low LRs lead to slow progress but more close fit to training data.

It was shown that a LR scheduler can balance these properties and lead to quick learning with more fine grained accuracy in later stages of training. However, this did not translate to significant benefits in validation and test performance, although there was some.

Experiment two demonstrated the regularisation effect of dropout both in regular trianing and in a trasnfer learning scenario. It was shown to have a signficant impact on the generalisation gap reducing it as the rate increased and reducing the validation loss signifcantly. It was also found to have a profound regularising effect in the transfer learning example. Although it did lead to improvements in performance this were fairly small. 

Experiment 3 demonstrated clearly the powerful impact of batch normalisation on the propogation of gradient backwards through they layers of a neural network. The stark contract in average gradients arriving in the early layers in the intial impacts was striking. The impact it had on model performance however was perhaps a bit dissapointing, but I believe can be understood.

The results in experiments one and two are very much to be expected.

Learning rates are known to have a significant impact on the training dynamics and convergence of neural networks. High LRs can lead to overshooting the optimal solution and oscillations around the minimum, while low LRs result in slow convergence but more stable updates. The use of LR schedulers, such as reducing the LR over time, allows for faster initial convergence while fine-tuning the model in later stages. This is consistent with the observations in experiment one. 

The regularization effect of dropout is also well-established. Dropout introduces noise and stochasticity into the network by randomly dropping activation, preventing over-reliance on individual neurons and promoting more robust representations. This leads to improved generalization and reduced overfitting, as demonstrated in experiment two. Experiment 2 also demonstrates the impact that this can have on a networks ability to fit to data - with a reduced performance on the training data going along with the increased accuracy. 

The results of experiment three, however, were less expected. Although the gradient flow analysis clearly showed the powerful effect of batch normalization on the propagation of gradients backward through the layers of the neural network, the batch-normalized model performed quite poorly on unseen data, with the generalisation performance on the test set being really quite poor. 

This result is was surprising as one of the benefits of batch normalization has been shown to be its regularization effect [1], [6] and I was expecting it to *reduce* overfitting, but it did not. However, my understanding is that one of the headline benefits of batch normalisation it how it can speed up learning due to this early propogation of gradient to all layers (as was seen here). Given this more careful consideration needs to be given to other hyperparamaters which should compliment this drastic change. In these experiments all hyperparameters were fixed other than those being investigated.

There are a number of approaches I would explore to enhance the model. As well as experimenting to find more complimentary hyperparamaters for use with batch normalisation, I would use a more advanced optimizer such as Adam [2]. I would also like to experiment with skip connections which were introduced in more recent and successful architectures, most noteably resnets [4]. I would also try experimenting with augmenting the data in the dataset by trying some of the techniques to increase the size and diversity of the training data, such as random rotations, flips, crops, and color jittering. Data augmentation can help improve the model's ability to generalize by exposing it to a wider range of variations and reducing overfitting.

