# Handin Overview

We will be studying how well neural networks learn when they have a scarcity of data. To see this, we will be using two datasets:
- **MNIST**: 28x28 black&white images of handwritten digits from 0-9. Class label is which digit the image holds.
- **EMNIST**: 28x28 black&white images of handwritten letters. Class labels are in 1-27 corresponding to which letter the image holds.

The EMNIST dataset was developed to be an extension of the MNIST one. Thus, we would expect that a network trained on one of these datasets should be applicable onto the other dataset. We will investigate to what extent this is true.

In order to do this, we will **pre-train** a set of convolutional layers on one dataset and then **generalize** or **finetune** it to the other dataset. The idea is that our convolutional layers serve as a **feature extractor**, whose role is to find information in the image.

Thus, we will create a network with one feature extractor and two prediction 'heads'. Each prediction 'head' is a fully-connected layer responsible for the classification step. Since we pre-train on one dataset and finetune on another, we have one prediction head for each dataset.

An example of this is shown in the following image:
![Multi-Headed Classifier](./misc/multi_headed_model.png)

## Pre-Training

We run pre-training on dataset $\mathcal{D}_P$. Given an input, target pair $(x, y)$, we train the *entire* network to predict $y$ given $x$. This means we are applying gradient descent on the convolutional layers *and* the prediction head responsible for dataset $\mathcal{D}_P$. After pre-training, we will have a convolutional neural network that has learned to extract information from dataset $\mathcal{D}_P$.

## Finetuning

We then want to evaluate how well our feature extractor trained on $\mathcal{D}_P$ performs on our finetuning dataset $\mathcal{D}_F$. **Note that $\mathcal{D}_P$ and $\mathcal{D}_F$ could be the same dataset or could be different datasets**. In either case, when finetuning we *only* apply the gradients to the prediction head responsible for $\mathcal{D}_F$. This means that we do not allow any change to occur to the feature extraction pipeline and will only train a classifier.

In this sense, we are analyzing how well features from $\mathcal{D}_P$ are represented in $\mathcal{D}_F$. If we obtain high accuracy scores on $\mathcal{D}_F$ after finetuning, then the two datasets must share a lot of similar information.

## Dataset Augmentations

We will also study how we can augment our datasets to encourage optimal feature extraction. To do this, you will implement two data augmentations: *collage* and *mixup*. Each augmentation accepts two images and outputs some interpolation between them. Each function also returns an *interpolation* scalar $\alpha \in [0, 1]$ that specifies how much the returned image is image $x_i$ and how much it is image $x_j$. So if the interpolation scalar is $\alpha = 0.2$ then it means the returned image is made up of $20\%$ from $x_i$ and $80\%$ from $x_j$.

Given an augmented image, the network will have to predict that it is $\alpha$ of class $y_i$ and $1 - \alpha$ of class $y_j$.

### The *collage* data augmentation

For the collage data augmentation, you accept two images $x_i, x_j$, each with shape $[1, 28, 28]$ (pytorch image shapes are always \[colors, height, width\]). You then return the following:
- an image of shape $[1, 28, 28]$ where the top-right and bottom-left quadrants come from image $x_i$ and the bottom-right and top-left quadrants come from image $x_j$
- the interpolation value $\alpha = 0.5$

An example of this augmentation can be seen below:
![Collage](./misc/collage.png)

The augmented images can be seen in the second row. Notice that the top-right and bottom-left quadrants are the same in the top-row and the bottom-row.

### The *mixup* data augmentation

In the case of [mixup](https://arxiv.org/pdf/1710.09412.pdf), you take a linear combination of the two images. Thus, you accept two images $x_i, x_j$ and return:
- an image of shape $[1, 28, 28]$ that is $\alpha$ of $x_i$ and $(1 - \alpha)$ of $x_j$.
- the interpolation value $\alpha$

An example of this augmentation can be seen below:
![Mixup](./misc/mixup.png)

The augmented images can be seen in the second row. They use the image from the first row as their $x_i$ and a random other image from the dataset as the $x_j$.

### Why do we do data augmentations?

We rarely have access to unlimited amounts of labeled training data. However, the more labeled data we have, the better our models will perform. Thus, we often use data augmentation to *expand* the size of our dataset. Some data augmentations simply alter a single input and predict the same label. We are instead using ones that combine two images and introduce a new set of labels. Thus, from $n$ data points, we get $O(n^2)$ augmented ones. However, the augmented data may not be fully representative of our true dataset, so there's a tradeoff...

## Data Scarcity

Our last point of study is how much data we need in order to learn good representations. For this reason, you will run experiments that sweep over different dataset sizes. Specifically, we will pretrain on datasets that have $s$ samples per class, where $s \in [1, 2, 4, 8, 16, 32, 64]$. We will then always finetune on a dataset of 256 samples per class. The question, then, is how much does our performance improve as $s$ grows?

# The Experiment Script

For your convenience, we have provided almost all the code that you will need. Your task will be to fill in the missing pieces of code and discuss how the results look.

You should start studying the codebase via the provided script that will run your experiments -- `main.py`. This script is responsible for executing the following logic:
"*For every data augmentation in \[no_aug, collage, mixup\] and every $s \in [1, 2, 4, 8, 16, 32, 64]$, pretrain on the (augmented) dataset $\mathcal{D}_P$ with $s$ samples per class and finetune on dataset $\mathcal{D}_F$.*"

At the end, the script will produce a plot that shows accuracies for each data augmentation and each pre-train dataset size. Put these plots in your writeup.

The missing pieces of code have `### YOUR CODE HERE` and `### END CODE` comments surrounding them. We recommend minimally interacting with the code outside of these comment blocks, as it has been set up to make your life easier.

`main.py` accepts the following command-line parameters:
- `finetune-dataset`. Valid inputs are 'mnist' and 'emnist'.
- `pre-train-dataset`. Valid inputs are 'mnist' and 'emnist'.
- `batch-size`. Valid inputs are any integer greater than 0. Default to 8.
- `test-during-training`. If this command-line parameter is present, the training script will measure classification performance every 500 training batches. This is useful if the `plot-train-curves` parameter is also present.
- `plot-train-curves`. If this parameter is present, every single training run will end by plotting a curve of the network's scores. Use this to verify that your network is training appropriately.
- `plot-augmentations`. If this parameter is present, it will produce the augmentation plots from earlier in this notebook. Use this to visualize what your augmentation implementations look like.
- `n-batches-pre-train`. Default to 2000. Number of batches of the pre-train dataset $\mathcal{D}_P$ to pre-train on. The higher this number the better your network will converge. However, it will also lead to slower runtimes.
- `n-batches-finetune`. Default to 2000. Number of batches of the finetune dataset $\mathcal{D}_F$ to finetune on. The higher this number the better your network will converge. However, it will also lead to slower runtimes.

For example, the command-line call `python main.py --finetune-dataset mnist --pre-train-dataset emnist --batch-size 16 --plot-augmentations` will pre-train on `emnist` and then finetune on `mnist` with batches of size 16. It will also plot your augmentations at the beginning to ensure they look like what you expect.

You can stick to the defaults when generating plots for the report. For example, `python main.py --pre-train-dataset mnist --finetune-dataset emnist`.

# Your tasks

## Short answers:
You are requested to provide (short) answers to the following questions to ensure that you understand how the code is working:
- Where are we ensuring that the finetuning does not affect the feature extractor?
- How does the code work that gets $s$ samples per class for the pre-training dataset?
- What is the `forward_call` parameter responsible for in the `train()` method (located in `network_training.py`)?
- Describe how the `augment()` method works (located in `augmentations.py`).
- If we pre-train and finetune on the same dataset, is there any reason to do the finetuning step?

Additionally, provide (short) predictions for the following questions (you are not graded on correctness here). These are to be done **before** you run the experiments. Take some time to think about what you *expect* will happen.
- Will the collage and mixup data augmentations help achieve higher finetune accuracies? Which do you expect will be more effective?
- What relationship do you expect between the number of samples in the pre-training dataset and the finetuning accuracy? Does this change with data augmentations?

## Code snippets:
There are three main coding sections that you must fill in. At the end, you can check correctness by running `python test_cases.py`. This will test your implementations of the neural network, augmentations and cross-entropy loss. You can also test the individual coding tasks by running `python augmentations.py`, `python network.py` and `python utils.py`. 

You may need to install the python package torchvision using e.g. `pip install torchvision`.

### The neural network body
The `network.py` file has three incomplete methods: `apply_convs`, `generalize`, `forward`
- `apply_convs` is the *feature extractor*. We have instantiated the layers for you in the `__init__` method -- your job is to stitch them together. We want the feature extractor to have two layers. Each should be a convolution layer, 2x2 max-pool, and finally relu. We suggest using the `torch.nn.functional` [library](https://pytorch.org/docs/stable/nn.functional.html).
    - *Not required* -- consider applying [convolutional dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout2d.html) to the second convolutional layer before the maxpool operation. It will improve the network's performance.
- `generalize` is the head responsible for finetuning. It will first extract features (using the feature extractor) and then apply the linear layer responsible for classifying on the finetuning dataset. This linear layer is `self.generalizer`
- `forward` is the head responsible for pre-training. It will first extract features (using the feature extractor) and then apply the linear layer responsible for classifying on the pre-training dataset. This linear layer is `self.pretrainer`. We call this method `forward` to stay consistent with pytorch documentation.

### The augmentations
As described, you must implement the `collage` and `mixup` augmentations in `augmentations.py`. The expected inputs and outputs are described above.

### The cross-entropy loss
We cannot use the standard pytorch cross-entropy loss as it expects targets that have a single class. However, when we use our augmentations, we will have labels that are partially one class and partially another. Thus, you must implement the cross-entropy loss function yourself in `utils.py`.

**NOTE** - our network prediction heads return the **log_softmax** of the predictions.

## Report

Now that the code works, you should be able to run `main.py` and obtain training plots. You must now run each combination of pretraining on $\mathcal{D}_P \in [\text{mnist, emnist}]$ and finetuning on $\mathcal{D}_F \in [\text{mnist, emnist}]$ and discuss the results. Put the corresponding training plots in the writeup along with at least one page of discussion on them. Possible topics for discussion include (but are not limited to):
- How does the number of samples per class affect training performance? Does this get affected by the augmentations?
- Which augmentation performs better? Why?
- Does finetuning and pre-training on the same dataset obtain better performance than pre-training on one dataset and finetuning on another? Why?