<a href="https://colab.research.google.com/github/gdewael/teaching/blob/main/predmod/CNN/PClab013_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://pbs.twimg.com/media/FQK3KkkWUAQsW5L?format=jpg&name=medium" width = 400>

*Image generated by DALL-E 2 upon prompted by "robot meditating on mathematics, digital art, keywords: synthwave, transcendent, beautiful mind"*

*For more examples of what this model generates: https://twitter.com/hashtag/dalle*

# PC lab: Convolutional Neural Networks

# 1. Introduction

### 1.1 A brief history

Convolutional neural networks caused a major step forward in the performance of computer vision. They first became popular because of an annual challenge in image classification: The [ImageNet Large Scale Visual Recognition Challenge (ILSVRC)](https://en.wikipedia.org/wiki/ImageNet#History_of_the_ImageNet_challenge), where models were challenged to classify an image as belonging to one of a thousand(!) different classes. Until 2011, hand-crafted features on images in combination with more traditional machine learning methods such as SVMs reigned supreme. The best models using such techniques classified image wrongly about 26\% of the time. In 2012, a breakthrough was made in terms of performance due to the succesful training of a large-scale CNN. Although the concept of convolutions has been known for quite some decades in machine learning, the technology for large-scale training of CNNs did not take off until this time because it was difficult to train them efficiently (GPU usage was not popular, and no solid platforms for computation on GPUs had been developed). Since then, the field has taken off and multiple developments have come yearly. In 2015, human performance in image classification on this dataset had been beaten by a 152-layer CNN model.

<img src="https://i0.wp.com/semiengineering.com/wp-content/uploads/2019/10/Synopsys_computer-vision-processors-EV7-Fig2-ImageNet.jpeg?ssl=1" width = 400>

### 1.2 The Convolution

A convolution is the iteration of a kernel with size $ M \times N $ over a given input $ \textbf{X} $, performing a 2D linear combination of the weights $ W $  of the kernel with the overlapping area of the input. For a normal convolution with single striding and no padding, the output $ y_{ij} $ is equal to:

$$ y_{ij} = \sum_{a=0}^{m-1} \sum_{b=0}^{n-1} W_{ab} x_{(i+a)(j+b)} $$

During a convolution, the kernels slides over the input image to obtain a new image of outputs. The stride of a kernel defines the horizontal and vertical stepsize during iteration. Input data can be padded with multiple layers of a zero-filled border, increasing the output dimensions.

It is important to understand that a convolution applies the same operation at every local patch in the input. In this sense, convolutions are useful when you expect the input data to contain regularly appearing **local patterns**.


You can find visualizations of common convolutional set-ups [here](https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md).

Several other examples &  an extended explanation on different types of convolutions can be found [here](https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d)

### 1.3 Channels

A convolutional neural network usually processes an image with multiple input channels at every pixel. The number of channels can be seen as the dimensionality of every input **token**. In the case of image inputs, the tokens are pixels, and its input dimensionality is three (every pixel is defined by its red/green/blue values). The kernel, although often depicted as only evaluating one channel, actually takes **the sum of all channels** to obtain an output. If we only apply one kernel, we effectively compress the information in our image (since the RGB value of the pixel now has to be represented by a single number). For this reason, multiple parallel kernels are used that each output their own channel. Because of this, we can **learn higher-dimensional spaces of local regions of pixels**. The following image shows the difference between a convolution with one output channel and a convolution with 8 output channels. (The pictured convolution employs no padding, hence the width and height of the image are also affected)

<img src="https://raw.githubusercontent.com/gdewael/teaching/main/predmod/CNN/channelsdrawing.png" width = 800>

<div class="alert alert-success">

<b>THOUGHT EXERCISE:</b> 
<p> How much weights do the above two convolutional layers have? One single bias/intercept number/weight is usually added for every output channel.</p>

</div>


The .gif below shows an illustrated example with numbers of how a single convolutional kernel (i.e. one output channel) works.

<img src="https://miro.medium.com/max/2560/1*ciDgQEjViWLnCbmX-EeSrA.gif" width = 800>

### 1.4 The CNN

A Convolutional neural network consists of more than convolutions. Just like with MLPs, we usually add non-linearities such as the ReLU after a linear layer.
Another popular operation in convolutional neural networks is a pooling operation such as **max pooling**. Maximum pooling reduces the (width $\times$ height) dimensionality of the input, which effectively compresses the image, which can in turn reduce the amount of parameters present in a neural network, which (again) in turn reduces overfitting and computational burden. Maximum pooling is also initialized with specific arguments such as kernel size, stride and padding.

Another advantage of max pooling is that it make the networks somewhat invariant as to where in an image a certain pattern appeared (*translation invariance*).

<img src="https://computersciencewiki.org/images/8/8a/MaxpoolSample2.png" width = 400>

After many convolutions, we are left with a 3-dimensional object for every image: $channels \times width \times height$. By now, the width and height may not represent pixels anymore but something which may be described as "pseudo"-pixels, since every remaining "pseudo"-pixel has aggregated information from a region of original input pixels. The channels together then represent the vector of information gathered in that "pseudo"-pixel.

If we want to do classification (or regression) with this "pseudo"-image, we need to reshape or flatten this 3-dimensional object to a 1-dimensional vector representation representing the output classes. Hence, a CNN usually consists of a convolutional part, where we extract higher dimensional features of local regions, and an MLP part, where we put linear layers on top of the flattened representations of the image (i.e. letting the information between all regions recombine to get to a final class prediction).

Note that this flattening/reshaping is only necessary for classification purposes, as we need to obtain a 1-dimensional vector at the end. If we were doing, for example, image segmentation (classifying every pixel as belonging to a category, not the whole image at once), we would keep our 3-dimensional representation of images. 

Below is an illustration of a toy example of a cat detector that shows how you can visualize what "pseudo"-pixels may signify in a learned neural network.

<img src="https://raw.githubusercontent.com/gdewael/teaching/main/predmod/CNN/catdetectordrawing.png" width = 800>

Another example showing multi-class classification

<img src="https://miro.medium.com/max/2510/1*vkQ0hXDaQv57sALXAJquxA.jpeg" width = 600>

One can interpret the convolutional layers as the section of the network in which local patterns are extracted (edges, contours, contrasts,...). These are used as inputs for the fully connected neural netwerk, which combines these features to train the classifier. 

The following picture shows a visualization of what a CNN extracts at each layer, starting from the first layers on the left going deeper towards the right. This visualization is obtained by optimizing an input image to maximally activate the convolution filters.

<img src='https://1.bp.blogspot.com/-icbxyuiDoA0/WgEivsyFIgI/AAAAAAAACKo/jsfMgFlfiVA233zXg8xAH3ZAKOchgLb-wCLcBGAs/s1600/image4.png'>

**Final note:** It is important to realize that we have been talking about convolutions on images up until now, these are what we call 2-D convolutions (because they convolve over both width and height). The same concept is also applicable on 1-D sequences, where every input channel can represent information of every input token in that sequence (e.g. a DNA or protein sequence with as tokens bases or AAs and channels as one-hot encodings for which base or AA is present at that position). A 1D Conv would then aggregate patterns from its neighboring inputs (DNA bases or AAs).

Even 3-D convolutions are possible on 3-D inputs such as 3-D renders of biologial cells (from e.g. electron microscopy) or 3D MRI-scans. Even a sequence of images (what we know as a video) can be seen as a 3-D object. A 3-D convolution on a video would then extract local patterns per image but also aggregate information from the previous and next frames (in the case of video) OR, in the case of MRI or 3D microscopy: also aggregate information from different depth slices.

### 1.5 The Residual Connection

Most recent neural networks use residual connections of some sort. A residual connection adds the input of some layer(s) to its output: $y = f(x) + x$. This skip-connection helps with vanishing gradient issues as it essentially lets a part of the input signal skip the layer(s) in $f(.)$. A logical requirement for the use of residual connections is that the output of $f(x)$ has the same dimensionality as the input $x$. For convolutions, this means performing a convolution with padding so that width and height of the image are preserved, and having as much output channels as there were input channels.

It is very popular to define a "Residual Block" as $f(.) = $ `[Layer -> Activation -> Dropout -> Normalization]`

# 2. Coding

Last PC-lab we familiarized ourself with PyTorch using MLPs and the MNIST dataset, on which we coded a classifier and an autoencoder. This week, we will repeat the same workflow, but work with convolutional neural networks instead of simple MLPs.

In [None]:
import torch
import torch.nn as nn

### 2.1 The convolution

Let's first test some things out with convolutional layers.
In PyTorch, The Channel dimension is put as the second dimension, just after the batch dimension and before the width x height dimensions. The following code shows the dimensions for a batch of 8 toy data images of 28 x 28 pixels with 3 channels:

In [None]:
x = torch.randn(8, 3, 28, 28)

PyTorch implements 2D convolutions via the [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) class. Check out the documentation to see which options you can specify. Below we illustrate the usage of some prominent ones:

Remember to check back [here](https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md) to see visually what all the options mean.

In [None]:
conv = nn.Conv2d(3, 16, 5) # from 3 to 16 channels with kernel size 5
print(x.shape)
y = conv(x)
print(y.shape)

In [None]:
# padding = "same" will perform padding so as to conserve the input width and height
print(x.shape)
conv = nn.Conv2d(3, 16, 5, padding = "same") 
y = conv(x)
print(y.shape)

In [None]:
print(x.shape)
# performing a conv of 2 x 2 every block of 2 by 2 pixels:
conv = nn.Conv2d(3, 16, 2, stride = 2) 
y = conv(x)
print(y.shape)

In [None]:
print(x.shape)
conv = nn.Conv2d(3, 8, 3, dilation = 2) 
y = conv(x)
print(y.shape)

PyTorch has similar Conv objects for 1D and 3D tasks. [MaxPool2d](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html) also has very similar arguments.

<div class="alert alert-success">

<b>EXERCISE:</b> 
<p> Let's implement our very own custom PyTorch layer: a convolutional residual block.
You will be able to use this layer in further exercises when implementing a complete CNN.
Refer to the introduction part on residual connections above to see how a Residual connection should be constructed. Remember to use appropriate kernel sizes and padding.
</p>

</div>

In [None]:
class ResidualBlock(nn.Module):
    def __init__(self, hidden_dim = 64, kernel_size = 5):
        super().__init__()

        # YOUR CODE HERE ....

        # TIP: You may want to to use these objects: ..
        #nn.Conv2d(...)
        #nn.ReLU()
        #nn.Dropout()
        #nn.BatchNorm2d(...)

        # YOUR CODE HERE ....

    def forward(self, x):
        return None # CHANGE THIS TO YOUR OUTPUT

In [None]:
ResidualBlock()(torch.randn(2, 64, 16, 16)).shape

### 2.2 A CNN for MNIST classification

With our residual block and PyTorch knowledge from last PC lab at the ready, we can implement our very own convolutional neural network. Let's first recap the most basic building blocks of training PyTorch models:

The most basic blueprint of PyTorch model training consists of
- Get your data
- Wrap your data splits in a [data loader](https://pytorch.org/docs/stable/data.html)
- Instantiate the model
- Instantiate a [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions)
- Instantiate an [optimizer object](https://pytorch.org/docs/stable/optim.html), to which you pass the parameters you want to optimize
- Iterate through your training data, for every batch:
    - reset the gradients
    - do forward pass
    - compute loss
    - backward pass
    - update parameters

(Optionally):
- After every full iteration through all training data samples (called an epoch), loop through all batches of validation data:
    - forward pass
    - compute loss and validation scores

Since we have implemented every necessary step of this process last PC lab already, we will focus our remaining time on modeling a proper CNN architecture. In order to test if our model will work, we will need to design a model architecture that works for our input image sizes. For this purpose, we can generate some random data in the same size as the MNIST data. Remember that MNIST consists of grayscale images, so they only have one input channel, as opposed to RGB images. They also have a 28 x 28 resolution:

In [None]:
# some generated data in the same shape that we expect our MNIST data:
x = torch.randn(8, 1, 28, 28)

With this generated toy data ready, we can implement a CNN and test it out. We recommend to model the CNN and classifier (MLP) part separately, so we can check the dimensionality of the image after it went through the CNN (this will influence how many input nodes the subsequent MLP needs to take in).


<div class="alert alert-success">

<b>EXERCISE:</b> 
<p> Implement a CNN backbone with at least 3 convolutional operations. You can optionally make use of the residual blocks for the CNN operations. We recommend to also make use of max pooling layers to reduce dimensionality.
Because the residual block strictly has the same output and input dimensionality, you should start your network with a normal convolutional layer that returns more than one channel. You can also put convolutions (and max pools) between residual blocks to change the number of hidden dimensions.
</p>

</div>

In [None]:
class CNNBackBone(nn.Module):
    def __init__(self):
        super().__init__()

        # YOUR CODE HERE
    def forward(self, x):
        # YOUR CODE HERE
        return None # CHANGE THIS TO YOUR OUTPUT

Let's see the shape of our fake image batch after a forward pass through the CNN:

In [None]:
print(x.shape)
y = CNNBackBone()(x)
print(y.shape)

Inspect the output shapes of your CNN backbone, did you compress the information in your image or expand the total dimensionality?

You will need to reshape it before application in an MLP. In PyTorch, you can use `nn.Flatten` for this purpose. Another option is to just reshape it manually using `x.view()` (equivalent to NumPy's `x.reshape()`). In order to implement it correctly using manual reshaping, you will need to specify the first dimension (corresping to the number of samples in the batch) reshaped shape as `-1`, so that it will reshape any sort of batch size correctly.

<div class="alert alert-success">

<b>EXERCISE:</b> 
<p> Combine the CNN backbone with an MLP with 10 output nodes for MNIST classification. At its simplest, the MLP on top of the the CNN backbone, is a single linear layer that takes in the flattened input and returns 10 ouput nodes. Alternatively, you can make use of the code in the previous PC lab to quickly implement an MLP
</p>

</div>

In [None]:
class CNNClassifier(nn.Module):
    def __init__(self):PClab013_CNN_solved
        super().__init__()
        # YOUR CODE HERE
        
    def forward(self, x):
        # YOUR CODE HERE
        return None # CHANGE THIS TO YOUR OUTPUT

In [None]:
print(x.shape)
y = CNNClassifier()(x)
print(y.shape)

# a smaller (4 instead of 8) batch of images:
x = torch.randn(4, 1, 28, 28)
print(x.shape)
y = CNNClassifier()(x)
print(y.shape)

You can use this code to inspect how much parameters you model has in total.

In [None]:
sum([p.numel() for p in CNNClassifier().parameters()])

Keep in mind that a very large model will take a long time to train. For the purpose of this PC lab, and considering MNIST is quite an "easy" dataset, we recommend you to not have more than a hundred thousand parameters.

With a model ready, let's load in the MNIST data:

Note that in comparison with last PC lab, we do not reshape the data so that every image is a single represented by a single vector. Instead, we added an `unsqueeze` statement, which adds an extra dimension for the channels.

In [None]:
from torchvision import datasets
from torchvision.transforms import ToTensor
import numpy as np

train_data = datasets.MNIST(
    root = 'data',
    train = True,                         
    transform = ToTensor(), 
    download = True,            
)
test_data = datasets.MNIST(
    root = 'data', 
    train = False, 
    transform = ToTensor()
)

X_train = train_data.data
y_train = train_data.targets

X_test = test_data.data
y_test = test_data.targets

X_train = X_train / 255
X_test = X_test / 255

X_train = X_train.unsqueeze(1)
X_test = X_test.unsqueeze(1)

np.random.seed(42)
train_indices, val_indices = np.split(np.random.permutation(len(X_train)), [int(len(X_train)*0.8)])
X_val = X_train[val_indices]
y_val = y_train[val_indices]
X_train = X_train[train_indices]
y_train = y_train[train_indices]

train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=16, pin_memory=True, shuffle=True)

val_dataset = torch.utils.data.TensorDataset(X_val, y_val)
val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=16, pin_memory=True, shuffle=True)

test_dataset = torch.utils.data.TensorDataset(X_test, y_test)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=16, pin_memory=True, shuffle=True)


In [None]:
X_train.shape, y_train.shape

<div class="alert alert-success">

<b>EXERCISE:</b> 
<p> Using the training code you made last PC lab, train your own CNN model. How is it performing in comparison to the model of last PC lab?

You should be able to obtain a model with at least +-98% validation accuracy
</p>

</div>

In [None]:
N_EPOCHS = 20

model =  # your model from previous exercises here.

# loss function & optimizer

for i in range(1, N_EPOCHS + 1):
    
    # train loop

    # eval loop

    # record or print some variables that you want to keep track of during training

You may also want to finally check your performance on the test set for comparison with the MLP we trained last PC lab.

### 2.3 Extra: The convolutional autoencoder

As a last step, we can implement a convolutional autoencoder.

Since we will go from an input image to an output image, we don't need to flatten our tensor at any point in our network. If we do this, our latent space will also be a channel $\times$ width $\times$ height tensor, conserving some positional information. The rationale is that by keeping this positional information in our bottleneck, the reconstruction will be made easier.

Example:

<img src='https://user-images.githubusercontent.com/26786663/27525317-b3026976-5a77-11e7-8767-8f4a06e5b696.jpg' width = 600>

We can still choose to flatten our bottleneck and then unflatten again in the decoder. This will make us lose all positional information, however.

Example:

<img src='https://miro.medium.com/max/1400/1*gzJAJDLDavH_W7Zv2M2J7w.png' width = 600>

Our encoder can be exactly the same structure as the CNNBackbone we made earlier in this PC lab. For the decoder, we need inverse convolutions (and maybe inverse max pooling) operations to upscale our image again from its compressed representation to its original dimensions.

In PyTorch, we have these options:

- [ConvTranspose2d](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html) also visualized [here](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html), the inverse of convolutions, also called deconvolution.
- [Upsample](https://pytorch.org/docs/stable/generated/torch.nn.Upsample.html#torch.nn.Upsample), performing simple non-parametric upscaling of inputs with any of a number of methods such as bilinear or bicubic upscaling.
- [MaxUnpool2d](https://pytorch.org/docs/stable/generated/torch.nn.MaxUnpool2d.html?highlight=maxunpool#torch.nn.MaxUnpool2d), performing the inverse of **a previous** max pooling operation. Essentially putting the maximal elements back in the location/index where they originally appeared before the previous max pooling. As such, this operation can only be used conjoined with a paired max pooling operation. (see examples in documentation).

For this PC lab, we simply recommend `ConvTranspose2d` and `Upsample`.

<div class="alert alert-success">

<b>EXTRA EXERCISE:</b> 
<p> Implement a Convolutional Autoencoder. Use your previously implemented CNNBackbone as an encoder and create a decoder from scratch using the layers discussed above. Try to make the model as symmetric as possible.</p>

</div>

In [None]:
x = torch.randn(8, 1, 28, 28)
Encoder = CNNBackBone()

encoded = Encoder(x)
print("Your model should decode this shape:")
print(encoded.shape)
print("to 8 x 1 x 28 x 28")

In [None]:
class CNNDecoder(nn.Module):
    def __init__(self):
        super().__init__()

        # YOUR CODE HERE
    def forward(self, x):
        # YOUR CODE HERE
        return None # CHANGE THIS TO YOUR OUTPUT




Testing your decoder:

In [None]:
CNNDecoder()(encoded).shape

<div class="alert alert-success">

<b>EXTRA EXERCISE:</b> 
<p> Train your Autoencoder using the same code as in last PC lab.</p>

</div>

You are free to explore all the same code we explored last PC lab for visualization of the data in 2D space and also to generate new samples.

Just remember that our latent space does not consist of 1-D vectors anymore, but of 3-dimensional tensors.

Keep in mind that the two autoencoders you made may not be comparable if they reconstruct their images to a different number of hidden dimensions. If your MLP autoencoder had a bottleneck of 16 hidden nodes, and your CNN bottleneck has a dimensionality of 8 x 2 x 2, than your MLP compresses the input image by twice as much. A fair evaluation would require a CNN bottleneck of 16 x 1 x 1 or 4 x 2 x 2.