<img src="https://drive.google.com/uc?id=1DvKhAzLtk-Hilu7Le73WAOz2EBR5d41G" width="500"/>

---


# **CNNs: convolutional neural networks (part 2)**

#### **Morning contents/agenda**

1. Commonly used datasets in computer vision

2. Important CNN architectures

3. U-nets and upsampling (unpooling & transpose convolutions)

4. Transfer learning

5. Summary of CNNs

#### **Learning outcomes**

1. Awareness of well-established CNN architectures

2. Undersand how to upsample data

3. Understand how and why transfer learning is used

#### **Afternoon contents/agenda**

1. Inspection of CNN filters

2. Transfer learning from ImageNet to Bees and Ants

#### **Learning outcomes**

1. Become familiar with the effect that filters have (sometimes you can interpret them, sometimes they have abstracted the data too far to develop intuitions)

2. Hands-on knowledge on how to apply transfer learning


<br/>

---

<br/>

In [8]:
!pip install pycm livelossplot
%pylab inline

from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit

from livelossplot import PlotLosses
from pycm import *

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import torchvision.datasets
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torchsummary import summary

Collecting pycm
  Downloading pycm-4.0-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting livelossplot
  Downloading livelossplot-0.5.5-py3-none-any.whl (22 kB)
Collecting art>=1.8 (from pycm)
  Downloading art-6.1-py3-none-any.whl (599 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m599.8/599.8 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: art, pycm, livelossplot
Successfully installed art-6.1 livelossplot-0.5.5 pycm-4.0
Populating the interactive namespace from numpy and matplotlib


In [None]:
def set_seed(seed):
    """
    Use this to set ALL the random seeds to a fixed value and take out any randomness from cuda kernels
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    torch.backends.cudnn.benchmark = False  ##uses the inbuilt cudnn auto-tuner to find the fastest convolution algorithms. -
    torch.backends.cudnn.enabled   = False

    return True

device = 'cpu'
if torch.cuda.device_count() > 0 and torch.cuda.is_available():
    print("Cuda installed! Running on GPU!")
    device = 'cuda'
else:
    print("No GPU available!")

Cuda installed! Running on GPU!


## 1. Commonly used datasets in computer vision

As we saw on the first week, the network capacity has to be adjusted in order to avoid overfitting to the data. In other words, very deep networks with large number of trainable parameters require big datasets because they have a lot of capacity to accomodate variations in the data.

So far we have seen MNIST and similarly-smalled sized datasets:

<br>

<p align = "center"><img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" width="400"/></p><p align = "center">
<i>MNIST dataset: 60k training & 10k test images</i>
</p>





<br>

It is often desirable to have datasets of natural images, as they can be used for a broader range of applications than MNIST-like datasets. [CIFAR-10 and CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) are two datasets of natural images with 10 and 100 classes respectively:

<br>

<p align = "center"><img src="https://production-media.paperswithcode.com/datasets/4fdf2b82-2bc3-4f97-ba51-400322b228b1.png" width="400"/></p><p align = "center">
<i>CIFAR-10 dataset: 50k training & 10k test images</i>
</p>

<br>

<p align = "center"><img src="https://miro.medium.com/max/1400/0*fqFMfJeP6CuBTuYc.webp" width="400"/></p><p align = "center">
<i>CIFAR-100 dataset: 50k training & 10k test images</i>
</p>

<br>

But larger datasets exist as well. [ImageNet](https://www.image-net.org/) has been used in various competitions, and it contains more than 14 million images and 20k classes:

<p align = "center"><img src="https://i0.wp.com/syncedreview.com/wp-content/uploads/2020/06/Imagenet.jpg?fit=1400%2C600&ssl=1" width="800"/></p><p align = "center">
<i>ImageNet: >14M images and 20k classes </i>
</p>


[Here](https://pytorch.org/vision/stable/datasets.html) is a list of available datasets in `torchvision.datasets`



## 2. Important CNN architectures

Since their introduction in 1998 with LeNet-5, convolutional neural networks have evolved significantly and competed for the top spot in computer vision tasks.

You can find a [good overview here](https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d) (the images below are from this website).

<br>

<img src="https://drive.google.com/uc?id=1BGcWnSRGLJmVzfRSQBujnkHtF9wTeQuq" width="600"/>

<br>

<center><img src="https://drive.google.com/uc?id=1PDNr20s96ddbabkX5dfjnUemw2ZMWvU4" width="800"/></center>

<br>

<center><img src="https://drive.google.com/uc?id=1FycJ5amqUL-Z_NXKtfmbejqID6pTFSsQ" width="800"/></center>

<br>

<center><img src="https://drive.google.com/uc?id=1yrpMY7PuyVG68M6gsnQhEHdGCbMZSYzo" width="800"/></center>

<br>

<center><img src="https://drive.google.com/uc?id=16TLC9m8JvZw1V5fVx3cqzDowfGk4TKUT" width="800"/></center>

<br>

<center><img src="https://drive.google.com/uc?id=178gJwbpE12TzKbHQfq3q8CTg4X-12gSZ" width="800"/></center>

<br>

<center><img src="https://drive.google.com/uc?id=14tHV7AMln-qH4DwpjDZUbSrj6o0Jx1YX" width="800"/></center>

<br>

As you can see, network sizes increase over time thanks to advances in computational power (better GPUs with more memory, etc):

<br>

<center><img src="https://drive.google.com/uc?id=1Bv0msGB95GXeiuKs5GQI6TMHYQQljz1z" width="600"/></center>

<br>

But even this numbers are considered small in modern architectures. For example, **GPT-4** has 1.7 trillion parameters (largest network so far, I think).


<br>

---

<br/>

## 2. U-nets and upsampling (unpooling & transpose convolutions)

What are the outputs of the CNNs we have seen so far?

<br>

<center><img src="https://drive.google.com/uc?id=1xSZ3Tb6mJgHit51fEZF9P5UojA0Tossm" width="600"/></center>

<br>

<br>

<center><img src="https://drive.google.com/uc?id=1EITH6oofcQurxnnXNKriZ1dBkWmerlTX" width="600"/></center>

<br>

But CNNs have other applications. In the field of computer vision, a very common architecture is the **U-Net** which is a type of convolutional autoencoder (we will see autoencoders next week):

<p align = "center"><img src="https://drive.google.com/uc?id=1C5oJBIihVeyditn1I0nIjlZfEpkIhH2F" width="800"/></p><p align = "center">
<i> sources: <a href="https://arxiv.org/pdf/1505.04597.pdf">original unet</a>, <a href="https://www.kaggle.com/c/tgs-salt-identification-challenge"> seismic segmentation</a></i>
</p>

<br>

An important operation we perform to generate U-Net (and other architectures) is upscaling. The most common methods are:

- nearest neighbour
- "bed of nails"
- Max unpooling

<br>

<center><img src="https://drive.google.com/uc?id=1b1-3ncTi9cO7XFFuFljvfkwFlhKFHltU" width="800"/></center>

<br>

and

- Transposed convolution or up-convolution, but **not deconvolution!**



## Transposed convolutions

Transposed convolutions can be computed by following an easy recipe:

<center><img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*54-7typHLLXhdvAhlku9SQ.png" width="900"/></center>


where we know that:
- `s`: stride
- `p`: padding
- `k`: kernel size

and we use this hyperparameters to calculate:
- `z`: how many zeros to insert in between pixels of my input
- `p'`: how much padding do I add around the image

But with the added caveat that, **as the name indicates**, we need to **transpose the kernel** before using it to convolve with the input.

# Exercise:
Let's practice with a couple of examples. First, let's try and calculate a simple case by hand:


<p align = "center"><img src="https://drive.google.com/uc?id=1KPteXRKw7OwUKzZkO95MvGAt1qnm-b51" width="800"/></p><p align = "center">


To check if we have the right solution we can use [`conv_transpose2d`](https://pytorch.org/docs/stable/generated/torch.nn.functional.conv_transpose2d.html), which allows us to pass an input and a predefined filter, whereas the [`ConvTranspose2d`](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html#torch.nn.ConvTranspose2d) layer randomly initialises the weights of the kernel which is not what we need now.

In [None]:
x = torch.tensor([[1, 5.], [3., 2.]])          ## why do I need to get an instance as below?
#x = torch.tensor([[[[1, 5.], [2., 3.]]]])     ## how many square brackets do I need, and what do they do?
print(x.shape)

kernel = torch.tensor([[0, 1.], [2., 3.]])
# kernel = torch.tensor([[[[0, 1.], [2., 3.]]]])

torch.nn.functional.conv_transpose2d(x,kernel)

Transpose convolutions, particularly with stride bigger than 1 can lead to checkerboard imprints on the outputs:

In [None]:
%%html
<iframe src="https://distill.pub/2016/deconv-checkerboard/" width="1000" height="500"></iframe>

## 3. Transfer learning

What is transfer learning and why is it useful? A definition from the Deep Learning book by Goodfellow et al (2016):

*Transfer learning and domain adaptation refer to the situation where what has been learned in one setting ... is exploited to improve generalization in another setting.*

<br>

- The most well-known CNN designs are **available** on-line and have been **successfully trained** on very large number of images (millions).

- In many applications we often work with a relatively **small number of images** (or data in general).

- The idea of transfer learning is to use an existing trained CNN model which tries to solve a problem of similar nature and **tailor the model** to our particular application.

The two main strategies are:

1. **Add one (or more) layers, or retrain the last layer(s) of a pre-trained network**: This strategy assumes that the filters of most of the network do a good job at extracting data features we can use. The last layers, then, act as a final fine-tunning to capture the specific features of our data.

2. **Retrain the whole network with small learning rates:** This strategy assumes that as a whole, the network captures data features well, and it only needs a bit of a ‘nudge’ to adapt the network parameters to our particular problem. In this case, we want to keep the underlying abstraction that the network does at different scales, but fine-tune it to our problem.

We will see examples of both in this afternoon exercise.

## 4. Receptive field

The receptive field is defined as the region of the input that affects the output, and it can be defined between adjacent or non-adjacent CNN layers.

<br>

<center><img src="https://drive.google.com/uc?id=1GfG26m6Xee9qhyiA-ARvbTnzAxbSYo--" width="800"/></center>

<br>

Why are we interested in the receptive field in our network?

Because the receptive field will determine what are the hyperparameters
I need to ensure full receptive field on my inputs:

- filter size and stride
- number of convolutional layers in the network

[Here](https://www.baeldung.com/cs/cnn-receptive-field-size) you can find a more detailed explanation with a few formulas to compute receptive fields.

# solution exercise transpose convolution:



<p align = "center"><img src="https://drive.google.com/uc?id=1KPteXRKw7OwUKzZkO95MvGAt1qnm-b51" width="800"/></p><p align = "center">

<p align = "center"><img src="https://drive.google.com/uc?id=1Q56nHt_tT6L7YOoNBH7PHduS4CrO41o4" width="800"/></p><p align = "center">

And check that we can get the same result with a normal convolution

In [None]:
#x = torch.tensor([[1, 5.], [3., 2.]])          ## why do I need to get an instance as below?
x = torch.tensor([[[[1, 5.], [2., 3.]]]])     ## how many square brackets do I need, and what do they do?
#print(x.shape)

kernel = torch.tensor([[[[3, 2.], [1., 0.]]]]) ## transpose the kernel here since we will use an 'normal' convolution

torch.nn.functional.conv2d(x,kernel, padding=1)