# Transfer Learning with ImageNet

This notebook is a merged and updated version based on two notebooks initially developed for the [UPC School's Postgraduate in Artificial Intelligence with Deep Learning](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) (2019). Both notebooks were originally created by [Daniel Fojo](https://www.linkedin.com/in/daniel-fojo/) and [Xavier Giro-i-Nieto](https://imatge.upc.edu/web/people/xavier-giro) and updated by various contributors.

Based on previous versions by [Miriam Bellver](https://imatge.upc.edu/web/people/miriam-bellver) for the ([Barcelona Technology School](https://barcelonatechnologyschool.com/) 2018).

Updated by [Pol Caselles](https://www.linkedin.com/in/pcaselles/) (2022) and [Gerard I. Gállego](https://www.linkedin.com/in/gerard-gallego/) (2022)

Merged and updated by [Laia Albors](https://www.linkedin.com/in/laia-albors-zumel-837a35211) and [Àlex Solé](https://www.linkedin.com/in/alex-sole-gomez/) (2024).

In this session we will work with convolutional neural networks, when dealing with small datasets, which is a very usual situation, as data can be difficult to obtain in certain scenearios, specially labelled data.

In [None]:
import os
import torch
import urllib
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
from PIL import Image
from torchvision.transforms.functional import to_pil_image
import ast
from typing import Tuple, List
import requests

if not torch.cuda.is_available():
    raise Exception("You should enagle GPU in the Runtime menu")
device = torch.device("cuda")

In [None]:
seed = 123
np.random.seed(seed)
_ = torch.manual_seed(seed)
_ = torch.cuda.manual_seed(seed)

## Downloading the database

During this  lab session, we will work with a small database of images with dogs and cats. The cats vs. dogs dataset that we will use isn't packaged with PyTorch. It was made available by Kaggle.com as part of a computer vision competition in late 2013, back when CNN weren't quite mainstream.

The following command line will download to your remote machine the Kaggle database that we will need for this lab session. It will take a few seconds.

In [None]:
!wget --no-check-certificate "https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip" -P /content/

Now you will need to unzip the dataset that you have just downloaded, with the following line:

In [None]:
!unzip -qq /content/cats_and_dogs_filtered.zip && rm /content/cats_and_dogs_filtered.zip
!mv /content/cats_and_dogs_filtered/* /datalab/

Now you already have the raw images at your remote machine.

## Training a ResNet from scratch on a small dataset

Having to train an image classification model using only very little data is a common situation, which you likely encounter yourself in
practice if you ever do computer vision in a professional context.

Having "few" samples can mean anywhere from a few hundreds to a few tens of thousands of images. As a practical example, we will focus on
classifying images as "dogs" or "cats", in a dataset containing only 800 pictures of cats and dogs (400 cats, 400 dogs). We will use 400
pictures for training and 400 for validation.

In this section, we will review one basic strategy to tackle this problem: training a new model from scratch on what little data we have. We
will start by naively training a small architecture inspired by resnet on our 400 training samples, without any regularization, to set a baseline for what can be
achieved. The ResNet is a deep neural network architecture designed to address the vanishing gradient problem in very deep networks. It introduces “skip connections” or shortcuts that bypass one or more layers, allowing the network to learn residual functions relative to the layer inputs. This enables the construction of much deeper networks by mitigating issues like gradient vanishing or degradation. As a result, ResNet models have achieved remarkable performance in image recognition tasks, often surpassing human-level accuracy. Here, our main issue will be overfitting. Then we will introduce
*data augmentation*, a powerful technique for mitigating overfitting in computer vision. By leveraging data augmentation, we will improve
our network.

In the last section of today's session, we will review two more essential techniques for applying deep learning to small datasets: *doing feature extraction
with a pre-trained network* and *fine-tuning a pre-trained network*. Together, these three strategies -- training a small model from scratch, doing feature extracting using a
pre-trained model, and fine-tuning a pre-trained model -- will constitute your future toolbox for tackling the problem of doing computer
vision with small datasets.

### The relevance of deep learning for small-data problems

You will sometimes hear that deep learning only works when lots of data is available. This is in part a valid point: one fundamental
characteristic of deep learning is that it is able to find interesting features in the training data on its own, without any need for manual feature engineering, and this can only be achieved when lots of training examples are available. This is especially true for problems where the input samples are very high-dimensional, like images.

However, what constitutes "lots" of samples is relative -- relative to the size and depth of the network you are trying to train, for
starters. It isn't possible to train a ResNet to solve a complex problem with just a few tens of samples, but a few hundreds can
potentially suffice if the model is small and well-regularized and if the task is simple. Because CNNs learn local, translation-invariant features, they are very data-efficient on perceptual problems compared to MLP. Training a very small ResNet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering. You will see this in action in this section.

But what's more, deep learning models are by nature highly repurposable: you can take, say, an image classification or speech-to-text model trained on a large-scale dataset then reuse it on a significantly different problem with only minor changes. Specifically, in the case of computer vision, many pre-trained models (usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models out of very little data. That's what we will do in the next section.

For now, let's get started by getting our hands on the data.

## Preparing the data

Some sample images of the cats vs. dogs dataset look like this:

![cats_vs_dogs_samples](https://s3.amazonaws.com/book.keras.io/img/ch5/cats_vs_dogs_samples.jpg)

The 2013 Cats vs. Dogs Kaggle competition showcased impressive results, with the top entries achieving up to 95% accuracy. In this example, we will aim to approach that level of accuracy (in the next section), even though we will be training our models on less than 5% of the data available to the competitors.

The original dataset consists of 25,000 images of dogs and cats (12,500 from each class) and is 543MB in size (compressed). After downloading and uncompressing the dataset, we will create a smaller dataset with three subsets: a training set containing 200 samples of each class and a validation set containing 200 samples of each class.

Below are a few lines of code to prepare this dataset:

In [None]:
import os, shutil

In [None]:
# The path to the directory where the original
# dataset was uncompressed
original_dataset_dir = '/datalab/train/'

# The directory where we will
# store our smaller dataset
base_dir = '/content/processed_datalab'
if not os.path.exists(base_dir):
    os.makedirs(base_dir)

# Directories for our training,
# validation and test splits
train_dir = os.path.join(base_dir, 'train')
if not os.path.exists(train_dir):
    os.mkdir(train_dir)
validation_dir = os.path.join(base_dir, 'validation')
if not os.path.exists(validation_dir):
    os.mkdir(validation_dir)
test_dir = os.path.join(base_dir, 'test')
if not os.path.exists(test_dir):
    os.mkdir(test_dir)

# Directory with our training cat pictures
train_cats_dir = os.path.join(train_dir, 'cats')
if not os.path.exists(train_cats_dir):
    os.mkdir(train_cats_dir)

# Directory with our training dog pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')
if not os.path.exists(train_dogs_dir):
    os.mkdir(train_dogs_dir)

# Directory with our validation cat pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')
if not os.path.exists(validation_cats_dir):
    os.mkdir(validation_cats_dir)

# Directory with our validation dog pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')
if not os.path.exists(validation_dogs_dir):
    os.mkdir(validation_dogs_dir)

# Directory with our validation cat pictures
test_cats_dir = os.path.join(test_dir, 'cats')
if not os.path.exists(test_cats_dir):
    os.mkdir(test_cats_dir)

# Directory with our validation dog pictures
test_dogs_dir = os.path.join(test_dir, 'dogs')
if not os.path.exists(test_dogs_dir):
    os.mkdir(test_dogs_dir)

# Copy first 200 cat images to train_cats_dir
fnames = ['cat.{}.jpg'.format(i) for i in range(200)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, "cats", fname)
    dst = os.path.join(train_cats_dir, fname)
    shutil.copyfile(src, dst)

# Copy next 200 cat images to validation_cats_dir
fnames = ['cat.{}.jpg'.format(i) for i in range(200, 400)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, "cats", fname)
    dst = os.path.join(validation_cats_dir, fname)
    shutil.copyfile(src, dst)

# Copy first 200 dog images to train_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(200)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, "dogs", fname)
    dst = os.path.join(train_dogs_dir, fname)
    shutil.copyfile(src, dst)

# Copy next 200 dog images to validation_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(200, 400)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, "dogs", fname)
    dst = os.path.join(validation_dogs_dir, fname)
    shutil.copyfile(src, dst)

As a sanity check, let's count how many pictures we have in each training split (train/validation/test):

In [None]:
print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))
print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))

## Building our network

### Exercise 0

You need to build a small ResNet-like architecture. To do that, first we are going to define the ResNet block. The ResNet block consists of two Conv2D layers with a 3×3 kernel, padding of 1, and no bias. The first Conv2D can have a variable stride based on the given `stride` parameter. Following the first convolution, there is a [Batch Normalization 2D](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html) and a ReLU activation. Then the block continues with the second Conv2D, now with stride 1, and a Batch Normalization 2D. At this point is where we need to add the identity from the original input before the first convolution. If the `in_channels` are the same as the `out_channels` and `stride` equals 1, we simply add the original input; otherwise, we should map the dimensions using the shortcut function already provided. Finally, we pass the added vector through a ReLU.

Hint: [Pytorch Implementation of the ResNet](https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py)


In [None]:
class SimpleResNetBlock(nn.Module):
        def __init__(self, in_channels, out_channels, stride):
                super(SimpleResNetBlock, self).__init__()

                # TODO: define the layers definied above
                ...

                if in_channels != out_channels or stride != 1:
                        self.shortcut = nn.Sequential(nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                                                      nn.BatchNorm2d(out_channels))
                else:
                        self.shortcut = None

        def forward(self, x):
                identity = x
                if self.shortcut is not None:
                        identity = self.shortcut(identity)

                # TODO: pass the input through the defined layers, add the identity (residual connection) and apply a ReLU to the result
                ...

                return x

Once we have created the blocks, we are going to define a ResNet architecture. The first block is given and consists of a `Conv2D` with a 7×7 kernel size and stride 2, followed by `Batch Normalization` and a `ReLU`. The image is then pooled using a `MaxPool`. This is where we start adding our residual blocks. You should add three residual blocks with a stride of 2, doubling the output channels at each block, starting with 64 input channels. At the end, add two fully connected layers with 512 hidden units before the final binary classification between cats and dogs using a `Sigmoid` activation.

In [None]:
#Smaller implementation inspired by the original resnet
class SimpleResNet(nn.Module):
        def __init__(self):
                super(SimpleResNet, self).__init__()
                self.dim_in = 64
                self.conv1 = nn.Conv2d(3, self.dim_in, kernel_size=7, stride=2, padding=3, bias=False)
                self.bn1 = nn.BatchNorm2d(self.dim_in)
                self.relu = nn.ReLU(inplace=True)
                self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
                self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) # hint: https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html

                # TODO: call 3 residuals blocks
                self.layers = nn.ModuleList([...])

                # TODO: define the final classifier
                self.classifier = nn.Sequential(...)


        def forward(self, x):
                x = self.conv1(x)
                x = self.bn1(x)
                x = self.relu(x)
                x = self.maxpool(x)

                # TODO: call the residual blocks
                        ...

                x = self.avgpool(x)

                # TODO: call the classifier
                ...

                return x

model = SimpleResNet()

model.to(device)

For the optimizer, we will use Adam. Since we are dealing with binary classification, we will use binary cross entropy as our loss function.

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()

## Data preprocessing

Data should be formatted into appropriately pre-processed floating point tensors before being fed into our
network. Currently, our data sits on a drive as JPEG files, so the steps for getting it into our network are roughly:

* Read the picture files.
* Decode the JPEG content to RBG grids of pixels.
* Resize and crop the pictures to the desired size (150x150)
* Convert these into floating point tensors.
* Rescale the pixel values (between 0 and 255) to the [0, 1] interval (neural networks prefer to deal with small input values).

It may seem a bit daunting, but thankfully PyTorch has utilities included in the TorchVision library to take care of these steps automatically. We will use the `ImageFolder` Dataset class, which reads images from different folders, where each folder represents a different category. We will use `torchvision.transforms` to preprocess our data.

In [None]:
transform = transforms.Compose([
                                transforms.Resize(150), # Resize the short side of the image to 150 keeping aspect ratio
                                transforms.CenterCrop(150), # Crop a square in the center of the image
                                transforms.ToTensor(), # Convert the image to a tensor with pixels in the range [0, 1]
                                ])
batch_size = 64

train_dataset = ImageFolder(train_dir, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataset = ImageFolder(validation_dir, transform=transform)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

Let's take a look at the output of one of these generators: it yields batches of 150x150 RGB images (shape `(64, 3, 150, 150)`) and binary
labels (shape `(64,)`). 64 is the number of samples in each batch (the batch size).

In [None]:
for data_batch, labels_batch in train_loader:
    print('data batch shape:', data_batch.shape)
    print('labels batch shape:', labels_batch.shape)
    break

Let's fit our model to the data using the DataLoader.

In [None]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def train_model(model, optimizer, loss_fn, train_loader, val_loader, epochs):

    train_accuracies, train_losses, val_accuracies, val_losses = [], [], [], []
    val_loss = AverageMeter()
    val_accuracy = AverageMeter()
    train_loss = AverageMeter()
    train_accuracy = AverageMeter()

    for epoch in range(epochs):
        # train
        model.train()
        train_loss.reset()
        train_accuracy.reset()
        train_loop = tqdm(train_loader, unit=" batches")  # For printing the progress bar
        for data, target in train_loop:
            train_loop.set_description('[TRAIN] Epoch {}/{}'.format(epoch + 1, epochs))
            data, target = data.float().to(device), target.float().to(device)
            target = target.unsqueeze(-1)
            optimizer.zero_grad()
            output = model(data)
            loss = loss_fn(output, target)
            loss.backward()
            optimizer.step()

            train_loss.update(loss.item(), n=len(target))
            pred = output.round()  # get the prediction
            acc = pred.eq(target.view_as(pred)).sum().item()/len(target)
            train_accuracy.update(acc, n=len(target))
            train_loop.set_postfix(loss=train_loss.avg, accuracy=train_accuracy.avg)

        train_losses.append(train_loss.avg)
        train_accuracies.append(train_accuracy.avg)

        # validation
        model.eval()
        val_loss.reset()
        val_accuracy.reset()
        val_loop = tqdm(val_loader, unit=" batches")  # For printing the progress bar
        with torch.no_grad():
            for data, target in val_loop:
                val_loop.set_description('[VAL] Epoch {}/{}'.format(epoch + 1, epochs))
                data, target = data.float().to(device), target.float().to(device)
                target = target.unsqueeze(-1)
                output = model(data)
                loss = loss_fn(output, target)
                val_loss.update(loss.item(), n=len(target))
                pred = output.round()  # get the prediction
                acc = pred.eq(target.view_as(pred)).sum().item()/len(target)
                val_accuracy.update(acc, n=len(target))
                val_loop.set_postfix(loss=val_loss.avg, accuracy=val_accuracy.avg)

        val_losses.append(val_loss.avg)
        val_accuracies.append(val_accuracy.avg)

    return train_accuracies, train_losses, val_accuracies, val_losses


In [None]:
epochs = 60
train_accuracies, train_losses, val_accuracies, val_losses = train_model(model, optimizer, loss_fn, train_loader, val_loader, epochs)

Let's plot the loss and accuracy of the model over the training and validation data during training:

In [None]:
epochs = range(len(train_accuracies))

plt.plot(epochs, train_accuracies, 'b', label='Training acc')
plt.plot(epochs, val_accuracies, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, train_losses, 'b', label='Training loss')
plt.plot(epochs, val_losses, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

These plots are characteristic of overfitting. Our training accuracy increases linearly over time, until it reaches nearly 100%, while our
validation accuracy stalls at about 60%. Our validation loss reaches its minimum after only 10 epochs then stalls, while the training loss
keeps decreasing linearly until it reaches nearly 0.

Because we only have very few training samples (400), overfitting is going to be our number one concern. You already know about a
number of techniques that can help mitigate overfitting, such as dropout and weight decay (L2 regularization). We are now going to
introduce a new one, specific to computer vision, and used almost universally when processing images with deep learning models: *data
augmentation*.

# 1 Using data augmentation

Overfitting is caused by having too few samples to learn from, which results in the inability to train a model that generalizes to new data.
Given infinite data, our model would be exposed to every possible aspect of the data distribution at hand: we would never overfit. Data
augmentation takes the approach of generating more training data from existing training samples, by "augmenting" the samples via a number
of random transformations that yield believable-looking images. The goal is that at training time, our model would never see the exact same
picture twice. This helps the model get exposed to more aspects of the data and generalize better.

In PyTorch, this can be done by passaing TorchVision transformations to the Dataset. Let's get started with an example:

In [None]:
train_transform = transforms.Compose([transforms.RandomHorizontalFlip(), transforms.RandomRotation(10), transforms.RandomResizedCrop(150), transforms.ToTensor()])
val_transform = transforms.Compose([transforms.Resize(150), transforms.CenterCrop(150), transforms.ToTensor()])

batch_size = 64
augmented_dataset = ImageFolder(train_dir, transform=train_transform)
augmented_loader = DataLoader(augmented_dataset, batch_size=batch_size, shuffle=True)
val_dataset = ImageFolder(validation_dir, transform=val_transform)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

These are just a few of the options available (for more, see in the [documentation](https://pytorch.org/docs/stable/torchvision/transforms.html)). Let's quickly go over what we just wrote:

* `RandomRotation` is a value in degrees (0-180), a range within which to randomly rotate pictures.
* `RandomHorizontalFlip` is for randomly flipping half of the images horizontally -- relevant when there are no assumptions of horizontal
asymmetry (e.g. real-world pictures).
* `RandomResizedCrop` randomly resize the image and returns a random crop with a small aspect ratio distorsion. Very popular transformation for training imagenet.

Let's take a look at our augmented images:

In [None]:
random_indices = [0, 42, 200, 242]
augmented_images = [augmented_dataset[i][0] for i in random_indices]
augmented_images = [np.asarray(transforms.functional.to_pil_image(im)) for im in augmented_images]

images = [train_dataset[i][0] for i in random_indices]
images = [np.asarray(transforms.functional.to_pil_image(im)) for im in images]

fig, axes = plt.subplots(len(random_indices), 2, figsize=(15, 15))
for i in range(len(random_indices)):
    axes[i, 0].imshow(images[i])
    axes[i, 1].imshow(augmented_images[i])

## Dropout regularization

If we train a new network using this data augmentation configuration, our network will never see twice the same input. However, the inputs
that it sees are still heavily intercorrelated, since they come from a small number of original images -- we cannot produce new information,
we can only remix existing information. As such, this might not be quite enough to completely get rid of overfitting.

### Exercise 1

To further fight  overfitting, modify the SimpleResNet defined above by adding a [Dropout](https://pytorch.org/docs/stable/nn.html?highlight=dropout#torch.nn.Dropout) layer with probability 0.5 between the convolutional block and the linear classifier.

In [None]:
class SimpleResNetDrop(nn.Module):
        def __init__(self):
                super(SimpleResNetDrop, self).__init__()
                # TODO
                ...


        def forward(self, x):
                # TODO

                for layer in self.layers:
                        x = layer(x)

                x = self.avgpool(x)
                x = torch.flatten(x,1)
                x = self.dropout(x)
                x = self.classifier(x)
                return x    ## ...

model = SimpleResNetDrop()

model.to(device)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()

Let's train our network using data augmentation and dropout:

In [None]:
epochs = 60
train_accuracies, train_losses, val_accuracies, val_losses = train_model(model, optimizer, loss_fn, augmented_loader, val_loader, epochs)

Let's save a checkpoint of our model for future references.

In [None]:
torch.save(model.state_dict(), 'cats_and_dogs_small_2_class.pt')
np.savez('history_cats_and_dogs_small_2_class.npz', train_accuracies=train_accuracies, train_losses=train_losses, val_accuracies=val_accuracies, val_losses=val_losses)

Plot the curves with the following lines:

In [None]:
epochs = range(len(train_accuracies))

plt.plot(epochs, train_accuracies, 'b', label='Training acc')
plt.plot(epochs, val_accuracies, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, train_losses, 'b', label='Training loss')
plt.plot(epochs, val_losses, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Thanks to data augmentation and dropout, we are no longer overfitting: the training curves are rather closely tracking the validation
curves. We are now able to reach an accuracy of about 65%.

By leveraging regularization techniques even further and by tuning the network's parameters (such as the number of filters per convolution
layer, or the number of layers in the network), we may be able to get an even better accuracy, likely up to 70%. However, it would prove
very difficult to go any higher just by training our own ResNet from scratch, simply because we have so little data to work with. As a
next step to improve our accuracy on this problem, we will have to leverage a pre-trained model, which will be the focus of the next two
sections.

# 2 Using a pre-trained CNN from ImageNet

A common and highly effective approach to deep learning on small image datasets is to leverage a pre-trained network. A pre-trained network is simply a saved network previously trained on a large dataset, typically on a large-scale image classification task. If this original dataset is large enough and general enough, then the spatial feature hierarchy learned by the pre-trained network can effectively act as a generic model of our visual world, and hence its features can prove useful for many different computer vision problems, even though these new problems might involve completely different classes from those of the original task. For instance, one might train a network on ImageNet (where classes are mostly animals and everyday objects) and then re-purpose this trained network for something as remote as identifying furniture items in images. Such portability of learned features across different problems is a key advantage of deep learning compared to many older shallow learning approaches, and it makes deep learning very effective for small-data problems.

In our case, we will consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1000 different classes). ImageNet contains many animal classes, including different species of cats and dogs, and we can thus expect to perform very well on our cat vs. dog classification problem.

In Torchvision there are many models pretrained with Imagenet. You can see the lists [here](https://pytorch.org/vision/stable/models.html#table-of-all-available-classification-weights).
We will begin by seeing how to use these already trained networks to predict the images from our dataset. In particular, we will use a ResNet18.


### Exercise 2

Load a pretrained ResNet18 model from PyTorch. Remember to set validation mode calling `.eval()` and move the model to GPU.

In [None]:
from torchvision import models

# TODO : load a pretrained resnet18
model = ...


try:
    from torchvision.models import ResNet
    assert isinstance(pretrained_model, ResNet), "Model is not a ResNet"
    assert pretrained_model.__class__.__name__ == "ResNet", "Model is not an instance of ResNet"
    # Check the specific architecture (number of layers in each block matches ResNet18)
    expected_layers = [2, 2, 2, 2]  # ResNet18 has 2 layers in each block
    actual_layers = [len(layer) for layer in [pretrained_model.layer1, pretrained_model.layer2, pretrained_model.layer3, pretrained_model.layer4]]
    assert actual_layers == expected_layers, "Model is not a ResNet18"
except Exception as e:
    print("Loaded model:", pretrained_model)
    print("Expected ResNet18 architecture with layers:", expected_layers)
    print("Actual layers:", actual_layers)
    raise e
try:
    assert next(pretrained_model.parameters()).is_cuda
except AssertionError:
    raise Exception("Did you forget to move the model to GPU?")

try:
    assert not pretrained_model.training
except AssertionError:
    raise Exception("Did you forget set validation mode?")

print("Well done!")

We will also load the 1000 imagenet labels in a Python dictionary.

In [None]:
urllib.request.urlretrieve("https://gist.githubusercontent.com/yrevar/942d3a0ac09ec9e5eb3a/raw/238f720ff059c1f82f368259d1ca4ffa5dd8f9f5/imagenet1000_clsidx_to_labels.txt", "labels.json")
with open("labels.json") as f:
    labels = ast.literal_eval(f.read()) # ast evaluates the string Python code
print(f"We have {len(labels)} labels.")
print(labels)

We will also group these 1000 labels into three superclasses: dog, cat, and other, to align with the classes of our task.

In [None]:
# read imagenet wordnet ids
import urllib.request

# Download the file
url = "https://gist.githubusercontent.com/yrevar/667fd94b94f1666137f45d1363f60910/raw/5722af8486eb7b152e7431a34b957ded557b5256/imagenet1000_clsid_to_labels.txt"
urllib.request.urlretrieve(url, "imagenet_classes.txt")

# Read the file and manually convert it into a dictionary
imagenet_classes = {}
with open("imagenet_classes.txt", "r") as f:
    lines = f.readlines()

# Fix first and last lines, then process all lines
lines[0] = lines[0].replace("{", "").strip()  # Remove opening '{' from the first line
lines[-1] = lines[-1].replace("'}", "").strip()  # Remove closing "'}" from the last line

for line in lines:
    line = line.strip()
    if line:  # Skip empty lines
        # Extract key and value safely
        try:
            key, value = line.split(": ", 1)
            key = key.strip()  # Remove extra spaces around the key
            value = value.strip().strip(",")  # Remove trailing comma
            imagenet_classes[key] = value.strip("'")  # Remove surrounding quotes
        except ValueError:
            print(f"Skipping problematic line: {line}")

# get indices and ids converison
from collections import defaultdict

labels_idx = defaultdict(list)
labels_id = defaultdict(list)

for idx, v in labels.items():
    labels_idx[v].append(idx)
for id_, v in imagenet_classes.items():
    labels_id[v].append(id_)

id2idx = {id_: idx for v, idxs in labels_idx.items() if len(idxs) == len(labels_id[v]) for id_, idx in zip(labels_id[v], idxs)}


# get superclasses
!pip install nltk

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn

# Define WordNet synsets for the superclasses
dog_synset = wn.synset('dog.n.01')  # Base synset for 'dog'
cat_synset = wn.synset('cat.n.01')  # Base synset for 'cat'

# Function to determine superclass
def get_superclass(wordnet_id):
    try:
        synset = wn.synset_from_pos_and_offset('n', int(wordnet_id[1:]))
        if synset.lowest_common_hypernyms(dog_synset) == [dog_synset]:
            return 'dog'
        elif synset.lowest_common_hypernyms(cat_synset) == [cat_synset]:
            return 'cat'
        else:
            return 'other'
    except Exception as e:
        raise Exception(f"Error: {str(e)}")

# Group labels into superclasses
superclasses = {}
for wordnet_id, label in imagenet_classes.items():
    superclass = get_superclass(wordnet_id)
    superclasses[id2idx[wordnet_id]] = superclass


# group the indices in superclasses
idx_superclasses = {}
for idx, superclass in superclasses.items():
  if superclass not in idx_superclasses:
    idx_superclasses[superclass] = []
  idx_superclasses[superclass].append(idx)
superidx_idx = {0: idx_superclasses['cat'], 1: idx_superclasses['dog'], 2: idx_superclasses['other']}
idx_superidx = {i: super_idx for super_idx,idxs in superidx_idx.items() for i in idxs}


print(f"We have {len(superclasses)} idices.")
print(f"Superclasses: {set(superclasses.values())}")
print(superclasses)

We can now download a sample image to test our network. When working with PyTorch, we usually use PIL or PILLOW (Python Image Library), which is the standard Pythonic way of working with images. `Image.open("\path\to\image.jpg")` returns an Image object, wich then can be converted to a Numpy or PyTorch tensor. We can look at the images using the `imshow` method from `matplotlib`.

In [None]:
def show_image(pil_image):
    imshow(np.asarray(pil_image))

In [None]:
urllib.request.urlretrieve("https://upload.wikimedia.org/wikipedia/commons/0/04/Labrador_Retriever_(1210559).jpg", "dog_image.jpg")
pil_image = Image.open("dog_image.jpg")
show_image(pil_image)

As you know, when the network was trained, ImageNet images went through some preprocessing transformations. Before feeding an image to the network, we should do the same transformations.

As in validation, we will:
* Resize our image to size 256 (this means, the smallest side will be 256)
* Cropping the image at the center with size 224
* Converting our image to a torch tensor
* Normalizing the values of the pixels with Imagenet mean and standard deviation for each channel. These are values are: `mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]`

We will use `transforms.Compose` to compose all these transformations in a single method.

### Exercise 3

Define the preprocessing transformations.

In [None]:
# TODO : Define the preprocessing transformations defined above
preprocess = transforms.Compose([...])

torch_image = preprocess(pil_image)
try:
    assert isinstance(torch_image, torch.Tensor)
    assert list(torch_image.shape) == [3, 224, 224]
    print("Well done!")
except Exception:
    raise Exception("Did you do the 4 required transformations?")


We can take a look at the image that will go through the network.

In [None]:
def show_torch_image(torch_image: torch.Tensor) -> None:
    img = (torch_image-torch_image.min()) / (torch_image.max() - torch_image.min())
    show_image(to_pil_image(img))

In [None]:
show_torch_image(torch_image)

Now we can get the predictions of our network. Note that our model expects a batch dimension at the beginning, so we should add it with the method `.unsqueeze()`.

### Exercise 4

Complete the predict function. Remember to move the images to GPU, as well as adding the batch dimension.

In [None]:
def predict_image(
        torch_image: torch.tensor,
        model: torch.nn.Module,
        topk: int = 3,
        ) -> List[Tuple[str, float]]:

    x = torch_image.to(device) # move image to GPU
    x = x.unsqueeze(0) # add batch dimension

    # TODO: predict raw outputs
    output = ...

    output = torch.softmax(output, dim=1)  #Compute the softmax to get probabilities
    probs, idxs = output.topk(topk)  # Get the top k predicitons
    imagenet_preds = [(labels[i.item()], p.item()*100) for p, i in zip(probs[0], idxs[0])]
    superclasses_preds = [
        ('dog', output[0, idx_superclasses['dog']].sum().item()),
        ('cat', output[0, idx_superclasses['cat']].sum().item()),
        ('other', output[0, idx_superclasses['other']].sum().item())
    ]
    return imagenet_preds, superclasses_preds

imagenet_preds, superclasses_preds = predict_image(torch_image, pretrained_model)

print('ImageNet prediction:', imagenet_preds)
print('Superclass prediction:', superclasses_preds)

Do the prediction with other images using different networks and compare the results.  You can use the function following function predict_from_url. The list of pretrained PyTorch networks can be found [here](https://pytorch.org/vision/stable/models.html#classification). When trying different models remember to move them to GPU and set evaluation mode.

In [None]:
def predict_from_url(
        url: str,
        model: torch.nn.Module,
        ) -> None:
    # urllib.request.urlretrieve(url, "image.jpg")
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    with open("image.jpg", 'wb') as out_file:
        out_file.write(response.content)
    pil_image = Image.open("image.jpg")
    show_image(pil_image)
    torch_image = preprocess(pil_image)
    imagenet_preds, superclasses_preds = predict_image(torch_image, model)
    print('ImageNet prediction:', imagenet_preds)
    print('Superclass prediction:', superclasses_preds)

print('resnet18:')
resnet18 = models.resnet18(weights='DEFAULT').to(device).eval()
predict_from_url("https://icatcare.org/app/uploads/2018/07/Thinking-of-getting-a-cat.png", resnet18)

print('\nvgg16:')
vgg16 = models.vgg16(weights='DEFAULT').to(device).eval()
predict_from_url("https://icatcare.org/app/uploads/2018/07/Thinking-of-getting-a-cat.png", vgg16)

Let's now use these superclasses to test different pretrained models on our whole dataset of cats and dogs:

In [None]:
resnet18 = models.resnet18(weights='DEFAULT').to(device).eval()
vgg16 = models.vgg16(weights='DEFAULT').to(device).eval()
densenet = models.densenet161(weights='DEFAULT').to(device).eval()
googlenet = models.googlenet(weights='DEFAULT').to(device).eval()

In [None]:
def test_pretrained_model(model):
  total_correct = 0
  with torch.no_grad():
    for data, target in val_loader:
      data, target = data.float().to(device), target.float().to(device)
      target = target.unsqueeze(-1)
      output = model(data)
      output = torch.softmax(output, dim=1)
      superclasses_preds = torch.stack([
          output[:, idx_superclasses['cat']].sum(dim=1),
          output[:, idx_superclasses['dog']].sum(dim=1),
          output[:, idx_superclasses['other']].sum(dim=1)
      ], dim=1)
      predicted_classes = superclasses_preds.argmax(dim=1, keepdim=True)
      correct_predictions = (predicted_classes == target).float()
      total_correct += correct_predictions.sum().item()

  acc = total_correct/len(val_dataset)
  return acc

accuracies_pretrained_models = {
    'resnet18': test_pretrained_model(resnet18),
    'vgg16': test_pretrained_model(vgg16),
    'densenet': test_pretrained_model(densenet),
    'googlenet': test_pretrained_model(googlenet)
}

accuracies_pretrained_models

In [None]:
import matplotlib.pyplot as plt

# Data for the bar plot
models_list = list(accuracies_pretrained_models.keys())
accuracies_list = list(accuracies_pretrained_models.values())

# Create the bar plot
plt.figure(figsize=(8, 5))
plt.bar(models_list, accuracies_list)

# Add labels and title
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Accuracy of Pretrained Models on the Cats vs Dogs Dataset')
plt.ylim(0, 1)  # Set y-axis limits to range 0-1
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()

As shown in the previous plot, using pre-trained models improves accuracy to between 80% and 89%, a significant increase compared to the 65% achieved when training from scratch. Although none of these models were specifically trained on our data, this improvement is expected because they were originally trained on ImageNet, which includes cats, dogs, and their subclasses. However, in more common scenarios where the source and target classes do not overlap, we can only leverage the early layers of a pre-trained model as a feature extractor and then train a custom classifier for our specific classes. In the next section, we will explore this more common scenario.

### Feature extraction


In a normal scenario (where the target dataset contains classes not present in the source dataset), there are two ways to leverage a pre-trained network: feature extraction and fine-tuning. We will cover both of them. Let's start with feature extraction. Feature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.

As we saw previously, convnets used for image classification comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely-connected classifier. The first part is called the "convolutional base" of the model. In the case of convnets, "feature extraction" will simply consist of taking the convolutional base of a previously-trained network, running the new data through it, and training a new classifier on top of the output.

![image](https://miro.medium.com/v2/resize:fit:1400/1*XbuW8WuRrAY5pC4t-9DZAQ.jpeg)

Why only reuse the convolutional base? Could we reuse the densely-connected classifier as well? In general, it should be avoided. The reason is simply that the representations learned by the convolutional base are likely to be more generic and therefore more reusable: the feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer vision problem at hand. On the other end, the representations learned by the classifier will necessarily be very specific to the set of classes that the model was trained on -- they will only contain information about the presence probability of this or that class in the entire picture. Additionally, representations found in densely-connected layers no longer contain any information about where objects are located in the input image: these layers get rid of the notion of space, whereas the object location is still described by convolutional feature maps. For problems where object location matters, densely-connected features would be largely useless.

Note that the level of generality (and therefore reusability) of the representations extracted by specific convolution layers depends on the depth of the layer in the model. Layers that come earlier in the model extract local, highly generic feature maps (such as visual edges, colors, and textures), while layers higher-up extract more abstract concepts (such as "cat ear" or "dog eye"). So if your new dataset differs a lot from the dataset that the original model was trained on, you may be better off using only the first few layers of the model to do feature extraction, rather than using the entire convolutional base.

In our case, since the ImageNet class set did contain multiple dog and cat classes, it is likely that it would be beneficial to reuse the information contained in the densely-connected layers of the original model. However, we will chose not to, in order to cover the more general case where the class set of the new problem does not overlap with the class set of the original model.

Let's put this in practice by using the convolutional base of the ResNet18 network, trained on ImageNet, to extract interesting features from our cat and dog images, and then training a cat vs. dog classifier on top of these features.



In [None]:
pretrained_model = models.resnet18(pretrained=True).to(device).eval()
pretrained_model

### Exercise 5

The feature extractor of the ResNet18 are all the children layers from the model except the last one.

Hint: Search how to obtain the feature extractor of a ResNet in Pytorch

In [None]:
#TODO: Remove the final fully connected layer
feature_extractor = ...



The final feature map has shape (512, 1, 1). That's the feature on top of which we will stick a densely-connected classifier.

The method we will use consists in running the convolutional base over our dataset, recording its output to a Numpy array on disk, then using this data as input to a standalone densely-connected classifier. This solution is very fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. However, for the exact same reason, this technique would not allow us to leverage data augmentation at all.


We will start by simply running instances of the previously-introduced ImageFolder Dataset to extract features from these images.


In [None]:
base_dir = '/content/processed_datalab'

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')


def extract_features(directory):
    transform = transforms.Compose([transforms.Resize(150), transforms.CenterCrop(150), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])  # Normalize with Imagenet parameters
    dataset = ImageFolder(directory, transform=transform)
    batch_size = 100
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    features = np.zeros(shape=(len(dataset), 512,1,1))
    labels = np.zeros(shape=(len(dataset),))
    with torch.no_grad():
        for i, (inputs_batch, labels_batch) in enumerate(loader):
            inputs_batch, labels_batch = inputs_batch.to(device), labels_batch.to(device)
            features_batch = feature_extractor(inputs_batch)
            features[i * batch_size : (i + 1) * batch_size] = features_batch.cpu().numpy()
            labels[i * batch_size : (i + 1) * batch_size] = labels_batch.cpu().numpy()

    return features, labels

train_features, train_labels = extract_features(train_dir)
validation_features, validation_labels = extract_features(validation_dir)


The extracted features are currently of shape (samples, 512, 1, 1). We will feed them to a densely-connected classifier, so first we must flatten them:

In [None]:
train_features = np.reshape(train_features, (-1, 512))
validation_features = np.reshape(validation_features, (-1, 512))

### Exercise 6
At this point, we can define a neural network with a single hiden layer trained independetly with the visual features extracted from the VGG model.

Implement this simple neural network with a 256 hidden neurons with a ReLU activation, and a dropout factor of 0.5 for both sets of parameters (input > hidden, hidden > output).

In [None]:
# TODO
feature_classifier = ...

feature_classifier.to(device)

In [None]:
optimizer = optim.Adam(feature_classifier.parameters(), lr=0.001)
loss_fn = nn.BCELoss()

In [None]:
from torch.utils.data import TensorDataset

batch_size = 64
train_features_dataset = TensorDataset(torch.tensor(train_features), torch.tensor(train_labels))
train_features_loader = DataLoader(train_features_dataset, batch_size=batch_size, shuffle=True)

val_features_dataset = TensorDataset(torch.tensor(validation_features), torch.tensor(validation_labels))
val_features_loader = DataLoader(val_features_dataset, batch_size=batch_size, shuffle=False)


In [None]:
epochs = 40
train_accuracies, train_losses, val_accuracies, val_losses = train_model(feature_classifier, optimizer, loss_fn, train_features_loader, val_features_loader, epochs)

Training is very fast, since we only have to deal with two Dense layers -- an epoch takes less than one second even on CPU.

Let's take a look at the loss and accuracy curves during training:


In [None]:
epochs = range(len(train_accuracies))

plt.plot(epochs, train_accuracies, 'b', label='Training acc')
plt.plot(epochs, val_accuracies, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, train_losses, 'b', label='Training loss')
plt.plot(epochs, val_losses, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

We have achieved a validation accuracy of approximately 97%, which is significantly better than what we obtained in the previous section with our small model trained from scratch. This demonstrates how we transferred the learning from the ImageNet dataset to this new domain, taking advantage of having seen and learnt a wider variety of classes and data. Furthermore, we leveraged the accuracy by 17% compared to grouping the subclasses of cats and dogs from ImageNet. This indicates the importance of specialising the final classifier to the target task.

# 3 Fine-tuning

Another widely used technique for model reuse, complementary to feature extraction, is fine-tuning. Fine-tuning consists in unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in our case, the fully-connected classifier) and these top layers. This is called "fine-tuning" because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand.

![finetuning](https://miro.medium.com/v2/resize:fit:720/format:webp/1*AUI4rH8_tbb7x4xkBsHu2Q.png)

We have stated before that it was necessary to freeze the convolution base of VGG16 in order to be able to train a randomly initialized classifier on top. For the same reason, it is only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained. If the classified wasn't already trained, then the error signal propagating through the network during training would be too large, and the representations previously learned by the layers being fine-tuned would be destroyed. Thus the steps for fine-tuning a network are as follow:


1.   Add your custom network on top of an already trained base network.
2.   Freeze the base network.
3.   Train the part you added.
4.   Unfreeze some layers in the base network.
5.   Jointly train both these layers and the part you added.

We have already completed the first 3 steps when doing feature extraction. Let's proceed with the 4th step.

As a reminder, this is what our convolutional base looks like:


In [None]:
feature_extractor

In [None]:
for layer in feature_extractor[:6]:  # Freeze the firsts 6 blocks
    for param in layer.parameters():
        param.requires_grad = False

for layer in feature_extractor[6:]:  # Train the lasts two blocks
    for param in layer.parameters():
        param.requires_grad = True

We will fine-tune the last blocks of convolutional layers.

Why not fine-tune more layers? Why not fine-tune the entire convolutional base? We could. However, we need to consider that:

1.   Earlier layers in the convolutional base encode more generic, reusable features, while layers higher up encode more specialized features. It is more useful to fine-tune the more specialized features, as these are the ones that need to be repurposed on our new problem. There would be fast-decreasing returns in fine-tuning lower layers.
2.   The more parameters we are training, the more we are at risk of overfitting. The convolutional base has 15M parameters, so it would be risky to attempt to train it on our small dataset.

Thus, in our situation, it is a good strategy to only fine-tune the top 2 to 3 layers in the convolutional base.

Let's set this up, starting from where we left off in the previous example:


### Exercise 7

Create a model with the `feature_classifier` of top of the `feature_extrator`. Note that you will need to flatten the features.

In [None]:
# TODO
model = ...

model.to(device)

Now we can start fine-tuning our network. We will do this with the Adam optimizer, using a lower learning rate. The reason for using a low learning rate is that we want to limit the magnitude of the modifications we make to the representations of the 3 layers that we are fine-tuning. Updates that are too large may harm these representations.

Now let's proceed with fine-tuning:


In this case, we can do data augmentation, with the same configuration we used in our previous example:

In [None]:
train_transform = transforms.Compose([transforms.RandomHorizontalFlip(), transforms.RandomRotation(10), transforms.RandomResizedCrop(150), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])  # We have to add ImageNet normalization
val_transform = transforms.Compose([transforms.Resize(150), transforms.CenterCrop(150), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

batch_size = 64
augmented_dataset = ImageFolder(train_dir, transform=train_transform)
augmented_loader = DataLoader(augmented_dataset, batch_size=batch_size, shuffle=True)
val_dataset = ImageFolder(validation_dir, transform=val_transform)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=1e-4)
loss_fn = nn.BCELoss()

In [None]:
epochs = 60
train_accuracies, train_losses, val_accuracies, val_losses = train_model(model, optimizer, loss_fn, augmented_loader, val_loader, epochs)

In [None]:
epochs = range(len(train_accuracies))

plt.plot(epochs, train_accuracies, 'b', label='Training acc')
plt.plot(epochs, val_accuracies, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, train_losses, 'b', label='Training loss')
plt.plot(epochs, val_losses, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

We have achieved a validation accuracy of approximately 99%, which is an increase of 2% compared to training only the model’s classifier. This might suggest that the data augmentation methods used in this training help the model to generalise even further.

# Take-aways: using CNN with small datasets

Here's what you should take away from these exercises:


*   CNN are the best type of machine learning models for computer vision tasks. It is possible to train one from scratch even on a very small dataset, with decent results.
*   On a small dataset, overfitting will be the main issue. Data augmentation is a powerful way to fight overfitting when working with image data.
*    It is easy to reuse an existing CNN on a new dataset, via feature extraction. This is a very valuable technique for working with small image datasets.
*   As a complement to feature extraction, one may use fine-tuning, which adapts to a new problem some of the representations previously learned by an existing model. This pushes performance a bit further.



Now you have a solid set of tools for dealing with image classification problems, in particular with small datasets.