<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/04_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Transfer Learning

Now, let's consider what is Transfer learning?

The idea is quite simple. First, some big tech company, which has access to virtually
infinite amounts of data and computing power, develops and trains a huge model
for their own purpose. 

Next, once it is trained, its architecture and the corresponding trained weights (the pre-trained model) are released. Finally,
everyone else can use these weights as a starting point and fine-tune them
further for a different (but similar) purpose.

That’s transfer learning in a nutshell.

##Setup

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)    
except ModuleNotFoundError:
    pass

from config import *
config_chapter7()
# This is needed to render the plots in this chapter
from plots.chapter7 import *

Downloading files from GitHub repo to Colab...
Finished!


In [3]:
import numpy as np
from PIL import Image

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import DataLoader, Dataset, random_split, TensorDataset
from torchvision.transforms import Compose, ToTensor, Normalize, Resize, ToPILImage, CenterCrop, RandomResizedCrop
from torchvision.datasets import ImageFolder
from torchvision.models import alexnet, resnet18, inception_v3
from torchvision.models.alexnet import model_urls
try:
  from torchvision.models.utils import load_state_dict_from_url
except ImportError:
  from torch.hub import load_state_dict_from_url

from stepbystep.v3 import StepByStep
from data_generation.rps import download_rps

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [5]:
import os
# content/gdrive/My Drive/Kaggle is the path where kaggle.json is  present in the Google Drive
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle-keys"

In [6]:
%%shell

# download dataset from kaggle> URL: https://www.kaggle.com/datasets/sanikamal/rock-paper-scissors-dataset
kaggle datasets download -d sanikamal/rock-paper-scissors-dataset

unzip -qq rock-paper-scissors-dataset.zip
rm -rf rock-paper-scissors-dataset.zip

Downloading rock-paper-scissors-dataset.zip to /content
 98% 443M/452M [00:04<00:00, 138MB/s]
100% 452M/452M [00:04<00:00, 98.7MB/s]




## Data Preparation

The data preparation step will be a bit more demanding this time since we’ll be
standardizing the images.Besides, we can use the ImageFolder dataset now.

The Rock Paper Scissors dataset is organized like that:

```
rps/paper/paper01-000.png
rps/paper/paper01-001.png

rps/rock/rock01-000.png
rps/rock/rock01-001.png

rps/scissors/scissors01-000.png
rps/scissors/scissors01-001.png
```

The dataset is also perfectly balanced, with each sub-folder containing 840 images
of its particular class.

In [7]:
ROOT_FOLDER = "Rock-Paper-Scissors"

Since we’re using a pre-trained model, we need to use the standardization
parameters used to train the original model. 

In other words, we need to use the
statistics of the original dataset used to train that model.

So, the data preparation step for the Rock Paper Scissors dataset looks like this now:

In [8]:
normalizer = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
composer = Compose([
  Resize(256),
  CenterCrop(224),
  ToTensor(),
  normalizer
])

train_data = ImageFolder(root=f"{ROOT_FOLDER}/train", transform=composer)
val_data = ImageFolder(root=f"{ROOT_FOLDER}/test", transform=composer)

# Builds a loader of each set
train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
val_loader = DataLoader(val_data, batch_size=16)

##Pre-Trained Model

Let's start by creating an instance of AlexNet without loading its pre-trained
weights.

In [9]:
alex = alexnet(weights=False)
print(alex)



AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

###Adaptive Pooling

`AdaptiveAvgPool2d` is a special kind of pooling: Instead of requiring the kernel size
(and stride), it requires the desired output size. 

In other words, whatever the
image size it gets as input, it will return a tensor with the desired size.

It gives you the freedom to use images of different sizes as inputs.

Let’s verify it.

In [10]:
result1 = F.adaptive_avg_pool2d(torch.randn(16, 32, 32), output_size=(6, 6))
result2 = F.adaptive_avg_pool2d(torch.randn(16, 12, 12), output_size=(6, 6))

result1.shape, result2.shape

(torch.Size([16, 6, 6]), torch.Size([16, 6, 6]))

###Loading Weights

Let’s download the weights
from a given URL, which gives you the flexibility to use pre-trained weights from
wherever you want!

In [11]:
URL = model_urls["alexnet"]
URL



'https://download.pytorch.org/models/alexnet-owt-7be5be79.pth'

In [12]:
state_dict = load_state_dict_from_url(URL, model_dir="pretrained", progress=True)

Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to pretrained/alexnet-owt-7be5be79.pth


  0%|          | 0.00/233M [00:00<?, ?B/s]

In [13]:
# let's load model
alex.load_state_dict(state_dict)

<All keys matched successfully>

###Model Freezing

Freezing the model means it won’t learn anymore; that is, its
parameters / weights will not be updated anymore.

What best characterizes a tensor representing a learnable parameter? It requires
gradients. 

So, if we’d like to make them stop learning anything, we need to change
exactly that:

In [14]:
def freeze_model(model):
  for parameter in model.parameters():
    parameter.requires_grad = False

freeze_model(alex)

If the model is frozen, how I am supposed to train it for my own
purpose?

We have to unfreeze a small part of the model or, better yet,
replace a small part of the model.

###Top of the Model

The "top" of the model is loosely defined as the last layer(s) of the model, usually
belonging to its classifier part. 

The featurizer part is usually left untouched since
we’re trying to leverage the model’s ability to generate features for us.

In [15]:
print(alex.features)

Sequential(
  (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (1): ReLU(inplace=True)
  (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (4): ReLU(inplace=True)
  (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU(inplace=True)
  (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): ReLU(inplace=True)
  (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
)


In [16]:
print(alex.classifier)

Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Linear(in_features=4096, out_features=1000, bias=True)
)


In our Rock Paper Scissors dataset, we have three classes. 

So, we need to replace the
output layer accordingly:

In [17]:
alex.classifier[6] = nn.Linear(in_features=4096, out_features=3)

In [18]:
print(alex.classifier)

Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Linear(in_features=4096, out_features=3, bias=True)
)


Notice that the number of input features remains the same, since it still takes the
output from the hidden layer that precedes it. 

The new output layer requires
gradients by default, but we can double-check it:

In [19]:
for name, param in alex.named_parameters():
  if param.requires_grad == True:
    print(name)

classifier.6.weight
classifier.6.bias


## Model Training

The configuration part is short and straightforward: We use alex model, a loss
function, and an optimizer.

In [20]:
torch.manual_seed(17)

multi_loss_fn = nn.CrossEntropyLoss(reduction="mean")
optimizer_alex = optim.Adam(alex.parameters(), lr=3e-4)

We have everything set to train the "top" layer of our modified version of AlexNet.

In [23]:
sbs_alex = StepByStep(alex, multi_loss_fn, optimizer_alex)
sbs_alex.set_loaders(train_loader, val_loader)
sbs_alex.train(1)

Let’s see how effective transfer learning is by evaluating our model after
having trained it over one epoch only.

In [24]:
StepByStep.loader_apply(val_loader, sbs_alex.correct)

tensor([[112, 124],
        [124, 124],
        [124, 124]])

##Generating features

Well, since the frozen layers are simply generating features that will be the input
of the trainable layers, why not treat the frozen layers as such? 

We could do it in
four easy steps:

* Keep only the frozen layers in the model.
* Run the whole dataset through it and collect its outputs as a dataset of
features.
* Train a separate model (that corresponds to the "top" of the original model)
using the dataset of features.
* Attach the trained model to the top of the frozen layers.

This way, we’re effectively splitting the feature extraction and actual training
phases, thus avoiding the overhead of generating features over and over again for
every single forward pass.

To keep only the frozen layers, we need to get rid of the "top" of the original model.


But, since we also want to attach our new layer to the whole model after training,
it is a better idea to simply replace the "top" layer with an identity layer instead of
removing it entirely:

In [25]:
alex.classifier[6] = nn.Identity()
print(alex.classifier)

Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Identity()
)


This way, the last effective layer is still `classifier.5`, which will produce the
features we’re interested in. We have a feature extractor in our hands now! 

Let’s use it to pre-process our dataset.