# Transfer Learning

source: https://cs231n.github.io/transfer-learning/

Typically CNNs are not trained from scratch (with random initialization) because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a CNN on a very large dataset and then use the CNN either as an initialization or a fixed feature extractor for the task of interest.

The pre-traines model in the example is **ResNet-50**, a convolutional neural network that is 50 layers deep. THe pretrained version of the network is trained on more than a million images from the **ImageNet database** and can classify images into **1000 object categories**. As a result, the network has learned rich feature representations for a wide range of images. The network has an image **input size of 224x224**.

There are three primary types of transfer learning from a pre-trained CNN model:

1. Pretrained Model
1. Feature Extraction
1. Fine Tuning

### Pre-Trained Model

In [8]:
from torchvision import models
from torch import nn
from torchinfo import summary

# load ResNet50 model as feature extractor
model = models.resnet50(pretrained=True)

summary(model, (1, 3, 224, 224), row_settings=('depth', 'var_names'), depth=2)

Layer (type (var_name):depth-idx)                  Output Shape              Param #
ResNet                                             --                        --
├─Conv2d (conv1): 1-1                              [1, 64, 112, 112]         9,408
├─BatchNorm2d (bn1): 1-2                           [1, 64, 112, 112]         128
├─ReLU (relu): 1-3                                 [1, 64, 112, 112]         --
├─MaxPool2d (maxpool): 1-4                         [1, 64, 56, 56]           --
├─Sequential (layer1): 1-5                         [1, 256, 56, 56]          --
│    └─Bottleneck (0): 2-1                         [1, 256, 56, 56]          75,008
│    └─Bottleneck (1): 2-2                         [1, 256, 56, 56]          70,400
│    └─Bottleneck (2): 2-3                         [1, 256, 56, 56]          70,400
├─Sequential (layer2): 1-6                         [1, 512, 28, 28]          --
│    └─Bottleneck (0): 2-4                         [1, 512, 28, 28]          379,392
│    └─Bottlen

## Feature Extraction

Here we remove the last fully-connected layer `(fc)`, then treat the rest of the CNN as a **fixed feature extractor** for the new dataset. For ResNet-50, this computes a 2048-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features **CNN codes**. Once you extract the 2048-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.

Transfer learning means **retraining the final layer** of a deep network. Not only is this useful for solving problems with **limited training examples**, but also when you don't have adequate **computing resources** to train a network from scratch. 

However, if you have sufficient data, adapting weights via transfer learning is not preferable because the features that were extracted from the original training process are unlikely to be ideal for another application.

Feature extraction in the context of a **CNN** is not necessarily an explicit process, rather a sort of high-level product of the training process. Feature extraction refers to the portion of the training process by which a CNN learns to map input space to a latent space that can subsequently be used for classification via the final layer. 

In other words, the hidden layers learn discriminatory features in the form of weight-adjusted convolutional filters. Thus the term "feature extraction" generally refers to the portion of the training process that occurs before the final layer. So it is not part of transfer learning in which only the last layer is trained.

### Create ResNET Model for Feature Extraction

In [9]:
from torchvision import models
from torch import nn
from torchinfo import summary

# load ResNet50 model as feature extractor
model = models.resnet50(pretrained=True)

# freeze parameters to non-trainable (by default they are trainable)
for param in model.parameters():
    param.requires_grad = False

# append a new classification top to our feature extractor and pop it on to the current device
num_features = model.fc.in_features
num_classes = 5
model.fc = nn.Linear(num_features, num_classes)

summary(model, (1, 3, 224, 224), row_settings=('depth', 'var_names'), depth=2)

Layer (type (var_name):depth-idx)                  Output Shape              Param #
ResNet                                             --                        --
├─Conv2d (conv1): 1-1                              [1, 64, 112, 112]         (9,408)
├─BatchNorm2d (bn1): 1-2                           [1, 64, 112, 112]         (128)
├─ReLU (relu): 1-3                                 [1, 64, 112, 112]         --
├─MaxPool2d (maxpool): 1-4                         [1, 64, 56, 56]           --
├─Sequential (layer1): 1-5                         [1, 256, 56, 56]          --
│    └─Bottleneck (0): 2-1                         [1, 256, 56, 56]          (75,008)
│    └─Bottleneck (1): 2-2                         [1, 256, 56, 56]          (70,400)
│    └─Bottleneck (2): 2-3                         [1, 256, 56, 56]          (70,400)
├─Sequential (layer2): 1-6                         [1, 512, 28, 28]          --
│    └─Bottleneck (0): 2-4                         [1, 512, 28, 28]          (379,392)
│ 

## Fine Tuning

Here we not only replace and retrain the classifier on top of the CNN on the new dataset, but to also **fine-tune the weights** of the pretrained network by continuing the backpropagation. 

It is possible to fine-tune all the layers of the CNN, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the **earlier features** of a CNN contain more **generic features** (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but **later layers** of the CNN becomes progressively more specific to the **details of** the classes contained in the **original dataset**. 

### Create a Fine-Tuned ResNET Model

In [10]:
from torchvision import models
from torch import nn
from torchinfo import summary

# load ResNet50 model for fine tuning
model = models.resnet50(pretrained=True)

num_features = model.fc.in_features

# loop over the modules of the model and set the parameters of batch normalization modules as not trainable
for module, param in zip(model.modules(), model.parameters()):
    if isinstance(module, nn.BatchNorm2d):
        param.requires_grad = False

# define the network head and attach it to the model
num_classes = 5
model.fc = nn.Sequential(
    nn.Linear(num_features, 512),
    nn.ReLU(),
    nn.Dropout(0.25),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, num_classes)
)

summary(model, (1, 3, 224, 224), row_settings=('depth', 'var_names'), depth=2)

Layer (type (var_name):depth-idx)                  Output Shape              Param #
ResNet                                             --                        --
├─Conv2d (conv1): 1-1                              [1, 64, 112, 112]         9,408
├─BatchNorm2d (bn1): 1-2                           [1, 64, 112, 112]         128
├─ReLU (relu): 1-3                                 [1, 64, 112, 112]         --
├─MaxPool2d (maxpool): 1-4                         [1, 64, 56, 56]           --
├─Sequential (layer1): 1-5                         [1, 256, 56, 56]          --
│    └─Bottleneck (0): 2-1                         [1, 256, 56, 56]          75,008
│    └─Bottleneck (1): 2-2                         [1, 256, 56, 56]          70,400
│    └─Bottleneck (2): 2-3                         [1, 256, 56, 56]          70,400
├─Sequential (layer2): 1-6                         [1, 512, 28, 28]          --
│    └─Bottleneck (0): 2-4                         [1, 512, 28, 28]          379,392
│    └─Bottlen