# Lab 6A: CNN Architectures and Transfer Learning

The learning objectives for this lab exercise are as follows:
1. Customize the standard CNN Network to a targeted task
2. Perform different kinds of transfer learning:
    1. Train from scratch
    2. Finetune the whole model
    3. Finetune the upper layers of the model
    4. As a feature extractor

In practice, it is common to use a **standard CNN architectures** such that ResNet, MNASNet, ResNeXt, EfficientNet, etc. to build a model. The effectiveness of these network architectures has been well attested for a wide range of applications. The [`torchvision.models`](https://pytorch.org/vision/stable/models.html) subpackage contains these different network models that have been pre-trained on ImageNet. Rather than training from scratch, it is advisable to use **transfer learning** by training on top of a standard model that has been **pretrained** on the ImageNet dataset. Transfer learning reduces overfitting and improves the generalization performance of the trained model, especially when the training set for the targeted task is small. We perform transfer learning in two ways:

1. *Finetuning the convnet*: Instead of random initialization,  initialize the network with the pretrained network. 

2. *Fixed feature extractor*: Freeze the weights for all of the layers of the network except for the final fully connected (fc) layer. Replace the last fc layer so that the output size is the same as the number of classes for the new task. The new layer is initialized with random weights and only this layer is trained.

> Training a deep architecture using a pre-trained model allows us to train on a small dataset with less overfitting.

Mount google drive onto virtual machine

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Change current directory to Lab 6

In [None]:
cd "/content/gdrive/My Drive/UCCD3074_Labs/UCCD3074_Lab6"

Load required libraries

In [None]:
import numpy as np
import torchvision.models as models

import torch, torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
from torchsummary import summary

from cifar10 import CIFAR10

---
## Helper Functions

Define the train function

In [None]:
loss_iter = 1

def train(net, num_epochs, lr=0.1, momentum=0.9, verbose=True):
    
    history = []
    
    loss_iterations = int(np.ceil(len(trainloader)/loss_iter))
    
    # transfer model to GPU
    if torch.cuda.is_available():
        net = net.cuda()
    
    # set the optimizer
    optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
    
    # set to training mode
    net.train()

    # train the network
    for e in range(num_epochs):    

        running_loss = 0.0
        running_count = 0.0

        for i, (inputs, labels) in enumerate(trainloader):

            # Clear all the gradient to 0
            optimizer.zero_grad()

            # transfer data to GPU
            if torch.cuda.is_available():
                inputs = inputs.cuda()
                labels = labels.cuda()

            # forward propagation to get h
            outs = net(inputs)

            # compute loss 
            loss = F.cross_entropy(outs, labels)

            # backpropagation to get dw
            loss.backward()

            # update w
            optimizer.step()

            # get the loss
            running_loss += loss.item()
            running_count += 1

             # display the averaged loss value 
            if i % loss_iterations == loss_iterations-1 or i == len(trainloader) - 1:                
                train_loss = running_loss / running_count
                running_loss = 0. 
                running_count = 0.
                if verbose:
                    print(f'[Epoch {e+1:2d}/{num_epochs:d} Iter {i+1:5d}/{len(trainloader)}]: train_loss = {train_loss:.4f}')       
                
                history.append(train_loss)
    
    return history

Define the evaluate function

In [None]:
def evaluate(net):
    # set to evaluation mode
    net.eval()
    
    # running_correct
    running_corrects = 0
    
    for inputs, targets in testloader:
        
        # transfer to the GPU
        if torch.cuda.is_available():
            inputs = inputs.cuda()
            targets = targets.cuda()
        
        # perform prediction (no need to compute gradient)
        with torch.no_grad():
            outputs = net(inputs)
            _, predicted = torch.max(outputs, 1)
            running_corrects += (targets == predicted).double().sum()
            
    print('Accuracy = {:.2f}%'.format(100*running_corrects/len(testloader.dataset)))

## 1. Load CIFAR10 dataset

Here, we use a sub-sample of CIFAR10 where we use a sub-sample of 1000 training and testing samples. The sample size is small and hence is expected to face overfitting issue. Using a pretrained model alleviates the problem.

In [None]:
# transform the model
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# dataset
trainset = CIFAR10(train=True, download=True, transform=transform, num_samples=1000)
testset  = CIFAR10(train=False, download=True, transform=transform, num_samples=1000)

# dataloader]
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)
testloader  = DataLoader(testset, batch_size=128, shuffle=True, num_workers=2)

## 2. The ResNet18 model

In this section, we shall build our network using a standard network architectures. We customize resnet18 by replacing its classifier layer, i.e., the last fully connected layer with our own. The original classifier layer has 1000 outputs (ImageNet has 1000 output classes) whereas our model has only 10. 

### Network Architecture of ResNet18

We shall use resnet18 as our base network. Before we customize it, let's print out the summary of all layers of the model to view its architecture. Bear in mind that to customize the network, we need to replace the last linear layer. 

First, let's review the resnet18 network architecture.

In [None]:
# ... create resnet18 ...

In [None]:
# ... print network ...

We can get the name of the first layer by accessing the `.named_children`.

In [None]:
# ... display the name of first layer ...

We will see that:
* `layer1` to `layer4` contains two blocks each. Each block is contains two convolutional layers. 
* The second last layer (`avgpool`) performs *global average pooling* to average out the spatial dimensions. 
* The last layer (`fc`) is a linear layer and indeed, it functions as a classifier. This is the layer that we want to replace to fit our model.

To customize the network, we need to replace the `fc` layer with our own classifier layer.

### Customizing ResNet18

In the following, we shall replace the last layer with a new classifier layer. The original layer  is designed to classify ImageNet's 1000 image categories. The new layers will be used to classify Cifar10's 10 classes

In [None]:
# ... create build_network ...

Let's visualize what we have built. Note that the last layer of the network (`fc`) now has 10 instead of 1000 neurons.

In [None]:
print(build_network())

---
### Model 1: Training from scratch

Let's build the network **without** loading the pretrained model. To do this, we set `pretrained=False`.

In [None]:
# ... load the model without pretraining...

Train the model and save the training loss history into `history1`.

In [None]:
history1 = train(resnet18, num_epochs=30, lr=0.01, momentum=0.8)

Evaluate the model

In [None]:
evaluate(resnet18)

---
### Model 2: Finetuning the pretrained model

Typically, a standard network come with a pretrained model trained on ImageNet's large-scale dataset for the image classification task. 
* In the following, we shall load resnet18 with the pretrained model and use it to **initialize** the network.  To do this, we set `pretrained=True`.
* The training will update the parameters **all layers** of the network.

For Windows system, the pretrained model will be saved to the following directory: `C:\Users\<user name>\.cache\torch\checkpoints`. A PyTorch model has an extension of `.pt` or `.pth`. 

In [None]:
# ... load the pretrained model ...

By default, all the layers are set to `requires_grad=True`

In [None]:
# ... Unfreeze all layers ...

Train the model and save into `history2`.

In [None]:
history2 = train(resnet18, num_epochs=30, lr=0.01, momentum=0.8)

Evaluate the network

In [None]:
evaluate(resnet18)

---
### Model 3: As a fixed feature extractor

When the dataset is too small, fine-tuning the model may still incur overfitting. In this case, you may want to try to use the pretrained as a fixed feature extractor where we train only the  classifier layer (i.e., **last layer**) that we have newly inserted into the network.

In [None]:
# ... load the pretrained model ...

We set `requires_grad=False` for all parameters except for the newly replaced layer `fc`, i.e., the last two parameters in `resnet.parameters()`.

In [None]:
# ... freeze all layers ...

In [None]:
# ... check that all layers are freezed ...

Train the model and save into `history3`.

In [None]:
history3 = train(resnet18, num_epochs=30, lr=0.01, momentum=0.8)

Evaluate the model

In [None]:
evaluate(resnet18)

---
### Model 4: Finetuning the top few layers

We can also tune the top few layers of the network. The following tunes all the layers in the block `layer 4` as well as the `fc` layer.


In [None]:
# ... load the pretrained model ...

Then, we freeze all tha layers except for `layer4` and `fc` layers

In [None]:
# ... freeze the bottom few layers ...

In [None]:
# ... check that only the selected layers are freezed ...

Train the model and save into `history4`.

In [None]:
history4 = train(resnet18, num_epochs=30, lr=0.01, momentum=0.8)

Evaluate the model

In [None]:
evaluate(resnet18)

### Plotting training loss

Lastly, we plot the training loss history for each of the training schemes above.

In [None]:
import matplotlib.pyplot as plt

plt.plot(history1, label='From scratch')
plt.plot(history2, label='Finetuning the pretrained model')
plt.plot(history3, label='As a fixed extractor')
plt.plot(history4, label='Finetuning the top few layers')
plt.legend()
plt.show()

## Exercise

You can try with different network architectures (e.g., EfficientNet-B0) and see if it results in higher test accuracy.

In [None]:
def build_network(pretrained=True):
    # ...
    return efficientNet

In [None]:
efficientNet = build_network() 

In [None]:
history5 = train(efficientNet, num_epochs=30, lr=0.01, momentum=0.8)

In [None]:
evaluate(efficientNet)