$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$

# CS236605: Deep Learning
# Tutorial 6: Transfer Learning and Domain Adaptation

## Introduction

In this tutorial, we will cover:

**TODO**

- Transfer learning
- Fine tuning and freezing
- Domain adaptation

In [2]:
# Setup
%matplotlib inline
import os
import sys
import torch
import matplotlib.pyplot as plt

plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Theory Reminders

- classification with CNNs

So far, we considered mostly the **supervised learning** setting, where we assumed the
**train** and **test** sets are both from the same **distribution** and both labeled.



## Transfer learning

In the real world, we often don't have the perfect training set for our problem.

What should we do when the supervised learning assumption is invalid?

<img src="img/transfer_learning_digits.png" />

### Domains, targets and tasks

Lets start with some definitions to explain the problem.

- Imagine we have a **feature space**, $\mathcal{X}$
    - For example, $\mathcal{X}$ is the space of color images of size 32x32, each pixel in the range 0-255

In [126]:
import math
# size of this "limited" feature space
math.log10(256**(32**2*3))

7398.1131734380015

- As usual, we have a training set $X=\{\vec{x}^{(i)}\}_{i=1}^{N},\ \vec{x}^{(i)}\in\cset{X}$.
    - For example, CIFAR-10
    
    <img src="img/cifar10.png" width="500"/>

- There exists some **probability distribution** $P(X)$ (aka $P_{X}(\vec{x})$) over our training set.
    - For example, in CIFAR-10, the probability of an all-black image is very low
    - If classes are unbalanced, much different probabilities for members of large and small classes
    
    <img src="img/data_dist.jpg" />

- Our **label space**, $\cset{Y}$ includes the possible labels for sample in our problem.
    - For example $\cset{Y}=\{0,1\}$ in binary classification.
- We may have also $Y = \{y^{(i)}\}_{i=1}^{N}$, the set of labels for our dataset.

- We want to learn the target function $\hat{y}=f(\vec{x})$ which predicts a label given an image.
    - From the probabilistic perspective, $\hat{y}=P(\hat{y}|\vec{x})$.

Finally,
- A learning **domain** $\cset{D}$, is defined as $\cset{D}=\left\{\mathcal{X},P(X)\right\}$.
- A learning **task** $\cset{T}$ is defined as $\cset{T}=\{\cset{Y},P(Y|X)\}$.

### Transfer learning settings

**Definition** (Pan & Yang, 2010):

Given
- A **source** domain $\cset{D}_S$ and learning task $\cset{T}_S$
- A **target** domain $\cset{D}_T$ and learning task $\cset{T}_T$

*Transfer learning* aims to improve the learning of the target function
using *knowledge* in $\cset{D}_S$ and $\cset{T}_S$, when
- $\cset{D}_S\neq\cset{D}_T$, **or**
- $\cset{T}_S\neq\cset{T}_T$

Usually also there are other constraints on the target domain, such as little or no labels available.

When $\cset{D}_S=\cset{D}_T$ and $\cset{T}_S=\cset{T}_T$ we're in the regular supervised learning setting
we have seen thus far.

For example, splitting CIFAR-10 randomly into a train and test set.

#### Same domain, different task

Recall, a learning **task** $\cset{T}$ is defined as $\cset{T}=\{\cset{Y},P(Y|X)\}$.

So there are two cases (not mutually exclusive).

Case 1: The label spaces are different, $\cset{Y}_S \neq \cset{Y}_T$

For example, target domain has more classes.

<img src="img/cifar10_100.png" />

Case 2: The target conditional distributions are different, $P(Y_S|X_S)\neq P(Y_T|X_T)$.

This may be the case when the class-balance is very different in the source and target distributions.

#### Same task, different domain

Recall, a learning **domain** $\cset{D}$, is defined as $\cset{D}=\left\{\mathcal{X},P(X)\right\}$.

Again, two cases.

Case 1: Different feature spaces, $\cset{X}_S \neq \cset{X}_T$.

For example: $\cset{X}_S$ is a space of grayscale images while $\cset{X}_T$ is a space of color images;
documents in different languages.

Case 2: Different data distributions, $P(X_S)\neq P(X_T)$.

For example: source domain contains hand-drawn images, while target domain contains photographs;
documents in the same language about different topics.

<img src="img/tl_example.png" width="500"/>

## Example 1: Fine-tuning a pre-trained model

We have trained trained a model in a source domain,
and now we want to use it to speed up training for a different domain where we have much less data.

Common example: pre-train on ImageNet (1M+ images, 1000 classes), and then classify e.g. medical images.

<img src="img/transfer-learning-medical.png" width="500" />


Why would this work?

CNNs capture hierarchical features, with deeper layers capturing higher-level, class-specific features
(Zeiler & Fergus, 2013).

<img src="img/zf1.png" width="800"/>

<img src="img/zf2.png" width="800"/>

Hence, we can start from a pre-trained model and,
- "Fine-tune" the convolutional filter, mainly in the deeper layers.
- Change the classifier "head" to fit our task and train it from scratch.

In [117]:
import torchvision as tv

# Load a deep CNN pretrained on ImageNet
# Using ResNet18 just to reduce download size, use something deeper
resnet18 = tv.models.resnet18(pretrained=True)
resnet18

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (conv2): Co

In [118]:
# Freeze all layers: disable gradient tracking
for p in resnet18.parameters():
    p.requires_grad = False

In [119]:
# "Thaw" last layer (or whatever is relevant for you)
for p in resnet18.layer4.parameters():
    p.requires_grad = True

In [120]:
import torch.nn as nn

# Replace fully-connected part by some other classifier, e.g.

cnn_features = resnet18.fc.in_features
num_classes = 13

resnet18.fc =  nn.Sequential(
    nn.Linear(cnn_features, 100, bias=True),
    nn.ReLU(),
    nn.Linear(100, num_classes, bias=True),
)

In [122]:
import torchvision.transforms as tvtf

# Important nuance 1: need to scale our data same as ImageNet training data
tf = tvtf.Compose([
    tvtf.Resize(224),
    tvtf.ToTensor(),
    tvtf.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load our target domain data (CIFAR-10 used just as a simple example)
ds_train = tv.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tf)
ds_test = tv.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tf)

batch_size = 8
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=True, num_workers=2)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size, shuffle=True, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


In [123]:
resnet18(ds_train[0][0].unsqueeze(dim=0))

tensor([[-0.1018,  0.0296, -0.1904,  0.2327,  0.2346, -0.0646,  0.0659, -0.0181,
         -0.1356, -0.0122,  0.4035, -0.1575, -0.0995]],
       grad_fn=<AddmmBackward>)

In [125]:
import torch.optim as optim
from tut6.train import train_model

# Important nunance 2: Only parameters that track gradients can be passed into the optimizer
params_non_frozen = filter(lambda p: p.requires_grad, resnet18.parameters())
opt = optim.SGD(params_non_frozen, lr=0.05, momentum=0.9)

# Finetuning usually means we want smaller than usual learning rates and 
# decaying them in order to keep improving the weights
lr_sched = optim.lr_scheduler.ReduceLROnPlateau(opt, factor=0.05, patience=5,)

loss_fn = nn.CrossEntropyLoss()

def train(model, loss_fn, opt, lr_sched, dl_train, dl_test):
    # Same as regular classifier traning, just call lr_sched.step() every epoch.
    # ...
    pass

## Example 2: Unsupervised domain adaptation

In [143]:
from tut6.data import MNISTMDataset

image_size = 28

tf_source = tvtf.Compose([
    tvtf.Resize(image_size),
    tvtf.ToTensor(),
    tvtf.Normalize(mean=(0.1307,), std=(0.3081,))
])

tf_target = tvtf.Compose([
    tvtf.Resize(image_size),
    tvtf.ToTensor(),
    tvtf.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
])

ds_source = tv.datasets.MNIST(root=data_dir, train=True, transform=tf_source, download=True)

ds_target = MNISTMDataset(os.path.join(data_dir, 'mnist_m', 'mnist_m_train'),
                          os.path.join(data_dir, 'mnist_m', 'mnist_m_train_labels.txt'),
                         transform=tf_target)

(tensor([[[-0.4510, -0.4431, -0.4510,  ..., -0.4275, -0.4118, -0.4039],
          [-0.4510, -0.4510, -0.4510,  ..., -0.4353, -0.4118, -0.3882],
          [-0.4431, -0.4510, -0.4510,  ..., -0.4353, -0.4118, -0.3882],
          ...,
          [-0.3490, -0.3412, -0.3333,  ..., -0.3412, -0.3412, -0.3412],
          [-0.3333, -0.3333, -0.3255,  ..., -0.3412, -0.3490, -0.3333],
          [-0.3176, -0.3176, -0.3098,  ..., -0.3490, -0.3490, -0.3412]],
 
         [[-0.3647, -0.3569, -0.3647,  ..., -0.3647, -0.3490, -0.3490],
          [-0.3647, -0.3569, -0.3569,  ..., -0.3647, -0.3569, -0.3333],
          [-0.3490, -0.3569, -0.3569,  ..., -0.3647, -0.3490, -0.3333],
          ...,
          [-0.2863, -0.2784, -0.2784,  ..., -0.2706, -0.2627, -0.2549],
          [-0.2784, -0.2784, -0.2706,  ..., -0.2784, -0.2706, -0.2549],
          [-0.2863, -0.2784, -0.2706,  ..., -0.2863, -0.2784, -0.2627]],
 
         [[-0.4980, -0.4980, -0.4980,  ..., -0.4902, -0.4902, -0.4902],
          [-0.5059, -0.5137,

**Image credits**

Some images in this tutorial were taken and/or adapted from:

- M. Wulfmeier et al., https://arxiv.org/abs/1703.01461v2
- Andrej Karpathy, http://karpathy.github.io
- K. Xu et al. 2015, https://arxiv.org/abs/1502.03044