# Homework 11 - Transfer Learning (Domain Adversarial Training)

> Author: Arvin Liu (r09922071@ntu.edu.tw)

若有任何問題，歡迎來信至助教信箱 kafuchino0410@gmail.com


# Readme


這份作業的任務是Transfer Learning中的Domain Adversarial Training。

<img src="https://i.imgur.com/iMVIxCH.png" width="500px">

> 也就是左下角的那一塊。

## Scenario and Why Domain Adversarial Training
你現在有Source Data + label，其中Source Data和Target Data可能有點關係，所以你想要訓練一個model做在Source Data上並Predict在Target Data上。

但這樣有什麼樣的問題? 相信大家學過Anomaly Detection就會知道，如果有data是在Source Data沒有出現過的(或稱Abnormal的)，那麼model大部分都會因為不熟悉這個data而可能亂做一發。 

以下我們將model拆成Feature Extractor(上半部)和Classifier(下半部)來作例子:
<img src="https://i.imgur.com/IL0PxCY.png" width="500px">

整個Model在學習Source Data的時候，Feature Extrator因為看過很多次Source Data，所以所抽取出來的Feature可能就頗具意義，例如像圖上的藍色Distribution，已經將圖片分成各個Cluster，所以這個時候Classifier就可以依照這個Cluster去預測結果。

但是在做Target Data的時候，Feature Extractor會沒看過這樣的Data，導致輸出的Target Feature可能不屬於在Source Feature Distribution上，這樣的Feature給Classifier預測結果顯然就不會做得好。

## Domain Adversarial Training of Nerural Networks (DaNN)
基於如此，是不是只要讓Soucre Data和Target Data經過Feature Extractor都在同個Distribution上，就會做得好了呢? 這就是DaNN的主要核心。

<img src="https://i.imgur.com/vrOE5a6.png" width="500px">

我們追加一個Domain Classifier，在學習的過程中，讓Domain Classifier去判斷經過Feature Extractor後的Feature是源自於哪個domain，讓Feature Extractor學習如何產生Feature以**騙過**Domain Classifier。 持久下來，通常Feature Extractor都會打贏Domain Classifier。(因為Domain Classifier的Input來自於Feature Extractor，而且對Feature Extractor來說Domain&Classification的任務並沒有衝突。)

如此一來，我們就可以確信不管是哪一個Domain，Feature Extractor都會把它產生在同一個Feature Distribution上。

# Data Introduce

這次的任務是Source Data: 真實照片，Target Data: 手畫塗鴉。

我們必須讓model看過真實照片以及標籤，嘗試去預測手畫塗鴉的標籤為何。

資料位於[這裡](https://drive.google.com/file/d/1e4CaQ5VUF3F04XRDGXrnRQGogo89TiF8/view?usp=sharing)，以下的code分別為下載和觀看這次的資料大概長甚麼樣子。

特別注意一點: **這次的source和target data的圖片都是平衡的，你們可以使用這個資訊做其他事情。**

In [None]:
# # Download dataset
# !gdown --id '1e4CaQ5VUF3F04XRDGXrnRQGogo89TiF8' --output real_or_drawing.zip
# # Unzip the files
# !unzip real_or_drawing.zip

In [None]:
import matplotlib.pyplot as plt

def no_axis_show(img, title='', cmap=None):
  # imshow, and set the interpolation mode to be "nearest"。
  # fig = plt.imshow(img, interpolation='nearest', cmap=cmap)
  fig = plt.imshow(img, interpolation='bicubic', cmap=cmap)
  # do not show the axis in the images.
  fig.axes.get_xaxis().set_visible(False)
  fig.axes.get_yaxis().set_visible(False)
  plt.title(title)

titles = ['horse', 'bed', 'clock', 'apple', 'cat', 'plane', 'television', 'dog', 'dolphin', 'spider']
plt.figure(figsize=(18, 18))
for i in range(10):
  plt.subplot(1, 10, i+1)
  fig = no_axis_show(plt.imread(f'./real_or_drawing/train_data/{i}/{500*i}.bmp'), title=titles[i])

In [None]:
plt.figure(figsize=(18, 18))
for i in range(10):
  plt.subplot(1, 10, i+1)
  fig = no_axis_show(plt.imread(f'./real_or_drawing/test_data/0/' + str(i).rjust(5, '0') + '.bmp'))

# Special Domain Knowledge

因為大家塗鴉的時候通常只會畫輪廓，我們可以根據這點將source data做點邊緣偵測處理，讓source data更像target data一點。

## Canny Edge Detection
算法這邊不贅述，只教大家怎麼用。若有興趣歡迎參考wiki或[這裡](https://medium.com/@pomelyu5199/canny-edge-detector-%E5%AF%A6%E4%BD%9C-opencv-f7d1a0a57d19)。

cv2.Canny使用非常方便，只需要兩個參數: low_threshold, high_threshold。

```cv2.Canny(image, low_threshold, high_threshold)```

簡單來說就是當邊緣值超過high_threshold，我們就確定它是edge。如果只有超過low_threshold，那就先判斷一下再決定是不是edge。

以下我們直接拿source data做做看。

In [None]:
import cv2
import matplotlib.pyplot as plt
titles = ['horse', 'bed', 'clock', 'apple', 'cat', 'plane', 'television', 'dog', 'dolphin', 'spider']
plt.figure(figsize=(18, 18))

original_img = plt.imread(f'./real_or_drawing/train_data/0/0.bmp')
plt.subplot(1, 5, 1)
no_axis_show(original_img, title='original')

gray_img = cv2.cvtColor(original_img, cv2.COLOR_RGB2GRAY)
plt.subplot(1, 5, 2)
no_axis_show(gray_img, title='gray scale', cmap='gray')

canny_50100 = cv2.Canny(gray_img, 50, 100)
plt.subplot(1, 5, 3)
no_axis_show(canny_50100, title='Canny(50, 100)', cmap='gray')

canny_150200 = cv2.Canny(gray_img, 150, 200)
plt.subplot(1, 5, 4)
no_axis_show(canny_150200, title='Canny(150, 200)', cmap='gray')

canny_250300 = cv2.Canny(gray_img, 250, 300)
plt.subplot(1, 5, 5)
no_axis_show(canny_250300, title='Canny(250, 300)', cmap='gray')
  

# Data Process

在這裡我故意將data用成可以使用torchvision.ImageFolder的形式，所以只要使用該函式便可以做出一個datasets。

transform的部分請參考以下註解。
<!-- 
#### 一些細節

在一般的版本上，對灰階圖片使用RandomRotation使用```transforms.RandomRotation(15)```即可。但在colab上需要加上```fill=(0,)```才可運行。
在n98上執行需要把```fill=(0,)```拿掉才可運行。 -->


In [None]:
from tqdm import tqdm
from time import time
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Function

import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

def same_seeds(seed):
    # Python built-in random module
    random.seed(seed)
    # Numpy
    np.random.seed(seed)
    # Torch
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

same_seeds(1116)

source_transform = transforms.Compose([
    # Turn RGB to grayscale. (Bacause Canny do not support RGB images.)
    transforms.Grayscale(),
    # cv2 do not support skimage.Image, so we transform it to np.array, 
    # and then adopt cv2.Canny algorithm.
    transforms.Lambda(lambda x: cv2.Canny(np.array(x), 170, 300)),
    # Transform np.array back to the skimage.Image.
    transforms.ToPILImage(),
    # 50% Horizontal Flip. (For Augmentation)
    transforms.RandomHorizontalFlip(),
    # Rotate +- 15 degrees. (For Augmentation), and filled with zero 
    # if there's empty pixel after rotation.
    transforms.RandomRotation(15, fill=(0,)),
    # Transform to tensor for model inputs.
    transforms.ToTensor(),
])
target_transform = transforms.Compose([
    # Turn RGB to grayscale.
    transforms.Grayscale(),
    # Resize: size of source data is 32x32, thus we need to 
    #  enlarge the size of target data from 28x28 to 32x32。
    transforms.Resize((32, 32)),
    # 50% Horizontal Flip. (For Augmentation)
    transforms.RandomHorizontalFlip(),
    # Rotate +- 15 degrees. (For Augmentation), and filled with zero 
    # if there's empty pixel after rotation.
    transforms.RandomRotation(15, fill=(0,)),
    # Transform to tensor for model inputs.
    transforms.ToTensor(),
])

source_dataset = ImageFolder('./real_or_drawing/train_data', transform=source_transform)
target_dataset = ImageFolder('./real_or_drawing/test_data', transform=target_transform)

source_dataloader = DataLoader(source_dataset, batch_size=32, shuffle=True)
target_dataloader = DataLoader(target_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(target_dataset, batch_size=128, shuffle=False)

# Model

Feature Extractor: 典型的VGG-like疊法。

Label Predictor / Domain Classifier: MLP到尾。

相信作業寫到這邊大家對以下的Layer都很熟悉，因此不再贅述。

In [None]:
class GradReverse(Function):
    def __init__(self , lambd):
        self.lambd = lambd
        return

    def forward(self , x):
        return x.view_as(x)

    def backward(self , grad_output):
        return -self.lambd * grad_output

def grad_reverse(x , lambd = 1.0):
    return GradReverse(lambd)(x)
    
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 64, 3, 1, 1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(64, 128, 3, 1, 1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(128, 256, 3, 1, 1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),

            nn.Conv2d(256, 256, 3, 1, 1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(256, 512, 3, 1, 1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(512, 512, 3, 1, 1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(512, 512, 3, 1, 1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(512, 512, 3, 1, 1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
    def forward(self, x):
        x = self.conv(x).squeeze()
        return x

class Classifier(nn.Module):
    def __init__(self):
        
        super(Classifier , self).__init__()

        self.fc1 = nn.Linear(512 , 512)
        self.bn1 = nn.BatchNorm1d(512)
        self.fc2 = nn.Linear(512 , 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.fc3 = nn.Linear(256 , 10)
        # self.layer = nn.Sequential(
        #     nn.Linear(512, 512),
        #     nn.BatchNorm1d(512),
        #     nn.ReLU(),

        #     nn.Linear(512, 256),
        #     nn.BatchNorm1d(256),
        #     nn.ReLU(),

        #     nn.Linear(256, 10),
        # )

    # def forward(self, h):
    #     c = self.layer(h)
    #     return c
    def set_lambda(self , lambd):
        self.lambd = lambd
        return

    def forward(self , x , reverse = False):
        if (reverse):
            x = grad_reverse(x , self.lambd)
        x = F.relu(self.bn1(self.fc1(x)))
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.fc3(x)
        return x


# Pre-processing

這裡我們選用Adam來當Optimizer。

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class_criterion = nn.CrossEntropyLoss()
L1_loss = nn.L1Loss()
def discrepancy(output_1 , output_2):
	return torch.mean(torch.abs(F.softmax(output_1 , dim = 1) - F.softmax(output_2 , dim = 1)))


generator = Generator().cuda()
classifier_1 = Classifier().cuda()
classifier_2 = Classifier().cuda()

# generator.load_state_dict(torch.load("./MCDmodel/generator.pt"))
# classifier_1.load_state_dict(torch.load("./MCDmodel/classifier_1.pt"))
# classifier_2.load_state_dict(torch.load("./MCDmodel/classifier_2.pt"))

optimizer_generator = optim.Adam(generator.parameters())
optimizer_classifier_1 = optim.Adam(classifier_1.parameters())
optimizer_classifier_2 = optim.Adam(classifier_2.parameters())

# batch_size = 128
# learning_rate = 0.00002
# weight_decay = 0.0005
# epoch = 2000
# (optimizer_generator , optimizer_classifier_1 , optimizer_classifier_2) = (Adam(generator.parameters() , lr = learning_rate , weight_decay = weight_decay) , Adam(classifier_1.parameters() , lr = learning_rate , weight_decay = weight_decay) , Adam(classifier_2.parameters() , lr = learning_rate , weight_decay = weight_decay))




## 如何實作DaNN?

理論上，在原始paper中是加上Gradient Reversal Layer，並將Feature Extractor / Label Predictor / Domain Classifier 一起train，但其實我們也可以交換的train Domain Classfier & Feature Extractor(就像在train GAN的Generator & Discriminator一樣)，這也是可行的。

在code實現中，我們採取後者的方式。

## 小提醒
* 原文中的lambda(控制Domain Adversarial Loss的係數)是有Adaptive的版本，如果有興趣可以參考[原文](https://arxiv.org/pdf/1505.07818.pdf)。
* 因為我們完全沒有target的label，所以結果如何，只好丟kaggle看看囉:)?

In [None]:
def MCD_train_epoch(epoch, num_epochs, source_dataloader, target_dataloader):
    '''
    Args:
    source_dataloader: source data的dataloader
    target_dataloader: target data的dataloader
    lamb: control the balance of domain adaptatoin and classification.
    '''
    start = time()
    # D loss: Domain Classifier的loss
    # F loss: Feature Extrator & Label Predictor的loss
    running_Step1_loss, running_Step2_loss, running_Step3_loss = 0.0, 0.0, 0.0
    total_hit1, total_hit2, total_num = 0.0, 0.0, 0.0

    # for i, ((source_data, source_label), (target_data, _)) in tqdm(enumerate(zip(source_dataloader, target_dataloader))):
    for i, ((source_data, source_label), (target_data, _)) in enumerate(zip(source_dataloader, target_dataloader)):

        source_data = source_data.cuda()
        source_label = source_label.cuda()
        target_data = target_data.cuda()

        # Step 1
        optimizer_generator.zero_grad()
        optimizer_classifier_1.zero_grad()
        optimizer_classifier_2.zero_grad()

        feature = generator(source_data)
        y_1 = classifier_1(feature)
        y_2 = classifier_2(feature)
        loss = class_criterion(y_1 , source_label) + class_criterion(y_2 , source_label)
        running_Step1_loss += loss.item()

        loss.backward()

        optimizer_generator.step()
        optimizer_classifier_1.step()
        optimizer_classifier_2.step()

        # Step 2
        optimizer_generator.zero_grad()
        optimizer_classifier_1.zero_grad()
        optimizer_classifier_2.zero_grad()

        feature = generator(source_data)
        y_1 = classifier_1(feature)
        y_2 = classifier_2(feature)
        loss_1 = class_criterion(y_1 , source_label) + class_criterion(y_2 , source_label)

        feature = generator(target_data)
        y_1 = classifier_1(feature)
        y_2 = classifier_2(feature)
        loss_2 = discrepancy(y_1 , y_2)

        loss = loss_1 - loss_2
        running_Step2_loss += loss.item()

        loss.backward()

        optimizer_classifier_1.step()
        optimizer_classifier_2.step()
        
        # Step 3
        for k in range(4):
            optimizer_generator.zero_grad()
            feature = generator(target_data)
            y_1 = classifier_1(feature)
            y_2 = classifier_2(feature)
            loss_discrepancy = discrepancy(y_1 , y_2)
            if i == 3:
                    running_Step3_loss += loss_discrepancy.item()

            loss_discrepancy.backward()
            
            optimizer_generator.step()

        if (i < min(len(source_dataloader) , len(target_dataloader)) - 1):
            m = int(50 * (i + 1) / min(len(source_dataloader) , len(target_dataloader)))
            bar = m * '=' + '>' + (49 - m) * ' '
            print('epoch {}/{} [{}]'.format(epoch + 1 , num_epochs , bar) , end = '\r')
        else:
            bar = 50 * '='
            end = time()
            print('epoch {}/{} [{}] ({}s)'.format(epoch+ 1 , num_epochs , bar , int(end - start)))


        feature = generator(source_data)
        y_1 = classifier_1(feature)
        y_2 = classifier_2(feature)
        total_hit1 += torch.sum(torch.argmax(y_1, dim=1) == source_label).item()
        total_hit2 += torch.sum(torch.argmax(y_2, dim=1) == source_label).item()
        total_num += source_data.shape[0]
        # print(i, end='\r')

    return running_Step1_loss / (i+1) , running_Step2_loss / (i+1), running_Step3_loss / (i+1), total_hit1 / total_num, total_hit2 / total_num



# Start Training

In [None]:
# training
num_epochs = 2000

# generator.load_state_dict(torch.load("./MCDmodel/generator.pt"))
# classifier_1.load_state_dict(torch.load("./MCDmodel/classifier_1.pt"))
# classifier_2.load_state_dict(torch.load("./MCDmodel/classifier_2.pt"))

generator.train()
classifier_1.train()
classifier_2.train()

for epoch in range(num_epochs):
    # You should chooose lamnda cleverly.
    # lamb = adaptive_lambda(epoch, num_epochs)
    train_Step1_loss, train_Step2_loss, train_Discrepency_loss, trainF1_acc, trainF2_acc = MCD_train_epoch(epoch, num_epochs, source_dataloader, target_dataloader)

    torch.save(generator.state_dict() , './MCDmodel/generator.pt')
    torch.save(classifier_1.state_dict() , './MCDmodel/classifier_1.pt')
    torch.save(classifier_2.state_dict() , './MCDmodel/classifier_2.pt')
    print('epoch {:>3d}: train_Step1_loss: {:6.4f}, train_Step2_loss: {:6.4f}, train_Discrepency_loss: {:6.4f}, F1_acc {:6.4f}, F2_acc {:6.4f}'.format(epoch+1, train_Step1_loss, train_Step2_loss, train_Discrepency_loss, trainF1_acc, trainF2_acc))

    if epoch != 0 and epoch % 100 == 0 :
        result = []

        generator.eval()
        classifier_1.eval()
        classifier_2.eval()

        with torch.no_grad():
            for i, (test_data, _) in enumerate(test_dataloader):
                test_data = test_data.cuda()

                feature = generator(test_data)
                y_1 = classifier_1(feature)
                y_2 = classifier_2(feature)
                y = (y_1 + y_2) 
                
                answer = torch.argmax(y, dim = 1).cpu().detach().numpy()
                result.append(answer)

        import pandas as pd
        result = np.concatenate(result)

        # Generate your submission
        df = pd.DataFrame({'id': np.arange(0,len(result)), 'label': result})
        df.to_csv('./MCD_Predict/MCD_submission-'+str(epoch)+'.csv',index=False)

        generator.train()
        classifier_1.train()
        classifier_2.train()

# Inference

就跟前幾次作業一樣。這裡我使用pd來生產csv，因為看起來比較潮(?)

此外，200 epochs的Accuracy可能會不太穩定，可以多丟幾次或train久一點。

In [None]:
result = []

generator = Generator().cuda()
classifier_1 = Classifier().cuda()
classifier_2 = Classifier().cuda()

generator.load_state_dict(torch.load("./MCDmodel/generator.pt"))
classifier_1.load_state_dict(torch.load("./MCDmodel/classifier_1.pt"))
classifier_2.load_state_dict(torch.load("./MCDmodel/classifier_2.pt"))

generator.eval()
classifier_1.eval()
classifier_2.eval()

with torch.no_grad():
    for i, (test_data, _) in enumerate(test_dataloader):
        test_data = test_data.cuda()

        feature = generator(test_data)
        y_1 = classifier_1(feature)
        y_2 = classifier_2(feature)
        y = (y_1 + y_2) 
        
        answer = torch.argmax(y, dim = 1).cpu().detach().numpy()
        result.append(answer)

import pandas as pd
result = np.concatenate(result)

# Generate your submission
df = pd.DataFrame({'id': np.arange(0,len(result)), 'label': result})
df.to_csv('MCD_submission.csv',index=False)

# Training Statistics

- Number of parameters:
  - Feature Extractor: 2, 142, 336
  - Label Predictor: 530, 442
  - Domain Classifier: 1, 055, 233

- Simple
 - Training time on colab: ~ 1 hr
- Medium
 - Training time on colab: 2 ~ 4 hr
- Strong
 - Training time on colab: 5 ~ 6 hrs
- Boss
 - **Unmeasurable**

# Learning Curve (Strong Baseline)
* This method is slightly different from colab.

![Loss Curve](https://i.imgur.com/vIujQyo.png)

# Accuracy Curve (Strong Baseline)
* Note that you cannot access testing accuracy. But this plot tells you that even though the model overfits the training data, the testing accuracy is still improving, and that's why you need to train more epochs.

![Acc Curve](https://i.imgur.com/4W1otXG.png)



# Special Thanks
下面是原本台大助教提供的參考作業。

[NTU_r08942071_太神啦 / 組長: 劉正仁同學](https://drive.google.com/open?id=11uNDcz7_eMS8dMQxvnWsbrdguu9k4c-c)

[NTU_r08921a08_CAT / 組長: 廖子毅同學](https://drive.google.com/open?id=1xIkSs8HAShdcfV1E0NEnf4JDbL7POZTf)
