# Assignment 3 -- Transformer

## Goal

1. Learning to design an Transformer architecture.

2. Learning to implement and realize single (and multi-head) attention mechansim.

3. Learning to implement the position-wise feedforward (FFN) Layer.


## Score

1. Position-wise feedforward (FFN) Layer 20%

2. Single-head attention 20% (Bonus: multi-head attention 20%)

3. Transformer architecture 20% (Please note that you can modify part 9)

4. Model size 15%:

* 10%: If your model size is smaller than **1MB**, you will get 10%. Otherwise, no points will be awarded.
* 5%:  The remaining 5% will depend on your ranking within the class.

5. Model accuracy 15%:

* 10%: If your accuracy is higher than **57%**, you will get 10%. Otherwise, no points will be awarded.
* 5%:  The remaining 5% will depend on your ranking within the class.

6. Model accuracy on another dataset 10%: it will depand on your ranking within the class.


## Rule

1. You can use all torch and einops functions for your implementations.

2. Please still avoid using other modules not mentioned above. Contact TA if you have questions about callable library.

3. Please do NOT attempt to modify the sections `DO NOT MODIFY`.

## Submission

Upload your files to NTU Cool.
* This .ipynb file: Please rename this file with the format (DL_HW3_StudentID.ipynb)
* Model : .pt file
* Output: .csv file

Deadline: 5/20 midnight (23:59)

Please fill your student ID number below

In [None]:
# Please fill your student ID number
student_id = 'xxxxx'

## Part 1

Import necessary library

`DO NOT MODIFY`

In [None]:
# Model
import torch
import torch.nn as nn
import torch.nn.functional as F

# Dataset
from torchvision import datasets
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from scipy.io import loadmat

# Optimizer
from torch.optim.optimizer import Optimizer

# Pre-processing
import torchvision.transforms as trns
from PIL import Image

from math import sqrt

# For Transformer
# you may need using `pip install einops` to install einops
from einops import rearrange, repeat
from einops.layers.torch import Rearrange

## Part 2

`DO NOT MODIFY`

Global variables.

If following code used these variables, please keep them when you modify the code

In [None]:
batch_size = 16
num_classes = 10
input_size = (32, 32, 3)
patch_size = 4
num_epoch = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Part 3

`DO NOT MODIFY`

Create dataloader with pre-processing of dataset

In [None]:
# Create train/test transforms
train_transform = trns.Compose([
    trns.ToTensor(),
])

test_transform = trns.Compose([
    trns.ToTensor(),
])

# Create train/test datasets with pre-processing
# The dataset will automatic download if does not exist
data_train = datasets.CIFAR10(root='./dataset/', train=True, transform=train_transform, download=True)
data_test = datasets.CIFAR10(root='./dataset/', train=False, transform=test_transform, download=True)

# Create train/test dataloader for datasets with  pre-processing
train_loader = DataLoader(data_train, batch_size=batch_size, shuffle=True)
test_loader  = DataLoader(data_test,  batch_size=batch_size, shuffle=False)

## Part 4

`DO NOT MODIFY`

Convert image to token

In [None]:
class myTokenization(nn.Module):

    def __init__(self, output_dim, patch_size, channels):

        super().__init__()

        patch_dim = patch_size * patch_size * channels

        self.to_patch_tokens = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, output_dim)
        )

    def forward(self, x):

        return self.to_patch_tokens(x)


## Part 5

Please implement the following formula for the position-wise feedforward layer (FFN).

$FFN(x) = max(0, x W_1 + b_1)W_2 + b_2$

In [None]:
class myFFN(nn.Module):

    def __init__(self, input_dim, hidden_dim):
        super().__init__()

        # your implementation
        # example: single linear layer
        # self.ffn = nn.Linear(input_dim, hidden_dim)

    def forward(self, x):

        # your implementation
        # example: single linear layer
        # out = self.ffn(x)

        return out

## Part 6

Please implement the scaled dot-product attention (single head), including the masking (setting to −∞).

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

_Bonus: please implement multi-head attention:_

$MutliHead(Q, K, V) = Concate(head_1, ..., head_h)W^O$, where $head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)$

![picture](https://drive.google.com/uc?id=1JF7t1r3s2YBaaESo2UqnOn4ZGXpLSBZC)

In [None]:
class myAttention(nn.Module):

    def __init__(self, input_dim, heads, head_dim):

        super().__init__()

        # your implementation

    def forward(self, x):

        # your implementation
        # Please implement scaled dot-product attention here (bonus: multi-head attention)

        return out

## Part 7

Please design your Transformer architecture with your implementation of {myFFN, myAttention}.

You could decide number of layers, number of hidden neurons of each layer, activation function of each layer to design your Transformer.

Please notice that your score will depands on both size and accuracy of your model.

Here are the architectures of Transformer and ViT for your reference.

### Transformer
![picture](https://drive.google.com/uc?id=1_M6HgPKyInm1bjGQ81QiqIDR5BDAHXZs)

### ViT
![picture](https://drive.google.com/uc?id=18E82sYrhth9uJAF6ILGnmZlZ5tIcIUIo)

In [None]:
class myTransformer(nn.Module):

    def __init__(self, dim, heads, dim_head, mlp_dim):

        super().__init__()

        # your implementation

    def forward(self, x):

        # your implementation

        return out

## Part 8

`DO NOT MODIFY`

Construct Vision Transformer by implemented Transformer and FFN

![picture](https://drive.google.com/uc?id=1TKPJyIkGXSnHN9BTJA5XTZ3QtyORqi89)

In [None]:
class myViT(nn.Module):

    def __init__(self, input_size, patch_size, hidden_dim, heads, head_dim, mlp_dim, num_classes):

        super().__init__()

        image_height, image_width, channels = input_size
        patch_height, patch_width = patch_size, patch_size

        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = patch_height * patch_width * channels

        self.to_input_token = myTokenization(hidden_dim, patch_size, channels)
        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim))
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches+1, hidden_dim))

        self.transformer = myTransformer(hidden_dim, heads, head_dim, mlp_dim)

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, num_classes))

    def forward(self, x):

        # convert image to token embeddings
        x = self.to_input_token(x)

        # concatenate cls token and input token and add positional embedding
        b, n, _ = x.shape
        cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]

        # transformer
        x = self.transformer(x)

        # cls token for prediction
        x = x[:, 0]

        return self.mlp_head(x)



## Part 9

You could adjust `hidden_dim`, `heads`, `head_dim`, and `mlp_dim` to construct your ViT.

In [None]:
model = myViT(
    input_size = input_size,
    patch_size = patch_size,
    hidden_dim = 64,
    heads = 4,
    head_dim = 64,
    mlp_dim = 64,
    num_classes = num_classes).to(device)


## Part 10

`DO NOT MODIFY`

Multiclass cross-entropy los

In [None]:
criterion = nn.CrossEntropyLoss()

## Part 11

`DO NOT MODIFY`

Adam optimizer.

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)

## Part 12

`DO NOT MODIFY`

Model training

In [None]:
model.train()

for epoch in range(num_epoch):

    losses = []

    for batch_num, input_data in enumerate(train_loader):

        optimizer.zero_grad()

        x, y = input_data
        x = x.to(device).float()
        y = y.to(device)

        output = model(x)
        loss = criterion(output, y)
        loss.backward()
        losses.append(loss.item())

        optimizer.step()

        if batch_num % 500 == 0:
            print('\tEpoch %d | Batch %d | Loss %6.4f' % (epoch, batch_num, loss.item()))

    print('Epoch %d | Loss %6.4f' % (epoch, sum(losses)/len(losses)))

torch.save(model, student_id + '_submission.pt')

## Part 8

`DO NOT MODIFY`

Model evaluation

In [None]:
import csv
model.eval()

with open(student_id + '_submission.csv', 'w') as f:

    fieldnames = ['ImageId', 'Prediction', 'Label']

    writer = csv.DictWriter(f, fieldnames=fieldnames, lineterminator = '\n')
    writer.writeheader()

    correct = 0
    total = 0

    with torch.no_grad():

        for x, t in test_loader:

            x = x.to(device).float()
            output = model(x).argmax(dim=1)

            for y,l in zip(output, t):

                writer.writerow({fieldnames[0]: (total+1),
                                 fieldnames[1]: y.item(),
                                 fieldnames[2]: l.item()})

                total += 1
                if y.item() == l.item():
                    correct += 1

    print('Accuracy: %6.4f' % (correct / total))