# Neural Machine Translation with Various Sequence Models

## Instructions
- In this project, we will perform Neural Machine Translation with recurrent neural networks and attention based models on Multi30k dataset which include language pairs of German and English.
- To this end, you need to implement necessary network components (e.g. LSTMCell, Multi-head attention) using nn.Module class and complete whole models with those modules. Then, you will experiment those network architectures and report Bilingual Evaluation Understudy (BLEU) on the test set.
- Fill in the section marked **Px.x** with the appropriate code. **You can only modify inside those areas, and not the skeleton code.**
- To begin, you should download this ipynb file into your own Google drive clicking `make a copy(사본만들기)`. Find the copy in your drive, change their name to `Translation.ipynb`, if their names were changed to e.g. `Copy of Translation.ipynb` or `Translationipynb의 사본`.
- <font color="red">You'll be training large models. We recommend you to create at least **1GB** of space available on your Google drive to run everything properly.</font>

---
# Prerequisite: Mount your gdrive.

In [None]:
# mount drive https://datascience.stackexchange.com/questions/29480/uploading-images-folder-from-my-system-into-google-colab
# login with your google account and type authorization code to mount on your googlbie drive.
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


---
# Prerequisite: Setup the `root` directory properly.

In [None]:
# Specify the directory path where `Translation.ipynb` exists.
# For example, if you saved `Translation.ipynb` in `/gdrive/My Drive/samsung_ai` directory,
# then set root = '/gdrive/My Drive/samsung_ai'
root = '/gdrive/My Drive/Projects/samsung_2023'

root_ = root.replace(' ', '\ ')
!ls $root_

'Copy of Translation.ipynb'   results  'Translation(Solution).ipynb'


---
# Prerequisite: Install libraries.
You only have to run this cell once per VM at startup.

In [None]:
!pip install torchtext==0.6.0
!pip install spacy
!python -m spacy download en
!python -m spacy download de

Collecting torchtext==0.6.0
  Using cached torchtext-0.6.0-py3-none-any.whl (64 kB)
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.2
    Uninstalling torchtext-0.15.2:
      Successfully uninstalled torchtext-0.15.2
Successfully installed torchtext-0.6.0
2023-08-13 13:36:24.749036: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-13 13:36:27.264969: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-1

---
# Basic settings

## Import libraries

In [None]:
import os
import numpy as np
import time
from pathlib import Path
import torch
import torch.nn as nn
from torch.nn.parameter import Parameter
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.optim import SGD
import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
from torchtext.data.utils import get_tokenizer
from torchtext import data
from torchtext.data.metrics import bleu_score
import spacy
from spacy.symbols import ORTH
import math
import random
import tqdm.notebook as tq
import copy

## Set Hyperparameters

In [None]:
# Basic settings
torch.manual_seed(470)
torch.cuda.manual_seed(470)

#!pip install easydict
from easydict import EasyDict as edict

args = edict()
args.batch_size = 32
args.nlayers = 2
args.ninp = 256
args.nhid = 256 #512


args.clip = 1
args.lr_lstm = 0.001
args.dropout = 0.2
args.nhid_attn = 256
args.epochs = 20

##### Transformer
args.nhid_tran = 256
args.nhead = 8
args.nlayers_transformer = 6
args.attn_pdrop = 0.1
args.resid_pdrop = 0.1
args.embd_pdrop = 0.1
args.nff = 4 * args.nhid_tran


args.lr_transformer = 0.0001 #1.0
args.betas = (0.9, 0.98)

args.gpu = True


device = 'cuda:0' if torch.cuda.is_available() and args.gpu else 'cpu'
# Create directory name.
result_dir = Path(root) / 'results'
result_dir.mkdir(parents=True, exist_ok=True)

---
# Utility functions


In [None]:
def word_ids_to_sentence(id_tensor, vocab, join=' '):
    """Converts a sequence of word ids to a sentence"""
    if isinstance(id_tensor, torch.LongTensor):
        ids = id_tensor.transpose(0, 1).contiguous().view(-1)
    elif isinstance(id_tensor, np.ndarray):
        ids = id_tensor.transpose().reshape(-1)
    batch = [vocab.itos[ind] for ind in ids] # denumericalize
    if join is None:
        return batch
    else:
        return join.join(batch)

# Extracts bias and non-bias parameters from a model.
def get_parameters(model, bias=False):
    for m in model.modules():
        if isinstance(m, nn.Linear):
            if bias:
                yield m.bias
            else:
                yield m.weight
        else:
            if not bias:
                yield m.parameters()

def run_epoch(epoch, model, optimizer, is_train=True, data_iter=None):
    total_loss = 0
    n_correct = 0
    n_total = 0
    if data_iter is None:
        data_iter = train_iter if is_train else valid_iter
    if is_train:
        model.train()
    else:
        model.eval()
    for batch in data_iter:
        x, y, length = sort_batch(batch.src.to(device), batch.trg.to(device))
        target = y[1:]
        if isinstance(model, Transformer):
            x, y = x.transpose(0, 1), y.transpose(0, 1)
            target = target.transpose(0, 1) #y[:, 1:]
        pred = model(x, y, length)
        loss = criterion(pred.reshape(-1, trg_ntoken), target.reshape(-1))
        n_targets = (target != pad_id).long().sum().item()
        n_total += n_targets
        n_correct += (pred.argmax(-1) == target)[target != pad_id].long().sum().item()
        if is_train:
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
            optimizer.step()


        total_loss += loss.item() * n_targets
    total_loss /= n_total
    print("Epoch", epoch, 'Train' if is_train else 'Valid',
          "Loss", np.mean(total_loss),
          "Acc", n_correct / n_total,
          "PPL", np.exp(total_loss))
    return total_loss

def word_ids_to_sentence_(ids, vocab):
    sentence = []
    for ind in ids:
        if ind == eos_id:
            break
        sentence.append(vocab.itos[ind])
    return sentence

def run_translation(model, data_iter, max_len=100, mode='best'):
    with torch.no_grad():
        model.eval()
        load_model(model, mode)
        src_list = []
        gt_list = []
        pred_list = []
        for batch in data_iter:
            x, y, length = sort_batch(batch.src.to(device), batch.trg.to(device))
            target = y[1:]
            if isinstance(model, Transformer):
                x, y = x.transpose(0, 1), y.transpose(0, 1)
                target = target.transpose(0, 1)
            pred = model(x, y, length, max_len=max_len, teacher_forcing=False)
            pred_token = pred.argmax(-1)
            if not isinstance(model, Transformer):
                pred_token = pred_token.transpose(0, 1).cpu().numpy()
                y = y.transpose(0, 1).cpu().numpy()
                x = x.transpose(0, 1).cpu().numpy()
            # pred_token : batch_size x max_len
            for x_, y_, pred_ in zip(x, y, pred_token):
                src_list.append(word_ids_to_sentence_(x_[1:], SRC.vocab))
                gt_list.append([word_ids_to_sentence_(y_[1:], TRG.vocab)])
                pred_list.append(word_ids_to_sentence_(pred_, TRG.vocab))

        for i in range(5):
            print(f"--------- Translation Example {i+1} ---------")
            print("SRC :", ' '.join(src_list[i]))
            print("TRG :", ' '.join(gt_list[i][0]))
            print("PRED:", ' '.join(pred_list[i]))
        print()
        print("BLEU:", bleu_score(pred_list, gt_list))



def save_model(model, mode="last"):
    torch.save(model.state_dict(),  result_dir / f'{type(model).__name__}_{mode}.ckpt')

def load_model(model, mode="last"):
    if os.path.exists(result_dir / f'{type(model).__name__}_{mode}.ckpt'):
        model.load_state_dict(torch.load(result_dir / f'{type(model).__name__}_{mode}.ckpt'))

def sort_batch(X, y, lengths=None):
    if lengths is None:
        lengths = (X != pad_id_src).long().sum(0)
    lengths, indx = lengths.sort(dim=0, descending=True)
    X = torch.index_select(X, 1, indx)
    y = torch.index_select(y, 1, indx)
    return X, y, lengths

def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

---
# Download Multi30k Dataset

In [None]:
!wget -P .data/multi30k/ https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
!wget -P .data/multi30k/ https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
!wget -P .data/multi30k/ https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz

!tar -xzf .data/multi30k/training.tar.gz
!tar -xzf .data/multi30k/validation.tar.gz
!tar -xzf .data/multi30k/mmt_task1_test2016.tar.gz
!mv train.de train.en val.de val.en test2016.de test2016.en .data/multi30k/

---
# Define `DataLoader` for training & validation set


In [None]:
SRC = Field(tokenize = "spacy",
            tokenizer_language="de_core_news_sm",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

TRG = Field(tokenize = "spacy",
            tokenizer_language="en_core_web_sm",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

src_ntoken = len(SRC.vocab.stoi)
trg_ntoken = len(TRG.vocab.stoi)

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = args.batch_size,
    device = device)

In [None]:
pad_id_trg = TRG.vocab.stoi[TRG.pad_token]
pad_id_src = SRC.vocab.stoi[SRC.pad_token]
pad_id = pad_id_src
eos_id = TRG.vocab.stoi[TRG.eos_token]
criterion = nn.CrossEntropyLoss(ignore_index=pad_id)

for batch in train_iter:
    src, trg, length_src = sort_batch(batch.src, batch.trg)
    print(length_src)
    print(src, src.shape)
    print(trg, trg.shape)
    break

print("##### EXAMPLE #####")
print("SRC: ", word_ids_to_sentence(src[:, 1:2].long().cpu(), SRC.vocab))
print("TRG: ", word_ids_to_sentence(trg[:, 1:2].long().cpu(), TRG.vocab))

print("SRC vocab size", len(SRC.vocab.stoi))
print("TRG vocab size", len(TRG.vocab.stoi))
print("Vocab", list(SRC.vocab.stoi.items())[:10])

tensor([22, 19, 18, 18, 17, 16, 16, 15, 15, 14, 14, 14, 13, 13, 13, 13, 13, 12,
        12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 10, 10,  9,  9],
       device='cuda:0')
tensor([[   2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2],
        [  18,    5,    5,    8,    5,    5,    5,    5,   18,    5,    5,    8,
            5, 2475,    5,    5,    5,   18,    8,    5,    5,    5,    5,    5,
          105,   18,   73,    8,    8,    5,    5,   43],
        [  45,  271,   13,   16,   66,  171,   96,  116,   45,   70,    0,   16,
            0,   44,   13, 2912,   70,   25, 1047, 5533,    0, 1049,   49,  435,
           54,  241,  279,   36, 2161,  164,  632,   45],
        [   7,  676,    7,    7,   25,   32,   13,  218,    9,  551,  228, 1182,
           13, 6114,   37,    7,  820,  137, 2475,   11,  116,   69,   11,  228,
         

In [None]:
!python3 -m spacy validate
print(spacy.__version__)

⠙ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.7/dist-packages/spacy[0m

NAME              SPACY            VERSION                            
de_core_news_sm   >=3.4.0,<3.5.0   [38;5;2m3.4.0[0m   [38;5;2m✔[0m
en_core_web_sm    >=3.4.0,<3.5.0   [38;5;2m3.4.1[0m   [38;5;2m✔[0m

3.4.2


In [None]:
src_ntoken

7853

---
# Define networks

You should implement the `forward()` method of the given classes. Some classes are provided with the `forward()` method as well; you don't have to change anything in this case. However, **you are not allowed to modify the `__init__` method of all classes.**

## P1. Implement LSTM

### (a)  LSTMCell [(illustration)](https://docs.google.com/drawings/d/1ICw_GxDMxkSS5g7D1w6gXDkLm3NAok1otmPcwYzLPEg/edit?usp=sharing)
- LSTMCell is a single unit constructing LSTM. It gets current input(`x`) and previous state (which is composed of hidden state `hx` and cell state `cx`) as inputs and returns the state for the next time step (`hy` and `cy`). There are four switch variables to handle information flows through time. Implement forward function with those four switch variables following the illustration.

In [None]:
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.linear_input = nn.Linear(input_size, 4 * hidden_size)
        self.linear_hidden = nn.Linear(hidden_size, 4 * hidden_size)

    def forward(self, x, state):
        # type: (Tensor, Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]
        hx, cx = state
        gates = self.linear_input(x) + self.linear_hidden(hx)
        ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)

        ingate = torch.sigmoid(ingate)
        forgetgate = torch.sigmoid(forgetgate)
        cellgate = torch.tanh(cellgate)
        outgate = torch.sigmoid(outgate)

        cy = (forgetgate * cx) + (ingate * cellgate)
        hy = outgate * torch.tanh(cy)

        return hy, (hy, cy)

### (b)  LSTM[(illustration)](https://docs.google.com/drawings/d/1eiYyY9k6NELizHcRAsOS6jkEi_mISjfFlXntv8gaGZI/edit?usp=sharing)
- LSTMLayer is a single layer composed of sequential LSTMCells. While LSTMCell handles a single input, LSTMLayer gets a sequence as an input and processes it in an autoregressive manner. You don't need to implement LSTMLayer. It will be given.
- Using LSTMLayer, you should implement one full LSTM module by stacking multiple LSTMLayers. Note that `states` now contain multiple `state`s where each state becomes an initial state for a different level of LSTMLayers. Also, each output of an LSTMLayer is fed into the next layer of LSTMLayer as an input.
As a result, LSTM returns `output` tensor of size (L,B,nhid) and `output_states` consists of output states from different levels of LSTMLayers, which is a type of List(Tensor, Tensor, ..., Tensor) and each Tensor has a size of (L,B,nhid). Here L,B,nhid are a maximum length of sentences within a batch (equal to `x.size(0)`), batch size, and dimension size of hidden states, respectively. Implement the forward function following the given illustration.

In [None]:
class LSTMLayer(nn.Module):
    def __init__(self,*cell_args):
        super(LSTMLayer, self).__init__()
        self.cell = LSTMCell(*cell_args)

    def forward(self, x, state, length_x=None):
        # DO NOT MODIFY
        # type: (Tensor, Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]
        inputs = x.unbind(0)
        assert (length_x is None) or torch.all(length_x == length_x.sort(descending=True)[0])
        outputs = []
        out_hidden_state = []
        out_cell_state = []
        for i in range(len(inputs)):
            out, state = self.cell(inputs[i] , state)
            outputs += [out]
            if length_x is not None:
                if torch.any(i+1 == length_x):
                    out_hidden_state = [state[0][i+1==length_x]] + out_hidden_state
                    out_cell_state = [state[1][i+1==length_x]] + out_cell_state
        if length_x is not None:
            state = (torch.cat(out_hidden_state, dim=0), torch.cat(out_cell_state, dim=0))
        return torch.stack(outputs), state


class LSTM(nn.Module):
    def __init__(self, ninp, nhid, num_layers, dropout):
        super(LSTM, self).__init__()
        self.layers = []
        self.dropout = nn.Dropout(dropout)
        for i in range(num_layers):
            if i == 0:
                self.layers.append(LSTMLayer(ninp, nhid))
            else:
                self.layers.append(LSTMLayer(nhid, nhid))
        self.layers = nn.ModuleList(self.layers)

    def forward(self, x, states, length_x=None):
        # WRITE YOUR CODE HERE
        output_states = []
        output = x
        i = 0
        for rnn_layer, state in zip(self.layers, states):
            if i > 0:
                output = self.dropout(output)
            output, out_state = rnn_layer(output, state, length_x=length_x)
            output_states.append(out_state)
            i += 1
        return output, output_states

### (c)  Implement LSTMEncoder[(illustration)](https://docs.google.com/drawings/d/1wt5JhHtsx5b28KEem-_RX0FdeDmtUzSm0SJDypoCmtM/edit?usp=sharing)
LSTMEncoder encodes a sequence of tokens into the context vector. It first embeds a tokenized sequence using the embedding layer followed by dropout layer, and then LSTM computes `output` and `context_vector`. Implement the forward function following the given illustration.


In [None]:
class LSTMEncoder(nn.Module):
    def __init__(self):
        super(LSTMEncoder, self).__init__()
        ninp = args.ninp
        nhid = args.nhid
        nlayers = args.nlayers
        dropout = args.dropout
        self.embed = nn.Embedding(src_ntoken, ninp, padding_idx=pad_id)
        self.dropout = nn.Dropout(dropout)
        self.lstm = LSTM(ninp, nhid, nlayers, dropout)

    def forward(self, x, states, length_x=None):
        # WRITE YOUR CODE HERE
        x = self.dropout(self.embed(x))
        output, context_vector = self.lstm(x, states, length_x=length_x)
        return output, context_vector

### (d)  Implement LSTMDecoder[(illustration)](https://docs.google.com/drawings/d/151_NavPXUYtxbEDXBPpeMZnn0HlcZ-UVcIbSZFPU2o4/edit?usp=sharing)
LSTMDecoder gets a single token as an input to predict the next token. Similar to LSTMEncoder, it first embeds a given input (usually a predicted token from last time step) using embedding layer followed by dropout layer, and then LSTM computes `output` and `output_states`. Implement the forward function following the given illustration.

In [None]:
class LSTMDecoder(nn.Module):
    def __init__(self):
        super(LSTMDecoder, self).__init__()
        self.embed = nn.Embedding(trg_ntoken, args.ninp, padding_idx=pad_id)
        self.lstm = LSTM(args.ninp, args.nhid, args.nlayers, args.dropout)
        self.fc_out = nn.Linear(args.nhid, trg_ntoken)
        self.dropout = nn.Dropout(args.dropout)
        self.fc_out.weight = self.embed.weight

    def forward(self, x, states):
        # WRITE YOUR CODE HERE
        x = self.dropout(self.embed(x))
        output, output_states = self.lstm(x, states)
        output = self.fc_out(output)
        return output, output_states

### (e)  Implement LSTMSeq2Seq[(illustration)](https://docs.google.com/drawings/d/1xWxCE44_IaQhtxEXSnvA5fzdBjylj3_Maj9B02glMqs/edit?usp=sharing)
LSTMSeq2Seq is a complete model for neural machine translation. It starts with LSTMEncoder encoding a given tokenized sequence into the context vector. LSMTDecoder then decodes the context vector step by step. As mentioned in the description for LSTMDecoder, each input for the decoder is a token predicted by the previous decoder. In the training stage, however, one noisy prediction from the previous decoder can mess up all of the following predictions so teacher forcing is used in the training stage. Teacher forcing allows LSTMdecoder to always ground-truth token as an input instead of predicted one from the previous step. Therefore, implement the forward function to use the ground-truth label for the input for LSTMDecoder if `teacher_focing` is True (it's the case for training stage), and use the predicted token from last time step otherwise (case for inference). Also, note that all of the sentences start with <sos> token so the first input token to LSTMDecoder should be always `<sos>`. Implement the forward function following the given illustration.

In [None]:
class LSTMSeq2Seq(nn.Module):
    def __init__(self):
        super(LSTMSeq2Seq, self).__init__()
        self.encoder = LSTMEncoder()
        self.decoder = LSTMDecoder()

    def _get_init_states(self, x):
        init_states = [
            (torch.zeros((x.size(1), args.nhid)).to(x.device),
            torch.zeros((x.size(1), args.nhid)).to(x.device))
            for _ in range(args.nlayers)
        ]
        return init_states

    def forward(self, x, y, length, max_len=None, teacher_forcing=True):
        # WRITE YOUR CODE HERE
        ##### Encoding Procedure
        init_states = self._get_init_states(x)
        _, output_states = self.encoder(x, init_states, length)

        ##### Decoding Procedure
        # Decoding Initialize
        trg_len = y.size(0) if max_len is None else max_len
        dec_outputs = []
        for i in range(trg_len-1):
            if teacher_forcing or i==0:
                dec_input = y[i:i+1]
            else:
                dec_input= dec_output.argmax(-1)
            dec_output, output_states = self.decoder(dec_input, output_states)
            dec_outputs.append(dec_output)
        return torch.cat(dec_outputs, dim=0)


## P2. Implement Transformer

**This section has no dependency on the two preceding models; you can implement the Transformer first if you want to.**

### (a) Implement MaskedMultiheadAttention [(Illustration)](https://docs.google.com/drawings/d/1kCsVW-61xHT-riSDxGzF8VBRo2CpqrE0ngG7H21wirs/edit?usp=sharing)
In this module, you will implement a single layer of multi-head attention, which will be the key building block of the Transformer model. Each query, key, value input will first pass through a feed-forward network, then scaled dot-product attention is performed. Additionally, there's an optional mask layer inside the scaled dot-product attention applied only in the decoder stage of the Transformer, to prevent the model from being able to see future inputs. Implement the forward function following the given illustration.

In [None]:
MAX_LEN = 100
class MaskedMultiheadAttention(nn.Module):
    """
    A vanilla multi-head masked attention layer with a projection at the end.
    """
    def __init__(self, mask=False):
        super(MaskedMultiheadAttention, self).__init__()
        assert args.nhid_tran % args.nhead == 0
        # mask : whether to use
        # key, query, value projections for all heads
        self.key = nn.Linear(args.nhid_tran, args.nhid_tran)
        self.query = nn.Linear(args.nhid_tran, args.nhid_tran)
        self.value = nn.Linear(args.nhid_tran, args.nhid_tran)
        # regularization
        self.attn_drop = nn.Dropout(args.attn_pdrop)
        # output projection
        self.proj = nn.Linear(args.nhid_tran, args.nhid_tran)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        if mask:
            self.register_buffer("mask", torch.tril(torch.ones(MAX_LEN, MAX_LEN)))
        self.nhead = args.nhead
        self.d_k = args.nhid_tran // args.nhead

    def forward(self, q, k, v, mask=None):
        # WRITE YOUR CODE HERE
        B, T_q, C = q.shape
        _, T, _ = k.shape
        q = self.query(q).view(B, T_q, self.nhead, self.d_k).transpose(1, 2)
        k = self.key(k).view(B, T, self.nhead, self.d_k).transpose(1, 2)
        v = self.value(v).view(B, T, self.nhead, self.d_k).transpose(1, 2)

        # MatMul and Scale
        att = (q @ k.transpose(-2, -1)) / (C ** 0.5)

        # Mask
        if hasattr(self, 'mask'):
            att[:, :, self.mask[:T_q, :T]==0] = float('-inf')
        if mask is not None:
            assert len(mask.shape) == 2 # batch_size x t
            att = att.transpose(0, 2)
            att[:, :, mask == 0] =  float('-inf')
            att = att.transpose(0, 2)

        # SoftMax
        att = F.softmax(att, dim=-1) #(B, nh, T_q, T)
        # Dropout
        att = self.attn_drop(att)
        # MatMul2
        y = att @ v # (B, nh, T_q, T) x (B, nh, T, hs) -> (B, nh, T_q, hs)
        y = y.transpose(1, 2).contiguous().view(B, T_q, C) # re-assemble all head outputs side by side

        # output projection
        y = self.proj(y)
        return y


### (b) Implement TransformerEncLayer [(Illustration)](https://docs.google.com/drawings/d/1DSJmF8z0g79J0EZCY8WyTDR0pxUmsmqNEVU9jxYrrsY/edit?usp=sharing)
This module is a single layer of the Transformer encoder, containing a layer of masked multi-head attention and a feed-forward network with dropout and skip connection. Both attention and feed-forward layer have skip connections and are preceded by LayerNorm. You will stack this layer multiple times to create the full version of the encoder. Since attention is performed in a self-attention manner, you will pass the same values to query, key, and value inputs of the MaskedSelfAttention module.


In [None]:
class TransformerEncLayer(nn.Module):
    def __init__(self):
        super(TransformerEncLayer, self).__init__()
        self.ln1 = nn.LayerNorm(args.nhid_tran)
        self.ln2 = nn.LayerNorm(args.nhid_tran)
        self.attn = MaskedMultiheadAttention()
        self.dropout1 = nn.Dropout(args.resid_pdrop)
        self.dropout2 = nn.Dropout(args.resid_pdrop)
        self.ff = nn.Sequential(
            nn.Linear(args.nhid_tran, args.nff),
            nn.ReLU(),
            nn.Linear(args.nff, args.nhid_tran)
        )

    def forward(self, x, mask=None):
        # WRITE YOUR CODE HERE
        x = self.ln1(x)
        o = self.dropout1(self.attn(x, x, x, mask))
        x = x + o

        x = self.ln2(x)
        o = self.dropout2(self.ff(x))
        x = x + o

        return x

### (c) Implement TransformerDecLayer [(Illustration)](https://docs.google.com/drawings/d/1qNP7ibDTWCRhXJdejtO7jRkvj5RrPN90ORwuK-ANT-4/edit?usp=sharing)
This module is a single layer of the Transformer decoder. The module contains two masked multi-head attentions and a feed-forward network, all with a skip connection and a preceding LayerNorm. The first attention is identical to the encoder's attention. However, the second attention is a cross-attention: that is, the key and value inputs of this layer would be the encoded words from the **source** sentence, given as `enc_o`.



In [None]:
class TransformerDecLayer(nn.Module):
    def __init__(self):
        super(TransformerDecLayer, self).__init__()
        self.ln1 = nn.LayerNorm(args.nhid_tran)
        self.ln2 = nn.LayerNorm(args.nhid_tran)
        self.ln3 = nn.LayerNorm(args.nhid_tran)
        self.dropout1 = nn.Dropout(args.resid_pdrop)
        self.dropout2 = nn.Dropout(args.resid_pdrop)
        self.dropout3 = nn.Dropout(args.resid_pdrop)
        self.attn1 = MaskedMultiheadAttention(mask=True) # self-attention
        self.attn2 = MaskedMultiheadAttention() # tgt to src attention
        self.ff = nn.Sequential(
            nn.Linear(args.nhid_tran, args.nff),
            nn.ReLU(),
            nn.Linear(args.nff, args.nhid_tran)
        )

    def forward(self, x, enc_o, enc_mask=None):
        # WRITE YOUR CODE HERE
        x = self.ln1(x)
        o = self.dropout1(self.attn1(x, x, x))
        x = x + o

        x = self.ln2(x)
        o = self.dropout2(self.attn2(x, enc_o, enc_o, enc_mask))
        x = x + o

        x = self.ln3(x)
        o = self.dropout3(self.ff(x))
        x = x + o

        return x

### (d) Implement TransformerEncoder [(Illustration)](https://docs.google.com/drawings/d/1WtbU0xcaAVWsVegSwO0AzZBfk9NNDltTY9P-GiImVqg/edit?usp=sharing)
In this module, you will first tokenize the input word, apply positional encoding (Refer to `PositionalEncoding` class that we've implemented for you), then pass through multiple layers of TransformerEncLayer, and conclude with a LayerNorm.


In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, max_len=4096):
        super().__init__()
        dim = args.nhid_tran
        pos = np.arange(0, max_len)[:, None]
        i = np.arange(0, dim // 2)
        denom = 10000 ** (2 * i / dim)

        pe = np.zeros([max_len, dim])
        pe[:, 0::2] = np.sin(pos / denom)
        pe[:, 1::2] = np.cos(pos / denom)
        pe = torch.from_numpy(pe).float()

        self.register_buffer('pe', pe)

    def forward(self, x):
        # DO NOT MODIFY
        return x + self.pe[:x.shape[1]]

class TransformerEncoder(nn.Module):

    def __init__(self):
        super(TransformerEncoder, self).__init__()
        # input embedding stem
        self.tok_emb = nn.Embedding(src_ntoken, args.nhid_tran)
        self.pos_enc = PositionalEncoding()
        self.dropout = nn.Dropout(args.embd_pdrop)
        # transformer
        self.transform = nn.ModuleList([TransformerEncLayer() for _ in range(args.nlayers_transformer)])
        # decoder head
        self.ln_f = nn.LayerNorm(args.nhid_tran)


    def forward(self, x, mask):
        # WRITE YOUR CODE HERE
        x = self.dropout(self.pos_enc(self.tok_emb(x)))

        for m in self.transform:
            x = m(x, mask)
        outputs = self.ln_f(x)

        return outputs

### (e) Implement TransformerDecoder [(Illustration)](https://docs.google.com/drawings/d/1cipGddhtugoqM31H6_5fswHSYCXPSiMrJ85GdhGfz4c/edit?usp=sharing)
What TransformerDecoder does is pretty much identical to TransformerEncoder. There are two differences: first is that you should use TransformerDecLayer instead of TransformerEncLayer (obviously!), the other difference is that there's an extra linear layer at the very end of the pipeline.


In [None]:
class TransformerDecoder(nn.Module):
    def __init__(self):
        super(TransformerDecoder, self).__init__()
        self.tok_emb = nn.Embedding(trg_ntoken, args.nhid_tran)
        self.pos_enc = PositionalEncoding()
        self.dropout = nn.Dropout(args.embd_pdrop)
        self.transform = nn.ModuleList([TransformerDecLayer() for _ in range(args.nlayers_transformer)])
        self.ln_f = nn.LayerNorm(args.nhid_tran)
        self.lin_out = nn.Linear(args.nhid_tran, trg_ntoken)
        self.lin_out.weight = self.tok_emb.weight


    def forward(self, x, enc_o, enc_mask):
        # WRITE YOUR CODE HERE
        x = self.dropout(self.pos_enc(self.tok_emb(x)))

        for m in self.transform:
            x = m(x, enc_o, enc_mask)
        x = self.ln_f(x)
        logits = self.lin_out(x)

        logits /= args.nhid_tran ** 0.5 # Scaling logits. Do not modify this
        return logits

### (f) Implement Transformer [(Illustration)](https://docs.google.com/drawings/d/18BeRA4Jl--rR5Txvyfr3nA5SA7ve8nESjZJSiD51ULw/edit?usp=sharing)
Finally, we combine everything to construct the full Transformer model. Begin by creating a mask according to `length_x` parameter, and pass the inputs through TransformerEncoder to obtain encoder output. Now if we're on training mode (`self.training == True`) or teacher forcing is enabled, then we run through the decoder exactly once to predict the very next word. Otherwise, we run through the decoder `max_len - 1` times to create a sequence of `max_len` tokens. The first token to feed the decoder is always the first token of `y`.


In [None]:
class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        self.encoder = TransformerEncoder()
        self.decoder = TransformerDecoder()

    def forward(self, x, y, length_x, max_len=None, teacher_forcing=True):
        # WRITE YOUR CODE HERE
        max_len_src = x.size(1)
        if length_x is not None:
            enc_mask = length_x.view(-1, 1) > torch.arange(max_len_src).to(length_x.device)
        else:
            enc_mask = None
        enc_o = self.encoder(x, enc_mask)
        if self.training or teacher_forcing:
            return self.decoder(y[:, :-1], enc_o, enc_mask)
        # Training
        dec_input = y[:, :1] # batch_size x 1
        if max_len is None:
            max_len = y.shape[1]
        for t in range(1, max_len):
            dec_output = self.decoder(dec_input, enc_o, enc_mask)
            dec_input = torch.cat((dec_input, dec_output[:, -1:].argmax(-1)), dim=-1) # batch_size
        return dec_output

## Run Experiment

You can run the experiment after you've finished at least one out of the three models. However, please **run every single cell above (even the cells you haven't implemented yet)** to run the function properly. We expect training a model for 20 epochs should take less than an hour.

In [None]:
def run_experiment(model):
    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)

    optimizer = optim.Adam(model.parameters(), lr=args.lr_lstm if not isinstance(model, Transformer) else args.lr_transformer)

    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
            factor=0.25, patience=1, threshold=0.0001, threshold_mode='rel',
            cooldown=0, min_lr=0, eps=1e-08, verbose=False)

    best_val_loss = np.inf
    for epoch in tq.tqdm(range(args.epochs)):
        run_epoch(epoch, model, optimizer, is_train=True)
        with torch.no_grad():
            val_loss = run_epoch(epoch, model, None, is_train=False)
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            save_model(model, 'best')
        save_model(model)
        scheduler.step(val_loss)

## P3. Train and Validate models
**About evaluation metrics**

[PPL(Perplexity)](https://en.wikipedia.org/wiki/Perplexity) can be interpreted as "how many words are considered as candidate output on every time step". A lower perplexity means the model is more confident with its output.

[BLEU(Bilingual Evaluation Understudy) Score](https://en.wikipedia.org/wiki/BLEU) is a general metric to measure the quality of machine translation output. It takes three elements into calculation:
- Precision: How accurate are each n-gram of the predicted sentence?
- Clipping: Calibrate the score when a word occurs multiple times in true/predicted sentence.
- Brevity penalty: The predicted and true sentence should have identical (or similar) length.

A higher  BLEU score is considered a better quality translation.

**Expected BLEU Score**
- LSTMSeq2Seq: 0.245 ±0.02
- Transformer: 0.358 ±0.02

In [None]:
lstm_model = LSTMSeq2Seq().to(device)
lstm_model.apply(init_weights)
run_experiment(lstm_model)
run_translation(lstm_model, test_iter, max_len=100)
print('')

  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 0 Train Loss 4.323539385868659 Acc 0.3020846061731714 PPL 75.45522137006377
Epoch 0 Valid Loss 3.5830149840780243 Acc 0.37049861495844877 PPL 35.98186221308135
Epoch 1 Train Loss 3.4227984112167134 Acc 0.40108262665265526 PPL 30.655080622819234
Epoch 1 Valid Loss 3.1426588538445923 Acc 0.43843490304709143 PPL 23.165378402809186
Epoch 2 Train Loss 3.0496699364683684 Acc 0.44974950511987094 PPL 21.108376167435594
Epoch 2 Valid Loss 2.863048366942234 Acc 0.47181440443213296 PPL 17.51483729153387
Epoch 3 Train Loss 2.7622231056929163 Acc 0.48531000268823776 PPL 15.835006740997565
Epoch 3 Valid Loss 2.6588375709558787 Acc 0.5018005540166205 PPL 14.279680332132656
Epoch 4 Train Loss 2.5279890915146 Acc 0.5133141083604194 PPL 12.528287549290305
Epoch 4 Valid Loss 2.5133029524309154 Acc 0.5238227146814405 PPL 12.345639853410463
Epoch 5 Train Loss 2.330103936713999 Acc 0.5388621422810919 PPL 10.27900984403047
Epoch 5 Valid Loss 2.390129504771774 Acc 0.5429362880886427 PPL 10.9149073841060

In [None]:
transformer_model = Transformer().to(device)
run_experiment(transformer_model)
run_translation(transformer_model, test_iter, max_len=100)
print('')

  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 0 Train Loss 4.811685344572684 Acc 0.38517803465382827 PPL 122.93863698331013
Epoch 0 Valid Loss 3.7916825116836463 Acc 0.48822714681440443 PPL 44.330924863187555
Epoch 1 Train Loss 3.6758363614471206 Acc 0.4949705515775068 PPL 39.481663993960865
Epoch 1 Valid Loss 3.1721370085124496 Acc 0.5426592797783933 PPL 23.858415553976986
Epoch 2 Train Loss 3.1995236115079346 Acc 0.5348615557564945 PPL 24.52084596537994
Epoch 2 Valid Loss 2.826955975521965 Acc 0.5722299168975069 PPL 16.893956856439342
Epoch 3 Train Loss 2.8972878206807233 Acc 0.5624282118331337 PPL 18.12492061158609
Epoch 3 Valid Loss 2.6268052824811594 Acc 0.5885041551246537 PPL 13.829517850410818
Epoch 4 Train Loss 2.6784492715741273 Acc 0.5822600747818861 PPL 14.562493304866067
Epoch 4 Valid Loss 2.475264404008263 Acc 0.6058171745152354 PPL 11.884849099443697
Epoch 5 Train Loss 2.507999816642146 Acc 0.5990762237591339 PPL 12.280342542153681
Epoch 5 Valid Loss 2.341220359376263 Acc 0.6181440443213296 PPL 10.3939131355223