# Translation with a Sequence to Sequence Network

This project is split into two sections: Neural Machine Translation with (1) RNNs and (2) Transformer. To be more specifically, in Machine Translation, our goal is to convert a sentence from the source language (e.g. Vietnamese) to the target language (e.g. English). In this project, we will implement a sequence-to-sequence (Seq2Seq) network based on two architectures: **RNNs with Attention** and **Transformer**, to build a Neural Machine Translation (NMT) system.

That's a lot to digest, the goal of this project is to break it down into easy to understand parts. In this project we will:

- Prepare the data.
- Implement necessary components:
    - With RNNs and attention architecture:
        - Embedding Layer: to initialize the necessary word embeddings
        - Declare basic components of our model.
        - The Encoder & Decoder

    - With Transformer Architecture:
        - Positional embeddings.
        - Transformer Layer

- Build & train two our models.
- Generate translations.

**Requirements**

Firstly, apart from standards libraries, we need to install some package:

1, *sentencepiece*: To build our own vocabulary \\
2, *sacrebleu*: To evaluate our model using BLUE score metric

In [1]:
%%capture
!pip install sentencepiece==0.1.97
!pip install tqdm==4.29.1
!pip install sacrebleu
!pip install nltk

Below, we import our standard libraries.

In [2]:
# Standard libraries
import sys
import json
import time
import math
import numpy as np
from typing import List, Tuple, Dict, Set, Union
from collections import Counter, namedtuple
from itertools import chain
from dataclasses import dataclass

# to compute BLUE score
import sacrebleu

# Pytorch
import torch
import torch.nn as nn
import torch.nn.utils
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence

# To train vocabulary
import sentencepiece as spm

# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
cd /content/drive/MyDrive/Colab Notebooks/VietAI/ASM

/content/drive/MyDrive/Colab Notebooks/VietAI/ASM


In [5]:
#@title Default hyperparameters
@dataclass
class Args:
    cuda: str = "cuda:0"
    train_src: str = "data/train.vi"
    train_tgt: str = "data/train.en"
    dev_src: str = "data/dev.vi"
    dev_tgt: str = 'data/dev.en'
    vocab_file: str = 'vocab.json'
    src_vocab_size: int = 15000
    tgt_vocab_size: int = 21000
    seed: int = 0
    batch_size: int = 32
    max_len: int = 320
    embed_size: int = 1024
    hidden_size: int = 768
    clip_grad: float = 5.0                  # gradient clipping
    log_every: int = 10                     # log every
    max_epoch: int = 30                     # max epoch
    patience: int = 5                       # wait for how many iterations to decay learning rate
    max_num_trial: int = 5                  # terminate training after how many trials
    lr_decay: float = 0.5                   # learning rate decay
    beam_size: int = 5                      # beam size
    lr: float = 0.001                       # learning rate
    uniform_init: float = 0.1               # uniformly initialize all parameters
    model_save_path: str = 'lstm_model.bin' # model save path
    valid_niter: int = 2000                 # perform validation after how many iterations
    dropout: float = 0.3    
    max_decoding_time_step: int = 70        # maximum number of decoding time steps

args = Args()
device = torch.device(args.cuda) if torch.cuda.is_available() else torch.device("cpu")

seed = int(args.seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed * 13 // 7)

Loading data files
==================

The data for this project is a set of many thousands of Vietnamese to
English translation pairs. We will download them first then save to 'data' folder.

In [None]:
%%capture
!pip install gdown
!mkdir data
import gdown

data_path = 'https://drive.google.com/file/d/1eq68XlKxWBFCj4YgMRl2N5YdrZvB9FDs/view?usp=sharing'
gdown.download(data_path, args.train_src, quiet=False, fuzzy=True)

data_path = 'https://drive.google.com/file/d/1679j2kIvdl8Oe_WRSX0vi62JtOrhr1GD/view?usp=sharing'
gdown.download(data_path, args.train_tgt, quiet=False, fuzzy=True)

data_path = 'https://drive.google.com/file/d/1p0tBxnD-MVXyve772omfq1nFDraeI_sO/view?usp=sharing'
gdown.download(data_path, args.dev_src, quiet=False, fuzzy=True)

data_path = 'https://drive.google.com/file/d/1ZvBBTUwzYJuN4J8WCZ9-kZiBEm4iPpiL/view?usp=sharing'
gdown.download(data_path, args.dev_tgt, quiet=False, fuzzy=True)

## Padding function
In order to apply tensor operations, we must ensure that the sentences in a given batch are of the same length. Thus, we must identify the longest sentence in a batch and pad others to be the same length. Implement the pad_sents function, which shall produce these padded sentences.

In [6]:
def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
        The paddings should be at the end of each sentence.
    @param sents (list[list[str]]): list of sentences, where each sentence
                                    is represented as a list of words
    @param pad_token (str): padding token
    @returns sents_padded (list[list[str]]): list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentences in the batch now has equal length.
    """
    sents_padded = []

    len_list = [len(l) for l in sents]
    max_len = sorted(len_list,reverse=True)[0]
    for l in sents:
        for i in range(max_len-len(l)):
            l.append(pad_token)
        sents_padded.append(l)

    return sents_padded

Below, we define the *VocabEntry* class. The *VocabEntry* class is a vocabulary entry that contains a dictionary that maps words to indices and provides methods to convert words to indices, indices to words, and sentences to tensors. The purpose of this class is to facilitate the management of the vocabulary.

In [7]:
class VocabEntry(object):
    """ Vocabulary Entry, i.e. structure containing either
    src or tgt language terms.
    """
    def __init__(self, word2id=None):
        """ Init VocabEntry Instance.
        @param word2id (dict): dictionary mapping words 2 indices
        """
        if word2id:
            self.word2id = word2id
        else:
            self.word2id = dict()
            self.word2id['<pad>'] = 0   # Pad Token
            self.word2id['<s>'] = 1 # Start Token
            self.word2id['</s>'] = 2    # End Token
            self.word2id['<unk>'] = 3   # Unknown Token
        self.unk_id = self.word2id['<unk>']
        self.id2word = {v: k for k, v in self.word2id.items()}

    def __getitem__(self, word):
        """ Retrieve word's index. Return the index for the unk
        token if the word is out of vocabulary.
        @param word (str): word to look up.
        @returns index (int): index of word 
        """
        return self.word2id.get(word, self.unk_id)

    def __contains__(self, word):
        """ Check if word is captured by VocabEntry.
        @param word (str): word to look up
        @returns contains (bool): whether word is contained    
        """
        return word in self.word2id

    def __setitem__(self, key, value):
        """ Raise error, if one tries to edit the VocabEntry.
        """
        raise ValueError('vocabulary is readonly')

    def __len__(self):
        """ Compute number of words in VocabEntry.
        @returns len (int): number of words in VocabEntry
        """
        return len(self.word2id)

    def __repr__(self):
        """ Representation of VocabEntry to be used
        when printing the object.
        """
        return 'Vocabulary[size=%d]' % len(self)

    def id2word(self, wid):
        """ Return mapping of index to word.
        @param wid (int): word index
        @returns word (str): word corresponding to index
        """
        return self.id2word[wid]

    def add(self, word):
        """ Add word to VocabEntry, if it is previously unseen.
        @param word (str): word to add to VocabEntry
        @return index (int): index that the word has been assigned
        """
        if word not in self:
            wid = self.word2id[word] = len(self)
            self.id2word[wid] = word
            return wid
        else:
            return self[word]

    def words2indices(self, sents):
        """ Convert list of words or list of sentences of words
        into list or list of list of indices.
        @param sents (list[str] or list[list[str]]): sentence(s) in words
        @return word_ids (list[int] or list[list[int]]): sentence(s) in indices
        """
        if type(sents[0]) == list:
            for i in range(len(sents)):
                # set max length
                sents[i] = sents[i][:args.max_len]
            return [[self[w] for w in s] for s in sents]
        else:
            # set max length
            sents = sents[i][:args.max_len]
            return [self[w] for w in sents]

    def indices2words(self, word_ids):
        """ Convert list of indices into words.
        @param word_ids (list[int]): list of word ids
        @return sents (list[str]): list of words
        """
        return [self.id2word[w_id] for w_id in word_ids]

    def to_input_tensor(self, sents: List[List[str]], device: torch.device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for 
        shorter sentences.

        @param sents (List[List[str]]): list of sentences (words)
        @param device: device on which to load the tesnor, i.e. CPU or GPU

        @returns sents_var: tensor of (max_sentence_length, batch_size)
        """
        word_ids = self.words2indices(sents)
        sents_t = pad_sents(word_ids, self['<pad>'])
        sents_var = torch.tensor(sents_t, dtype=torch.long, device=device)
        return torch.t(sents_var)

    @staticmethod
    def from_corpus(corpus, size, freq_cutoff=2):
        """ Given a corpus construct a Vocab Entry.
        @param corpus (list[str]): corpus of text produced by read_corpus function
        @param size (int): # of words in vocabulary
        @param freq_cutoff (int): if word occurs n < freq_cutoff times, drop the word
        @returns vocab_entry (VocabEntry): VocabEntry instance produced from provided corpus
        """
        vocab_entry = VocabEntry()
        word_freq = Counter(chain(*corpus))
        valid_words = [w for w, v in word_freq.items() if v >= freq_cutoff]
        print('number of word types: {}, number of word types w/ frequency >= {}: {}'
              .format(len(word_freq), freq_cutoff, len(valid_words)))
        top_k_words = sorted(valid_words, key=lambda w: word_freq[w], reverse=True)[:size]
        for word in top_k_words:
            vocab_entry.add(word)
        return vocab_entry
    
    @staticmethod
    def from_subword_list(subword_list):
        vocab_entry = VocabEntry()
        for subword in subword_list:
            vocab_entry.add(subword)
        return vocab_entry

Afterwards, we use a **Vocab** class to wrap vocabulary used for both the source and target languages in a machine translation task. It is composed of two VocabEntry objects, one for the source language and one for the target language.

The build method is used to construct a **Vocab** object from a list of subwords generated by SentencePiece for both the source and target languages. Then, we save them to a JSON file.

In [8]:
class Vocab(object):
    """ Vocab encapsulating src and target langauges.
    """
    def __init__(self, src_vocab: VocabEntry, tgt_vocab: VocabEntry):
        """ Init Vocab.
        @param src_vocab (VocabEntry): VocabEntry for source language
        @param tgt_vocab (VocabEntry): VocabEntry for target language
        """
        self.src = src_vocab
        self.tgt = tgt_vocab

    @staticmethod
    def build(src_sents, tgt_sents) -> 'Vocab':
        """ Build Vocabulary.
        @param src_sents (list[str]): Source subwords provided by SentencePiece
        @param tgt_sents (list[str]): Target subwords provided by SentencePiece
        """

        print('initialize source vocabulary ..')
        src = VocabEntry.from_subword_list(src_sents)

        print('initialize target vocabulary ..')
        tgt = VocabEntry.from_subword_list(tgt_sents)

        return Vocab(src, tgt)

    def save(self, file_path):
        """ Save Vocab to file as JSON dump.
        @param file_path (str): file path to vocab file
        """
        with open(file_path, 'w') as f:
            json.dump(dict(src_word2id=self.src.word2id, tgt_word2id=self.tgt.word2id), f, indent=2)

    @staticmethod
    def load(file_path):
        """ Load vocabulary from JSON dump.
        @param file_path (str): file path to vocab file
        @returns Vocab object loaded from JSON dump
        """
        entry = json.load(open(file_path, 'r'))
        src_word2id = entry['src_word2id']
        tgt_word2id = entry['tgt_word2id']

        return Vocab(VocabEntry(src_word2id), VocabEntry(tgt_word2id))

    def __repr__(self):
        """ Representation of Vocab to be used
        when printing the object.
        """
        return 'Vocab(source %d words, target %d words)' % (len(self.src), len(self.tgt))


def get_vocab_list(file_path, source, vocab_size):
    """ Use SentencePiece to tokenize and acquire list of unique subwords.
    @param file_path (str): file path to corpus
    @param source (str): tgt or src
    @param vocab_size: desired vocabulary size
    """ 
    spm.SentencePieceTrainer.Train(input=file_path, model_prefix=source, vocab_size=vocab_size)     # train the spm model
    sp = spm.SentencePieceProcessor()   # create an instance; this saves .model and .vocab files 
    sp.Load('{}.model'.format(source))  # loads tgt.model or src.model
    sp_list = [sp.IdToPiece(piece_id) for piece_id in range(sp.GetPieceSize())] # this is the list of subwords
    return sp_list 

### Train and save our vocabulary to a json file

In [9]:
print('read in source sentences: %s' % args.train_src)
print('read in target sentences: %s' % args.train_tgt)

src_sents = get_vocab_list(args.train_src, source='src', vocab_size=args.src_vocab_size)
tgt_sents = get_vocab_list(args.train_tgt, source='tgt', vocab_size=args.tgt_vocab_size)
vocab = Vocab.build(src_sents, tgt_sents)
print('generated vocabulary, source %d words, target %d words' % (len(src_sents), len(tgt_sents)))

vocab.save(args.vocab_file)
print('vocabulary saved to %s' % args.vocab_file)

read in source sentences: data/train.vi
read in target sentences: data/train.en
initialize source vocabulary ..
initialize target vocabulary ..
generated vocabulary, source 15000 words, target 21000 words
vocabulary saved to vocab.json


### Read sentence pairs for training
The full process for preparing the data is:

- Read text file  into pairs
- Encode raw text into subwords
- Add word lists into our data

In [10]:
def read_corpus(file_path, source):
    """ Read file, where each sentence is dilineated by a `\n`.
    @param file_path (str): path to file containing corpus
    @param source (str): "tgt" or "src" indicating whether text
        is of the source language or target language
    """
    data = []
    sp = spm.SentencePieceProcessor()
    sp.load('{}.model'.format(source))

    with open(file_path, 'r', encoding='utf8') as f:
        for line in f:
            subword_tokens = sp.encode_as_pieces(line)
            # only append <s> and </s> to the target sentence
            if source == 'tgt':
                subword_tokens = ['<s>'] + subword_tokens + ['</s>']
            data.append(subword_tokens)

    return data

train_data_src = read_corpus(args.train_src, source='src')
train_data_tgt = read_corpus(args.train_tgt, source='tgt')

dev_data_src = read_corpus(args.dev_src, source='src')
dev_data_tgt = read_corpus(args.dev_tgt, source='tgt')

train_data = list(zip(train_data_src, train_data_tgt))
dev_data = list(zip(dev_data_src, dev_data_tgt))

In [11]:
# We define the batch_iter function to iterate through the given data in batches of a specified size, where each batch contains source and target sentences. 
# The sentences are sorted in reverse order by their length, so that longer sentences come first. 
# The function takes three arguments: the data to iterate through, the batch size, and a flag indicating whether to shuffle the data randomly or not.

def batch_iter(data, batch_size, shuffle=False):
    """ Yield batches of source and target sentences reverse sorted by length (largest to smallest).
    @param data (list of (src_sent, tgt_sent)): list of tuples containing source and target sentence
    @param batch_size (int): batch size
    @param shuffle (boolean): whether to randomly shuffle the dataset
    """
    batch_num = math.ceil(len(data) / batch_size)
    index_array = list(range(len(data)))

    if shuffle:
        np.random.shuffle(index_array)

    for i in range(batch_num):
        indices = index_array[i * batch_size: (i + 1) * batch_size]
        examples = [data[idx] for idx in indices]

        examples = sorted(examples, key=lambda e: len(e[0]), reverse=True)
        src_sents, tgt_sents = list(), list()
        for src_sent, tgt_sent in examples:
            if len(src_sent) > 0 and len(tgt_sent) > 0:
                src_sents.append(src_sent)
                tgt_sents.append(tgt_sent)
        yield src_sents, tgt_sents

## Embedding Layer Initilization

Implement the __init__ function in model_embeddings.py to initialize the necessary source and target embeddings.

In [12]:
class ModelEmbeddings(nn.Module): 
    """
    Class that converts input words to their embeddings.
    """
    def __init__(self, embed_size, vocab):
        """
        Init the Embedding layers.

        @param embed_size (int): Embedding size (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        """
        super(ModelEmbeddings, self).__init__()
        self.embed_size = embed_size

        # default values

        src_pad_token_idx = vocab.src['<pad>']
        tgt_pad_token_idx = vocab.tgt['<pad>']

        ### TODO - Initialize the following variables:
        ###     self.source (Embedding Layer for source language)
        ###     self.target (Embedding Layer for target langauge)

        self.source = nn.Embedding(len(vocab.src),embed_size,padding_idx=src_pad_token_idx)
        self.target = nn.Embedding(len(vocab.tgt),embed_size,padding_idx=tgt_pad_token_idx)



# The Seq2Seq Model 1: RNNs with global attention

In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder. 

<img src="https://i.ibb.co/pjRW6tC/arc.png" alt="arc" border="0" width=70%>

# Model description (training procedure)

Given a sentence in the source language, we look up the character or word embeddings from an **embeddings matrix**, yielding $x_1,...,x_m (x_i \in \mathbb{R}^e)$, where $m$ is the length of the source sentence and e is the embedding size. We feed the embeddings to the bidirectional encoder, yielding hidden states and cell states for both the forwards (→) and backwards (←) LSTMs. The forwards and backwards versions are concatenated to give hidden states $h^{enc}_i$ and cell states $c^{enc}_i$ :

$$ h^{enc}_i = [\overleftarrow{h^{enc}_i}; \overrightarrow{h^{enc}_i}] \:\: \text{where} \:\: h^{enc}_i \in \mathbb{R}^{2h \times 1} $$
$$ c^{enc}_i = [\overleftarrow{c^{enc}_i}; \overrightarrow{c^{enc}_i}] \:\: \text{where} \:\: c^{enc}_i \in \mathbb{R}^{2h \times 1} $$ \\

We then initialize the **decoder**’s first hidden state $h^{enc}_0$ and cell state $c^{enc}_0$  with a linear projection of the encoder’s final hidden state and final cell state.

$$ h^{dec}_0 = W_h[\overleftarrow{h^{enc}_1}; \overrightarrow{h^{enc}_m}] \:\: \text{where} \:\: h^{dec}_0 \in \mathbb{R}^{h \times 1} $$
$$ c^{dec}_0 = W_c[\overleftarrow{c^{enc}_1}; \overrightarrow{c^{enc}_m}] \:\: \text{where} \:\: c^{dec}_0 \in \mathbb{R}^{h \times 1} $$ \\

With the decoder initialized, we must now feed it a target sentence. On the $t^{th}$ step, we look up the embedding for the $t^{th}$ subword, $y_t \in \mathbb{R}^{e \times 1}$ . We then concatenate $y_t$ with the combined-output vector $o_{t-1} \in \mathbb{R}^{h \times 1}$ from the previous timestep (we will explain what this is later!) to produce $\bar{y_t} \in \mathbb{R}^{(e+h) \times 1}$. Note that for the first target subword (i.e. the start token) $o_0$ is a zero-vector. We then feed $\bar{y_t}$ as input to the decoder.


$$ h^{dec}_t , c^{dec}_t = \text{Decoder}(\bar{y_t},  h^{dec}_{t-1} , c^{dec}_{t-1} ) \:\:\: \text{where} \:\:\: h^{dec}_t \in \mathbb{R}^{h \times 1} , c^{dec}_t \in \mathbb{R}^{h \times 1} $$ \\

We then use $h^{dec}_t$ to compute multiplicative attention over $h^{enc}_1,...,, h^{enc}_m$ :

$$ e_{t,i} = (h_t^{dec})^TW_{attProj}h^{enc}_i \:\:\: \text{where} \:\:\: e_t \in \mathbb{R}^{m \times 1}, W_{attProj} \in \mathbb{R}^{h \times 2h} $$

$$ \alpha_t = softmax(e_t) \:\:\: \text{where} \:\:\: \alpha_t \in \mathbb{R}^{m \times 1}$$

$$ a_t = ∑_{i=1}^m \alpha_{t, i} h^{enc}_i \:\:\: \text{where} \:\:\: a_t \in \mathbb{R}^{2h \times 1}$$ \\

We now concatenate the attention output $a_t$ with the decoder hidden state $h^{dec}_t$ and pass this through a linear layer, tanh, and dropout to attain the *combined-output* vector $o_t$.

$$ u_t = [a_t;h^{dec}_t] \:\:\: \text{where} \:\:\: u_t \in \mathbb{R}^{3h \times 1} $$

$$ v_t = W_uu_t \:\:\: where \:\:\: v_t \in \mathbb{R}^{h \times 1},W_u \in \mathbb{R}^{h \times 3h}$$

$$ o_t = dropout(tanh(v_t)) \:\:\: where \:\:\: o_t \in \mathbb{R}^{h \times 1}$$ \\

Then, we produce a probability distribution $P_t$ over target subwords at the $t^{th}$ timestep: 

$$ P_t = softmax(W_{vocab}o_t) \:\:\: where \:\:\: P_t \in \mathbb{R}^{V_t \times 1}, W_{vocab}\in \mathbb{R}^{V_t \times h} $$

Here, $V_t$ is the size of the target vocabulary. Finally, to train the network we then compute the cross entropy loss between $P_t$ and $g_t$, where $g_t$ is the one-hot vector of the target subword at timestep $t$:

$$ J_t(θ) = CrossEntropy(P_t, g_t)$$
Here, $θ$ represents all the parameters of the model and $J_t(θ)$ is the loss on step t of the decoder.

Now that we have described the model, let’s try implementing it Mandarin Vietnamese to English translation!




# Initialize layers in NMT model
Implement the init function to initialize the necessary model layers (LSTM, projection, and dropout) for the NMT system.

# Encoder

Implement the encode function. This function converts the padded source sentences into the tensor $X$, generates $h^{enc}_1 , . . . , h^{enc}_m $, and computes the initial state $h^{dec}_0$ and initial cell  $h^{dec}_0$ for the $\text{Decoder}$

# Decoder
Implement the decode function. This function constructs $\bar{y}$ and runs the step function over every timestep for the input.

# Decoder step
Implement the step function. This function applies the Decoder’s LSTM cell for a single timestep, computing the encoding of the target subword $h^{dec}_t$ , the attention scores $e_t$, attention distribution $\alpha_t$, the attention output $a_t$, and finally the combined output $o_t$.

In [13]:
Hypothesis = namedtuple('Hypothesis', ['value', 'score'])

class NMT(nn.Module):
    """ Simple Neural Machine Translation Model:
        - Bidrectional LSTM Encoder
        - Unidirection LSTM Decoder
        - Global Attention Model (Luong, et al. 2015)
    """

    def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
        """ Init NMT Model.

        @param embed_size (int): Embedding size (dimensionality)
        @param hidden_size (int): Hidden Size, the size of hidden states (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        @param dropout_rate (float): Dropout probability, for attention
        """
        super(NMT, self).__init__()
        self.model_embeddings = ModelEmbeddings(embed_size, vocab)
        self.hidden_size = hidden_size
        self.dropout_rate = dropout_rate
        self.vocab = vocab

        # For sanity check only, not relevant to implementation
        self.gen_sanity_check = False
        self.counter = 0

        ### TODO - Initialize the following variables IN THIS ORDER:
        ###     self.post_embed_cnn (Conv1d layer with kernel size 2, input and output channels = embed_size,
        ###         padding = same to preserve output shape )
        ###     self.encoder (Bidirectional LSTM with bias)
        ###     self.decoder (LSTM Cell with bias)
        ###     self.h_projection (Linear Layer with no bias), called W_{h} .
        ###     self.c_projection (Linear Layer with no bias), called W_{c} .
        ###     self.att_projection (Linear Layer with no bias), called W_{attProj}.
        ###     self.combined_output_projection (Linear Layer with no bias), called W_{u}.
        ###     self.target_vocab_projection (Linear Layer with no bias), called W_{vocab}.
        ###     self.dropout (Dropout Layer)

        self.post_embed_cnn = nn.Conv1d(in_channels=embed_size,out_channels=embed_size,kernel_size=2,padding="same")
        self.encoder = nn.LSTM(input_size=embed_size,hidden_size=hidden_size,bidirectional=True,bias=True)
        self.decoder = nn.LSTMCell(input_size=embed_size+hidden_size,hidden_size=hidden_size,bias=True)
        self.h_projection = nn.Linear(in_features=2*hidden_size,out_features=hidden_size,bias=False)
        self.c_projection = nn.Linear(in_features=2*hidden_size,out_features=hidden_size,bias=False)
        self.att_projection = nn.Linear(in_features=2*hidden_size,out_features=hidden_size,bias=False)
        self.combined_output_projection = nn.Linear(in_features=3*hidden_size,out_features=hidden_size,bias=False)
        self.target_vocab_projection = nn.Linear(in_features=hidden_size,out_features=len(vocab.tgt),bias=False)
        self.dropout = nn.Dropout(p=dropout_rate)

    def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        @param source (List[List[str]]): list of source sentence tokens
        @param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`

        @returns scores (Tensor): a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        """
        # Compute sentence lengths
        # source_lengths = [len(s) for s in source]
        source_lengths = [len(s) if len(s) <= args.max_len else args.max_len for s in source]

        # Convert list of lists into tensors
        source_padded = self.vocab.src.to_input_tensor(source, device=self.device)  # Tensor: (src_len, b)
        target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device)  # Tensor: (tgt_len, b)

        ###     Run the network forward:
        ###     1. Apply the encoder to `source_padded` by calling `self.encode()`
        ###     2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()`
        ###     3. Apply the decoder to compute combined-output by calling `self.decode()`
        ###     4. Compute log probability distribution over the target vocabulary using the
        ###        combined_outputs returned by the `self.decode()` function.

        enc_hiddens, dec_init_state = self.encode(source_padded, source_lengths)
        enc_masks = self.generate_sent_masks(enc_hiddens, source_lengths)
        combined_outputs = self.decode(enc_hiddens, enc_masks, dec_init_state, target_padded)
        P = F.log_softmax(self.target_vocab_projection(combined_outputs), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.vocab.tgt['<pad>']).float()

        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(
            -1) * target_masks[1:]
        scores = target_gold_words_log_prob.sum(dim=0)
        return scores

    def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[
        torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """ Apply the encoder to source sentences to obtain encoder hidden states.
            Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

        @param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
                                        b = batch_size, src_len = maximum source sentence length. Note that
                                       these have already been sorted in order of longest to shortest sentence.
        @param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
        @returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
                                        b = batch size, src_len = maximum source sentence length, h = hidden size.
        @returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
                                                hidden state and cell. Both tensors should have shape (2, b, h).
        """

        ### TODO:
        ###     1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
        ###         src_len = maximum source sentence length, b = batch size, e = embedding size. Note
        ###         that there is no initial hidden state or cell for the encoder.
        ###     2. Apply the post_embed_cnn layer. Before feeding X into the CNN, first use torch.permute to change the
        ###         shape of X to (b, e, src_len). After getting the output from the CNN, still stored in the X variable,
        ###         remember to use torch.permute again to revert X back to its original shape.
        ###     3. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
        ###         - Before we can apply the encoder, we need to apply the `pack_padded_sequence` function to X.
        ###         - After we apply the encoder, we need to apply the `pad_packed_sequence` function to enc_hiddens.
        ###         - Note that the shape of the tensor output returned by the encoder RNN is (src_len, b, h*2) and we want to
        ###           return a tensor of shape (b, src_len, h*2) as `enc_hiddens`, so we may need to do more permuting.
        ###         - Note on using pad_packed_sequence -> For batched inputs, we need to make sure that each of the
        ###           individual input examples has the same shape.
        ###     4. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
        ###         - `init_decoder_hidden`:
        ###             `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
        ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the h_projection layer to this in order to compute init_decoder_hidden.
        ###             This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
        ###         - `init_decoder_cell`:
        ###             `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
        ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the c_projection layer to this in order to compute init_decoder_cell.
        ###             This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size

        X = self.model_embeddings.source(source_padded)
        X = self.post_embed_cnn(X.permute(1,2,0))
        X = X.permute(2,0,1)
        enc_hiddens,(last_hidden,last_cell) = self.encoder(pack_padded_sequence(X,source_lengths))
        enc_hiddens = pad_packed_sequence(enc_hiddens)[0].permute(1,0,2)
        last_hidden = torch.cat((last_hidden[0],last_hidden[1]),dim=1)
        last_cell = torch.cat((last_cell[0],last_cell[1]),dim=1)
        init_decoder_hidden = self.h_projection(last_hidden)
        init_decoder_cell = self.c_projection(last_cell)
        dec_init_state = (init_decoder_hidden,init_decoder_cell)

        return enc_hiddens, dec_init_state

    def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
               dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
        """Compute combined output vectors for a batch.

        @param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                     b = batch size, src_len = maximum source sentence length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
                                     b = batch size, src_len = maximum source sentence length.
        @param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
        @param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
                                       tgt_len = maximum target sentence length, b = batch size.

        @returns combined_outputs (Tensor): combined output tensor  (tgt_len, b,  h), where
                                        tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
        """
        # Chop off the <END> token for max length sentences.
        target_padded = target_padded[:-1]

        # Initialize the decoder state (hidden and cell)
        dec_state = dec_init_state

        # Initialize previous combined output vector o_{t-1} as zero
        batch_size = enc_hiddens.size(0)
        o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

        # Initialize a list we will use to collect the combined output o_t on each step
        combined_outputs = []

        ### TODO:
        ###     1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
        ###         which should be shape (b, src_len, h),
        ###         where b = batch size, src_len = maximum source length, h = hidden size.
        ###         This is applying W_{attProj} to h^enc, as described in the PDF.
        ###     2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
        ###         where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     3. Use the torch.split function to iterate over the time dimension of Y.
        ###         Within the loop, this will give we Y_t of shape (1, b, e) where b = batch size, e = embedding size.
        ###             - Squeeze Y_t into a tensor of dimension (b, e).
        ###             - Construct Ybar_t by concatenating Y_t with o_prev on their last dimension
        ###             - Use the step function to compute the the Decoder's next (cell, state) values
        ###               as well as the new combined output o_t.
        ###             - Append o_t to combined_outputs
        ###             - Update o_prev to the new o_t.
        ###     4. Use torch.stack to convert combined_outputs from a list length tgt_len of
        ###         tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
        ###         where tgt_len = maximum target sentence length, b = batch size, h = hidden size.

        enc_hiddens_proj = self.att_projection(enc_hiddens)
        Y = self.model_embeddings.target(target_padded)
        Y = torch.split(Y,1,dim=0)
        for y_t in Y:
            y_t = torch.squeeze(y_t)
            ybar_t = torch.cat((y_t,o_prev),dim=1)
            dec_state,o_t,e_t = self.step(ybar_t,dec_state,enc_hiddens,enc_hiddens_proj,enc_masks)
            combined_outputs.append(o_t)
            o_prev = o_t
        combined_outputs=torch.stack(combined_outputs,dim=0)

        return combined_outputs

    def step(self, Ybar_t: torch.Tensor,
             dec_state: Tuple[torch.Tensor, torch.Tensor],
             enc_hiddens: torch.Tensor,
             enc_hiddens_proj: torch.Tensor,
             enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
        """ Compute one forward step of the LSTM decoder, including the attention computation.

        @param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                                where b = batch size, e = embedding size, h = hidden size.
        @param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
        @param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                    src_len = maximum source length, h = hidden size.
        @param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
                                    where b = batch size, src_len = maximum source length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
                                    where b = batch size, src_len is maximum source length.

        @returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's new hidden state, second tensor is decoder's new cell.
        @returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
        @returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
                                Note: we will not use this outside of this function.
                                      We are simply returning this value so that we can sanity check
                                      our implementation.
        """

        combined_output = None

        ### TODO:
        ###     1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
        ###     2. Split dec_state into its two parts (dec_hidden, dec_cell)
        ###     3. Compute the attention scores e_t, a Tensor shape (b, src_len).
        ###        Note: b = batch_size, src_len = maximum source length, h = hidden size.
        ###
        ###       Hints:
        ###         - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
        ###         - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
        ###         - Use batched matrix multiplication (torch.bmm) to compute e_t (be careful about the input/ output shapes!)
        ###         - To get the tensors into the right shapes for bmm, we will need to do some squeezing and unsqueezing.
        ###         - When using the squeeze() function make sure to specify the dimension we want to squeeze
        ###             over. Otherwise, we will remove the batch dimension accidentally, if batch_size = 1.

        dec_state = self.decoder(Ybar_t,dec_state)
        dec_hidden, dec_cell = dec_state[0], dec_state[1]
        e_t = torch.bmm(enc_hiddens_proj,dec_hidden.unsqueeze(-1)).squeeze(-1)

        # Set e_t to -inf where enc_masks has 1
        if enc_masks is not None:
            e_t.data.masked_fill_(enc_masks.bool(), -float('inf'))

        ### TODO:
        ###     1. Apply softmax to e_t to yield alpha_t
        ###     2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
        ###         attention output vector, a_t.
        # $$     Hints:
        ###           - alpha_t is shape (b, src_len)
        ###           - enc_hiddens is shape (b, src_len, 2h)
        ###           - a_t should be shape (b, 2h)
        ###           - we will need to do some squeezing and unsqueezing.
        ###     Note: b = batch size, src_len = maximum source length, h = hidden size.
        ###
        ###     3. Concatenate dec_hidden with a_t to compute tensor U_t
        ###     4. Apply the combined output projection layer to U_t to compute tensor V_t
        ###     5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
        
        alpha_t = F.softmax(e_t)
        a_t = torch.bmm(alpha_t.unsqueeze(1),enc_hiddens).squeeze(dim=1)
        U_t = torch.cat((a_t,dec_hidden),dim=1)
        V_t = self.combined_output_projection(U_t)
        O_t = self.dropout(F.tanh(V_t))

        combined_output = O_t
        return dec_state, combined_output, e_t

    def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor:
        """ Generate sentence masks for encoder hidden states.

        @param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size,
                                     src_len = max source length, h = hidden size.
        @param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch.

        @returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len),
                                    where src_len = max source length, h = hidden size.
        """
        enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
        for e_id, src_len in enumerate(source_lengths):
            enc_masks[e_id, src_len:] = 1
        return enc_masks.to(self.device)

    def beam_search(self, src_sent: List[str], beam_size: int = 5, max_decoding_time_step: int = 70) -> List[
        Hypothesis]:
        """ Given a single source sentence, perform beam search, yielding translations in the target language.
        @param src_sent (List[str]): a single source sentence (words)
        @param beam_size (int): beam size
        @param max_decoding_time_step (int): maximum number of time steps to unroll the decoding RNN
        @returns hypotheses (List[Hypothesis]): a list of hypothesis, each hypothesis has two fields:
                value: List[str]: the decoded target sentence, represented as a list of words
                score: float: the log-likelihood of the target sentence
        """
        src_sents_var = self.vocab.src.to_input_tensor([src_sent], self.device)

        src_encodings, dec_init_vec = self.encode(src_sents_var, [len(src_sent)])
        src_encodings_att_linear = self.att_projection(src_encodings)

        h_tm1 = dec_init_vec
        att_tm1 = torch.zeros(1, self.hidden_size, device=self.device)

        eos_id = self.vocab.tgt['</s>']

        hypotheses = [['<s>']]
        hyp_scores = torch.zeros(len(hypotheses), dtype=torch.float, device=self.device)
        completed_hypotheses = []

        t = 0
        while len(completed_hypotheses) < beam_size and t < max_decoding_time_step:
            t += 1
            hyp_num = len(hypotheses)

            exp_src_encodings = src_encodings.expand(hyp_num,
                                                     src_encodings.size(1),
                                                     src_encodings.size(2))

            exp_src_encodings_att_linear = src_encodings_att_linear.expand(hyp_num,
                                                                           src_encodings_att_linear.size(1),
                                                                           src_encodings_att_linear.size(2))

            y_tm1 = torch.tensor([self.vocab.tgt[hyp[-1]] for hyp in hypotheses], dtype=torch.long, device=self.device)
            y_t_embed = self.model_embeddings.target(y_tm1)

            x = torch.cat([y_t_embed, att_tm1], dim=-1)

            (h_t, cell_t), att_t, _ = self.step(x, h_tm1,
                                                exp_src_encodings, exp_src_encodings_att_linear, enc_masks=None)

            # log probabilities over target words
            log_p_t = F.log_softmax(self.target_vocab_projection(att_t), dim=-1)

            live_hyp_num = beam_size - len(completed_hypotheses)
            contiuating_hyp_scores = (hyp_scores.unsqueeze(1).expand_as(log_p_t) + log_p_t).view(-1)
            top_cand_hyp_scores, top_cand_hyp_pos = torch.topk(contiuating_hyp_scores, k=live_hyp_num)

            prev_hyp_ids = torch.div(top_cand_hyp_pos, len(self.vocab.tgt), rounding_mode='floor')
            hyp_word_ids = top_cand_hyp_pos % len(self.vocab.tgt)

            new_hypotheses = []
            live_hyp_ids = []
            new_hyp_scores = []

            for prev_hyp_id, hyp_word_id, cand_new_hyp_score in zip(prev_hyp_ids, hyp_word_ids, top_cand_hyp_scores):
                prev_hyp_id = prev_hyp_id.item()
                hyp_word_id = hyp_word_id.item()
                cand_new_hyp_score = cand_new_hyp_score.item()

                hyp_word = self.vocab.tgt.id2word[hyp_word_id]
                new_hyp_sent = hypotheses[prev_hyp_id] + [hyp_word]
                if hyp_word == '</s>':
                    completed_hypotheses.append(Hypothesis(value=new_hyp_sent[1:-1],
                                                           score=cand_new_hyp_score))
                else:
                    new_hypotheses.append(new_hyp_sent)
                    live_hyp_ids.append(prev_hyp_id)
                    new_hyp_scores.append(cand_new_hyp_score)

            if len(completed_hypotheses) == beam_size:
                break

            live_hyp_ids = torch.tensor(live_hyp_ids, dtype=torch.long, device=self.device)
            h_tm1 = (h_t[live_hyp_ids], cell_t[live_hyp_ids])
            att_tm1 = att_t[live_hyp_ids]

            hypotheses = new_hypotheses
            hyp_scores = torch.tensor(new_hyp_scores, dtype=torch.float, device=self.device)

        if len(completed_hypotheses) == 0:
            completed_hypotheses.append(Hypothesis(value=hypotheses[0][1:],
                                                   score=hyp_scores[0].item()))

        completed_hypotheses.sort(key=lambda hyp: hyp.score, reverse=True)

        return completed_hypotheses

    @property
    def device(self) -> torch.device:
        """ Determine which device to place the Tensors upon, CPU or GPU.
        """
        return self.model_embeddings.source.weight.device

    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = NMT(vocab=params['vocab'], **args)
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the odel to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(embed_size=self.model_embeddings.embed_size, hidden_size=self.hidden_size,
                         dropout_rate=self.dropout_rate),
            'vocab': self.vocab,
            'state_dict': self.state_dict()
        }

        torch.save(params, path)

Now it’s time to get things running!

In [14]:
#@title Evaluating function
def evaluate_ppl(model, dev_data, batch_size=32):
    """ Evaluate perplexity on dev sentences
    @param model (NMT): NMT Model
    @param dev_data (list of (src_sent, tgt_sent)): list of tuples containing source and target sentence
    @param batch_size (batch size)
    @returns ppl (perplixty on dev sentences)
    """
    was_training = model.training
    model.eval()

    cum_loss = 0.
    cum_tgt_words = 0.

    # no_grad() signals backend to throw away all gradients
    with torch.no_grad():
        for src_sents, tgt_sents in batch_iter(dev_data, batch_size):
            loss = -model(src_sents, tgt_sents).sum()

            cum_loss += loss.item()
            tgt_word_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            cum_tgt_words += tgt_word_num_to_predict

        ppl = np.exp(cum_loss / cum_tgt_words)

    if was_training:
        model.train()

    return ppl

def compute_corpus_level_bleu_score(references: List[List[str]], hypotheses: List[Hypothesis]) -> float:
    """ Given decoding results and reference sentences, compute corpus-level BLEU score.
    @param references (List[List[str]]): a list of gold-standard reference target sentences
    @param hypotheses (List[Hypothesis]): a list of hypotheses, one for each reference
    @returns bleu_score: corpus-level BLEU score
    """
    # remove the start and end tokens
    if references[0][0] == '<s>':
        references = [ref[1:-1] for ref in references]
    
    # detokenize the subword pieces to get full sentences
    detokened_refs = [''.join(pieces).replace('▁', ' ') for pieces in references]
    detokened_hyps = [''.join(hyp.value).replace('▁', ' ') for hyp in hypotheses]

    # sacreBLEU can take multiple references (golden example per sentence) but we only feed it one
    bleu = sacrebleu.corpus_bleu(detokened_hyps, [detokened_refs])

    return bleu.score

In [None]:
# Initialize our model and optimizer
model = NMT(embed_size=args.embed_size,
            hidden_size=args.hidden_size,
            dropout_rate=float(args.dropout),
            vocab=vocab)
model.train()

uniform_init = float(args.uniform_init)
if np.abs(uniform_init) > 0.:
    print('uniformly initialize parameters [-%f, +%f]' % (uniform_init, uniform_init), file=sys.stderr)
    for p in model.parameters():
        p.data.uniform_(-uniform_init, uniform_init)

model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=float(args.lr))

uniformly initialize parameters [-0.100000, +0.100000]


In [None]:
#@title Training the model
num_trial = 0
train_iter = patience = cum_loss = report_loss = cum_tgt_words = report_tgt_words = 0
cum_examples = report_examples = epoch = valid_num = 0
hist_valid_scores = []
train_time = begin_time = time.time()
print('begin Maximum Likelihood training')

for epoch in range(args.max_epoch):
    for src_sents, tgt_sents in batch_iter(train_data, batch_size=args.batch_size, shuffle=True):
        train_iter += 1

        optimizer.zero_grad()

        batch_size = len(src_sents)

        example_losses = -model(src_sents, tgt_sents) # (batch_size,)
        batch_loss = example_losses.sum()
        loss = batch_loss / batch_size

        loss.backward()

        # clip gradient
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_grad)

        optimizer.step()

        batch_losses_val = batch_loss.item()
        report_loss += batch_losses_val
        cum_loss += batch_losses_val

        tgt_words_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
        report_tgt_words += tgt_words_num_to_predict
        cum_tgt_words += tgt_words_num_to_predict
        report_examples += batch_size
        cum_examples += batch_size

        if train_iter % (args.log_every*10) == 0:
            print('epoch %d, iter %d, avg. loss %.2f, avg. ppl %.2f ' \
                    'cum. examples %d, speed %.2f words/sec, time elapsed %.2f sec' % (epoch, train_iter,
                                                                                        report_loss / report_examples,
                                                                                        math.exp(report_loss / report_tgt_words),
                                                                                        cum_examples,
                                                                                        report_tgt_words / (time.time() - train_time),
                                                                                        time.time() - begin_time), file=sys.stderr)

            train_time = time.time()
            report_loss = report_tgt_words = report_examples = 0.

        # perform validation
        if train_iter % args.valid_niter == 0:
            print('epoch %d, iter %d, cum. loss %.2f, cum. ppl %.2f cum. examples %d' % (epoch, train_iter,
                                                                                        cum_loss / cum_examples,
                                                                                        np.exp(cum_loss / cum_tgt_words),
                                                                                        cum_examples), file=sys.stderr)

            cum_loss = cum_examples = cum_tgt_words = 0.
            valid_num += 1

            print('begin validation ...', file=sys.stderr)

            # compute dev. ppl and bleu
            dev_ppl = evaluate_ppl(model, dev_data, batch_size=128)   # dev batch size can be a bit larger
            valid_metric = -dev_ppl

            print('validation: iter %d, dev. ppl %f' % (train_iter, dev_ppl), file=sys.stderr)

            is_better = len(hist_valid_scores) == 0 or valid_metric > max(hist_valid_scores)
            hist_valid_scores.append(valid_metric)

            if is_better:
                patience = 0
                print('save currently the best model to [%s]' % args.model_save_path, file=sys.stderr)
                model.save(args.model_save_path)

                # also save the optimizers' state
                torch.save(optimizer.state_dict(), args.model_save_path + '.optim')
            elif patience < int(args.patience):
                patience += 1
                print('hit patience %d' % patience, file=sys.stderr)

                if patience == int(args.patience):
                    num_trial += 1
                    print('hit #%d trial' % num_trial, file=sys.stderr)
                    if num_trial == int(args.max_num_trial):
                        print('early stop!', file=sys.stderr)
                        exit(0)

                    # decay lr, and restore from previously best checkpoint
                    lr = optimizer.param_groups[0]['lr'] * float(args.lr_decay)
                    print('load previously best model and decay learning rate to %f' % lr, file=sys.stderr)

                    # load model
                    params = torch.load(args.model_save_path, map_location=lambda storage, loc: storage)
                    model.load_state_dict(params['state_dict'])
                    model = model.to(device)

                    print('restore parameters of the optimizers', file=sys.stderr)
                    optimizer.load_state_dict(torch.load(args.model_save_path + '.optim'))

                    # set new lr
                    for param_group in optimizer.param_groups:
                        param_group['lr'] = lr

                    # reset patience
                    patience = 0

begin Maximum Likelihood training


  return F.conv1d(input, weight, bias, self.stride,
  alpha_t = F.softmax(e_t)
epoch 0, iter 100, avg. loss 159.51, avg. ppl 315.13 cum. examples 3196, speed 3438.16 words/sec, time elapsed 25.77 sec
epoch 0, iter 200, avg. loss 125.57, avg. ppl 93.24 cum. examples 6392, speed 3973.81 words/sec, time elapsed 48.04 sec
epoch 0, iter 300, avg. loss 118.50, avg. ppl 65.06 cum. examples 9589, speed 4104.45 words/sec, time elapsed 70.15 sec
epoch 0, iter 400, avg. loss 110.94, avg. ppl 51.17 cum. examples 12788, speed 4142.12 words/sec, time elapsed 91.92 sec
epoch 0, iter 500, avg. loss 104.22, avg. ppl 42.32 cum. examples 15988, speed 4108.01 words/sec, time elapsed 113.60 sec
epoch 0, iter 600, avg. loss 102.56, avg. ppl 37.78 cum. examples 19185, speed 3980.23 words/sec, time elapsed 136.28 sec
epoch 0, iter 700, avg. loss 96.84, avg. ppl 32.82 cum. examples 22381, speed 4081.75 words/sec, time elapsed 158.00 sec
epoch 0, iter 800, avg. loss 94.98, avg. ppl 29.56 cum. examples 25578, sp

# The Seq2Seq Model 2: Transformer
In this part, we will train a sequence-to-sequence Transformer model to translate Portuguese into English. The Transformer was originally proposed in "Attention is all you need" by Vaswani et al. (2017).

<img src="https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif" alt="Applying the Transformer to machine translation">

Figure 2: Applying the Transformer to machine translation. Source: [Google AI Blog](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html).

A Transformer is a sequence-to-sequence encoder-decoder model similar to the model in the [NMT with attention tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention).
A single-layer Transformer takes a little more code to write, but is almost identical to that encoder-decoder RNN model. The only difference is that the RNN layers are replaced with self attention layers.

<table>
<tr>
  <th>The <a href=https://www.tensorflow.org/text/tutorials/nmt_with_attention>RNN+Attention model</a></th>
  <th>A 1-layer transformer</th>
</tr>
<tr>
  <td>
   <img width=411 src="https://www.tensorflow.org/images/tutorials/transformer/RNN+attention-words.png"/>
  </td>
  <td>
   <img width=400 src="https://www.tensorflow.org/images/tutorials/transformer/Transformer-1layer-words.png"/>
  </td>
</tr>
</table>

### The embedding and positional encoding layer

The inputs to both the encoder and decoder use the same embedding and positional encoding logic. 

<table>
<tr>
  <th colspan=1>The embedding and positional encoding layer</th>
<tr>
<tr>
  <td>
   <img src="https://www.tensorflow.org/images/tutorials/transformer/PositionalEmbedding.png"/>
  </td>
</tr>
</table>

The formula for calculating the positional encoding (implemented in Python below) is as follows:

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$

# Transformer Embedding Layer 
Implement embedding layer (consists of lookup embedding & positional encoding) for transformer model. 

In [15]:
class TransformerEmbedding(nn.Module):
    """
    Class that combines token embeddings with positional embeddings.
    """
    def __init__(self, vocab_size, embedding_size, max_len, dropout_rate):
        """
        Init the Transformer Embedding layer.

        @param vocab_size (int): Vocabulary size (number of unique tokens)
        @param embedding_size (int): Embedding size (dimensionality)
        @param max_len (int): Maximum sequence length
        @param dropout_rate (float): Dropout probability
        """
        super().__init__()
        # default values

        ### TODO - Implement the positional embedding and Initialize the following variables :
        ###     self.token_embedding (Embedding Layer)
        ###     self.pos_embedding (Positional Embedding Layer), notes that pos_embedding is not learnable parameters, 
        ###         so we should use the self.register_buffer function to initialize it.
        ###     self.dropout (Dropout Layer)

        # Token Embedding Layer
        self.token_embedding = nn.Embedding(vocab_size, embedding_size)

        # Positional Encoding Layer
        den = torch.exp(- torch.arange(0, embedding_size, 2)* math.log(10000) / embedding_size)
        pos = torch.arange(0, max_len).reshape(max_len, 1)
        pos_embedding = torch.zeros((max_len, embedding_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout_rate)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, x):
        """
        Maps input sequences of tokens to their embeddings.

        param x (Tensor): Input tensor of tokens with shape (batch_size, seq_len)

        returns embedded (Tensor): Tensor of token embeddings with shape (batch_size, seq_len, embedding_size)
        """
        # Retrieve token embeddings
        embedded_tokens = self.token_embedding(x.long())
        
        # Retrieve positional embeddings for the appropriate segment of the input sequence
        embedded_positions = self.pos_embedding[:embedded_tokens.size(0), :]

        # Add token and positional embeddings together, apply dropout, and return
        embedded = self.dropout(embedded_tokens + embedded_positions)
        return embedded
        

### The transformer model:

To be convinient, we will use nn.Transformer layer from PyTorch. We will build a 4-layer Transformer model.

<table>
<tr>
  <th colspan=1>The original Transformer diagram</th>
  <th colspan=1>A representation of a 4-layer Transformer</th>
</tr>
<tr>
  <td>
   <img width=400 src="https://www.tensorflow.org/images/tutorials/transformer/transformer.png"/>
  </td>
  <td>
   <img width=307 src="https://www.tensorflow.org/images/tutorials/transformer/Transformer-4layer-compact.png"/>
  </td>
</tr>
</table>

# Initialize layers in TransformerNMT model
Implement the __init__ function  to initialize the
necessary module for our TransformerNMT model


# Implement the **forward** function
Implement the forward function in the TransformerNMT class



In [16]:
Hypothesis = namedtuple('Hypothesis', ['value', 'score'])

class TransformerNMT(nn.Module):
    """ Neural Machine Translation Model with Transformer:
        - Encoder with stacked self-attention and feedforward layers
        - Decoder with stacked self-attention, encoder-decoder attention, and feedforward layers
    """
    def __init__(self, d_model, n_heads, ff_size, n_layers, max_len, vocab, dropout_rate):
        """ Init TransformerNMT NMT Model.
        @param d_model (int): Hidden Size, the size of hidden states (dimensionality)
        @param n_heads (int): The number of heads in the multiheadattention
        @param n_layers (int): The number of sub-layers in the Encoder and Decoder Transformer
        @param ff_size (int): The dimension of the feedforward network model 
        @param max_len (int) max sequence length
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        @param dropout_rate (float): Dropout probability, for attention
        """
        super(TransformerNMT, self).__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.ff_size = ff_size
        self.n_layers = n_layers
        self.max_len = max_len
        self.vocab = vocab
        self.dropout_rate = dropout_rate

        self.src_embedding = None
        self.tgt_embedding = None
        self.transformer = None
        self.target_vocab_projection = None

        ### TODO - Initialize the following variables IN THIS ORDER:
        ###     self.src_embedding: Transformer Embedding Layer used for source language
        ###     self.tgt_embedding: Transformer Embedding Layer used for target language
        ###     self.transformer: Transformer layer
        ###     self.target_vocab_projection (Linear Layer with no bias), mapping hidden representation to the vocab distribution

        self.src_embedding = TransformerEmbedding(vocab_size=len(vocab.src), embedding_size=d_model, max_len=max_len, dropout_rate=dropout_rate)
        self.tgt_embedding = TransformerEmbedding(vocab_size=len(vocab.tgt), embedding_size=d_model, max_len=max_len, dropout_rate=dropout_rate)

        self.transformer = nn.Transformer(d_model=d_model, nhead=n_heads, num_encoder_layers=n_layers, num_decoder_layers=n_layers, dim_feedforward=ff_size, dropout=dropout_rate)
        self.target_vocab_projection = nn.Linear(d_model, len(vocab.tgt))

    def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        @param source (List[List[str]]): list of source sentence tokens
        @param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`

        @returns scores (Tensor): a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        """
        # Convert list of lists into tensors
        source_padded = self.vocab.src.to_input_tensor(source, device=self.device)   # Tensor: (src_len, b)
        target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device)   # Tensor: (tgt_len, b)

        # Compute masking
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = self.generate_sent_masks(source_padded, target_padded)

        # Compute embedding
        embedded_src = self.src_embedding(source_padded) 
        embedded_tgt = self.tgt_embedding(target_padded) 

        # 3. Apply the Transformer to compute decoder output
        decoder_output = self.transformer(embedded_src, embedded_tgt, src_mask, tgt_mask)
        P = F.log_softmax(self.target_vocab_projection(decoder_output), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.vocab.tgt['<pad>']).float()
        
        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded.unsqueeze(-1), dim=-1).squeeze(-1) * target_masks
        scores = target_gold_words_log_prob.sum(dim=0)
        return scores
    
    def encode(self, src, src_mask):
        return self.transformer.encoder(self.src_embedding(src), src_mask)
    
    def decode(self, tgt: torch.Tensor, memory: torch.Tensor, tgt_mask: torch.Tensor):
        return self.transformer.decoder(self.tgt_embedding(tgt), memory, tgt_mask)
    
    def generate_sent_masks(self, src_ids, tgt_ids):
        src_seq_len = src_ids.size(0)
        tgt_seq_len = tgt_ids.size(0)

        tgt_mask = self.generate_square_subsequent_mask(tgt_seq_len)
        src_mask = torch.zeros((src_seq_len, src_seq_len), device=self.device).type(torch.bool)

        src_padding_mask = (src_ids == 0).transpose(0, 1)
        tgt_padding_mask = (tgt_ids == 0).transpose(0, 1)
        
        return src_mask.to(self.device), tgt_mask.to(self.device), src_padding_mask.to(self.device), tgt_padding_mask.to(self.device)
    
    def generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones((sz, sz), device=self.device)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask
  
    @property
    def device(self) -> torch.device:
        """ Determine which device to place the Tensors upon, CPU or GPU.
        """
        return self.src_embedding.token_embedding.weight.device

    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = TransformerNMT(vocab=params['vocab'], **args)
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the odel to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(d_model=self.d_model, n_heads=self.n_heads, ff_size=self.ff_size, n_layers=self.n_layers, max_len=self.max_len, dropout_rate=self.dropout_rate),
            'vocab': self.vocab,
            'state_dict': self.state_dict()
        }

        torch.save(params, path)

# Train our full model
Train our Transformer model using training script above and report the results compared with LSTM-Attention model.



In [17]:
args.n_heads = 8
args.ff_size = 768
args.n_layers = 4
args.d_model = 768
args.lr = 1e-4
args.dropout = 0.1
args.model_save_path = "transformer_model.bin"

In [None]:
# Initialize our model and optimizer
model = TransformerNMT(d_model=args.d_model, n_heads=args.n_heads, ff_size=args.ff_size, n_layers=args.n_layers, max_len=args.max_len, vocab=vocab, dropout_rate=float(args.dropout))
model.train()

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=float(args.lr))

In [None]:
#@title Training the model
num_trial = 0
train_iter = patience = cum_loss = report_loss = cum_tgt_words = report_tgt_words = 0
cum_examples = report_examples = epoch = valid_num = 0
hist_valid_scores = []
train_time = begin_time = time.time()
print('begin Maximum Likelihood training')

for epoch in range(args.max_epoch):
    for src_sents, tgt_sents in batch_iter(train_data, batch_size=args.batch_size, shuffle=True):
        train_iter += 1

        optimizer.zero_grad()

        batch_size = len(src_sents)

        example_losses = -model(src_sents, tgt_sents) # (batch_size,)
        batch_loss = example_losses.sum()
        loss = batch_loss / batch_size

        loss.backward()

        # clip gradient
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_grad)

        optimizer.step()

        batch_losses_val = batch_loss.item()
        report_loss += batch_losses_val
        cum_loss += batch_losses_val

        tgt_words_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
        report_tgt_words += tgt_words_num_to_predict
        cum_tgt_words += tgt_words_num_to_predict
        report_examples += batch_size
        cum_examples += batch_size

        if train_iter % (args.log_every*10) == 0:
            print('epoch %d, iter %d, avg. loss %.2f, avg. ppl %.2f ' \
                    'cum. examples %d, speed %.2f words/sec, time elapsed %.2f sec' % (epoch, train_iter,
                                                                                        report_loss / report_examples,
                                                                                        math.exp(report_loss / report_tgt_words),
                                                                                        cum_examples,
                                                                                        report_tgt_words / (time.time() - train_time),
                                                                                        time.time() - begin_time), file=sys.stderr)

            train_time = time.time()
            report_loss = report_tgt_words = report_examples = 0.

        # perform validation
        if train_iter % args.valid_niter == 0:
            print('epoch %d, iter %d, cum. loss %.2f, cum. ppl %.2f cum. examples %d' % (epoch, train_iter,
                                                                                        cum_loss / cum_examples,
                                                                                        np.exp(cum_loss / cum_tgt_words),
                                                                                        cum_examples), file=sys.stderr)

            cum_loss = cum_examples = cum_tgt_words = 0.
            valid_num += 1

            print('begin validation ...', file=sys.stderr)

            # compute dev. ppl and bleu
            dev_ppl = evaluate_ppl(model, dev_data, batch_size=128)   # dev batch size can be a bit larger
            valid_metric = -dev_ppl

            print('validation: iter %d, dev. ppl %f' % (train_iter, dev_ppl), file=sys.stderr)

            is_better = len(hist_valid_scores) == 0 or valid_metric > max(hist_valid_scores)
            hist_valid_scores.append(valid_metric)

            if is_better:
                patience = 0
                print('save currently the best model to [%s]' % args.model_save_path, file=sys.stderr)
                model.save(args.model_save_path)

                # also save the optimizers' state
                torch.save(optimizer.state_dict(), args.model_save_path + '.optim')
            elif patience < int(args.patience):
                patience += 1
                print('hit patience %d' % patience, file=sys.stderr)

                if patience == int(args.patience):
                    num_trial += 1
                    print('hit #%d trial' % num_trial, file=sys.stderr)
                    if num_trial == int(args.max_num_trial):
                        print('early stop!', file=sys.stderr)
                        exit(0)

                    # decay lr, and restore from previously best checkpoint
                    lr = optimizer.param_groups[0]['lr'] * float(args.lr_decay)
                    print('load previously best model and decay learning rate to %f' % lr, file=sys.stderr)

                    # load model
                    params = torch.load(args.model_save_path, map_location=lambda storage, loc: storage)
                    model.load_state_dict(params['state_dict'])
                    model = model.to(device)

                    print('restore parameters of the optimizers', file=sys.stderr)
                    optimizer.load_state_dict(torch.load(args.model_save_path + '.optim'))

                    # set new lr
                    for param_group in optimizer.param_groups:
                        param_group['lr'] = lr

                    # reset patience
                    patience = 0

begin Maximum Likelihood training


epoch 0, iter 100, avg. loss 193.43, avg. ppl 1071.13 cum. examples 3196, speed 4195.73 words/sec, time elapsed 21.12 sec
epoch 0, iter 200, avg. loss 164.68, avg. ppl 382.94 cum. examples 6392, speed 4567.86 words/sec, time elapsed 40.49 sec
epoch 0, iter 300, avg. loss 162.75, avg. ppl 309.34 cum. examples 9589, speed 4765.73 words/sec, time elapsed 59.53 sec
epoch 0, iter 400, avg. loss 144.10, avg. ppl 165.87 cum. examples 12788, speed 4747.91 words/sec, time elapsed 78.53 sec
epoch 0, iter 500, avg. loss 129.24, avg. ppl 103.98 cum. examples 15988, speed 4685.28 words/sec, time elapsed 97.53 sec
epoch 0, iter 600, avg. loss 123.36, avg. ppl 78.92 cum. examples 19185, speed 4622.91 words/sec, time elapsed 117.06 sec
epoch 0, iter 700, avg. loss 110.81, avg. ppl 54.32 cum. examples 22381, speed 4711.07 words/sec, time elapsed 135.88 sec
epoch 0, iter 800, avg. loss 102.99, avg. ppl 39.34 cum. examples 25578, speed 4944.84 words/sec, time elapsed 154.01 sec
epoch 0, iter 900, avg. lo

dựa trên các tiêu chí avg. loss, avg. ppl, time elapsed thì có thể thấy mô hình Transformer có kết quả tốt hơn so với mô hình LSTM-Attention về cả độ chính xác và tốc độ huấn luyện

In [30]:
# Load the pre-trained model weights
model_path = "/content/drive/MyDrive/Colab Notebooks/VietAI/ASM/transformer_model.bin"
model_state_dict = torch.load(model_path)

# Create an instance of the TransformerNMT model with the same configuration used during training
model = TransformerNMT(d_model=768, n_heads=8, ff_size=768, n_layers=4, max_len=320, vocab=vocab, dropout_rate=0.1)

# Load the model weights
model.load_state_dict(model_state_dict)

RuntimeError: ignored

In [26]:
import nltk

def tokenize(sentence):
    # Download the necessary resources for tokenization (only required once)
    nltk.download('punkt')
    
    # Tokenize the sentence into individual tokens
    tokens = nltk.word_tokenize(sentence)
    
    return tokens


In [27]:
# Tokenize the source sentence
source_sentence = "xin chào, bạn khỏe không?"
source_tokens = tokenize(source_sentence)  # Implement a tokenization function that splits the sentence into tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [28]:
source_tokens

['xin', 'chào', ',', 'bạn', 'khỏe', 'không', '?']

In [29]:
# Translate the source sentence
translated_tokens = model.translate(source_tokens)

# Join the translated tokens into a readable translation
translation = " ".join(translated_tokens)

# Print the translation
print(translation)

AttributeError: ignored

In [None]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
    for i in range(max_len-1):
        memory = memory.to(device)
        tgt_mask = TransformerNMT.generate_square_subsequent_mask(ys.size(0)).type(torch.bool).to(device)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == 2:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=1).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<s>", "").replace("</s>", "")