# Project 3 (part 2): Language Translation with Recurrent Neural Networks
## CS4740/5740 Fall 2021

Johann Lee



## Dataset
You are given access to a set of parallel sentences. One sentence is written in modern English (the "source") and another is in Shakespearean English (the "target"). For this project, given modern English you will need to translate this into Shakespearean English. This is usually called (Neural) Machine Translation. We'll simply refer to it as NMT or Neural Machine Translation in the project.

We will minimally preprocess the source/target sentences and handle tokenization in what we release. For this assignment, we do not anticipate any further preprocessing to be done by you. Should you choose to do so, it would be interesting to hear about in the report (along with whether or not it helped performance), but it is not a required aspect of the assignment.

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=True)

#https://www.digitalocean.com/community/tutorials/how-to-set-up-jupyter-notebook-with-python-3-on-ubuntu-18-04
#https://stackoverflow.com/questions/16886179/scp-or-sftp-copy-multiple-files-with-single-command
# jupyter notebook --no-browser

source_path = os.path.join(os.getcwd(), "drive", "My Drive", "sophomore", "nlp 4740", "project_3", "source.txt") 
target_path = os.path.join(os.getcwd(), "drive", "My Drive", "sophomore", "nlp 4740", "project_3", "target.txt") 
test_path = os.path.join(os.getcwd(), "drive", "My Drive", "sophomore", "nlp 4740", "project_3", "test.txt") 


Mounted at /content/drive


## Import libraries and connect to Google Drive

In [None]:
!pip install -U gensim



In [None]:
!pip3 install sentencepiece
from collections import Counter, namedtuple
from itertools import chain
import json
import math
import os
from pathlib import Path
import random
import time
from tqdm.notebook import tqdm, trange
from typing import List, Tuple, Dict, Set, Union


import gensim
import nltk
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
import numpy as np
import sentencepiece as spm
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.nn import init
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
import torch.nn.utils
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence


from tqdm.notebook import tqdm, trange



In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Part 1: Recurrent Neural Network

Below we define the general problem set up of FFNNs and RNNs.

$\textbf{FFNN.}$ \
$Input: \text{We have an input vector }\vec{x} \in \mathcal{R}^d$ \
$Model\text{ }Output: \text{The model has some intermediate output }\vec{z} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$Final\text{ }Output: \text{ The model outputs a vector } \vec{y} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the constraint of being a probability distribution, i.e. $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}[i] \leq 1$, which is achieved via _Softmax_ applied to $\vec{z}$.
<br></br>
$\textbf{RNN.}$ \
$Input: \text{The model takes as input a sequence of vectors} \vec{x}_1,\vec{x}_2, \dots, \vec{x}_k; \vec{x}_i \in \mathcal{R}^d$ \
$Model\text{ }Output: \text{The model generates some intermediate sequence output} \vec{z}_1,\vec{z}_2, \dots, \vec{z}_k; \vec{z}_i \in \mathcal{R}^{h}, \text{ where h is the hidden state size.}$
$Final\text{ }Output: \text{The model generates some final sequence output} \vec{y}_1, \dots, \vec{y}_k \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the constraint of being a probability distribution, i.e. $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}_j[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}_j[i] \geq 0$.

Let linear classification vector $\vec{y}$ be

$$\vec{y}_j = Softmax(W\vec{z}_j); \text{ where }W\in \mathcal{R}^{\mid \mathcal{Y}\mid \times h} $$

Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding $x_1,\dots, x_n$ ($x_i \in R^{e}$), where n is the length of the source sentence and e is the embedding size. We feed these embeddings to the bidirectional encoder, yielding hidden states for both the forward (→) and backward (←) RNNs. The forward and backward versions are concatenated to give hidden states $h_i^{enc}$


$$h_i^{enc} = [\overrightarrow{h_i^{enc}}; \overleftarrow{h_i^{enc}}] \text{ where }h_i^{enc} \in R^{2h}, \overrightarrow{h_i^{enc}}, \overleftarrow{h_i^{enc}} \in R^{h}$$


We then initialize the decoder’s first hidden state $h_0^{dec}$ with a linear projection of the encoder’s final hidden state

$$h_0^{dec} = W_h[\overrightarrow{h_n^{enc}}; \overleftarrow{h_0^{enc}}] \text{ where }h_0^{dec} \in R^{2h}, W_h \in R^{h \times 2h}$$

With the decoder initialized, we must now feed it a target sentence. On the $t^{th}$ step, we look up the embedding for the $t^{th}$ word, $y_t \in R^{e}$. We then concatenate $y_t$ with the combined-output vector $o_{t−1} \in R^{h}$ from the previous timestep to produce $y_t \in R^{e+h}$. Note that for the first target (i.e. the start token) $o_0$ is a zero-vector for us (but it can be random or a learned vector as well). We then feed $y_t$ as input to the decoder.

$$ h_t^{dec} = Decoder(y_t, h_{t-1}^{dec})\text{ where }h_{t-1}^{dec} ∈ R^{h}$$

We can take the decoder hidden state $h_t^{dec}$ and pass this through a linear layer to obtain an intermediate output $v_t$. This is then passed through an activation function (like tanh) to obtain our combined-output vector $o_t$

$$v_t = W_v h_t^{dec} \text{ where } W_v \in R^{h \times h}, v_t \in R^{h}$$
$$o_t = \tanh{(v_t)} \text{ where } o_t \in R^{h}$$

Then, we produce a probability distribution $P_t$ over target words at the $t^{th}$ timestep.

$$P_t = Softmax(W_{v_{target}} o_t) \text{ where }P_t \in R^{V_{target}}, W_{v_{target}}\in R^{V_{target} \times h}$$


Here, $V_{target}$ is the size of the target vocabulary. Finally, to train the network we then compute the softmax cross entropy loss between $P_t$ and $g_t$, where $g_t$ is the one-hot vector of the target word at timestep t:

$$Loss(Model) = CrossEntropy(P_t, g_t)$$

Now that we have described the model, we'll implementing it for Modern English to Shakespearean English translation, using **BLEU** (bilingual Evaluation Understudy) as our mtetric. We calculate BLEU by counting matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order. BLEU uses N-grams of size 1-4 in its computation.









## 1.1 RNN Implementation






### 1.1.1 Data loading

In [None]:
Hypothesis = namedtuple('Hypothesis', ['value', 'score'])

In [None]:
def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
        The paddings should be at the end of each sentence.
    :param sents: list of sentences, where each sentence
                                    is represented as a list of words
    :type sents: list[list[str]]
    :param pad_token: padding token
    :type pad_token: str
    :returns sents_padded: list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentence in the batch now has equal length.
    :rtype: list[list[str]]
    """
    sents_padded = []

    max_len = max([len(sent) for sent in sents])
    sents_padded = [(sent + ([pad_token] * (max_len - len(sent)))) for sent in sents]

    return sents_padded

In [None]:
def read_corpus(file_path, source):
    """ Read file, where each sentence is dilineated by a `\n`.
    :param file_path: path to file containing corpus
    :type file_path: str
    :param source: "tgt" or "src" indicating whether text
        is of the source language or target language
    :type source: str
    """
    data = []
    for line in open(file_path):
        sent = nltk.word_tokenize(line)
        # only append <s> and </s> to the target sentence
        if source == 'tgt':
            sent = ['<s>'] + sent + ['</s>']
        data.append(sent)

    return data

In [None]:
class Vocab(object):
    """ Vocabulary, i.e. structure containing either
    src or tgt language terms.
    """
    def __init__(self, word2id=None):
        """ Init Vocab Instance.
        
        :param word2id: dictionary mapping words 2 indices
        :type word2id: dict[str, int]
        """
        if word2id:
            self.word2id = word2id
        else:
            self.word2id = dict()
            self.word2id['<pad>'] = 0   # Pad Token
            self.word2id['<s>'] = 1     # Start Token
            self.word2id['</s>'] = 2    # End Token
            self.word2id['<unk>'] = 3   # Unknown Token
        self.unk_id = self.word2id['<unk>']
        self.id2word = {v: k for k, v in self.word2id.items()}

    def __getitem__(self, word):
        """ Retrieve word's index. Return the index for the unk
        token if the word is out of vocabulary.
        
        :param word: word to look up
        :type word: str
        :returns: index of word
        :rtype: int
        """
        return self.word2id.get(word, self.unk_id)

    def __contains__(self, word):
        """ Check if word is captured by Vocab.
        
        :param word: word to look up
        :type word: str
        :returns: whether word is in vocab
        :rtype: bool
        """
        return word in self.word2id

    def __setitem__(self, key, value):
        """ Raise error, if one tries to edit the Vocab directly.
        """
        raise ValueError('vocabulary is readonly')

    def __len__(self):
        """ Compute number of words in Vocab.
        
        :returns: number of words in Vocab
        :rtype: int
        """
        return len(self.word2id)

    def __repr__(self):
        """ Representation of Vocab to be used
        when printing the object.
        """
        return 'Vocabulary[size=%d]' % len(self)

    def id2word(self, wid):
        """ Return mapping of index to word.
        
        :param wid: word index
        :type wid: int
        :returns: word corresponding to index
        :rtype: str
        """
        return self.id2word[wid]

    def add(self, word):
        """ Add word to Vocab, if it is previously unseen.
        
        :param word: to add to Vocab
        :type word: str
        :returns: index that the word has been assigned
        :rtype: int
        """
        if word not in self:
            wid = self.word2id[word] = len(self)
            self.id2word[wid] = word
            return wid
        else:
            return self[word]

    def words2indices(self, sents):
        """ Convert list of words or list of sentences of words
        into list or list of list of indices.
        
        :param sents: sentence(s) in words
        :type sents: Union[List[str], List[List[str]]]
        :returns: sentence(s) in indices
        :rtype: Union[List[int], List[List[int]]]
        """
        if type(sents[0]) == list:
            return [[self[w] for w in s] for s in sents]
        else:
            return [self[w] for w in sents]

    def indices2words(self, word_ids):
        """ Convert list of indices into words.
        
        :param word_ids: list of word ids
        :type word_ids: List[int]
        :returns: list of words
        :rtype: List[Str]
        """
        return [self.id2word[w_id] for w_id in word_ids]

    def to_input_tensor(self, sents: List[List[str]], device: torch.device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for 
        shorter sentences.
        
        :param sents: list of sentences (words)
        :type sents: List[List[str]]
        :param device: Device on which to load the tensor, ie. CPU or GPU
        :type device: torch.device
        :returns: Sentence tensor of (max_sentence_length, batch_size)
        :rtype: torch.Tensor
        """

        word_ids = self.words2indices(sents)
        sents_t = pad_sents(word_ids, self['<pad>'])
        sents_var = torch.tensor(sents_t, dtype=torch.long, device=device)
        return torch.t(sents_var)

    @staticmethod
    def from_corpus(corpus, size, freq_cutoff=2):
        """ Given a corpus construct a Vocab.
        
        :param corpus: corpus of text produced by read_corpus function
        :type corpus: List[str]
        :param size: # of words in vocabulary
        :type size: int
        :param freq_cutoff: if word occurs n < freq_cutoff times, drop the word
        :type freq_cutoff: int
        :returns: Vocab instance produced from provided corpus
        :rtype: Vocab
        """
        vocab_entry = Vocab()
        word_freq = Counter(chain(*corpus))
        valid_words = [w for w, v in word_freq.items() if v >= freq_cutoff]
        print('number of word types: {}, number of word types w/ frequency >= {}: {}'
              .format(len(word_freq), freq_cutoff, len(valid_words)))
        top_k_words = sorted(valid_words, key=lambda w: word_freq[w], reverse=True)[:size]
        for word in top_k_words:
            vocab_entry.add(word)
        return vocab_entry
    
    @staticmethod
    def from_subword_list(subword_list):
        """Given a list of subwords, construct the Vocab.
        
        :param subword_list: list of subwords in corpus
        :type subword_list: List[str]
        :returns: Vocab instance produced from provided list
        :rtype: Vocab
        """
        vocab_entry = Vocab()
        for subword in subword_list:
            vocab_entry.add(subword)
        return vocab_entry

In [None]:
import nltk
nltk.download('punkt')

print('initialize source vocabulary ..')
src_sents = read_corpus(source_path, "src")
src = Vocab.from_corpus(src_sents, 20000, 2) # 7098, 9422

print('initialize target vocabulary ..')
tgt_sents = read_corpus(target_path, "tgt")
tgt = Vocab.from_corpus(tgt_sents, 20000, 2) # 6893, 10956

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
initialize source vocabulary ..
number of word types: 13252, number of word types w/ frequency >= 2: 9167
initialize target vocabulary ..
number of word types: 15216, number of word types w/ frequency >= 2: 10725


In [None]:
# We explicitly choose to do nothing with respect to embeddings here and the embeddings are learned end-to-end during training in the NMT class

In [None]:
# Split into training and validation data
train_data_src, val_data_src, train_data_tgt, val_data_tgt = train_test_split(src_sents, tgt_sents, test_size=0.045922, random_state=42)

In [None]:
train_data = list(zip(train_data_src, train_data_tgt))
val_data = list(zip(val_data_src, val_data_tgt))

### 1.1.2 NMT Model Implementation

In [None]:
def generate_sent_masks(enc_hiddens: torch.Tensor, source_lengths: List[int], device: torch.device) -> torch.Tensor:
    """ Generate sentence masks for encoder hidden states.

    :param enc_hiddens: encodings of shape (b, src_len, 2*h), where b = batch size,
        src_len = max source length, h = hidden size.
    :type enc_hiddens: torch.Tensor
    :param source_lengths: List of actual lengths for each of the sentences in the batch.   
    :type source_lengths: List[int]
    :param device: Device on which to load the tensor, ie. CPU or GPU
    :type device: torch.device
    :returns: Tensor of sentence masks of shape (b, src_len),
        where src_len = max source length, h = hidden size.
    :rtype: torch.Tensor
    """
    enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
    for e_id, src_len in enumerate(source_lengths):
        enc_masks[e_id, src_len:] = 1
    return enc_masks.to(device)

In [None]:
class Encoder(nn.Module):
    def __init__(self, embed_size, hidden_size, source_embeddings):
        """
        """
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embed_size = embed_size
        self.embedding = source_embeddings
        ###     self.encoder (Bidirectional RNN with bias)
        ###     self.h_projection (Linear Layer with no bias), called W_{h} above.
        # self.n, self.e = self.embedding.size() 
        self.encoder = torch.nn.LSTM(input_size=embed_size, hidden_size=hidden_size, num_layers=1, bidirectional=True)
        # encoder takes embeddings x1,...xn in R^e, therefore input size is embed_size
        self.h_projection = torch.nn.Linear(in_features=2*hidden_size, out_features=hidden_size, bias=False)
        self.c_projection = torch.nn.Linear(in_features= 2*hidden_size, out_features = hidden_size, bias=False)

        
    
    def forward(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """
        """
        enc_hiddens, dec_init_state = None, None

        ###     1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
        ###         src_len = maximum source sentence length, b = batch size, e = embedding size.
        ###     2. Compute `enc_hiddens`, `last_hidden` by applying the encoder to `X`.
        ###     3. Compute `dec_init_state` = init_decoder_hidden:
        ###         - `init_decoder_hidden`:
        ###             `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forward and backwards.
        ###             Concatenate the forward and backward tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the h_projection layer to this in order to compute init_decoder_hidden.
        ###             This is h_0^{dec} in above in the writeup. Here b = batch size, h = hidden size
       
        # 1
        X = self.embedding(source_padded)
        # 2
        X = torch.nn.utils.rnn.pack_padded_sequence(X, source_lengths) # b batches of size n
        enc_hiddens, (last_hidden, final_cell_state) = self.encoder(X)
        enc_hiddens, len = torch.nn.utils.rnn.pad_packed_sequence(X, batch_first = False)
        enc_hiddens = torch.permute(enc_hiddens, (1, 0, 2)) # (src_len, b, h*2) -> (b, src_len, h*2)        
        # 3
        # dec_init_state = torch.cat(last_hidden[0], last_hidden[1]) # concat forward =and backwards of last hidden (2, b, h) to get (b, 2*h)
        # dec_init_state = last_hidden.squeeze(0)
        # print(last_hidden.shape) # 2, 16, 512
        dec_init_state = torch.cat([last_hidden[i] for i in range(last_hidden.size(0))], -1) # concat forward =and backwards of last hidden (2, b, h) to get (b, 2*h)
        dec_init_hidden = torch.cat([final_cell_state[i] for i in range(final_cell_state.size(0))], -1)
        # print(dec_init_state.shape) # 16, 1024
        init_decoder_hidden = self.h_projection(dec_init_state)
        init_decoder_cell = self.c_projection(dec_init_hidden)
        
        dec_init_state = (init_decoder_hidden, init_decoder_cell)

        return enc_hiddens, dec_init_state

In [None]:
class Decoder(nn.Module):
    def __init__(self, embed_size, hidden_size, target_embedding, device):
        """
        """
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.device = device
        self.embedding = target_embedding
        output_vocab_size = self.embedding.weight.size(0)
        self.softmax = nn.Softmax(dim=1)

        ###     self.decoder (RNN Cell with bias)
        ###     self.combined_output_projection (Linear Layer with no bias), called W_{v} above.
        ###     self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} above.
        
        self.decoder = torch.nn.LSTMCell(input_size=embed_size+hidden_size, hidden_size=hidden_size, bias=True)
        self.combined_output_projection = torch.nn.Linear(in_features=hidden_size, out_features=hidden_size, bias=False) # Wv in Rhxh
        self.target_vocab_projection = torch.nn.Linear(in_features=hidden_size, out_features=output_vocab_size, bias=False) # Wtraget in vtarget x h

    
    def forward(self, enc_hiddens: torch.Tensor,
                dec_init_state: torch.Tensor, target_padded: torch.Tensor) -> torch.Tensor:
        """
        """
        # Chop of the <END> token for max length sentences.
        target_padded = target_padded[:-1]

        dec_state = dec_init_state

        # Initialize previous combined output vector o_{t-1} as zero
        batch_size = enc_hiddens.size(0)
        o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

        # Initialize a list we will use to collect the combined output o_t on each step
        combined_outputs = []

        ###     1. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
        ###         where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     2. Use the torch.split function to iterate over the time dimension of Y.
        ###         Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
        ###             - Squeeze Y_t into a tensor of dimension (b, e). 
        ###             - Construct Ybar_t by concatenating Y_t with o_prev on their last dimension
        ###             - Use the step function to compute the the Decoder's next (cell, state) values
        ###               as well as the new combined output o_t.
        ###             - Append o_t to combined_outputs
        ###             - Update o_prev to the new o_t.
        ###     3. Use torch.stack to convert combined_outputs from a list length tgt_len of
        ###         tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
        ###         where tgt_len = maximum target sentence length, b = batch size, h = hidden size.

        # dim is hidden + encoding, encoding is teacher forcing, oprev is prev hidden

        #1
        Y = self.embedding(target_padded)
        # print(Y.shape)

        #2
        for Y_t in torch.split(Y, 1): # https://edstem.org/us/courses/12801/discussion/861579 ?
          # print("Y_t shape", Y_t.shape)
          Y_t = torch.squeeze(Y_t)
          # print(Y_t.shape)
          # print("oprev",o_prev.shape, " y_t",Y_t.shape)
          # print(Y_t.shape)
          Ybar_t = torch.cat((Y_t, o_prev), dim=-1)
          # print("ybart shaoe",Ybar_t.shape)
          # print("got here 1")
          dec_state, o_t = self.step(Ybar_t, dec_state, enc_hiddens) #tdodo
          # print("got here")
          combined_outputs.append(o_t)
          o_prev = o_t
          
        #3
        combined_outputs = torch.stack(combined_outputs)

        return combined_outputs
    
    def step(self, Ybar_t: torch.Tensor,
            dec_state: Tuple[torch.Tensor, torch.Tensor],
            enc_hiddens: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
        """ Compute one forward step of the LSTM decoder, including the attention computation.

        :param Ybar_t: Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                                where b = batch size, e = embedding size, h = hidden size.
        :type Ybar_t: torch.Tensor
        :param dec_state: Tensors with shape (b, h), where b = batch size, h = hidden size.
                Tensor is decoder's prev hidden state
        :type dec_state: torch.Tensor
        :param enc_hiddens: Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                    src_len = maximum source length, h = hidden size.
        :type enc_hiddens: torch.Tensor

        :returns dec_state: Tensors with shape (b, h), where b = batch size, h = hidden size.
                Tensor is decoder's new hidden state. For an LSTM, this should be a tuple
                of the hidden state and cell state.
        returns combined_output: Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
        """

        combined_output = None

        ###     1. Apply the decoder to `Ybar_t` and `dec_state` to obtain the new dec_state.
        ###     2. Rename dec_state to dec_hidden
        # print(Ybar_t.shape, dec_state.shape)
        # print("get")
        # print(Ybar_t.shape)
        # print(dec_state.shape)
        dec_state = self.decoder(Ybar_t, dec_state)
        # print("goot")
        dec_hidden, dec_cell = dec_state

        ###     1. Apply the combined output projection layer to h^dec_t to compute tensor V_t
        ###     2. Compute tensor O_t by applying the Tanh function.
        V_t = self.combined_output_projection(dec_hidden)
        O_t = torch.tanh(V_t)

        combined_output = O_t
        return dec_state, combined_output

In [None]:
class NMT(nn.Module):
    """ Simple Neural Machine Translation Model:
        - Bidrectional RNN Encoder
        - Unidirection RNN Decoder
    """
    def __init__(self, embed_size, hidden_size, src_vocab, tgt_vocab, dropout_rate, device=torch.device("cpu"), pretrained_source=None,pretrained_target=None,):
        """ Init NMT Model.

        :param embed_size: Embedding size (dimensionality)
        :type embed_size: int
        :param hidden_size: Hidden Size, the size of hidden states (dimensionality)
        :type hidden_size: int
        :param src_vocab: Vocabulary object containing src language
        :type src_vocab: Vocab
        :param tgt_vocab: Vocabulary object containing tgt language
        :type tgt_vocab: Vocab
        :param device: torch device to put all modules on
        :type device: torch.device
        :param pretrained_source: Matrix of pre-trained source word embeddings
        :type pretrained_source: Optional[torch.Tensor]
        :param pretrained_target: Matrix of pre-trained target word embeddings
        :type pretrained_target: Optional[torch.Tensor]
        """
        super(NMT, self).__init__()
        self.device=device
        self.embed_size = embed_size
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        src_pad_token_idx = src_vocab['<pad>']
        tgt_pad_token_idx = tgt_vocab['<pad>']
        self.source_embedding = nn.Embedding(len(src_vocab), embed_size, padding_idx=src_pad_token_idx)
        self.target_embedding = nn.Embedding(len(tgt_vocab), embed_size, padding_idx=tgt_pad_token_idx)
        self.dropout_rate = 0
        
        with torch.no_grad():
            if pretrained_source is not None:
                self.source_embedding.weight.data = pretrained_source
                # TODO: Decide if we want the embeddings to update as we train
                self.source_embedding.weight.requires_grad = False
        
            if pretrained_target is not None:
                self.target_embedding.weight.data = pretrained_target
                # TODO: Decide if we want the embeddings to update as we train
                self.target_embedding.weight.requires_grad = False
        
        self.hidden_size = hidden_size
        # self.dropout_rate = dropout_rate

        self.encoder = Encoder(
            embed_size=embed_size,
            hidden_size=hidden_size,
            source_embeddings=self.source_embedding,
        )
        self.decoder = Decoder(
            embed_size=embed_size,
            hidden_size=hidden_size,
            target_embedding=self.target_embedding,
            # dropout_rate=dropout_rate,
            device=self.device,
        )


    def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        :param source: list of source sentence tokens
        :type source: List[List[str]]
        :param target: list of target sentence tokens, wrapped by `<s>` and `</s>`
        :type target: List[List[str]]
        :returns scores: a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        :rtype: torch.Tensor
        """
        # Compute sentence lengths
        source_lengths = [len(s) for s in source]

        # Convert list of lists into tensors
        source_padded = self.src_vocab.to_input_tensor(source, device=self.device)   # Tensor: (src_len, b)
        target_padded = self.tgt_vocab.to_input_tensor(target, device=self.device)   # Tensor: (tgt_len, b)
        
        ###     1. Apply the encoder to `source_padded` by calling `self.encode()`
        ###     2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()`
        ###     3. Apply the decoder to compute combined-output by calling `self.decode()`
        ###     4. Compute log probability distribution over the target vocabulary using the
        ###        combined_outputs returned by the `self.decode()` function.

        enc_hiddens, dec_init_state = self.encode(source_padded, source_lengths)
        enc_masks = generate_sent_masks(enc_hiddens, source_lengths, self.device)
        combined_outputs = self.decode(enc_hiddens, dec_init_state, target_padded)
        P = F.log_softmax(self.decoder.target_vocab_projection(combined_outputs), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.tgt_vocab['<pad>']).float()
        
        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(-1) * target_masks[1:]
        scores = target_gold_words_log_prob.sum(dim=0)
        return scores


    def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """ Apply the encoder to source sentences to obtain encoder hidden states.
            Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

        :param source_padded: Tensor of padded source sentences with shape (src_len, b), where
            b = batch_size, src_len = maximum source sentence length. Note that these have
            already been sorted in order of longest to shortest sentence.
        :type source_padded: torch.Tensor
        :param source_lengths: List of actual lengths for each of the source sentences in the batch
        :type source_lengths: List[int]
        :returns: Tuple of two items. The first is Tensor of hidden units with shape (b, src_len, h*2),
            where b = batch size, src_len = maximum source sentence length, h = hidden size. The second is
            Tuple of tensors representing the decoder's initial hidden state and cell.
        :rtype: Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
        """
        return self.encoder(source_padded, source_lengths)


    def decode(self, enc_hiddens: torch.Tensor,
                dec_init_state: torch.Tensor, target_padded: torch.Tensor) -> torch.Tensor:
        """Compute combined output vectors for a batch.

        :param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                     b = batch size, src_len = maximum source sentence length, h = hidden size.
        :param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
        :param target_padded: Gold-standard padded target sentences (tgt_len, b), where
                                       tgt_len = maximum target sentence length, b = batch size. 

        :returns combined_outputs: combined output tensor  (tgt_len, b,  h), where
                                    tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
        :rtype: torch.Tensor
        """
        return self.decoder(enc_hiddens, dec_init_state, target_padded)

    def beam_search(self, src_sent: List[str], beam_size: int=5, max_decoding_time_step: int=70) -> List[Hypothesis]:
        """ Given a single source sentence, perform beam search, yielding translations in the target language.
        :param src_sent: a single source sentence (words)
        :type src_sent: List[str]
        :param beam_size: beam size
        :type beam_size: int
        :param max_decoding_time_step: maximum number of time steps to unroll the decoding RNN
        :type max_decoding_time_step: int
        :returns hypotheses: a list of hypothesis, each hypothesis has two fields:
                value: List[str]: the decoded target sentence, represented as a list of words
                score: float: the log-likelihood of the target sentence
        :rtype: List[Hypothesis]
        """
        src_sents_var = self.src_vocab.to_input_tensor([src_sent], self.device)

        src_encodings, dec_init_vec = self.encode(src_sents_var, [len(src_sent)])

        h_tm1 = dec_init_vec
        att_tm1 = torch.zeros(1, self.hidden_size, device=self.device)

        eos_id = self.tgt_vocab['</s>']

        hypotheses = [['<s>']]
        hyp_scores = torch.zeros(len(hypotheses), dtype=torch.float, device=self.device)
        completed_hypotheses = []

        t = 0
        while len(completed_hypotheses) < beam_size and t < max_decoding_time_step:
            t += 1
            hyp_num = len(hypotheses)

            exp_src_encodings = src_encodings.expand(hyp_num,
                                                     src_encodings.size(1),
                                                     src_encodings.size(2))

            y_tm1 = torch.tensor([self.tgt_vocab[hyp[-1]] for hyp in hypotheses], dtype=torch.long, device=self.device)
            y_t_embed = self.target_embedding(y_tm1)

            x = torch.cat([y_t_embed, att_tm1], dim=-1)

            h_t, att_t = self.decoder.step(x, h_tm1, exp_src_encodings)
          
            h_t, c_t = h_t

            # log probabilities over target words
            log_p_t = F.log_softmax(self.decoder.target_vocab_projection(att_t), dim=-1)

            live_hyp_num = beam_size - len(completed_hypotheses)
            contiuating_hyp_scores = (hyp_scores.unsqueeze(1).expand_as(log_p_t) + log_p_t).view(-1)
            top_cand_hyp_scores, top_cand_hyp_pos = torch.topk(contiuating_hyp_scores, k=live_hyp_num)

            prev_hyp_ids = torch.div(top_cand_hyp_pos, len(self.tgt_vocab), rounding_mode='floor')
            hyp_word_ids = top_cand_hyp_pos % len(self.tgt_vocab)

            new_hypotheses = []
            live_hyp_ids = []
            new_hyp_scores = []

            for prev_hyp_id, hyp_word_id, cand_new_hyp_score in zip(prev_hyp_ids, hyp_word_ids, top_cand_hyp_scores):
                prev_hyp_id = prev_hyp_id.item()
                hyp_word_id = hyp_word_id.item()
                cand_new_hyp_score = cand_new_hyp_score.item()

                hyp_word = self.tgt_vocab.id2word[hyp_word_id]
                new_hyp_sent = hypotheses[prev_hyp_id] + [hyp_word]
                if hyp_word == '</s>':
                    completed_hypotheses.append(Hypothesis(value=new_hyp_sent[1:-1],
                                                           score=cand_new_hyp_score))
                else:
                    new_hypotheses.append(new_hyp_sent)
                    live_hyp_ids.append(prev_hyp_id)
                    new_hyp_scores.append(cand_new_hyp_score)

            if len(completed_hypotheses) == beam_size:
                break

            live_hyp_ids = torch.tensor(live_hyp_ids, dtype=torch.long, device=self.device)

            h_tm1 = h_t[live_hyp_ids], c_t[live_hyp_ids]
            att_tm1 = att_t[live_hyp_ids]

            hypotheses = new_hypotheses
            hyp_scores = torch.tensor(new_hyp_scores, dtype=torch.float, device=self.device)

        if len(completed_hypotheses) == 0:
            completed_hypotheses.append(Hypothesis(value=hypotheses[0][1:],
                                                   score=hyp_scores[0].item()))

        completed_hypotheses.sort(key=lambda hyp: hyp.score, reverse=True)

        return completed_hypotheses


    def greedy(self, src_sent: List[str], max_decoding_time_step: int=70) -> List[Hypothesis]:
        return self.beam_search(src_sent, beam_size=1, max_decoding_time_step=max_decoding_time_step)


    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = NMT(
            src_vocab=params['vocab']['source'],
            tgt_vocab=params['vocab']['target'],
            **args
        )
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the model to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(embed_size=self.embed_size, hidden_size=self.hidden_size, dropout_rate=self.dropout_rate),
            'vocab': dict(source=self.src_vocab, target=self.tgt_vocab),
            'state_dict': self.state_dict()
        }

        torch.save(params, path)

In [None]:
def batch_iter(data, batch_size, shuffle=False):
    """ Yield batches of source and target sentences reverse sorted by length (largest to smallest).
    :param data: list of tuples containing source and target sentence. ie.
        (list of (src_sent, tgt_sent))
    :type data: List[Tuple[List[str], List[str]]]
    :param batch_size: batch size
    :type batch_size: int
    :param shuffle: whether to randomly shuffle the dataset
    :type shuffle: boolean
    """
    batch_num = math.ceil(len(data) / batch_size)
    index_array = list(range(len(data)))

    if shuffle:
        np.random.shuffle(index_array)

    for i in range(batch_num):
        indices = index_array[i * batch_size: (i + 1) * batch_size]
        examples = [data[idx] for idx in indices]

        examples = sorted(examples, key=lambda e: len(e[0]), reverse=True)
        src_sents = [e[0] for e in examples]
        tgt_sents = [e[1] for e in examples]

        yield src_sents, tgt_sents

In [None]:
def evaluate_ppl(model, val_data, batch_size=32):
    """ Evaluate perplexity on dev sentences
    :param model: NMT Model
    :type model: NMT
    :param dev_data: list of tuples containing source and target sentence.
        i.e. (list of (src_sent, tgt_sent))
    :param val_data: List[Tuple[List[str], List[str]]]
    :param batch_size: size of batches to extract
    :type batch_size: int
    :returns ppl: perplexity on val sentences
    """
    was_training = model.training
    model.eval()

    cum_loss = 0.
    cum_tgt_words = 0.

    # no_grad() signals backend to throw away all gradients
    with torch.no_grad():
        for src_sents, tgt_sents in batch_iter(val_data, batch_size):
            loss = -model(src_sents, tgt_sents).sum()

            cum_loss += loss.item()
            tgt_word_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            cum_tgt_words += tgt_word_num_to_predict

        ppl = np.exp(cum_loss / cum_tgt_words)

    if was_training:
        model.train()

    return ppl


def compute_corpus_level_bleu_score(references: List[List[str]], hypotheses: List[Hypothesis]) -> float:
    """ Given decoding results and reference sentences, compute corpus-level BLEU score.
    :param references: a list of gold-standard reference target sentences
    :type references: List[List[str]]
    :param hypotheses: a list of hypotheses, one for each reference
    :type hypotheses: List[Hypothesis]
    :returns bleu_score: corpus-level BLEU score
    """
    if references[0][0] == '<s>':
        references = [ref[1:-1] for ref in references]
    bleu_score = corpus_bleu([[ref] for ref in references],
                             [hyp.value for hyp in hypotheses])
    return bleu_score


def evaluate_bleu(references, model, source):
    """Generate decoding results and compute BLEU score.
    :param model: NMT Model
    :type model: NMT
    :param references: a list of gold-standard reference target sentences
    :type references: List[List[str]]
    :param source: a list of source sentences
    :type source: List[List[str]]
    :returns bleu_score: corpus-level BLEU score
    """
    with torch.no_grad():
        top_hypotheses = []
        for s in tqdm(source, leave=False):
            hyps = model.beam_search(s, beam_size=16, max_decoding_time_step=(len(s)+10))
            top_hypotheses.append(hyps[0])
    
    s1 = compute_corpus_level_bleu_score(references, top_hypotheses)
    
    return s1

In [None]:
def train_and_evaluate(model, train_data, val_data, optimizer, epochs=10, train_batch_size=32, clip_grad=2, log_every = 100, valid_niter = 500, model_save_path="NMT_model.ckpt"):
    num_trail = 0
    cum_examples = report_examples = epoch = valid_num = 0
    hist_valid_scores = []
    train_iter = patience = cum_loss = report_loss = cum_tgt_words = report_tgt_words = 0

    print('Begin Maximum Likelihood training')
    train_time = begin_time = time.time()

    val_data_tgt = [tgt for _, tgt in val_data]
    val_data_src = [src for src, _ in val_data]

    for epoch in tqdm(range(epochs)):
        for src_sents, tgt_sents in batch_iter(train_data, batch_size=train_batch_size, shuffle=True):
            train_iter += 1
            
            optimizer.zero_grad()
            
            batch_size = len(src_sents)
            
            example_losses = -model(src_sents, tgt_sents)
            batch_loss = example_losses.sum()
            loss = batch_loss / batch_size
            loss.backward()
            
            # clip gradient
            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)
            
            optimizer.step()
            
            batch_losses_val = batch_loss.item()
            report_loss += batch_losses_val
            cum_loss += batch_losses_val
            
            tgt_words_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            report_tgt_words += tgt_words_num_to_predict
            cum_tgt_words += tgt_words_num_to_predict
            report_examples += batch_size
            cum_examples += batch_size

            if train_iter % log_every == 0:
                print('epoch %d, iter %d, avg. loss %.2f, avg. ppl %.2f ' \
                        'cum. examples %d, speed %.2f words/sec, time elapsed %.2f sec' % (epoch, train_iter,
                                                                                            report_loss / report_examples,
                                                                                            math.exp(report_loss / report_tgt_words),
                                                                                            cum_examples,
                                                                                            report_tgt_words / (time.time() - train_time),
                                                                                            time.time() - begin_time))
                train_time = time.time()
                report_loss = report_tgt_words = report_examples = 0.

                

            # perform validation
            if train_iter % valid_niter == 0:
                print('epoch %d, iter %d, cum. loss %.2f, cum. ppl %.2f cum. examples %d' % (epoch, train_iter,
                                                                                            cum_loss / cum_examples,
                                                                                            np.exp(cum_loss / cum_tgt_words),
                                                                                            cum_examples))
                
                cum_loss = cum_examples = cum_tgt_words = 0.
                valid_num += 1

                print('begin validation ...')

                # compute dev. ppl and bleu
                dev_ppl = evaluate_ppl(model, val_data, batch_size=128)   # dev batch size can be a bit larger
                valid_metric = -dev_ppl
                
                bleu_score = evaluate_bleu(val_data_tgt, model, val_data_src)*100

                print('validation: iter %d, dev. ppl %f, bleu_score %f' % (train_iter, dev_ppl, bleu_score))

                is_better = len(hist_valid_scores) == 0 or valid_metric > max(hist_valid_scores)
                hist_valid_scores.append(bleu_score)

                if is_better:
                    print('save currently the best model to [%s]' % model_save_path)
                    model.save(model_save_path)

                    # also save the optimizers' state
                    torch.save(optimizer.state_dict(), model_save_path + '.optim')


In [None]:
embed_size = 128
hidden_size = 512
src_vocab = src
tgt_vocab = tgt

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [None]:
epochs = 30

# epochs = 10
train_batch_size = 16
clip_grad = 2
log_every = 100
valid_niter = 2000
model_save_path="NMT_model.ckpt"

In [None]:
model = NMT(
    embed_size,
    hidden_size,
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None,
    pretrained_target=None,
)
model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [None]:
import sys

In [None]:

train_and_evaluate(
    model,
    train_data,
    val_data,
    optimizer,
    epochs,
    train_batch_size,
    clip_grad,
    log_every,
    valid_niter,
    model_save_path
)

Begin Maximum Likelihood training


  0%|          | 0/30 [00:00<?, ?it/s]

epoch 0, iter 100, avg. loss 89.47, avg. ppl 511.12 cum. examples 1600, speed 5061.64 words/sec, time elapsed 4.53 sec
epoch 0, iter 200, avg. loss 80.78, avg. ppl 285.12 cum. examples 3200, speed 5435.52 words/sec, time elapsed 8.74 sec
epoch 0, iter 300, avg. loss 75.71, avg. ppl 221.68 cum. examples 4800, speed 4719.71 words/sec, time elapsed 13.49 sec
epoch 0, iter 400, avg. loss 71.31, avg. ppl 178.76 cum. examples 6400, speed 5276.62 words/sec, time elapsed 17.66 sec
epoch 0, iter 500, avg. loss 70.28, avg. ppl 164.15 cum. examples 8000, speed 5051.50 words/sec, time elapsed 22.03 sec
epoch 0, iter 600, avg. loss 70.45, avg. ppl 144.35 cum. examples 9600, speed 5470.58 words/sec, time elapsed 26.17 sec
epoch 0, iter 700, avg. loss 71.27, avg. ppl 139.91 cum. examples 11200, speed 4809.56 words/sec, time elapsed 30.97 sec
epoch 0, iter 800, avg. loss 64.85, avg. ppl 119.21 cum. examples 12800, speed 4960.40 words/sec, time elapsed 35.34 sec
epoch 0, iter 900, avg. loss 62.98, avg.

  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2000, dev. ppl 78.598888, bleu_score 2.090935
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 0, iter 2100, avg. loss 61.01, avg. ppl 77.83 cum. examples 1600, speed 337.63 words/sec, time elapsed 154.07 sec
epoch 0, iter 2200, avg. loss 60.86, avg. ppl 72.47 cum. examples 3200, speed 4957.04 words/sec, time elapsed 158.66 sec
epoch 0, iter 2300, avg. loss 60.48, avg. ppl 73.97 cum. examples 4800, speed 4771.24 words/sec, time elapsed 163.37 sec
epoch 1, iter 2400, avg. loss 58.56, avg. ppl 67.54 cum. examples 6387, speed 5784.34 words/sec, time elapsed 167.19 sec
epoch 1, iter 2500, avg. loss 55.79, avg. ppl 56.62 cum. examples 7987, speed 4700.58 words/sec, time elapsed 171.89 sec
epoch 1, iter 2600, avg. loss 59.23, avg. ppl 59.19 cum. examples 9587, speed 5480.93 words/sec, time elapsed 176.13 sec
epoch 1, iter 2700, avg. loss 56.81, avg. ppl 57.61 cum. examples 11187, speed 4245.62 words/sec, time elapsed 181.41 sec
epoch 1, iter 2800, avg. loss 52.87, avg. ppl 52.42 cum. examples 12787, speed 5617.16 words/sec, time elapsed 185.21 sec
epoch 1, iter 2900, avg. loss 5

  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4000, dev. ppl 53.742877, bleu_score 3.253782
epoch 1, iter 4100, avg. loss 54.71, avg. ppl 49.81 cum. examples 1600, speed 342.95 words/sec, time elapsed 302.88 sec
epoch 1, iter 4200, avg. loss 53.42, avg. ppl 46.09 cum. examples 3200, speed 5302.14 words/sec, time elapsed 307.08 sec
epoch 1, iter 4300, avg. loss 53.19, avg. ppl 45.95 cum. examples 4800, speed 4524.27 words/sec, time elapsed 312.00 sec
epoch 1, iter 4400, avg. loss 54.31, avg. ppl 46.00 cum. examples 6400, speed 5555.04 words/sec, time elapsed 316.08 sec
epoch 1, iter 4500, avg. loss 50.94, avg. ppl 43.42 cum. examples 8000, speed 4850.83 words/sec, time elapsed 320.54 sec
epoch 1, iter 4600, avg. loss 53.79, avg. ppl 45.86 cum. examples 9600, speed 5409.40 words/sec, time elapsed 324.70 sec


In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive')
# with open('/content/gdrive/My Drive/sophomore/nlp 4740/project_3/NMT_statedict.ckpt', 'w') as f:
#   f.write('NMT_statedict.ckpt')

In [None]:
# torch.save(model.state_dict, 'NMT_statedict.ckpt')

# Part 2: Analysis

## Part 2.1: Within-model comparison: ablation study
We train 4 variants of the RNN model:

1. Baseline model
2. Baseline model made more complex by modification $A$ (e.g. changing the hidden dimensionality from $h$ to $2h$).
3. Baseline model made more complex by modification $B$ (where $B$ is an entirely distinct/different update from $A$).
4. Baseline model with both modifications $A$ and $B$ applied.

Under the framing of an ablation study, we would describe this as beginning with model 4 and then ablating (i.e. removing) each of the two modifications, in turn; and then removing both to see if they were genuinely neccessary for the performance we observed. 

In [None]:
import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

### 2.1.1 Configuration 1

In [None]:
# baseline_nmt = NMT()
model = NMT(
    embed_size,
    hidden_size,
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None,
    pretrained_target=None,
)
model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

train_and_evaluate(
    model,
    train_data,
    val_data,
    optimizer,
    epochs,
    train_batch_size,
    clip_grad,
    log_every,
    valid_niter,
    model_save_path
)

### 2.1.2 Configuration 2


In [None]:
# mod_a_nmt = NMT()
# modification A
model = NMT(
    embed_size,
    2*hidden_size,
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None,
    pretrained_target=None,
)
model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

model_save_path = "NMT_modelA.ckpt"
train_and_evaluate(
    model,
    train_data,
    val_data,
    optimizer,
    epochs,
    train_batch_size,
    clip_grad,
    log_every,
    valid_niter,
    model_save_path
)

### 2.1.3 Configuration 3


In [None]:
# mod_b_nmt = NMT()
# modification B
model = NMT(
    2*embed_size,
    hidden_size,
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None,
    pretrained_target=None,
)
model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

model_save_path = "NMT_modelB.ckpt"
train_and_evaluate(
    model,
    train_data,
    val_data,
    optimizer,
    epochs,
    train_batch_size,
    clip_grad,
    log_every,
    valid_niter,
    model_save_path
)

### 2.1.4 Configuration 4


In [None]:
# both_mod_nmt = NMT()
# modification A&B
model = NMT(
    2*embed_size,
    2*hidden_size,
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None,
    pretrained_target=None,
)
model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

model_save_path = "NMT_modelAB.ckpt"
train_and_evaluate(
    model,
    train_data,
    val_data,
    optimizer,
    epochs,
    train_batch_size,
    clip_grad,
    log_every,
    valid_niter,
    model_save_path
)

### 2.1.5 Report

In [None]:
model = NMT.load("drive/My Drive/sophomore/nlp 4740/p3/NMT_model.ckpt")
modelA = NMT.load("drive/My Drive/sophomore/nlp 4740/p3/NMT_modelA.ckpt")
modelB = NMT.load("drive/My Drive/sophomore/nlp 4740/p3/NMT_modelB.ckpt")
modelAB = NMT.load("drive/My Drive/sophomore/nlp 4740/p3/NMT_modelAB.ckpt")

In [None]:
def evaluate_without_saving(model, train_data, val_data, epochs=10, train_batch_size=32, clip_grad=2, log_every = 100, valid_niter = 500):

  print('begin validation ...')

  # compute dev. ppl and bleu
  dev_ppl = evaluate_ppl(model, val_data, batch_size=128)   # dev batch size can be a bit larger
  valid_metric = -dev_ppl
  
  bleu_score = evaluate_bleu(val_data_tgt, model, val_data_src)*100

  print("ppl", dev_ppl)
  print("bleu", bleu_score)
  print('validation: dev. ppl %f, bleu_score %f' % (dev_ppl, bleu_score))


In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model.to(device)
evaluate_without_saving(
    model,
    train_data,
    val_data,
    epochs,
    train_batch_size,
    clip_grad,
    log_every
)

begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

ppl 78.56417308463188
bleu 2.0913877188792505
validation: dev. ppl 78.564173, bleu_score 2.091388


In [None]:
modelA.to(device)
evaluate_without_saving(
    modelA,
    train_data,
    val_data,
    epochs,
    train_batch_size,
    clip_grad,
    log_every
)

begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

ppl 75.04400906886652
bleu 2.3700046036300892
validation: dev. ppl 75.044009, bleu_score 2.370005


In [None]:
modelB.to(device)
evaluate_without_saving(
    modelB,
    train_data,
    val_data,
    epochs,
    train_batch_size,
    clip_grad,
    log_every
)

begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

ppl 70.19409542425542
bleu 2.56449775292072
validation: dev. ppl 70.194095, bleu_score 2.564498


In [None]:
modelAB.to(device)
evaluate_without_saving(
    modelAB,
    train_data,
    val_data,
    epochs,
    train_batch_size,
    clip_grad,
    log_every
)

begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

ppl 74.08051914103658
bleu 1.9298960568563488
validation: dev. ppl 74.080519, bleu_score 1.929896


### Nuanced Study: Sentence lengths

In [None]:
def evaluate_long_sentences(model, val_data, val_data_src, val_data_tgt, train_batch_size=32):

  # print('begin validation ...')

  # compute dev. ppl and bleu
  dev_ppl = evaluate_ppl(model, val_data, batch_size=2)   # dev batch size can be a bit larger
  valid_metric = -dev_ppl
  
  bleu_score = evaluate_bleu(val_data_tgt, model, val_data_src)*100

  # print("ppl", dev_ppl)
  # print("bleu", bleu_score)
  print('validation: dev. ppl %f, bleu_score %f' % (dev_ppl, bleu_score))


In [None]:
long_sentences = val_data.copy()
for source, target in val_data:
  if len(source) <= 45: 
    long_sentences.remove((source, target))

In [None]:
len(long_sentences)

26

In [None]:
long_target = val_data_tgt.copy()
long_source = val_data_src.copy()
for source, target in val_data:
  if len(source) <= 45:
    long_source.remove(source)
    long_target.remove(target)

In [None]:
evaluate_long_sentences(model, long_sentences, long_source, long_target, train_batch_size,)
evaluate_long_sentences(modelA, long_sentences, long_source, long_target, train_batch_size,)
evaluate_long_sentences(modelB, long_sentences, long_source, long_target, train_batch_size,)
evaluate_long_sentences(modelAB, long_sentences, long_source, long_target, train_batch_size,)

begin validation ...


  0%|          | 0/26 [00:00<?, ?it/s]

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


ppl 184.10522420037125
bleu 7.143017268331987
validation: dev. ppl 184.105224, bleu_score 7.143017
begin validation ...


  0%|          | 0/26 [00:00<?, ?it/s]

ppl 166.80338612522544
bleu 1.6730970086739416
validation: dev. ppl 166.803386, bleu_score 1.673097
begin validation ...


  0%|          | 0/26 [00:00<?, ?it/s]

ppl 158.61309483281323
bleu 1.5272698739238728
validation: dev. ppl 158.613095, bleu_score 1.527270
begin validation ...


  0%|          | 0/26 [00:00<?, ?it/s]

ppl 174.9625787267432
bleu 1.6646726434427883
validation: dev. ppl 174.962579, bleu_score 1.664673


In [None]:
threshold = 3
short_sentences = val_data.copy()
for source, target in val_data:
  if len(source) > threshold: 
    short_sentences.remove((source, target))
short_target = val_data_tgt.copy()
short_source = val_data_src.copy()
for source, target in val_data:
  if len(source) > threshold:
    short_source.remove(source)
    short_target.remove(target)
print(len(short_sentences))

118


In [None]:
evaluate_long_sentences(model, short_sentences, short_source, short_target, train_batch_size,)
evaluate_long_sentences(modelA, short_sentences, short_source, short_target, train_batch_size,)
evaluate_long_sentences(modelB, short_sentences, short_source, short_target, train_batch_size,)
evaluate_long_sentences(modelAB, short_sentences, short_source, short_target, train_batch_size,)

  0%|          | 0/118 [00:00<?, ?it/s]

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


validation: dev. ppl 15.761867, bleu_score 9.929016


  0%|          | 0/118 [00:00<?, ?it/s]

validation: dev. ppl 15.728714, bleu_score 9.954533


  0%|          | 0/118 [00:00<?, ?it/s]

validation: dev. ppl 14.117991, bleu_score 12.572569


  0%|          | 0/118 [00:00<?, ?it/s]

validation: dev. ppl 18.048835, bleu_score 29.297609


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



**The 4 models:**

Baseline: NMT with LSTM encoder and decoders, embed size of 256, hidden size of 1024. 

Ablation model A: Baseline model but reduced embed size to 128

Ablation model B: Baseline model but reduced hidden size to 512

Ablation model A+B: Baseline model but reduced embed size to 128 and reduced hidden size to 512.

**Reasoning for the modifications:**

In Q 1.6, I mentioned that embed size and hidden size both affect how well we can represent and retain our represented input. I concluded that big and small sizes have their separate advantages depending on the amount of training data we have and the complexity of the input. However, without empirical results, I could not state which size was the most suited size for this task. 

Running this ablation study would help us figure out whether our baseline model is needlessly complex (ex. If cutting embed or hidden size in half leads to similar performance, then we didn’t need that big of a size in the first place). If time permits, I would run an ablation study on the ablated models until we arrive at a reasonable performance-complexity tradeoff. 

**Report of Quantitative Scores (2 dp):**
(in the order of
Baseline,
Ablation A,
Ablation B,
Ablation A+B)

Perplexity:
78.56
75.04
70.19
74.08

Bleu:
2.09
2.37
2.56
1.93

**Nuanced Quantitative Analysis:**

Before we dive in, let us first look at the scores to form intuition. Ablations A and B both individually lower the perplexity compared to the baseline while raising the bleu score. Metrics wise, this is an outright improvement, suggesting that the original model was too complex for the amount of data I had and training iterations I ran. 
Thus we arrive at our hypothesis: the original model is too complex and overfits / hasn’t converged yet

Since the lack of convergence and overfitting is more evident in longer sentences (as the errors compound), if such a hypothesis were true, then the baseline should do the worst on long sentences, followed by ablations A and B, then the best should be ablation A+B (simplest model should generalize better if we don’t have enough data to converge). 

On the other hand, for short sentences the complexity should help more than the compounding errors hurt. Therefore we should expect to see the baseline do the best and ablations A and B follow, and A+B the worst. 

**Running on all sentences longer than 45 tokens we get:**
(in order of 
Baseline
Ablation A
Ablation B
Ablation A+B)

Perplexity
174.96
166.80
158.61
184.11

Bleu
1.66
1.67
1.52
7.14

Indeed, the best is A+B, with a Bleu score over 4 times the others while maintaining similar perplexity! Ablations A and B both have lower perplexity than the baseline, and A has similar Bleu. 

**Running on all sentences shorter than 3 tokens we get: **
(in order of 
Baseline
Ablation A
Ablation B
Ablation A+B)

Perplexity
18.04
14.11
15.73
15.76

Bleu
29.30
9.95
12.57
9.92

Indeed, the baseline has a 3 times higher Bleu score than the ablations, with similar perplexity. Note that ablation B seems to retain Bleu score better than ablation A. 

Since our results match our hypothesis, we can conclude that our evidence is correct: although the complexity of our baseline model significantly increases our ability to translate very short sentences, its complexity also leads to errors which compound over long sentences, causing its overall Bleu score to drop. 

If one were to choose 1 ablation, ablation B is the most effective ablation (doing similarly to A on long sentences, better than A and A+B on short sentences, and slightly better than A and A+B on average). I reckon this is because the hidden size acts as a bottleneck for the embeddings between the encoding and decoding parts, thus directly reducing this bottleneck’s size is more effective. 


In [None]:
# med_sentences = val_data.copy()
# for source, target in val_data:
#   if len(source) < 12 or len(source) > 15: 
#     med_sentences.remove((source, target))
# med_target = val_data_tgt.copy()
# med_source = val_data_src.copy()
# for source, target in val_data:
#   if len(source) < 12 or len(source) > 15: 
#     med_source.remove(source)
#     med_target.remove(target)
# print(len(med_sentences))

In [None]:
# evaluate_long_sentences(model, med_sentences, med_source, med_target, train_batch_size,)
# evaluate_long_sentences(modelA, med_sentences, med_source, med_target, train_batch_size,)
# evaluate_long_sentences(modelB, med_sentences, med_source, med_target, train_batch_size,)
# evaluate_long_sentences(modelAB, med_sentences, med_source, med_target, train_batch_size,)

# Live running demo

In [None]:
#@title Translation
#@markdown Enter a sentence to see the translation
input_string = "why should i play the roman fool and die on my own sword?" #@param {type:"string"}
model_type = "baseline_nmt" #@param ["baseline_nmt", "mod_a_nmt", "mod_b_nmt", "both_mods_nmt"]
from IPython.display import HTML

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

output = ""

# BAD THING TO DO BELOW!!
model_used = globals()[model_type]

with torch.no_grad():
    # RUN MODEL
    translation = untokenize(model.beam_search(input_string, beam_size=64, max_decoding_time_step=len(input_string)+10)[0].value)

# Generate nice display
output += '<p style="font-family:verdana; font-size:110%;">'
output += " Input sequence: "+input_string+"</p>"
output += '<p style="font-family:verdana; font-size:110%;">'
output += f" Translation to Shakespeare: {translation}</p><hr>"
output = "<h3>Results:</h3>" + output

display(HTML(output))