# Neural Machine Translation by Jointly Learning to Align and Translate

In this third notebook on sequence-to-sequence models using PyTorch and TorchText, we'll be implementing the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473). This model achives our best perplexity yet, ~27 compared to ~34 for the previous model.

## Introduction

Here is the general encoder-decoder model:

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq1.png?raw=1)

In the previous model, our architecture was set-up in a way to reduce "information compression" by explicitly passing the context vector, $z$, to the decoder at every time-step and by passing both the context vector and embedded input word, $d(y_t)$, along with the hidden state, $s_t$, to the linear layer, $f$, to make a prediction.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq7.png?raw=1)

Even though we have reduced some of this compression, our context vector still needs to contain all of the information about the source sentence. The model implemented in this notebook avoids this compression by allowing the decoder to look at the entire source sentence (via its hidden states) at each decoding step! How does it do this? It uses *attention*. 

Attention works by first, calculating an attention vector, $a$, that is the length of the source sentence. The attention vector has the property that each element is between 0 and 1, and the entire vector sums to 1. We then calculate a weighted sum of our source sentence hidden states, $H$, to get a weighted source vector, $w$. 

$$w = \sum_{i}a_ih_i$$

We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction. We'll explain how to do all of this during the session.


## Import Statements

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext import data
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time
import json
from tqdm import tqdm as tqdm

Set the random seeds for reproducability.

In [3]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [4]:
# downloading spacy english
%%bash
# python -m spacy download en

UsageError: Line magic function `%%bash` not found.


## Previewing the Dataset

Link to the dataset https://rajpurkar.github.io/SQuAD-explorer/

In [5]:
with open("./datasets/train-v2.0.json") as f:
    squad_data = json.load(f)

In [6]:
squad_data["data"][0]["paragraphs"][0]["qas"]

[{'question': 'When did Beyonce start becoming popular?',
  'id': '56be85543aeaaa14008c9063',
  'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
  'is_impossible': False},
 {'question': 'What areas did Beyonce compete in when she was growing up?',
  'id': '56be85543aeaaa14008c9065',
  'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
  'is_impossible': False},
 {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
  'id': '56be85543aeaaa14008c9066',
  'answers': [{'text': '2003', 'answer_start': 526}],
  'is_impossible': False},
 {'question': 'In what city and state did Beyonce  grow up? ',
  'id': '56bf6b0f3aeaaa14008c9601',
  'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
  'is_impossible': False},
 {'question': 'In which decade did Beyonce become famous?',
  'id': '56bf6b0f3aeaaa14008c9602',
  'answers': [{'text': 'late 1990s', 'answer_start': 276}],
  'is_impossible': False},
 {'question': 'In what R&B group

In [7]:
squad_data["data"][0]["paragraphs"][0]["context"]

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

## Preparing Data
### So converting the data above in a format having (context+question, answer) to be fed for training.
While combining context and question we will add one string "<mos>" to distinguish between context(paragraph) and question

In [8]:
quescontext_ans_pairs = []

for article in tqdm(squad_data["data"]):
    for paragraph in article["paragraphs"]:
        context = paragraph["context"]
        for qa in paragraph["qas"]:
            quescontext = context + "<mos> " + qa["question"]
            if not qa["is_impossible"]:
                answer = qa["answers"][0]["text"]
            else:
                answer = "is impossilbe"
            quescontext_ans_pairs.append([quescontext, answer])

In [9]:
len(quescontext_ans_pairs)

130319

In [10]:
quescontext_ans_pairs[0]

['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".<mos> When did Beyonce start becoming popular?',
 'in the late 1990s']

## Defining the Fields

Question and Answer are standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case.

In [11]:
Question = Field(sequential = True,
                      tokenize = 'spacy',
#                       batch_first =True,
                      include_lengths=True,
                      init_token = '<sos>',
                      eos_token = '<eos>',
                      lower = True)

Answer = Field(sequential = True,
                    tokenize ='spacy',
#                     batch_first =True,
                    include_lengths=True,
                    init_token = '<sos>',
                    eos_token = '<eos>',
                    lower = True)



Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the list quescontext_ans_pairs

In [12]:
# fields = [('questions', Question),('answers',Answer)]
fields = [('questions', Question),('answers',Question)]

In [13]:
example = [data.Example.fromlist([quescontext_ans_pairs[i][0],quescontext_ans_pairs[i][1]], fields) for i in tqdm(range(len(quescontext_ans_pairs)))] 



## Removing stopwords

In [14]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwods')
stopwords = stopwords.words("english")

def remove_stopwords(word_list):
    return [word for word in word_list if word not in stopwords]

[nltk_data] Error loading stopwods: Package 'stopwods' not found in
[nltk_data]     index


In [15]:
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [16]:
len(stopwords)

179

In [17]:
for idx, i in tqdm(enumerate(example)):
    example[idx].questions = remove_stopwords(i.questions)

In [18]:
len(example)

130319

## Creating dataset

In [19]:
# creating dataset
squadDataset = data.Dataset(example, fields)

Finally, we can split into training, testing, and validation sets by using the split() method:

In [20]:
(train, valid, test) = squadDataset.split(split_ratio=[0.70, 0.15, 0.15], random_state=random.seed(SEED))

In [21]:
(len(train), len(valid), len(test))

(91223, 19548, 19548)

An example from the dataset:

In [22]:
vars(train.examples[10])

{'questions': ['another',
  'issue',
  'use',
  'hypopodium',
  'standing',
  'platform',
  'support',
  'feet',
  ',',
  'given',
  'hands',
  'may',
  'able',
  'support',
  'weight',
  '.',
  '17th',
  'century',
  'rasmus',
  'bartholin',
  'considered',
  'number',
  'analytical',
  'scenarios',
  'topic',
  '.',
  '20th',
  'century',
  ',',
  'forensic',
  'pathologist',
  'frederick',
  'zugibe',
  'performed',
  'number',
  'crucifixion',
  'experiments',
  'using',
  'ropes',
  'hang',
  'human',
  'subjects',
  'various',
  'angles',
  'hand',
  'positions',
  '.',
  'experiments',
  'support',
  'angled',
  'suspension',
  ',',
  'two',
  '-',
  'beamed',
  'cross',
  ',',
  'perhaps',
  'form',
  'foot',
  'support',
  ',',
  'given',
  'aufbinden',
  'form',
  'suspension',
  'straight',
  'stake',
  '(',
  'used',
  'nazis',
  'dachau',
  'concentration',
  'camp',
  'world',
  'war',
  'ii',
  ')',
  ',',
  'death',
  'comes',
  'rather',
  'quickly.<mos',
  '>',
  'sai

## Data Augmentation
copied from the repository of Woncoh1 https://github.com/woncoh1/END/blob/main/Assignment_Session_7_Sentiment_Analysis_using_LSTM_RNN.ipynb

In [23]:
# import libraries and prepare sentence text

import random
# for back translation
import google_trans_new
from google_trans_new import google_translator
# for word tokenization
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/tensor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Back Translation

In [24]:
# translate a sentence to a random language,
# and translate back to original language

def back_translate(sentence, p=0.1):
    # do nothing with probability of 1-p
    if random.uniform(0,1) > p:
        return sentence
    
    # combine tokenized sentence into one string
    sentence = " ".join(sentence)
    
    # instantiate translator
    translator = google_translator()
    
    # choose a target language
    available_langs = list(google_trans_new.LANGUAGES.keys())
    trans_lang = random.choice(available_langs)
    
    # translate to the target language
    translations = translator.translate(sentence, lang_tgt=trans_lang)
    
    # translate back to original language
    translations_en_random = translator.translate(translations, lang_src=trans_lang, lang_tgt="en")
    
    # select only one translation
    if len(translations_en_random) > 1:
        translations_en_random = translations_en_random[0]
        
    return word_tokenize(translations_en_random)

#### Random Deletion

In [1]:
# randomly delete words from a sentence with a given probability

def random_deletion(sentence, p=0.5): 
    # return if single word
    if len(sentence) == 1: 
        return sentence
    # delete words
    remaining = list(filter(lambda x: random.uniform(0,1) > p, sentence)) 
    # if nothing left, sample a random word
    if len(remaining) == 0: 
        return [random.choice(sentence)] 
    else:
        return remaining

#### Random Swap

In [26]:
# randomly swap a pair of words in a sentence for a given # of times

def random_swap(sentence, n=5): 
    if len(sentence) < 2:
        return sentence
    length = range(len(sentence)) 
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
    return sentence

#### Carry Out Data Augmentation

In [1]:
for example in train.examples: 
    example.questions = back_translate(example.questions, p=0.01)
    example.questions = random_deletion(example.questions, p=0.1)
    example.questions = random_swap(example.questions, n=1)

NameError: name 'train' is not defined

## Building Vocabulary

In [28]:

MAX_VOCAB_SIZE = 25000
''' Use max of 25000 frequently occuring words to create vocabulary. Use pretrained fasttext embedding. Embedding of words that exist in the data but not in the
dictionary are set using normal distribution.
'''
Question.build_vocab(train,
                     max_size = MAX_VOCAB_SIZE,
                     vectors='fasttext.simple.300d',
                     unk_init = torch.Tensor.normal_,
                     min_freq = 2)
# Answer.build_vocab(train,
#                    vectors='fasttext.simple.300d',
#                    min_freq = 2)

In [29]:
# get the vocab instance
vocab = Question.vocab

In [30]:
embedding_vector = Question.vocab.vectors
embedding_vector.shape

torch.Size([25004, 300])

By default, torchtext will add two more special tokens, <unk> for unknown words and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU.

In [31]:
print('Size of context+question vocab : ', len(Question.vocab))
# print('Size of answer vocab : ', len(Answer.vocab))
print('Top 10 words appeared in contex+question:', list(Question.vocab.freqs.most_common(10)))
# print("Top 10 words appeared in answer:", list(Answer.vocab.freqs.most_common(10)))

Size of context+question vocab :  25004
Top 10 words appeared in contex+question: [(',', 624240), ('.', 325459), ('"', 104398), ('-', 102183), ('(', 89076), (')', 85255), ('>', 81566), ('?', 81362), ("'s", 67096), ('is', 31131)]


## Create the iterators

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

NameError: name 'torch' is not defined

In [33]:
BATCH_SIZE = 16
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train, valid, test),
                                                            batch_size = BATCH_SIZE, 
                                                            sort_key = lambda x: len(x.questions),
                                                            sort_within_batch=True,
                                                            device = device)



In [34]:
vars(Question.vocab)

{'freqs': Counter({'historical': 1632,
          'reveal': 185,
          'policing': 247,
          'agents': 420,
          'undertaken': 131,
          'cross': 1318,
          '-': 102183,
          'police': 2163,
          'missions': 325,
          'many': 19224,
          'years': 9706,
          '(': 89076,
          'deflem': 9,
          ',': 624240,
          ')': 85255,
          '.': 325459,
          '19th': 2675,
          'ww2': 5,
          'european': 5395,
          'agencies': 650,
          'undertook': 85,
          'border': 1206,
          'surveillance': 123,
          'concerns': 679,
          'anarchist': 20,
          'political': 5745,
          'notable': 1314,
          'example': 5204,
          'prussian': 840,
          'karl': 340,
          'marx': 147,
          'remained': 2028,
          'resident': 322,
          'london': 3596,
          'interests': 540,
          'public': 6321,
          'operation': 1217,
          'control': 4560,
       

Save the vocabulary for later use

In [35]:
import os, pickle
with open('question_tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Question.vocab.stoi, tokens)
# with open("answer_tokenizer.pkl", "wb") as tokens:
#     pickle.dump(Answer.vocab.stoi, tokens)

## Building the model

### Encoder

First, we'll build the encoder using a single layer GRU, however we now use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. 

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq8.png?raw=1)

We now have:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

As before, we only pass an input (`embedded`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `outputs` and `hidden`. 

`outputs` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`hidden` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and we currently have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

**Note**: this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

As we want our model to look back over the whole of the source sentence we return `outputs`, the stacked forward and backward hidden states for every token in the source sentence. We also return `hidden`, which acts as our initial hidden state in the decoder.

In [3]:
class Encoder(nn.Module):
    '''
        Encoder takes the following aruments
            input_dim --> [src_len,batch_size]
            emb_dim --> 300 to convert each input word to embedding
            enc_hid_dim --> 512 gru/lstm hidden unit for encoder 
            dec_hid_dim --> 512 since the output of encoder will be used in decoder we are usign this dim of decoder directly so that we don't end up doing any mistake
            dropout --> 0.5 dropout rate for embedding
            
        one thing to notice here is that we are usign a 1 layer bidir gru here
        the outputs and hidden has the following dim
        # outputs dim [src_len,batch_size,num_dir*hid_dim]
        # hidden dim [num_layer*num_dir,batch_size,hid_dim]
        so for a batch size of 8 and source lenght of 10
        # outputs dim [10,8,2*512]
        # hidden dim [1*2,8,512]
        
        

    '''
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, emb_dim)

        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)

        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # src dim [src_len,batch_size]
        # embedding dim [src_len,batch_size,emb_dim]
        embedding = self.dropout(
            self.embedding(src))  ## TODO check this why use dropout

        # outputs dim [src_len,batch_size,num_dir*hid_dim]
        # hidden dim [num_layer*num_dir,batch_size,hid_dim]
        outputs, hidden = self.rnn(embedding)
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer

        #initial decoder hidden is final hidden state of the forwards and backwards
        #  encoder RNNs fed through a linear layer to merge forward and backward
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))

        # this is the final dim which are being returned

        # outputs dim [src_len,batch_size,2 * enc_hid_dim]
        # hidden dim [batch_size,dec_hid_dim]
        return outputs, hidden

NameError: name 'nn' is not defined

### Attention

Next up is the attention layer. This will take in the previous hidden state of the decoder, $s_{t-1}$, and all of the stacked forward and backward hidden states from the encoder, $H$. The layer will output an attention vector, $a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1.

Intuitively, this layer takes what we have decoded so far, $s_{t-1}$, and all of what we have encoded, $H$, to produce a vector, $a_t$, that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode, $\hat{y}_{t+1}$. 

First, we calculate the *energy* between the previous decoder hidden state and the encoder hidden states. As our encoder hidden states are a sequence of $T$ tensors, and our previous decoder hidden state is a single tensor, the first thing we do is `repeat` the previous decoder hidden state $T$ times. We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (`attn`) and a $\tanh$ activation function. 

$$E_t = \tanh(\text{attn}(s_{t-1}, H))$$ 

This can be thought of as calculating how well each encoder hidden state "matches" the previous decoder hidden state.

We currently have a **[dec hid dim, src len]** tensor for each example in the batch. We want this to be **[src len]** for each example in the batch as the attention should be over the length of the source sentence. This is achieved by multiplying the `energy` by a **[1, dec hid dim]** tensor, $v$.

$$\hat{a}_t = v E_t$$

We can think of $v$ as the weights for a weighted sum of the energy across all encoder hidden states. These weights tell us how much we should attend to each token in the source sequence. The parameters of $v$ are initialized randomly, but learned with the rest of the model via backpropagation. Note how $v$ is not dependent on time, and the same $v$ is used for each time-step of the decoding. We implement $v$ as a linear layer without a bias.

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

This gives us the attention over the source sentence!

Graphically, this looks something like below. This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq9.png?raw=1)

In [37]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]---> hidden is basically st-1 of decoder
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1)

### Decoder

Next up is the decoder. 

The decoder contains the attention layer, `attention`, which takes the previous hidden state, $s_{t-1}$, all of the encoder hidden states, $H$, and returns the attention vector, $a_t$.

We then use this attention vector to create a weighted source vector, $w_t$, denoted by `weighted`, which is a weighted sum of the encoder hidden states, $H$, using $a_t$ as the weights.

$$w_t = a_t H$$

The embedded input word, $d(y_t)$, the weighted source vector, $w_t$, and the previous decoder hidden state, $s_{t-1}$, are then all passed into the decoder RNN, with $d(y_t)$ and $w_t$ being concatenated together.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$

We then pass $d(y_t)$, $w_t$ and $s_t$ through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. This is done by concatenating them all together.

$$\hat{y}_{t+1} = f(d(y_t), w_t, s_t)$$

The image below shows decoding the first word in an example translation.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq10.png?raw=1)

The green/teal blocks show the forward/backward encoder RNNs which output $H$, the red block shows the context vector, $z = h_T = \tanh(g(h^\rightarrow_T,h^\leftarrow_T)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$, the blue block shows the decoder RNN which outputs $s_t$, the purple block shows the linear layer, $f$, which outputs $\hat{y}_{t+1}$ and the orange block shows the calculation of the weighted sum over $H$ by $a_t$ and outputs $w_t$. Not shown is the calculation of $a_t$.

In [38]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0)

### Seq2Seq

In this model we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of `enc_dim * 2` to `enc_dim * 2 if encoder_is_bidirectional else enc_dim`. 

This seq2seq encapsulator is that the `encoder` returns both the final hidden state (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as every hidden state (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that `hidden` and `encoder_outputs` are passed to the decoder. 

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y}$
- the source sequence, $X$, is fed into the encoder to receive $z$ and $H$
- the initial decoder hidden state is set to be the `context` vector, $s_0 = z = h_T$
- we use a batch of `<sos>` tokens as the first `input`, $y_1$
- we then decode within a loop:
  - inserting the input token $y_t$, previous hidden state, $s_{t-1}$, and all encoder outputs, $H$, into the decoder
  - receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$
  - we then decide if we are going to teacher force or not, setting the next input as appropriate

In [39]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

## Training the Seq2Seq Model

We initialise our parameters, encoder, decoder and seq2seq model (placing it on the GPU if we have one). 

In [40]:
len(Question.vocab)

25004

In [41]:
INPUT_DIM = len(Question.vocab)
OUTPUT_DIM = len(Question.vocab)
# ENC_EMB_DIM = 256
ENC_EMB_DIM = 300
# DEC_EMB_DIM = 256
DEC_EMB_DIM = 300
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, embedding_vector, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, embedding_vector, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

We use a simplified version of the weight initialization scheme used in the paper. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.

In [42]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (rnn): GRU(300, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (rnn): GRU(1324, 512)
    (fc_out): Linear(in_features=1836, out_features=25004, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Calculate the number of parameters

In [43]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 52,568,380 trainable parameters


We create an optimizer.

In [50]:
optimizer = optim.Adam(model.parameters())

We initialize the loss function.

In [45]:
ANSWER_PAD_IDX = Question.vocab.stoi[Question.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = ANSWER_PAD_IDX)

We then create the training loop...

In [46]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src, src_length = batch.questions
        trg, trg_length = batch.answers
        

        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
        if(i % 1000 == 0):
            print(f"{i} steps are done")
        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing.

In [47]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src, src_length = batch.questions
            trg, trg_length = batch.answers

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Finally, define a timing function.

In [48]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Then, we train our model, saving the parameters that give us the best 


In [49]:
N_EPOCHS = 20
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'squad-attention.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

0 steps are done




1000 steps are done
2000 steps are done
3000 steps are done
4000 steps are done
5000 steps are done
Epoch: 01 | Time: 10m 11s
	Train Loss: 5.273 | Train PPL: 194.922
	 Val. Loss: 5.427 |  Val. PPL: 227.438
0 steps are done
1000 steps are done
2000 steps are done
3000 steps are done
4000 steps are done
5000 steps are done
Epoch: 02 | Time: 10m 12s
	Train Loss: 5.251 | Train PPL: 190.829
	 Val. Loss: 5.407 |  Val. PPL: 222.884
0 steps are done
1000 steps are done
2000 steps are done
3000 steps are done
4000 steps are done
5000 steps are done
Epoch: 03 | Time: 10m 13s
	Train Loss: 5.252 | Train PPL: 190.955
	 Val. Loss: 5.440 |  Val. PPL: 230.338
0 steps are done
1000 steps are done
2000 steps are done
3000 steps are done
4000 steps are done
5000 steps are done
Epoch: 04 | Time: 10m 10s
	Train Loss: 5.251 | Train PPL: 190.717
	 Val. Loss: 5.551 |  Val. PPL: 257.564
0 steps are done
1000 steps are done
2000 steps are done
3000 steps are done
4000 steps are done
5000 steps are done
Epoch: 0

Finally, we test the model on the test set using these "best" parameters.

In [39]:
model.load_state_dict(torch.load('squad-attention.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 5.178 | Test PPL: 177.363 |
