# Neural Machine Translation
## Attention Based Bidirectional LSTM

<p align="center">
<b>Attention Model with Bidirectional LSTM</b>
</p>
<p align="center">
<img src="../images/Attention Model with BiLSTM.png" style="width:250px;height:450px;">
</p>


The context vector $c_i$ depends on the sequence of *annotations $(h_1, h_2,...h_{T_x})$ (hidden states sequence at input)*, to which the encoder maps the input sentence. The context vector $c_i$ is, then, computed as a weighted sum of these annotations $h_i$ -

$$c_i = \sum_{j=1}^{T_x}\alpha_{ij}h_j$$

The weight $\alpha{ij}$ of each annotation $h_j$ is computed by a softmax on $e_{ij}$-

$$\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}$$

where $e_{ij}$ is calculated by an *alignment model* which scores how well the inputs around position $j$ and the output at position $i$ match. as follows -

$$e_{ij} = a(s_{i-1}, h_j)$$

where -
- $s_{i-1}$: The previous output's hidden state
- $h_j$: Hidden state of $j^{th}$ input

Alignment model $a(s_{i-1}, h_j)$ is modelled using a *feed-forward network*, which is jointly trained with all the other components of the system.

<p align="center">
Source: <a href="https://arxiv.org/pdf/1409.0473.pdf">Original Attention model paper</a>
</p>

In [1]:
import torch
import torch.nn as nn
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
import numpy as np
import random
import time
from torch.utils.tensorboard import SummaryWriter # To print to tensorboard

In [2]:
# spacy_eng = spacy.load('en')
# spacy_ger = spacy.load('de')
def tokenize(text):
    p = np.array([])
    for s in text.split(','):
        s1 = s.split(' ')
        for s2 in s1:
            s2 = s2.split('.')
            if s2!=['']:
                p = np.append(p, s2)
    if p[-1]=='':
        return list(p[:-1])
    else:
        return list(p)

In [3]:
# Example
tokenize("Hello, today is the day.")

['Hello', 'today', 'is', 'the', 'day']

In [4]:
def tokenize_eng(text):
    return [tok.text for tok in spacy_eng.tokenizer(text)]

def tokenize_ger(text):
    return [tok.text for tok in spacy_ger.tokenizer(text)]

In [5]:
english = Field(
    sequential=True, 
    use_vocab=True, 
    tokenize=tokenize, 
    lower=True,
    init_token='<sos>',
    eos_token='<eos>'
)
german = Field(
    sequential=True, 
    use_vocab=True, 
    tokenize=tokenize, 
    lower=True,
    init_token='<sos>',
    eos_token='<eos>'
)

In [6]:
train_data, val_data, test_data = Multi30k.splits(
    exts = ('.de', '.en'), # (Source language, Target Language)
    fields = (german, english)
)

In [7]:
# Build a vocabulary
english.build_vocab(train_data, max_size = 10000, min_freq = 2) # We won't add words used ONLY once, should occur atleast twice
german.build_vocab(train_data, max_size = 10000, min_freq = 2) # We won't add words used ONLY once

In [8]:
batch_size = 512
device = torch.device('cpu')

In the `BucketIterator` -

`sort_within_batch` and `sort_key` is going to prioritise to have examples of SIMILAR LENGTH in the batch, to minimize the amount of padding to save the amount of computing to be done

In [9]:
train_iterator, val_iterator, test_iterator = BucketIterator.splits(
    (train_data, val_data, test_data),
    batch_sizes=(batch_size, batch_size, batch_size),
    sort_within_batch = True,
    sort_key = lambda x: len(x.src), # This would prioritise the similar length sentences in the batch
    device = device
)

## Encoder model

In [10]:
class EncoderLSTM(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout_prob) -> None:
        super(EncoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.dropout = nn.Dropout(dropout_prob)
        self.embedding = nn.Embedding(num_embeddings=input_size, embedding_dim=embedding_size)
        
        if num_layers == 1:
            self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, bidirectional = True) # Added Bi-directional
        else:
            self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, bidirectional = True, dropout=dropout_prob)

        self.hidden_fc = nn.Linear(hidden_size*2, hidden_size)
        self.cell_fc = nn.Linear(hidden_size*2, hidden_size)
    
    def forward(self, x):
        # x shape: (seq_length, batch_size)

        embedding = self.dropout(self.embedding(x)) # Will return the embedding of every word in x
        # embedding shape: (seq_length, batch_size, embedding_size)

        encoder_states, (hidden, cell) = self.lstm(embedding)
        # encoder_states shape: (seq_length, batch_size, hidden_size*2)

        # hidden and cell state shape: (2, batch_size, hidden_size)
        hidden = self.hidden_fc(torch.cat((hidden[0:1], hidden[1:2]), dim = 2)) # [0:1]: Forward, [1:2]: Backward
        cell = self.cell_fc(torch.cat((cell[0:1], cell[1:2]), dim = 2))
        # dim = 2 since we need to concatenate in the hidden_size dimension

        
        # Context vector: hidden and cell state (These would only be the last hidden and cell state)
        # encoder_states: will include the entire hidden state sequence (h_j)
        # encoder_states shape: (seq_length, batch_size, hidden_size)
        return encoder_states, hidden, cell

## Decoder model

In [11]:
class DecoderLSTM(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size, num_layers, dropout_prob) -> None:
        """
            input_size: Size of the vocabulary
            output_size = input_size (Each dimension = Probability of each word)
            hidden_size: Same size as the hidden state size of the Encoder
        """
        super(DecoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.dropout = nn.Dropout(dropout_prob)
        self.embedding = nn.Embedding(num_embeddings=input_size, embedding_dim=embedding_size)
        if num_layers == 1:
            self.lstm = nn.LSTM(hidden_size*2 + embedding_size, hidden_size, num_layers)
        else:
            self.lstm = nn.LSTM(hidden_size*2 + embedding_size, hidden_size, num_layers, dropout=dropout_prob)
        # hidden_size*2: Since Encoder is a Bidirectional LSTM, so two times the hidden_size
        # So here the hidden_size*2: Context Vector
        # embedding_size: Is the same as before

        self.energy_alignment_model = nn.Linear(hidden_size*3, 1)
        # First we will add hidden state from encoder (2*hidden_size), and one from decoder (hidden_size)
        # Hence 3*hidden_size
        self.softmax_energy = nn.Softmax(dim=0)
        self.relu = nn.ReLU()

        self.fc = nn.Linear(hidden_size, output_size) # At the output at each iteration
    
    def forward(self, x, encoder_states, hidden, cell):
        """
            This function predicts ONLY ONE iteration/one word at a time.
            Will need to be iteratively called for the entire translation prediction.
        """
        # Context Vector: hidden and cell state of Encoder
        # x shape: (batch_size), but we want (1, batch_size) - 1 represents 1 word at a time, as a batch of batch_size
        x = x.unsqueeze(0) # Will add 1 dimension

        embedding_vector = self.dropout(self.embedding(x)) # Applies dropout on the embedding values for all words
        # embedding_vector shape: (1, batch_size, embedding_size)

        sequence_length = encoder_states.shape[0] # Shape of encoder_states: (sequence_length, batch_size, hidden_size*2)
        h_reshaped = hidden.repeat(sequence_length, 1, 1) # Decoder hidden state
        # So it changes its shape from: (batch_size, hidden_size) --> (sequence_length, batch_size, hidden_size)
        # Basically, it is a repeated version of hidden, which is repeated sequence_length number of times
        # The next torch.cat() operation converts the shape to --> (sequence_length, batch_size, hidden_size*3)

        energy = self.relu(self.energy_alignment_model(torch.cat((h_reshaped, encoder_states), dim = 2)))
        # energy shape: (sequence_length, batch_size, 1)
        attention = self.softmax_energy(energy) # Dimension was set 0, since we needed to take softmax on encoder sequence_length dimension
        # attention shape: (sequence_length, batch_size, 1)


        # attention = attention.permute(1,2,0)
        # # attention shape: (batch_size, 1, sequence_length)
        # encoder_states = encoder_states.permute(1,0,2)
        # # encoder_states shape: (sequence_length, batch_size, hidden_size*2) --> (batch_size, sequence_length, hidden_size*2)
        # context_vector = torch.bmm(attention, encoder_states).permute(1,0,2)
        # # context_vector shape: (batch_size, 1, hidden_size*2) --> (permuting to) (1, batch_size, hidden_size*2)

        context_vector = torch.einsum("snk,snl->knl", attention, encoder_states)

        lstm_input = torch.cat((context_vector, embedding_vector), dim = 2)
        # lstm_input shape: (1, batch_size, hidden_size*2 + embedding_size)
        
        outputs, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))
        # Shape of hidden_state: (1, batch_size, hidden_size)

        predictions = self.fc(outputs)
        # Shape of predictions: (1, batch_size, length_of_vocabulary)

        predictions = predictions.squeeze(0) # To remove the dimension 1
        # New shape: (batch_size, length_of_vocabulary)

        return predictions, hidden, cell

In [12]:
# To understand what repeat() is doing
asdf = torch.rand((22,33))
asdf.repeat(10,2,1).shape, asdf.repeat(10,1,1).shape

(torch.Size([10, 44, 33]), torch.Size([10, 22, 33]))

## The overall Seq2Seq model

In [13]:
class Attention(nn.Module):
    def __init__(self, encoder, decoder) -> None:
        super(Attention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
    
    def forward(self, source, target, teacher_force_ratio = 0.5):
        """
            teacher_force_ratio -
            Basically in the decoder, when one word is predicted,
            we use that word as the input for prediction of te next
            word. Here, we shall not do that completely. Instead, 
            we shall assign a probability for using the previous 
            predicted word, and the rest of the times, we shall use
            the ground-truth word as the input.

            The teacher_force_ratio determines the probability of 
            using the ground_truth word, instead of the previous 
            predicted word. We will never keep it as 1 (meaning 
            ONLY ground-truth word will be used), as that will 
            completely hamper the learning of the LSTM model.
        """
        # source shape: (source_sentence_len, batch_size)
        # target shape: (target_sentence_len, batch_size)
        # Different lengths in one batch are padded to the length of the longest sentence
        batch_size = source.shape[1]
        target_len = target.shape[0]
        target_vocab_size = len(english.vocab)

        encoder_states, hidden_context, cell_context = self.encoder(source)
        outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)
        # Each word will have a vector of entire vocabulary size with a batch size of batch_size
        # Each prediction will be added to target_len dimension (dimension 0)
        
        # Grab start token (<sos>)
        x = target[0] # Shape = (1, batch_size)

        for t in range(1, target_len):
            output, hidden_context, cell_context = self.decoder(x, encoder_states, hidden_context, cell_context)
            outputs[t] = output # adding along the first dimension

            # output shape: (batch_size, english_vocab_size)
            best_guess = output.argmax(1)
            x = target[t] if random.random() < teacher_force_ratio else best_guess
        
        return outputs

## Parameters

In [14]:
num_epochs = 50
learning_rate = 6e-4

In [15]:
load_model = False
input_size_encoder = len(german.vocab)
input_size_decoder = len(english.vocab)
output_size = input_size_decoder

In [16]:
encoder_embedding_size = 300
decoder_embedding_size = 300
hidden_size = 1024
num_layers = 1
encoder_dropout = 0.0
decoder_dropout = 0.0

In [19]:
# Tensorboard
writer = SummaryWriter(f'runs/Attention_Loss_plot')
step = 0

In [18]:
encoder_net = EncoderLSTM(
    input_size_encoder, 
    encoder_embedding_size, 
    hidden_size, 
    num_layers, 
    encoder_dropout
).to(device)

decoder_net = DecoderLSTM(
    input_size_decoder, 
    decoder_embedding_size, 
    hidden_size,
    output_size=output_size,
    num_layers=num_layers,
    dropout_prob=decoder_dropout
).to(device)

In [20]:
model = Attention(encoder=encoder_net, decoder=decoder_net).to(device)

In [21]:
pad_idx = english.vocab.stoi['<pad>'] # To obtain the index for <pad> token
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Since we don't want the loss to be calculated for the padding done on shorter sentences
# During averaging the loss part, the loss on these pad tokens won't be used for calculation

In [22]:
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)

## Utility functions

In [23]:
from torchtext.data.metrics import bleu_score
import sys

In [24]:
def save_checkpoint(state, filename="models_state_dict/Seq2Seq_Attention_checkpoint.pth.tar"):
    print("Saving Checkpoint...")
    torch.save(state, filename)
    print("Saved!")

In [25]:
def load_checkpoint(checkpoint, model, optimizer):
    print("Loading checkpoint...")
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    print("Successfully loaded!")

In [26]:
def translate_sentence(model, sentence, source, target, device, max_length=60):
    # print(sentence)
    # sys.exit()

    # Load source tokenizer
    if type(sentence) == str:
        tokens = [token.lower() for token in tokenize(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    # spacy_ger = spacy.load("de")
    # Create tokens using spacy and everything in lower case (which is what our vocab is)
    # if type(sentence) == str:
    #     tokens = [token.text.lower() for token in spacy_ger(sentence)]
    # else:
    #     tokens = [token.lower() for token in sentence]

    # Add <SOS> and <EOS> in the beginning and end respectively
    tokens.insert(0, source.init_token)
    tokens.append(source.eos_token)

    # Go through each source token and convert to an index
    text_to_indices = [german.vocab.stoi[token] for token in tokens]

    # Convert to tensor and add 1 dimension at the 1st index (2nd dimension)
    sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)

    # Build encoder hidden and cell state
    with torch.no_grad():
        # Obtain context vectors for decoder (hidden and cell state)
        encoder_states, hidden, cell = model.encoder(sentence_tensor) # Will not build computational graph
    
    outputs = [target.vocab.stoi["<sos>"]] # First word to be inputted

    for _ in range(max_length):
        previous_word = torch.LongTensor([outputs[-1]]).to(device)

        with torch.no_grad():
            prediction, hidden, cell = model.decoder(previous_word, encoder_states, hidden, cell)
            best_guess = torch.argmax(prediction, dim=1).item()
        
        outputs.append(best_guess)

        # Model checks if prediction is an <eos> token or End of Sentence token
        if outputs[-1] == target.vocab.stoi["<eos>"]:
            break
    
    translated_sentence = [target.vocab.itos[index] for index in outputs]

    # Remove start token
    return translated_sentence[1:]

In [27]:
def bleu(data, model, source, target, device):
    targets = []
    outputs = []

    for example in data:
        src = vars(example)["src"]
        trg = vars(example)["trg"]

        prediction = translate_sentence(model, src, german, english, device)
        prediction = prediction[:-1] # Removing <eos> token

        targets.append([trg])
        outputs.append(prediction)
    
    return bleu_score(outputs, targets)

## Training

In [28]:
with torch.no_grad():
    total_params = sum(p.numel() for p in model.parameters() if p.requires_grad) # Number of Parameters
    print(f'Total number of trainable parameters = {total_params}')

Total number of trainable parameters = 39125721


In [29]:
if load_model:
    load_checkpoint(torch.load("models_state_dict/Seq2Seq_Attention_checkpoint.pth.tar"), model, optimizer)

In [30]:
sentence = "ein boot mit mehreren männern darauf wird von einem großen pferdegespann ans ufer gezogen."

In [31]:
max1 = 0

In [34]:
for epoch in range(num_epochs):
    print(f'Epoch {epoch+1}/{num_epochs}')

    model.eval() # Will turn off dropout
    translated_sentence = translate_sentence(model, sentence, german, english, device, max_length=60)
    translated_sentence_final = ''
    for i, word in enumerate(translated_sentence[:-1]):
        if i != len(translated_sentence)-1:
            translated_sentence_final+=word+' '
        else:
            translated_sentence_final+=word+'.'
    print(f"Translated Example Sentence: \n {translated_sentence_final}")

    model.train()

    tic = time.time()
    for batch_idx, batch in enumerate(train_iterator):
        input_data = batch.src.to(device)
        target = batch.trg.to(device)

        output = model(input_data, target)
        # output shape: (trg_len, batch_size, output_dim)

        output = output[1:].reshape(-1, output.shape[2]) # So that we can send it to a softmax in CrossEntropyLoss function
        target = target[1:].reshape(-1) # Shape (trg_len * batch_size)

        # We are doing this so that the loss of all the words predicted can be done at once

        loss = criterion(output, target)

        optimizer.zero_grad()
        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1)
        optimizer.step()

        writer.add_scalar('Training Loss', loss, global_step = step)
        step+=1
    print(f"Time taken: {(time.time() - tic)//60:.0f}m {(time.time() - tic)%60:.0f}s")
    bleu_score_value = bleu(val_data, model, german, english, device)
    print(f'Bleu Score on Validation set = {bleu_score_value*100:.2f}')
    if max1 < bleu_score_value:
        max1 = bleu_score_value
        checkpoint = {'state_dict':model.state_dict(), 'optimizer':optimizer.state_dict()}
        save_checkpoint(checkpoint)

Epoch 1/30
Translated Example Sentence: 
 a boat with several men is being pulled by a large boat 
Time taken: 3m 30s
Bleu Score on Validation set = 21.91
Epoch 2/30
Translated Example Sentence: 
 a boat carrying several men is pulled by horses by a large of horses 
Time taken: 3m 30s
Bleu Score on Validation set = 22.29
Saving Checkpoint...
Saved!
Epoch 3/30
Translated Example Sentence: 
 a boat with several men is being pulled by a large shore of horses 
Time taken: 3m 27s
Bleu Score on Validation set = 22.34
Saving Checkpoint...
Saved!
Epoch 4/30
Translated Example Sentence: 
 a boat with several men is pulled by a large shore of horses 
Time taken: 3m 25s
Bleu Score on Validation set = 22.07
Epoch 5/30
Translated Example Sentence: 
 a boat with several men is pulled to shore by a large of of horses 
Time taken: 3m 23s
Bleu Score on Validation set = 22.46
Saving Checkpoint...
Saved!
Epoch 6/30
Translated Example Sentence: 
 a boat carrying several men is pulled to shore by a large o

## Testing

In [35]:
if load_model:
    load_checkpoint(torch.load("models_state_dict/Seq2Seq_Attention_checkpoint.pth.tar"), model, optimizer)
model.eval()

Attention(
  (encoder): EncoderLSTM(
    (dropout): Dropout(p=0.0, inplace=False)
    (embedding): Embedding(7805, 300)
    (lstm): LSTM(300, 1024, bidirectional=True)
    (hidden_fc): Linear(in_features=2048, out_features=1024, bias=True)
    (cell_fc): Linear(in_features=2048, out_features=1024, bias=True)
  )
  (decoder): DecoderLSTM(
    (dropout): Dropout(p=0.0, inplace=False)
    (embedding): Embedding(5964, 300)
    (lstm): LSTM(2348, 1024)
    (energy_alignment_model): Linear(in_features=3072, out_features=1, bias=True)
    (softmax_energy): Softmax(dim=0)
    (relu): ReLU()
    (fc): Linear(in_features=1024, out_features=5964, bias=True)
  )
)

In [36]:
print(f"Bleu Score on train data = {bleu(train_data, model, german, english, device)*100:.2f}")

Bleu Score on train data = 96.72


In [37]:
print(f"Bleu Score on test data = {bleu(test_data, model, german, english, device)*100:.2f}")

Bleu Score on test data = 23.27


## Trials

In [38]:
sentence = "Es gibt so viele verschiedene Möglichkeiten für Eiscreme"

**Expected translation**: *There is so much variety in the options for icecream available*

In [39]:
with torch.no_grad():
    translated_sentence = translate_sentence(model, sentence, german, english, device, max_length=60)
    translated_sentence_final = ''
    for i, word in enumerate(translated_sentence[:-1]):
        if i != len(translated_sentence)-1:
            translated_sentence_final+=word+' '
        else:
            translated_sentence_final+=word+'.'
    print(f"Translated Example Sentence: \n {translated_sentence_final}")

Translated Example Sentence: 
 it that many things set very <unk> 
