# Overview
**Assignment 2** focuses on the training on a Neural Machine Translation (NMT) system for English-Irish translation where English is the source language and Irish is the target language. 

**Grading Policy** 
Assignment 2 is graded and will be worth 25% of your overall grade. This assignment is worth a total of 50 points distributed over the tasks below.  Please note that this is an individual assignment and you must not work with other students to complete this assessment. Any copying from other students, from student exercises from previous years, and any internet resources will not be tolerated. Plagiarised assignments will receive zero marks and the students who commit this act will be reported. Feel free to reach out to the TAs and instructors if you have any questions.

## Task 1 - Data Collection and Preprocessing (10 points)
## Task 1a. Data Loading (5 pts)
Dataset: https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=0 
*  Download a English-Irish dataset and decompress it. The `DGT.en-ga.en` file contains a list english sentences and `DGT.en-ga.ga` contains the paralell Irish sentences. Read both files into the Jupyter environment and load them into a pandas dataframe. 
* Randomly sample 12,000 rows.
* Split the sampled data into train (10k), development (1k) and test set (1k)

In [1]:
# Your Code Here
#creating neccesaary imports
import os
import pandas as pd
import numpy as np


# reading in files
enf = []
gaf = []


with open("DGT.en-ga.en", encoding='utf-8') as en:
    #enf = en.read().splitlines()
    for i in en:
        enf.append(en.readline())
     
with open("DGT.en-ga.ga", encoding='utf-8') as ga:
    #gaf = ga.read().splitlines()
    for i in ga:
        gaf.append(ga.readline())

 
#print(len(enf))
#print(len(enf))

#checking if data is loaded correctly
assert len(enf) == len(gaf)


    
#reference->https://stackoverflow.com/questions/43152368/python-unicodedecodeerror-charmap-codec-cant-decode-byte-0x81-in-position
#reference -> https://www.w3schools.com/python/python_file_open.asp

In [2]:
#creating dataframe object
df = pd.DataFrame({"en":enf, "ga":gaf})

#random sampling 12,000 rows
sample_df = df.sample(n = 12000, replace= False, random_state = 42)

# performing splits according to : Split the sampled data into train (10k), development (1k) and test set (1k)

train = sample_df.head(10000)
dev = sample_df.iloc[10000:11000,:]
test = sample_df.iloc[11000:12000,:]

#validating the shapes
assert len(train) == 10000
assert len(dev) == 1000
assert len(test) == 1000


# reference -> https://datatofish.com/random-rows-pandas-dataframe/
# reference -> https://sparkbyexamples.com/pandas/get-first-n-rows-of-pandas/

## Task 1b. Preprocessing (5 pts)
* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Perform the following pre-processing steps:
  * Lowercase the text
  * Remove all punctuation
  * tokenize the text 
*  Build seperate vocabularies for each language. 
  * Assign each unique word an id value 
*Print statistics on the selected dataset:
  * Number of samples
  * Number of unique source language tokens
  * Number of unique target language tokens
  * Max sequence length of source language
  * Max sequence length of target language



In [3]:
# Your code here
from nltk.tokenize import word_tokenize
from typing import List
import re

class prep:
    def __init__(self, lang: str):
        self.lang = lang
        self.word2index = {"PAD": 0, "BOF":1, "EOS":2}
        self.index2word = {0:"PAD", 1:"BOF", 2:"EOS"}
        self.word2count = {}
        self.max_len_seq = 0
        self.n_words = len(self.index2word)
        
    
    def clean ( self, sentence: str):
        """
        lowercase
        remove punct
        tokenize
        """
        text = sentence.lower()
        clean_text = re.sub(r'[^\w\s]', '', text).strip()
        if len(word_tokenize(clean_text)) > self.max_len_seq:
            self.max_len_seq = len(word_tokenize(clean_text))
            
        for word in word_tokenize(clean_text):
              self.addWord(word)
    
    def addWord(self, word: str):
   
        if word not in self.word2index:
                self.word2index[word] = self.n_words
                self.word2count[word] = 1
                self.index2word[self.n_words] = word
                self.n_words += 1
        else:
                self.word2count[word] += 1
                
    def encodeSentence(self, sentence: str) -> List[int]:
   
        text = sentence.lower()
        clean_text = re.sub(r'[^\w\s]', '',text).strip()
        clean_text = "BOF " + clean_text + " EOS"
        return [self.word2index[word] for word in word_tokenize(clean_text) if word in self.word2index]

    def decodeIds(self, ids: list) -> List[str]:
    
        return " ".join([self.index2word[tok] for tok in ids])
    
#reference -> Lab_08_Neural_NMT, AdvanceNlp module
    

In [36]:
from tqdm.notebook import tqdm 

english = prep("english")
gaelic = prep("gaelic")

for _, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
  english.clean(row["en"])
  gaelic.clean(row["ga"])

print(f"Total number of samples in english language which is equal to unique token  : {english.n_words}")
print(f"Total number of samples in gaelic language which is equal to unique token: {gaelic.n_words}")
print("Max sequence length for English language", english.max_len_seq)
print("Max sequence length for Gaelic language", gaelic.max_len_seq)

  0%|          | 0/12000 [00:00<?, ?it/s]

Total number of samples in english language which is equal to unique token  : 11666
Total number of samples in gaelic language which is equal to unique token: 16258
Max sequence length for English language 175
Max sequence length for Gaelic language 268


In [5]:
# Your code here
import torch 
from tensorflow.keras.utils import pad_sequences
import pandas as pd

def encode_features(
    df: pd.DataFrame, 
    english: prep,
    gaelic: prep,
    pad_token: int = 0,
    max_seq_length = 10
  ):

  source = []
  target = []

  for _, row in df.iterrows():
    source.append(english.encodeSentence(row["en"]))
    target.append(gaelic.encodeSentence(row["ga"]))

  source = pad_sequences(
      source,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )

  target = pad_sequences(
      target,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )
  
  return source, target

train_source, train_target = encode_features(train, english,gaelic)
val_source, val_target = encode_features(dev, english, gaelic)
test_source, test_target = encode_features(test, english, gaelic)

print(f"Shapes of train source {train_source.shape}, and target {train_target.shape}")

#reference -> Lab_08_Neural_NMT, AdvanceNlp module

Shapes of train source (10000, 10), and target (10000, 10)


## Task 2. Model Implementation and Training (30 pts)



## Task 2a. Encoder-Decoder Model Implementation (10 pts)
Implement an Encoder-Decoder model in Pytorch with the following components
* A single layer RNN based encoder. 
* A single layer RNN based decoder
* A Encoder-Decoder model based on the above components that support sequence-to-sequence modelling. For the encoder/decoder you can use RNN, LSTMs or GRU. Use a hidden dimension of 256 or less depending on your compute constraints. 

In [23]:
from torch.utils.data import DataLoader, TensorDataset
import torch
import torch.nn as nn

train_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(train_source),
        torch.LongTensor(train_target)
    ),
    shuffle = True,
    batch_size = 32
)

val_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(val_source),
        torch.LongTensor(val_target)
    ),
    shuffle = False,
    batch_size = 32
)

test_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(test_source),
        torch.LongTensor(test_target)
    ),
    shuffle = False,
    batch_size = 32
)
#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [24]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [25]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

#reference-> https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

In [26]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell
    
#reference-> https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

In [27]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs
    
#reference-> https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

## Task 2b. Training (10 pts)
Implement the code to train the Encoder-Decoder model on the Irish-English data. You will write code for the following:
* Training, validation and test dataloaders 
* A training loop which trains the model for 5 epoch. Evaluate the loop at the end of each Epoch. Print out the train perplexity and validation perplexity after each epoch.

In [28]:
INPUT_DIM = english.n_words
OUTPUT_DIM = gaelic.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 1
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [29]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(11666, 256)
    (rnn): LSTM(256, 512, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(16258, 256)
    (rnn): LSTM(256, 512, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=16258, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [30]:
from tqdm.notebook import tqdm
import numpy as np 
import random
import torch.nn.functional as F

optimizer = torch.optim.Adam(model.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_val_loss = float('inf')

for epoch in range(EPOCHS):

  model.train()
  epoch_loss = 0
  for batch in tqdm(train_dl, total=len(train_dl)):

     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)

     optimizer.zero_grad()

     output = model(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model.eval()
  for batch in tqdm(val_dl, total=len(val_dl)):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  val_loss = round(eval_loss / len(val_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")


  if val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), 'best-model.pt')  

#reference -> Lab_08_Neural_NMT, AdvanceNlp module

  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 0 | train loss 5.982 | train ppl 396.2320402998885 | val ppl 314.8196704067165


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 1 | train loss 5.334 | train ppl 207.26537976093547 | val ppl 263.749555636939


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 2 | train loss 5.027 | train ppl 152.4749011684023 | val ppl 228.14924542400394


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 3 | train loss 4.734 | train ppl 113.74965217224874 | val ppl 230.90352901934028


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 4 | train loss 4.57 | train ppl 96.54410977284468 | val ppl 225.87912250203328


# Task 2c. Evaluation on the Test Set (10 pts)
Use the trained model to translate the text from the source language into the target language on the test set. Evaluate the performance of the model on the test set using the BLEU metric and print out the average the BLEU score.

In [33]:
def translate_sentence(
    text: str, 
    model: EncoderDecoder, 
    english: prep,
    gaelic: prep,
    device: str,
    max_len: int = 10,
  ) -> str:

  # Encode english sentence and convert to tensor
  input_ids = english.encodeSentence(text)
  input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

  # Get encooder hidden states
  with torch.no_grad():
    encoder_outputs, hidden = model.encoder(input_tensor)

  # Build target holder list
  trg_indexes = [gaelic.word2index["BOF"]]

  # Loop over sequence length of target sentence
  for i in range(max_len):
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
    
    # Decode the encoder outputs with respect to current target word
    with torch.no_grad():
      output, hidden, cell = model.decoder(trg_tensor, hidden, encoder_outputs)
    
    # Retrieve most likely word over target distribution
    pred_token = torch.argmax(output).item()
    trg_indexes.append(pred_token)

    if pred_token == gaelic.word2index["EOS"]:
      break

  return "".join(gaelic.decodeIds(trg_indexes))

#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [34]:
import nltk
model.eval()
score = []
for _,row in test.iterrows():
        translated = translate_sentence(row["en"], model, english, gaelic, device)
        b_score = nltk.translate.bleu_score.sentence_bleu(translated.split(), row["ga"].split())
        score.append(b_score)
print(np.mean(score))

#reference -> Lab_07, AdvanceNlp module

1.1053830046501567e-157


## Task 3. Improving NMT using Attention (10 pts) 
Extend the Encoder-Decoder model from Task 2 with the attention mechanism. Retrain the model and evaluate on test set. Print the updated average BLEU score on the test set. In a few sentences explains which model is the best for translation. 

We can see that BLEU score of model with attention is higher because attention allows the decoder to selectively focus on the parts of the input sequence that are most relevant to generarting next output token, that is , it then computes an attention vector over the length of the sentence which is used by the model to learn which part of the source sentence is most salient when translating the current token. 
Note that since the architecture has been changed from LSTM to GRU for attention based.

In [14]:
import torch 
import torch.nn as nn 
import torch.nn.functional as F

class EncoderGRU(nn.Module):
    def __init__(
        self, 
        input_vocab_size,  # size of source vocabulary  
        hidden_dim,        # hidden dimension of embeddings
        encoder_hid_dim,   # gru hidden dim
        decoder_hid_dim,   # decoder hidden dim 
        dropout_prob = .5
      ):
      
        super().__init__()
        self.embedding = nn.Embedding(input_vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, encoder_hid_dim, bidirectional = True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards GRU
        #hidden [-1, :, : ] is the last of the backwards GRU
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))        
        return outputs, hidden

#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [15]:
class Attention(nn.Module):
    def __init__(
        self, 
        enc_hid_dim,      # Encoder hidden dimension
        dec_hid_dim       # Decoder hidden dimension 
      ):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)
        
        #attention output: [batch size, src len]
        return F.softmax(attention, dim=1)
    
#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [16]:
class DecoderGRU(nn.Module):
    def __init__(
        self, 
        target_vocab_size,    # Size of target vocab 
        hidden_dim,           # hidden size of embedding  
        enc_hid_dim, 
        dec_hid_dim, 
        dropout
      ):
        super().__init__()

        self.output_dim = target_vocab_size
        self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
            (enc_hid_dim * 2) + dec_hid_dim + hidden_dim, 
            target_vocab_size
          )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)  # [1, batch size]
        
        embedded = self.dropout(self.embedding(input))  # [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)     # [batch size, src len]
        a = a.unsqueeze(1)                              # [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)           # [batch size, 1, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)               # [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch size, (enc hid dim * 2) + emb dim]

        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch size, output dim]
        return prediction, hidden.squeeze(0)
    
#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [17]:
import random 
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time     
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):     
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs
    
#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [18]:
INPUT_DIM = english.n_words
OUTPUT_DIM = gaelic.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model = EncoderDecoder(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

EncoderDecoder(
  (encoder): EncoderGRU(
    (embedding): Embedding(11666, 256)
    (rnn): GRU(256, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): DecoderGRU(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=128, bias=True)
      (v): Linear(in_features=128, out_features=1, bias=False)
    )
    (embedding): Embedding(16258, 256)
    (rnn): GRU(512, 128)
    (fc_out): Linear(in_features=640, out_features=16258, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [19]:
from tqdm.notebook import tqdm
import numpy as np 
optimizer = torch.optim.Adam(model.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_val_loss = float('inf')

for epoch in range(EPOCHS):

  model.train()
  epoch_loss = 0
  for batch in tqdm(train_dl, total=len(train_dl)):

     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)

     optimizer.zero_grad()

     output = model(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model.eval()
  for batch in tqdm(val_dl, total=len(val_dl)):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  val_loss = round(eval_loss / len(val_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")


  if val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), 'best-model.pt')  
  
#reference -> Lab_08_Neural_NMT, AdvanceNlp module

  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 0 | train loss 6.135 | train ppl 461.73909401890546 | val ppl 313.8766266683865


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 1 | train loss 5.308 | train ppl 201.94593236216255 | val ppl 265.33680997200577


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 2 | train loss 5.014 | train ppl 150.50555593211624 | val ppl 239.60698055027026


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 3 | train loss 4.828 | train ppl 124.9607889885125 | val ppl 230.21185645987578


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 4 | train loss 4.608 | train ppl 100.28338217150392 | val ppl 212.08772791591244


In [20]:
def translate_sentence(
    text: str, 
    model: EncoderDecoder, 
    english: prep,
    gaelic: prep,
    device: str,
    max_len: int = 10,
  ) -> str:

  # Encode english sentence and convert to tensor
  input_ids = english.encodeSentence(text)
  input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

  # Get encooder hidden states
  with torch.no_grad():
    encoder_outputs, hidden = model.encoder(input_tensor)

  # Build target holder list
  trg_indexes = [gaelic.word2index["BOF"]]

  # Loop over sequence length of target sentence
  for i in range(max_len):
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
    
    # Decode the encoder outputs with respect to current target word
    with torch.no_grad():
      output, hidden = model.decoder(trg_tensor, hidden, encoder_outputs)
    
    # Retrieve most likely word over target distribution
    pred_token = torch.argmax(output).item()
    trg_indexes.append(pred_token)

    if pred_token == gaelic.word2index["EOS"]:
      break

  return "".join(gaelic.decodeIds(trg_indexes))

#reference -> Lab_08_Neural_NMT, AdvanceNlp module

In [22]:
import nltk
model.eval()
score = []
for _,row in test.iterrows():
        translated = translate_sentence(row["en"], model, english, gaelic, device)
        b_score = nltk.translate.bleu_score.sentence_bleu(translated.split(), row["ga"].split())
        score.append(b_score)
print(np.mean(score))

#reference -> Lab_08_Neural_NMT, AdvanceNlp module

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


2.7048101190961166e-158
