# Overview
**Assignment 2** focuses on the training on a Neural Machine Translation (NMT) system for English-Irish translation where English is the source language and Irish is the target language. 

**Grading Policy** 
Assignment 2 is graded and will be worth 25% of your overall grade. This assignment is worth a total of 50 points distributed over the tasks below.  Please note that this is an individual assignment and you must not work with other students to complete this assessment. Any copying from other students, from student exercises from previous years, and any internet resources will not be tolerated. Plagiarised assignments will receive zero marks and the students who commit this act will be reported. Feel free to reach out to the TAs and instructors if you have any questions.

## Task 1 - Data Collection and Preprocessing (10 points)
## Task 1a. Data Loading (5 pts)
Dataset: https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=0 
*  Download a English-Irish dataset and decompress it. The `DGT.en-ga.en` file contains a list english sentences and `DGT.en-ga.ga` contains the paralell Irish sentences. Read both files into the Jupyter environment and load them into a pandas dataframe. 
* Randomly sample 12,000 rows.
* Split the sampled data into train (10k), development (1k) and test set (1k)

In [1]:
import pandas as pd
#reading lines from datasets for english and irish data
english_data = pd.read_csv("DGT-en-ga.txt\DGT.en-ga.en", sep = '\t', names = ["Englishtext"])
irish_data = pd.read_csv("DGT-en-ga.txt\DGT.en-ga.ga", sep = '\t', names = ["Irishtext"])

In [2]:
df = pd.concat([english_data, irish_data], axis="columns")

In [3]:
df = df.head(181143)

In [4]:
df["Irishlen"] = df["Irishtext"].apply(lambda x: len(x.split(" ")))
df["Englishlen"] = df["Englishtext"].apply(lambda x: len(str(x).split(" ")))

In [5]:
df

Unnamed: 0,Englishtext,Irishtext,Irishlen,Englishlen
0,Procès-verbal of rectification to the Conventi...,Miontuairisc cheartaitheach maidir le Coinbhin...,28,27
1,(Official Journal of the European Union L 147 ...,(Iris Oifigiúil an Aontais Eorpaigh L 147 an 1...,20,12
2,This rectification has been carried out by mea...,Rinneadh an ceartúchán seo le miontuairisc che...,28,33
3,"On pages 33-34, Annex I:","Ar leathanaigh 33-34, Iarscríbhinn I:",5,5
4,the entries for the States below are rectified...,maidir leis na hiontrálacha le haghaidh na Stá...,14,10
...,...,...,...,...
181138,For the Council,"Airteagal 2.5 (Coimirce talmhaíochta), agus Ai...",12,3
181139,Position of the European Parliament of 31 Janu...,"ciallaíonn ‘idirthréimhse’, i ndáil le hearra ...",43,23
181140,Regulation (EU) No 1305/2013 of the European P...,Airteagal 18 (Coimirciú) d'Iarscríbhinn 2-C ma...,10,37
181141,Regulation (EU) 2017/2393 of the European Parl...,I rith na 10 mbliana tar éis theacht i bhfeidh...,35,107


In [6]:
df = df[df["Irishlen"] > 28]
df = df.sample(12000, random_state=42)

In [7]:
dataset = df.reset_index()

## Task 1b. Preprocessing (5 pts)
* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Perform the following pre-processing steps:
  * Lowercase the text
  * Remove all punctuation
  * tokenize the text 
*  Build seperate vocabularies for each language. 
  * Assign each unique word an id value 
*Print statistics on the selected dataset:
  * Number of samples
  * Number of unique source language tokens
  * Number of unique target language tokens
  * Max sequence length of source language
  * Max sequence length of target language



In [8]:
import re
from collections import Counter

# Your code here
class Preprocess:
    def __init__(self, dataframe):
        self.dataframe = dataframe        
        self.englishwords = []
        self.irishwords = []
        self.english_valuecounts = {}
        self.irish_valuecounts = {}
        self.english_dict = {"PAD": 0, "<bof>": 1, "<eos>": 2}
        self.english_word2idx = {0: 'PAD', 1: "<bof>", 2 : "<eof>"}
        self.irish_dict = {"PAD": 0, "<bof>": 1, "<eos>": 2}
        self.irish_word2idx = {0: 'PAD', 1: "<bof>", 2 : "<eof>"}
        self.english_unique_words = 0
        self.irish_unique_words = 0
        self.english_word_count = 0
        self.irish_word_count = 0
        self.encoded_sentence_english = 0
        self.encoded_sentence_irish = 0
        
        
    #1. Add '<bof>' to denote beginning of sentence and '<eos>' to denote the end of the sentence to each target line.
    def sentenceTags(self):
        self.dataframe['Englishtext'] = self.dataframe['Englishtext'].apply(lambda x: '<bof> '+ x + ' <eos>')
        self.dataframe['Irishtext'] = self.dataframe['Irishtext'].apply(lambda x: '<bof> '+ x + ' <eos>')
        
        return self.dataframe
    
    def preprocess(self):
        #Lowercase the text
        self.dataframe['Englishtext'] = self.dataframe['Englishtext'].apply(lambda x: str(x).lower())
        self.dataframe['Irishtext'] = self.dataframe['Irishtext'].apply(lambda x: x.lower())
        
        #Remove all punctuation
        self.dataframe['Englishtext'] = self.dataframe['Englishtext'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
        self.dataframe['Irishtext'] = self.dataframe['Irishtext'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
        
        return self.dataframe
        
    #tokenize the text
    def tokenize(self):
        self.dataframe['EnglishTokens'] = self.dataframe['Englishtext'].apply(lambda x: x.split(" "))
        self.dataframe['IrishTokens'] = self.dataframe['Irishtext'].apply(lambda x: x.split(" "))
        
        return self.dataframe
    
    def create_dict(self):
        self.englishwords = self.dataframe['Englishtext'].str.cat().split(' ')
        self.irishwords = self.dataframe['Irishtext'].str.cat().split(' ')
        
        
        #create a word count dictonary
        #creat a index to word 
        value = 3
        for word in self.englishwords:
            if word not in self.english_dict and word != '':
                self.english_dict[word] = value
                self.english_word2idx[value] = word
                value += 1
        value = 3
        for word in self.irishwords:
            if word not in self.irish_dict and word != '':
                self.irish_dict[word] = value
                self.irish_word2idx[value] = word
                value += 1

        
        self.english_dict = dict(sorted(self.english_dict.items(), key = lambda x:x[1]))
        self.irish_dict = dict(sorted(self.irish_dict.items(), key = lambda x:x[1]))
        
        
        #Assign each unique word an id value 
        self.english_unique_words = set(self.englishwords)
        self.irish_unique_words = set(self.irishwords)
        
        self.english_word_count = len(self.english_unique_words)
        self.irish_word_count = len(self.irish_unique_words)
        
        print("Created dictonaries for indexing unique words and value counts")
        
    
    def encode_sentence(self):
        self.encoded_sentence_english = [[preprocess.english_dict[token] for token in sentence 
                             if token != ''] for sentence in preprocess.dataframe.EnglishTokens]
        self.encoded_sentence_irish = [[preprocess.irish_dict[token] for token in sentence 
                             if token != ''] for sentence in preprocess.dataframe.IrishTokens]
        
    """
        Print statistics on the selected dataset:
        Number of samples
        Number of unique source language tokens
        Number of unique target language tokens
        Max sequence length of source language
        Max sequence length of target language
    """
    
    def print_info(self):
        print("Statistics on dataset\n\n")
        
        print("Number of samples: "+ str(len(self.dataframe)))
        print("Unique words in english set: "+ str(self.english_word_count))
        print("Unique words in irish set: "+ str(self.irish_word_count))
        print("Max sequence length for source language : "+ str(max(self.dataframe.Englishlen)))
        print("Max sequence length for target language : "+ str(max(self.dataframe.Irishlen)))
        

In [9]:
preprocess = Preprocess(dataset)

In [10]:
preprocess.preprocess()

Unnamed: 0,index,Englishtext,Irishtext,Irishlen,Englishlen
0,158873,article 21b and c,gan dochar dfhreagrachtaí an oibreora aerártha...,69,4
1,86645,the commission shall send the information refe...,ceadófar do shoithí mórscála peiligeacha gabhá...,65,57
2,138191,for the purposes referred to in points a b and...,féadfaidh an coimisiún gníomhartha cur chun fe...,31,49
3,22757,adapting the production and output of producer...,cinnfidh na ballstáit an tuaslíon agus an tíos...,35,18
4,111506,the possibility of becoming a member the condi...,ba é an chonclúid a baineadh as an meastóireac...,34,16
...,...,...,...,...,...
11995,147034,steered axles number and position,cuirfidh an tiarratasóir isteach ráiteas ón mo...,39,5
11996,105707,active implantable devices and if appropriate ...,go háirithe tabharfar aird ar an tsábháilteach...,34,36
11997,72904,the supervisory authorities concerned shall no...,i gcás ina mbeidh bunaíochtaí ag an rialaitheo...,58,28
11998,157368,when the information referred to in article 72...,más rud é tar éis an chomhairliúcháin sin go m...,119,48


In [11]:
preprocess.sentenceTags()

Unnamed: 0,index,Englishtext,Irishtext,Irishlen,Englishlen
0,158873,<bof> article 21b and c <eos>,<bof> gan dochar dfhreagrachtaí an oibreora ae...,69,4
1,86645,<bof> the commission shall send the informatio...,<bof> ceadófar do shoithí mórscála peiligeacha...,65,57
2,138191,<bof> for the purposes referred to in points a...,<bof> féadfaidh an coimisiún gníomhartha cur c...,31,49
3,22757,<bof> adapting the production and output of pr...,<bof> cinnfidh na ballstáit an tuaslíon agus a...,35,18
4,111506,<bof> the possibility of becoming a member the...,<bof> ba é an chonclúid a baineadh as an meast...,34,16
...,...,...,...,...,...
11995,147034,<bof> steered axles number and position <eos>,<bof> cuirfidh an tiarratasóir isteach ráiteas...,39,5
11996,105707,<bof> active implantable devices and if approp...,<bof> go háirithe tabharfar aird ar an tsábhái...,34,36
11997,72904,<bof> the supervisory authorities concerned sh...,<bof> i gcás ina mbeidh bunaíochtaí ag an rial...,58,28
11998,157368,<bof> when the information referred to in arti...,<bof> más rud é tar éis an chomhairliúcháin si...,119,48


In [12]:
preprocess.tokenize()

Unnamed: 0,index,Englishtext,Irishtext,Irishlen,Englishlen,EnglishTokens,IrishTokens
0,158873,<bof> article 21b and c <eos>,<bof> gan dochar dfhreagrachtaí an oibreora ae...,69,4,"[<bof>, article, 21b, and, c, <eos>]","[<bof>, gan, dochar, dfhreagrachtaí, an, oibre..."
1,86645,<bof> the commission shall send the informatio...,<bof> ceadófar do shoithí mórscála peiligeacha...,65,57,"[<bof>, the, commission, shall, send, the, inf...","[<bof>, ceadófar, do, shoithí, mórscála, peili..."
2,138191,<bof> for the purposes referred to in points a...,<bof> féadfaidh an coimisiún gníomhartha cur c...,31,49,"[<bof>, for, the, purposes, referred, to, in, ...","[<bof>, féadfaidh, an, coimisiún, gníomhartha,..."
3,22757,<bof> adapting the production and output of pr...,<bof> cinnfidh na ballstáit an tuaslíon agus a...,35,18,"[<bof>, adapting, the, production, and, output...","[<bof>, cinnfidh, na, ballstáit, an, tuaslíon,..."
4,111506,<bof> the possibility of becoming a member the...,<bof> ba é an chonclúid a baineadh as an meast...,34,16,"[<bof>, the, possibility, of, becoming, a, mem...","[<bof>, ba, é, an, chonclúid, a, baineadh, as,..."
...,...,...,...,...,...,...,...
11995,147034,<bof> steered axles number and position <eos>,<bof> cuirfidh an tiarratasóir isteach ráiteas...,39,5,"[<bof>, steered, axles, number, and, position,...","[<bof>, cuirfidh, an, tiarratasóir, isteach, r..."
11996,105707,<bof> active implantable devices and if approp...,<bof> go háirithe tabharfar aird ar an tsábhái...,34,36,"[<bof>, active, implantable, devices, and, if,...","[<bof>, go, háirithe, tabharfar, aird, ar, an,..."
11997,72904,<bof> the supervisory authorities concerned sh...,<bof> i gcás ina mbeidh bunaíochtaí ag an rial...,58,28,"[<bof>, the, supervisory, authorities, concern...","[<bof>, i, gcás, ina, mbeidh, bunaíochtaí, ag,..."
11998,157368,<bof> when the information referred to in arti...,<bof> más rud é tar éis an chomhairliúcháin si...,119,48,"[<bof>, when, the, information, referred, to, ...","[<bof>, más, rud, é, tar, éis, an, chomhairliú..."


In [13]:
preprocess.create_dict()

Created dictonaries for indexing unique words and value counts


In [14]:
preprocess.print_info()

Statistics on dataset


Number of samples: 12000
Unique words in english set: 12173
Unique words in irish set: 22933
Max sequence length for source language : 307
Max sequence length for target language : 301


In [15]:
preprocess.encode_sentence()

In [16]:
from tensorflow.keras.utils import pad_sequences
import numpy as np
def pad_features(english_tokens, irish_tokens):

    source = pad_sequences(english_tokens, maxlen = 310, padding = 'post', truncating = 'post', value = 0)
    target = pad_sequences(irish_tokens, maxlen = 310, padding = 'post', truncating = 'post', value = 0)
        
        
    return source, target
        

In [17]:
data_source, data_target = pad_features(preprocess.encoded_sentence_english, preprocess.encoded_sentence_irish)

In [18]:
print(f"Shapes of train source {data_source.shape}, and target {data_target.shape}")

Shapes of train source (12000, 310), and target (12000, 310)


In [19]:
X_train = data_source[0:8000]
y_train = data_target[0:8000]

X_test = data_source[8000:10000]
y_test = data_target[8000:10000]

X_val = data_source[10000:]
y_val = data_target[10000:]

In [20]:
from torch.utils.data import DataLoader, TensorDataset
import torch

train_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(X_train),
        torch.LongTensor(y_train)
    ),
    shuffle = True,
    batch_size = 64
)

val_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(X_val),
        torch.LongTensor(y_val)
    ),
    shuffle = False,
    batch_size = 64
)

test_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(X_test),
        torch.LongTensor(y_test)
    ),
    shuffle = False,
    batch_size = 64
)

In [21]:
print("Train Data : "+str(len(X_train)) + "   "+ str(len(y_train)))
print("Test Data : "+str(len(X_test)) + "   "+ str(len(y_test)))
print("Validation Data : "+str(len(X_val)) + "   "+ str(len(y_val)))

Train Data : 8000   8000
Test Data : 2000   2000
Validation Data : 2000   2000


In [22]:
for batch in train_dl:
    print( batch[0].shape, batch[1].shape )
    break

torch.Size([64, 310]) torch.Size([64, 310])


## Task 2. Model Implementation and Training (30 pts)



## Task 2a. Encoder-Decoder Model Implementation (10 pts)
Implement an Encoder-Decoder model in Pytorch with the following components
* A single layer RNN based encoder. 
* A single layer RNN based decoder
* A Encoder-Decoder model based on the above components that support sequence-to-sequence modelling. For the encoder/decoder you can use RNN, LSTMs or GRU. Use a hidden dimension of 256 or less depending on your compute constraints. 

In [23]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
# Your code here

class Encoder(nn.Module):
    def __init__(self, input_vocab_size, hidden_dim, encoder_hid_dim, decoder_hid_dim , prob):
        super(Encoder, self).__init__()
        
        self.embedding = nn.Embedding(input_vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, encoder_hid_dim, bidirectional = True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(prob)
    
    def forward(self, src):
        embedding = self.dropout(self.embedding(src))
        
        outputs, hidden = self.rnn(embedding)
        
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2, :,:], hidden[-1,:,:]), dim = 1)))
        return outputs, hidden
    
    
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2)))
        
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim = 1)
    
    
    
class Decoder(nn.Module):
    def __init__(self, target_vocab_size, hidden_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.output_dim = target_vocab_size
        self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
                (enc_hid_dim * 2) + dec_hid_dim + hidden_dim,
                target_vocab_size
        )
        
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, input, hidden, encoder_outputs):
        input = input.unsqueeze(0)
        
        embedding = self.dropout(self.embedding(input))
        
        a = self.attention(hidden, encoder_outputs)
        a = a.unsqueeze(1)
        
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        
        rnn_input = torch.cat((embedding, weighted), dim = 2)
        
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        embedding = embedding.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediciton = self.fc_out(torch.cat((output, weighted, embedding), dim = 1))
        
        return predicitons, hidden.squeeze(0)
        

class Model(nn.Module):
    def __init__(self, encoder, decoder):
        super(Model, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = src.shape[1]
        target_len = trg.shape[0]
        
        target_vocab_size = self.decoder.output_dim
        
        outputs = torch.zeros(target_len, batch_size, target_vocab_size)
        
        encoder_outputs, hidden = self.encoder(src)
        
        input = trg[0, :]
    
        for t in range(1, target_len):
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            outputs[t] = output
            
            best = outputs.argmax(1)
            
            x = target[1] if random.random() < teacher_forcing_ratio else best
        return outputs


## Task 2b. Training (10 pts)
Implement the code to train the Encoder-Decoder model on the Irish-English data. You will write code for the following:
* Training, validation and test dataloaders 
* A training loop which trains the model for 5 epoch. Evaluate the loop at the end of each Epoch. Print out the train perplexity and validation perplexity after each epoch.

In [24]:
INPUT_DIM = len(preprocess.english_unique_words)
OUTPUT_DIM = len(preprocess.irish_unique_words)

ENC_EMB_DIM = 256
DEC_EMB_DIM = 256

ENC_HID_DIM = 128
DEC_HID_DIM = 128

ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

In [25]:
model = Model(enc, dec)
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Model(
  (encoder): Encoder(
    (embedding): Embedding(12173, 256)
    (rnn): GRU(256, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=128, bias=True)
      (v): Linear(in_features=128, out_features=1, bias=False)
    )
    (embedding): Embedding(22933, 256)
    (rnn): GRU(512, 128)
    (fc_out): Linear(in_features=640, out_features=22933, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [26]:
optimizer = torch.optim.Adam(model.parameters())

EPOCHS = 10
best_val_loss = float('inf')

for epoch in range(EPOCHS):
    model.train()
    epoch_loss = 0
    
    for batch in train_dl:
        src = batch[0].transpose(1 , 0)
        trg = batch[1].transpose(1 , 0)
        #trg2 = batch[1].transpose(0 , 1)
        print(src.shape," :1")
        print(trg.shape," :2")
        optimizer.zero_grad()
        output = model(src, trg)
        
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].reshape(-1)
        
        loss = F.cross_entropy(output, trg)
        loss.backward()
        
        torch.nn.utils.clip_grad_norm(model.parameters(), 1)
        optimizer.step()
        epoch_loss += loss.item()
    
    train_loss = round(epoch_loss/ len(train_dl), 3)
    
    eval_loss = 0
    
    model.eval()
    
    for batch in val_dl:
        src = batch[0].transpose(1, 0)
        trg = batch[1].transpose(1, 0)
        
        with torch.no_grad():
            output = model(src, trg)
            
            output_dim = output.shape[-1]
            output = output[-1].view(-1, output_dim)
            trg = trg[1:].reshape(-1)
            
            loss = F.cross_entropy(output, trg)
            
            eval_loss += loss.item()
    val_loss = round(eval_loss / len(val_dl), 3)
    print(f'Epoch {epoch} | train_loss : {train_loss} | train ppl : {np.exp(train_loss)} | val ppl : {np.exp(val_loss)}')
    
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'nest_model.pt')
        

torch.Size([310, 64])  :1
torch.Size([310, 64])  :2


RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [64, 310] but got: [310, 64].

In [None]:
for epoch in range(0, epochs):
    
    model.train()
    
    for batch in train_dl:
        src = batch[0].transpose(0, 1)
        trg = batch[1].transpose(0, 1)
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].reshape(-1)
        
        loss = F.cross_entropy(output, trg)
        
        eval_loss += loss.item()
    val_loss = round(eval_loss / len(val_dl), 3)
    print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")
        
        

# Task 2c. Evaluation on the Test Set (10 pts)
Use the trained model to translate the text from the source language into the target language on the test set. Evaluate the performance of the model on the test set using the BLEU metric and print out the average the BLEU score.

In [None]:
# Your code here

## Task 3. Improving NMT using Attention (10 pts) 
Extend the Encoder-Decoder model from Task 2 with the attention mechanism. Retrain the model and evaluate on test set. Print the updated average BLEU score on the test set. In a few sentences explains which model is the best for translation. 

In [None]:
import torch 
import torch.nn as nn 
import torch.nn.functional as F

class EncoderGRU(nn.Module):
    def __init__(
        self, 
        input_vocab_size,  # size of source vocabulary  
        hidden_dim,        # hidden dimension of embeddings
        encoder_hid_dim,   # gru hidden dim
        decoder_hid_dim,   # decoder hidden dim 
        dropout_prob = .5
      ):
      
        super().__init__()
        self.embedding = nn.Embedding(input_vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, encoder_hid_dim, bidirectional = True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards GRU
        #hidden [-1, :, : ] is the last of the backwards GRU
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))        
        return outputs, hidden

In [None]:
class Attention(nn.Module):
    def __init__(
        self, 
        enc_hid_dim,      # Encoder hidden dimension
        dec_hid_dim       # Decoder hidden dimension 
      ):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)
        
        #attention output: [batch size, src len]
        return F.softmax(attention, dim=1)

In [None]:
class DecoderGRU(nn.Module):
    def __init__(
        self, 
        target_vocab_size,    # Size of target vocab 
        hidden_dim,           # hidden size of embedding  
        enc_hid_dim, 
        dec_hid_dim, 
        dropout
      ):
        super().__init__()

        self.output_dim = target_vocab_size
        self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
            (enc_hid_dim * 2) + dec_hid_dim + hidden_dim, 
            target_vocab_size
          )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)  # [1, batch size]
        
        embedded = self.dropout(self.embedding(input))  # [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)     # [batch size, src len]
        a = a.unsqueeze(1)                              # [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)           # [batch size, 1, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)               # [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch size, (enc hid dim * 2) + emb dim]

        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch size, output dim]
        return prediction, hidden.squeeze(0)

In [None]:
import random 
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time     
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):     
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

In [None]:
INPUT_DIM = len(preprocess.english_unique_words)
OUTPUT_DIM = len(preprocess.irish_unique_words)
ENC_EMB_DIM = 128
DEC_EMB_DIM = 128
ENC_HID_DIM = 64
DEC_HID_DIM = 64
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model = EncoderDecoder(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

In [None]:
import numpy as np 
from tqdm.notebook import tqdm

optimizer = torch.optim.Adam(model.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_val_loss = float('inf')

for epoch in range(EPOCHS):

  model.train()
  epoch_loss = 0
  for batch in tqdm(train_dl):
     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)
     optimizer.zero_grad()

     output = model(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model.eval()
  for batch in tqdm(val_dl):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  val_loss = round(eval_loss / len(val_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")


  if val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), 'best-model.pt')  
  