# Neural Machine Translation (NMT) System: Transformer Architecture

## Executive Summary
This project explores the implementation of a Neural Machine Translation (NMT) system designed to translate English sentences into French. It specifically focuses on the application of the Transformer architecture, utilizing mechanisms like Self-Attention and Positional Encoding to overcome the limitations of sequential processing found in traditional RNNs.

## Research Objectives
1.  **Transformer Efficacy Analysis**: To evaluate the performance of Attention-based architectures in handling sequence translation tasks.
2.  **Transformer Pipeline**: To construct a robust pipeline involving data preprocessing, vocabulary handling, and tensor batching tailored for attention-based models.
3.  **Performance Quantification**: To utilize standard BLEU (Bilingual Evaluation Understudy) scores to rigorously assess translation quality.

## Methodology

### 1. Data Engineering
*   **Corpus**: English-French sentence pairs (Anki/ManyThings).
*   **Preprocessing**: Implementation of a Unicode-to-ASCII normalization pipeline, regex cleaning, and sequence padding.
*   **Vectorization**: Custom `DataLoader` and `Vocab_Lang` classes for efficient tensor mapping.

### 2. Model Architecture
*   **Transformer Framework**: Implementation of the "Attention Is All You Need" architecture, completely dispensing with recurrence and convolutions.
*   **Self-Attention**: Utilization of Multi-Head Attention mechanisms to model dependencies between words regardless of their distance in the sentence.
*   **Positional Encodings**: Injection of information about the relative or absolute position of tokens in the sequence.

### 3. Training & Evaluation
*   **Optimization**: Adam optimizer with Cross-Entropy Loss.
*   **Metrics**: Evaluation performed on unseen test data using BLEU-1 through BLEU-4 scores to measure n-gram overlap between predicted and reference translations.

---
*This notebook serves as a focused study on Transformer-based NMT methodologies, representing the state-of-the-art in sequence modeling.*

# Neural Machine Translation

Translation of sentences from source to target language

In [1]:
# run this code when running the code on Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# import sys
# sys.path.insert(0,'/content/drive/MyDrive/Colab Notebooks/NMT')

## 1. Importing Libraries

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
import unicodedata
import re
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch
import random
import os
import math

# Importing the required libraries for model training and evalutation
import torch.nn as nn
import torch.nn.functional as F
import time
from tqdm.notebook import tqdm
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction, corpus_bleu

## 2. Downloaing Data

Downloading the dataset and arranging the source and target sentences in a dataframe

In [2]:
# downloading the data
if __name__ == '__main__':
    os.system("wget http://www.manythings.org/anki/fra-eng.zip")
    os.system("unzip -o fra-eng.zip")

# arranding the data in a dataframe
if __name__ == '__main__':
    lines = open('fra.txt', encoding='UTF-8').read().strip().split('\n')
    total_num_examples = 50000 
    original_word_pairs = [[w for w in l.split('\t')][:2] for l in lines[:total_num_examples]]
    random.shuffle(original_word_pairs)

    dat = pd.DataFrame(original_word_pairs, columns=['eng', 'fra'])
    print(dat) # Visualize the data

--2026-01-13 13:44:13--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8186368 (7.8M) [application/zip]
Saving to: ‘fra-eng.zip.3’

     0K .......... .......... .......... .......... ..........  0%  690K 12s
    50K .......... .......... .......... .......... ..........  1% 1.50M 8s
   100K .......... .......... .......... .......... ..........  1% 11.6M 6s
   150K .......... .......... .......... .......... ..........  2% 50.7M 4s
   200K .......... .connected.
HTTP request sent, awaiting response... 200 OK
Length: 8186368 (7.8M) [application/zip]
Saving to: ‘fra-eng.zip.3’

     0K .......... .......... .......... .......... ..........  0%  690K 12s
    50K .......... .......... .......... 

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 
                         eng                              fra
0      How many do you need?        Tu as besoin de combien ?
1             I almost fell.              J'ai failli tomber.
2          What's the point?                     À quoi bon ?
3       This is my father's.                C’est à mon père.
4        She has no manners.  Elle est dépourvue de manières.
...                      ...                              ...
49995               Get out!               Vers l'extérieur !
49996     Give me your hand.               Donne-moi la main.
49997     I don't feel sick.        Je ne me sens pas malade.
49998      You're very wise.           Vous êtes très avisés.
49999  Don't talk like that.           Ne parle pas comme ça.

[50000 rows x 2 columns]

                         eng                              fra
0      How many do you need?        Tu as besoin de combien ?
1      

## 3. Preprocessing the data


In [3]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    """Normalizes latin chars with accent to their canonical decomposition"""
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    '''
    Preprocess the sentence to add the start, end tokens and make them lower-case
    '''
    w = unicode_to_ascii(w.lower().strip())
    w = re.sub(r'([?.!,¿])', r' \1 ', w)
    w = re.sub(r'[" "]+', ' ', w)

    w = re.sub(r'[^a-zA-Z?.!,¿]+', ' ', w)
    
    w = w.rstrip().strip()
    w = '<start> ' + w + ' <end>'
    return w

if __name__ == '__main__':
    data = dat.copy()
    data['eng'] = dat.eng.apply(lambda w: preprocess_sentence(w))
    data['fra'] = dat.fra.apply(lambda w: preprocess_sentence(w))
    print(data) # Visualizing the data

                                        eng  \
0      <start> how many do you need ? <end>   
1             <start> i almost fell . <end>   
2          <start> what s the point ? <end>   
3       <start> this is my father s . <end>   
4        <start> she has no manners . <end>   
...                                     ...   
49995               <start> get out ! <end>   
49996     <start> give me your hand . <end>   
49997     <start> i don t feel sick . <end>   
49998      <start> you re very wise . <end>   
49999  <start> don t talk like that . <end>   

                                                  fra  
0             <start> tu as besoin de combien ? <end>  
1                  <start> j ai failli tomber . <end>  
2                          <start> a quoi bon ? <end>  
3                    <start> c est a mon pere . <end>  
4      <start> elle est depourvue de manieres . <end>  
...                                               ...  
49995                <start> vers l exterie

## 4. Building Vocabulary

Arranging the vocabulary of words from the source and target languages in a list

In [4]:
def build_vocabulary(pd_dataframe):
    '''
    Creating a list to store words forming the vocabulary of a chosen language
    '''
    sentences = [sen.split() for sen in pd_dataframe]
    vocabulary = {}
    for sent in sentences:
        for word in sent:
            if word not in vocabulary:
                vocabulary[word] = 1
    return list(vocabulary.keys())

if __name__ == '__main__':
    src_vocab_list = build_vocabulary(data['eng'])
    trg_vocab_list = build_vocabulary(data['fra'])

print("The source vocabulary is: ", src_vocab_list)
print("The target vocabulary is: ", trg_vocab_list)

The target vocabulary is:  ['<start>', 'tu', 'as', 'besoin', 'de', 'combien', '?', '<end>', 'j', 'ai', 'failli', 'tomber', '.', 'a', 'quoi', 'bon', 'c', 'est', 'mon', 'pere', 'elle', 'depourvue', 'manieres', 'y', 'suis', 'oppose', 'comment', 'les', 'veux', 'je', 'ne', 'peux', 'pas', 'leur', 'en', 'vouloir', 'mes', 'pieds', 'sont', 'douloureux', 'aime', 'skier', 'pourquoi', 'tom', 'demissionne', 't', 'il', 'etait', 'riche', 'elles', 'm', 'ont', 'ignore', 'depasse', 'sortons', 'moi', 'qui', 'eu', 'l', 'idee', 'vous', 'serez', 'pret', 'nous', 'sommes', 'partis', 'train', 'ils', 'aimaient', 'dois', 'partir', 'maintenant', 'ce', 'n', 'encore', 'termine', 'remonte', 'dans', 'ma', 'voiture', 'poursuis', 'et', 'parle', 'etes', 'egoiste', 'arrete', 'geindre', 'qu', 'avez', 'ecrit', 'dit', 'venir', 'seul', 'derriere', 'tout', 'ca', 'un', 'emploi', 'le', 'proprietaire', 'perdu', 'espoir', 'bateau', 'petit', 'expliquer', 'cela', 'fit', 'une', 'vanne', 'faites', 'attention', '!', 'que', 'cette', 't

## 5. Instantiating the training and target data set

1. **Vocabulary Class** -  A separate class has been created for vocabulary. With this, the vocabulary list for each language can be instantiated into a data structure which stores the words from the vocabulary accompanied with a mapping of these words with numbers serving as their indices. This numbers can be used in the training process by the model.

2. **DataLoader Class** - Each sentence is stored as a list of words. The DataLoader will instantiate this into a long tensor.



In [5]:
# vocabulary class
class Vocab_Lang():
    def __init__(self, vocab):
        self.word2idx = {'<pad>': 0, '<unk>': 1}
        self.idx2word = {0: '<pad>', 1: '<unk>'}
        self.vocab = vocab
        
        for index, word in enumerate(vocab):
            self.word2idx[word] = index + 2 # +2 because of <pad> and <unk> token
            self.idx2word[index + 2] = word
    
    def __len__(self):
        return len(self.word2idx)

# data loader class
class MyData(Dataset):
    def __init__(self, X, y):
        self.length = torch.LongTensor([np.sum(1 - np.equal(x, 0)) for x in X])
        self.data = torch.LongTensor(X)
        self.target = torch.LongTensor(y)
    
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        return x, y

    def __len__(self):
        return len(self.data)

In [7]:
def pad_sequences(x, max_len):
    """ 
    Adding padding to sentences of length smaller than the maximum sentence length
    """
    padded = np.zeros((max_len), dtype=np.int64)
    if len(x) > max_len:
        padded[:] = x[:max_len]
    else:
        padded[:len(x)] = x
    return padded


def preprocess_data_to_tensor(dataframe, src_vocab, trg_vocab):
    # Vectorize the input and target languages
    src_tensor = [[src_vocab.word2idx[s if s in src_vocab.vocab else '<unk>'] for s in eng.split(' ')] for eng in dataframe['eng'].values.tolist()]
    trg_tensor = [[trg_vocab.word2idx[s if s in trg_vocab.vocab else '<unk>'] for s in fra.split(' ')] for fra in dataframe['fra'].values.tolist()]

    # Calculate the max_length of input and output tensor for padding
    max_length_src, max_length_trg = max(len(t) for t in src_tensor), max(len(t) for t in trg_tensor)
    print('max_length_src: {}, max_length_trg: {}'.format(max_length_src, max_length_trg))

    # Pad all the sentences in the dataset with the max_length
    src_tensor = [pad_sequences(x, max_length_src) for x in src_tensor]
    trg_tensor = [pad_sequences(x, max_length_trg) for x in trg_tensor]

    return src_tensor, trg_tensor, max_length_src, max_length_trg


def train_test_split(src_tensor, trg_tensor):
    '''
    Create training and test sets.
    '''
    total_num_examples = len(src_tensor) - int(0.2*len(src_tensor))
    src_tensor_train, src_tensor_test = src_tensor[:int(0.75*total_num_examples)], src_tensor[int(0.75*total_num_examples):total_num_examples]
    trg_tensor_train, trg_tensor_test = trg_tensor[:int(0.75*total_num_examples)], trg_tensor[int(0.75*total_num_examples):total_num_examples]

    return src_tensor_train, src_tensor_test, trg_tensor_train, trg_tensor_test

The sentences from the source and target language are stored in form of tensors. The data is split into training and testing data set separately. By setting the appropriate hyperparamters of embedding and batch size the tensor data in smapled into batched using the DataLoader module from Pytorch

In [8]:
if __name__ == '__main__':

    # HYPERPARAMETERS 
    BATCH_SIZE = 64
    EMBEDDING_DIM = 256

    src_vocab = Vocab_Lang(src_vocab_list)
    trg_vocab = Vocab_Lang(trg_vocab_list)

    src_tensor, trg_tensor, max_length_src, max_length_trg = preprocess_data_to_tensor(data, src_vocab, trg_vocab)
    src_tensor_train, src_tensor_val, trg_tensor_train, trg_tensor_val = train_test_split(src_tensor, trg_tensor)
    # Create train and val datasets
    train_dataset = MyData(src_tensor_train, trg_tensor_train)
    train_dataset = DataLoader(train_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=True)
    
    test_dataset = MyData(src_tensor_val, trg_tensor_val)
    test_dataset = DataLoader(test_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=False)

max_length_src: 10, max_length_trg: 17


  self.data = torch.LongTensor(X)


In [9]:
if __name__ == '__main__':
    idxes = random.choices(range(len(train_dataset.dataset)), k=5)
    src, trg =  train_dataset.dataset[idxes]
    print('Source:', src)
    print('Source Dimensions: ', src.size())
    print('Target:', trg)
    print('Target Dimensions: ', trg.size())

Source: tensor([[   2,   16,  206,   19, 2927,   13,    9,    0,    0,    0],
        [   2,  469,   54,   13,    9,    0,    0,    0,    0,    0],
        [   2,    6, 2590,   47,   13,    9,    0,    0,    0,    0],
        [   2,   48,   23,   89, 1207,   13,    9,    0,    0,    0],
        [   2,  113,   15,   55,   10,   88,   13,    9,    0,    0]])
Source Dimensions:  torch.Size([5, 10])
Target: tensor([[   2,  126,  230,   19, 4274,   14,    9,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   2, 1763,   14,    9,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   2,    3,   52,    4,  368,   14,    9,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   2,   48,   15,  395,   15,  126, 1460,   14,    9,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   2,   18,   19,   96,   74,  114,   10,   11,   14,    9,    0,    0,
            0,    0,    0,  

## 6. Model training
Now we will train a Transformer-based Encoder and Decoder model for learning the translation from the source to target language. We will import the encoder and decoder models and use them in the training process

In [10]:
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
output.backward()
output

tensor(2.5283, grad_fn=<NllLossBackward0>)

In [11]:
import transformerencoder
import transformerdecoder

def train_transformer_model(encoder, decoder, dataset, optimizer, device, n_epochs):
    """ Model training for machine translation
    """
    encoder.train()
    decoder.train()
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    for epoch in range(n_epochs):
        start = time.time()
        losses = []

        for src, trg in tqdm(train_dataset):
            
            src = src.to(device).transpose(0,1) # [max_src_length, batch_size]
            trg = trg.to(device).transpose(0,1) # [max_trg_length, batch_size]

            enc_out = encoder(src)
            output = decoder(trg[:-1, :], enc_out)

            output = output.reshape(-1, output.shape[2])
            trg = trg[1:].reshape(-1)

            optimizer.zero_grad()

            loss = criterion(output, trg)
            losses.append(loss.item())

            loss.backward()

            # Clip to avoid exploding grading issues
            torch.nn.utils.clip_grad_norm_(encoder.parameters(), max_norm=1)
            torch.nn.utils.clip_grad_norm_(decoder.parameters(), max_norm=1)

            optimizer.step()

        mean_loss = sum(losses) / len(losses)
        print('Epoch:{:2d}/{}\t Loss:{:.4f} ({:.2f}s)'.format(epoch + 1, n_epochs, mean_loss, time.time() - start))


In [12]:
if __name__ == '__main__':
    # HYPERPARAMETERS
    LEARNING_RATE = 0.001
    DIM_FEEDFORWARD=512
    N_EPOCHS=5
    N_HEADS=2
    N_LAYERS=2

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # transformer encoder model
    transformer_encoder = transformerencoder.TransformerEncoder(src_vocab, EMBEDDING_DIM, N_HEADS, 
                                 N_LAYERS,DIM_FEEDFORWARD,
                                 max_length_src, device).to(device)
    # transformer decoder model
    transformer_decoder = transformerdecoder.TransformerDecoder(trg_vocab, EMBEDDING_DIM, N_HEADS, 
                              N_LAYERS,DIM_FEEDFORWARD,
                              max_length_trg, device).to(device)

    transformer_model_params = list(transformer_encoder.parameters()) + list(transformer_decoder.parameters())
    optimizer = torch.optim.Adam(transformer_model_params, lr=LEARNING_RATE)

    print('Encoder and Decoder models have been initialized!')



Encoder and Decoder models have been initialized!


In [13]:
if __name__ == '__main__':
    train_transformer_model(transformer_encoder, transformer_decoder, train_dataset, optimizer, device, N_EPOCHS)

  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 1/5	 Loss:3.1603 (82.30s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 2/5	 Loss:2.1568 (69.74s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 3/5	 Loss:1.7017 (72.89s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 4/5	 Loss:1.3878 (86.05s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 5/5	 Loss:1.1787 (79.10s)


# 7. Decoding output

Decoding the output to predict the sentences for unseen source sentences

In [14]:
import decodingalgorithm
if __name__ == '__main__':
    transformer_encoder.eval()
    transformer_decoder.eval()
    idxes = random.choices(range(len(test_dataset.dataset)), k=5)
    src, trg =  train_dataset.dataset[idxes]
    curr_output, _, _ = decodingalgorithm.decode_transformer_model(transformer_encoder, transformer_decoder, src.transpose(0,1).to(device), trg.size(1), device)
    for i in range(len(src)):
        print("Source sentence:", ' '.join([x for x in [src_vocab.idx2word[j.item()] for j in src[i]] if x != '<pad>']))
        print("Target sentence:", ' '.join([x for x in [trg_vocab.idx2word[j.item()] for j in trg[i]] if x != '<pad>']))
        print("Predicted sentence:", ' '.join([x for x in [trg_vocab.idx2word[j.item()] for j in curr_output[i]] if x != '<pad>']))
        print("----------------")

Source sentence: <start> i m stubborn . <end>
Target sentence: <start> je suis tetue . <end>
Predicted sentence: <start> je suis cense . <end> . <end> . <end> . <end> . <end> . <end> .
----------------
Source sentence: <start> tom surrendered . <end>
Target sentence: <start> tom abandonna . <end>
Predicted sentence: <start> tom a capitule . <end> . <end> . <end> . <end> . <end> . <end> .
----------------
Source sentence: <start> what was taken ? <end>
Target sentence: <start> qu est ce qui a ete pris ? <end>
Predicted sentence: <start> qu etait ce ? <end> ? <end> ? <end> ? <end> ? <end> ? <end> ?
----------------
Source sentence: <start> this is my city now . <end>
Target sentence: <start> c est ma ville desormais . <end>
Predicted sentence: <start> c est mon ville . <end> . <end> . <end> . <end> . <end> . <end>
----------------
Source sentence: <start> speak to me . <end>
Target sentence: <start> parle moi ! <end>
Predicted sentence: <start> parle moi ! <end> . <end> . <end> . <end> .

## Model Evaluation

Evaluation of the model is based on the blue score

In [15]:
def get_reference_candidate(target, pred, trg_vocab):
    def _to_token(sentence):
        lis = []
        for s in sentence[1:]:
            x = trg_vocab.idx2word[s]
            if x == "<end>": break
            lis.append(x)
        return lis
    reference = _to_token(list(target.numpy()))
    candidate = _to_token(list(pred.numpy()))
    return reference, candidate
    
def compute_bleu_scores(target_tensor_val, target_output, final_output, trg_vocab):
    bleu_1 = 0.0
    bleu_2 = 0.0
    bleu_3 = 0.0
    bleu_4 = 0.0

    smoother = SmoothingFunction()
    save_reference = []
    save_candidate = []
    for i in range(len(target_tensor_val)):
        reference, candidate = get_reference_candidate(target_output[i], final_output[i], trg_vocab)
    
        bleu_1 += sentence_bleu(reference, candidate, weights=(1,), smoothing_function=smoother.method1)
        bleu_2 += sentence_bleu(reference, candidate, weights=(1/2, 1/2), smoothing_function=smoother.method1)
        bleu_3 += sentence_bleu(reference, candidate, weights=(1/3, 1/3, 1/3), smoothing_function=smoother.method1)
        bleu_4 += sentence_bleu(reference, candidate, weights=(1/4, 1/4, 1/4, 1/4), smoothing_function=smoother.method1)

        save_reference.append(reference)
        save_candidate.append(candidate)
    
    bleu_1 = bleu_1/len(target_tensor_val)
    bleu_2 = bleu_2/len(target_tensor_val)
    bleu_3 = bleu_3/len(target_tensor_val)
    bleu_4 = bleu_4/len(target_tensor_val)

    scores = {"bleu_1": bleu_1, "bleu_2": bleu_2, "bleu_3": bleu_3, "bleu_4": bleu_4}
    print('BLEU 1-gram: %f' % (bleu_1))
    print('BLEU 2-gram: %f' % (bleu_2))
    print('BLEU 3-gram: %f' % (bleu_3))
    print('BLEU 4-gram: %f' % (bleu_4))

    return save_candidate, scores

def evaluate_model(encoder, decoder, test_dataset, target_tensor_val, device):
    trg_vocab = decoder.trg_vocab
    batch_size = test_dataset.batch_size
    n_batch = 0
    total_loss = 0

    encoder.eval()
    decoder.eval()
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    losses=[]
    final_output, target_output = None, None

    with torch.no_grad():
        for batch, (src, trg) in enumerate(test_dataset):
            n_batch += 1
            loss = 0
            
            src, trg = src.transpose(0,1).to(device), trg.transpose(0,1).to(device)
            curr_output, curr_predictions, enc_out = decodingalgorithm.decode_transformer_model(encoder, decoder, src, trg.size(0), device)

            for t in range(1, trg.size(0)):
                output = decoder(trg[:-1, :], enc_out)
                output = output.reshape(-1, output.shape[2])
                loss_trg = trg[1:].reshape(-1)
                loss += criterion(output, loss_trg)
                # loss += criterion(curr_predictions[:,t,:].to(device), trg[t,:].reshape(-1).to(device))

            if final_output is None:
                final_output = torch.zeros((len(target_tensor_val), trg.size(0)))
                target_output = torch.zeros((len(target_tensor_val), trg.size(0)))

            final_output[batch*batch_size:(batch+1)*batch_size] = curr_output
            target_output[batch*batch_size:(batch+1)*batch_size] = trg.transpose(0,1)
            losses.append(loss.item() / (trg.size(0)-1))

        mean_loss = sum(losses) / len(losses)
        print('Loss {:.4f}'.format(mean_loss))
    
    # Compute Bleu scores
    return compute_bleu_scores(target_tensor_val, target_output, final_output, trg_vocab)

In [16]:
if __name__ == '__main__':
    transformer_save_candidate, transformer_scores = evaluate_model(transformer_encoder, transformer_decoder, test_dataset, trg_tensor_val, device)

Loss 1.6569
BLEU 1-gram: 0.247012
BLEU 2-gram: 0.075584
BLEU 3-gram: 0.057015
BLEU 4-gram: 0.055257
BLEU 1-gram: 0.247012
BLEU 2-gram: 0.075584
BLEU 3-gram: 0.057015
BLEU 4-gram: 0.055257


**Saving Transformer Encoder and Decoder Model**

In [18]:
if __name__=='__main__':
    # from google.colab import drive
    # drive.mount('/content/drive')
    if transformer_encoder is not None and transformer_decoder is not None:
        print("Saving Transformer model....") 
        torch.save(transformer_encoder, 'transformer_encoder.pt')
        torch.save(transformer_decoder, 'transformer_decoder.pt')
        print("Transformer model saved!")

Saving Transformer model....
Transformer model saved!
