# Neural Machine Translation (NMT) with Transformers
## Project Overview

This notebook presents a comprehensive implementation of a Neural Machine Translation system designed to translate English text into French. Moving beyond traditional Recurrent Neural Networks (RNNs), this project leverages the **Transformer** architecture, utilizing self-attention mechanisms to capture long-range dependencies and improve translation accuracy.

### Key Components & Methodology

1.  **Data Acquisition & Preprocessing**:
    *   Utilization of the English-French bilingual dataset from [ManyThings.org](http://www.manythings.org/anki/).
    *   Rigorous text normalization including Unicode-to-ASCII conversion, regex-based cleaning, and tokenization.
    *   Construction of source and target vocabularies and conversion of sentences into padded tensor sequences.

2.  **Model Architecture**:
    *   Implementation of a sequence-to-sequence **Transformer** model.
    *   Incorporates **Positional Embeddings** to retain sequence order information.
    *   Features modular **Encoder** and **Decoder** blocks defined in external modules (`transformerencoder`, `transformerdecoder`).

3.  **Training Regime**:
    *   Model optimization using **Adam** optimizer and **Cross Entropy Loss**.
    *   Implementation of gradient clipping for training stability.
    *   Validation loops with loss tracking.

4.  **Inference & Evaluation**:
    *   Custom decoding algorithm for generating target translations from the trained model.
    *   Quantitative evaluation using **BLEU scores** (1-gram through 4-gram) to assess linguistic precision and fluency against reference data.

---
*Note: This notebook is structured for research experimentation, allowing for hyperparameter tuning and modular component replacement.*

In [1]:
# run this code when running the code on Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# import sys
# sys.path.insert(0,'/content/drive/MyDrive/Colab Notebooks/NMT')

**1. Importing Libraries**

In [2]:
# Importing the required libraries
import pandas as pd
import numpy as np
import unicodedata
import re
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch
import random
import os
import math

# Importing libraries for model training and evalutation
import torch.nn as nn
import torch.nn.functional as F
import time
from tqdm.notebook import tqdm
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction, corpus_bleu

**2. Downloading the datasetand arranging in a dataframe**

In [3]:
if __name__ == '__main__':
    os.system("wget http://www.manythings.org/anki/fra-eng.zip")
    os.system("unzip -o fra-eng.zip")

--2026-01-13 12:25:57--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8186368 (7.8M) [application/zip]
Saving to: ‘fra-eng.zip’

     0K .......... .......... .......... .......... ..........  0%  832K 10s
    50K .......... .......... .......... .......... ..........  1% 1.62M 7s
   100K .......... .......... .......... .......... ..........  1% 4.64M 5s
   150K .......... .......... .......... .......... ..........  2% 4.23M 4s
   200K .......... .......... .......... .......... ..........  3% 3.33M 4s
   250K .......... .......... .......... .......... .connected.
HTTP request sent, awaiting response... 200 OK
Length: 8186368 (7.8M) [application/zip]
Saving to: ‘fra-eng.zip’

     0K .....

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 



In [4]:
if __name__ == '__main__':
    lines = open('fra.txt', encoding='UTF-8').read().strip().split('\n')
    total_num_examples = 50000 
    original_word_pairs = [[w for w in l.split('\t')][:2] for l in lines[:total_num_examples]]
    random.shuffle(original_word_pairs)

    dat = pd.DataFrame(original_word_pairs, columns=['eng', 'fra'])
    print(dat) # Visualize the data

                        eng                             fra
0          Throw it to Tom.                 Lance-le à Tom.
1             That's funny.                  C'est amusant.
2             We loved you.                 Nous t'aimions.
3      Tom, are you asleep?       Est-ce que tu dors, Tom ?
4             They're lazy.         Elles sont paresseuses.
...                     ...                             ...
49995    Don't go in there.        Ne rentre pas là-dedans.
49996     Whatever you say.               Comme tu voudras.
49997     I had to see you.              Je devais te voir.
49998       That's my wish.                C'est mon désir.
49999     Tom is a bad man.  Tom est une mauvaise personne.

[50000 rows x 2 columns]


**3. Preprocessing the data**

In [5]:
# Converting the unicode file to ascii
def unicode_to_ascii(s):
    """Normalizes latin chars with accent to their canonical decomposition"""
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    '''
    Preprocess the sentence to add the start, end tokens and make them lower-case
    '''
    w = unicode_to_ascii(w.lower().strip())
    w = re.sub(r'([?.!,¿])', r' \1 ', w)
    w = re.sub(r'[" "]+', ' ', w)

    w = re.sub(r'[^a-zA-Z?.!,¿]+', ' ', w)
    
    w = w.rstrip().strip()
    w = '<start> ' + w + ' <end>'
    return w

if __name__ == '__main__':
    data = dat.copy()
    data['eng'] = dat.eng.apply(lambda w: preprocess_sentence(w))
    data['fra'] = dat.fra.apply(lambda w: preprocess_sentence(w))
    print(data) # Visualizing the data

                                        eng  \
0           <start> throw it to tom . <end>   
1              <start> that s funny . <end>   
2              <start> we loved you . <end>   
3      <start> tom , are you asleep ? <end>   
4              <start> they re lazy . <end>   
...                                     ...   
49995     <start> don t go in there . <end>   
49996      <start> whatever you say . <end>   
49997      <start> i had to see you . <end>   
49998        <start> that s my wish . <end>   
49999      <start> tom is a bad man . <end>   

                                                 fra  
0                     <start> lance le a tom . <end>  
1                      <start> c est amusant . <end>  
2                     <start> nous t aimions . <end>  
3           <start> est ce que tu dors , tom ? <end>  
4             <start> elles sont paresseuses . <end>  
...                                              ...  
49995        <start> ne rentre pas la dedans . <en

**4. Building the source and target vocabulary**

In [6]:
def build_vocabulary(pd_dataframe):
    sentences = [sen.split() for sen in pd_dataframe]
    vocab = {}
    for sen in sentences:
        for word in sen:
            if word not in vocab:
                vocab[word] = 1
    return list(vocab.keys())

if __name__ == '__main__':
    src_vocab_list = build_vocabulary(data['eng'])
    trg_vocab_list = build_vocabulary(data['fra'])

print("The source vocabulary is: ", src_vocab_list)
print("The target vocabulary is: ", trg_vocab_list)

The target vocabulary is:  ['<start>', 'lance', 'le', 'a', 'tom', '.', '<end>', 'c', 'est', 'amusant', 'nous', 't', 'aimions', 'ce', 'que', 'tu', 'dors', ',', '?', 'elles', 'sont', 'paresseuses', 'eut', 'un', 'sourire', 'force', 'fatiguees', 's', 'il', 'tue', 'n', 'pas', 'une', 'idee', 'nouvelle', 'bizarre', 'je', 'soupe', 'chien', 'eux', 'en', 'route', '!', 'acceptable', 'voulons', 'des', 'emplois', 'comment', 'ca', 'marche', 'beaucoup', 'trop', 'tot', 'y', 'personne', 'chez', 'moi', 'ne', 'me', 'sens', 'bien', 'arrete', 'de', 'harceler', 'ai', 'fait', 'telle', 'chose', 'peux', 'aider', 'aimes', 'les', 'hot', 'dogs', 'compte', 'pourrais', 'etre', 'utile', 'vas', 'prendre', 'froid', 'relais', 'mets', 'ton', 'chapeau', 'apres', 'demain', 'vous', 'avez', 'problemes', 'possede', 'cheval', 'guet', 'apens', 'ete', 'blesse', 'semble', 'touche', 'essayez', 'gagner', 'du', 'temps', 'beau', 'sais', 'm', 'apprecies', 'faire', 'non', 'plus', 'etes', 'saoules', 'triste', 'mais', 'la', 'verite', 'p

**5. Instantiating the training and target data set**

In [7]:
def pad_sequences(x, max_len):
    padded = np.zeros((max_len), dtype=np.int64)
    if len(x) > max_len:
        padded[:] = x[:max_len]
    else:
        padded[:len(x)] = x
    return padded


def preprocess_data_to_tensor(dataframe, src_vocab, trg_vocab):
    # Vectorize the input and target languages
    src_tensor = [[src_vocab.word2idx[s if s in src_vocab.vocab else '<unk>'] for s in eng.split(' ')] for eng in dataframe['eng'].values.tolist()]
    trg_tensor = [[trg_vocab.word2idx[s if s in trg_vocab.vocab else '<unk>'] for s in fra.split(' ')] for fra in dataframe['fra'].values.tolist()]

    # Calculate the max_length of input and output tensor for padding
    max_length_src, max_length_trg = max(len(t) for t in src_tensor), max(len(t) for t in trg_tensor)
    print('max_length_src: {}, max_length_trg: {}'.format(max_length_src, max_length_trg))

    # Pad all the sentences in the dataset with the max_length
    src_tensor = [pad_sequences(x, max_length_src) for x in src_tensor]
    trg_tensor = [pad_sequences(x, max_length_trg) for x in trg_tensor]

    return src_tensor, trg_tensor, max_length_src, max_length_trg


def train_test_split(src_tensor, trg_tensor):
    '''
    Create training and test sets.
    '''
    total_num_examples = len(src_tensor) - int(0.2*len(src_tensor))
    src_tensor_train, src_tensor_test = src_tensor[:int(0.75*total_num_examples)], src_tensor[int(0.75*total_num_examples):total_num_examples]
    trg_tensor_train, trg_tensor_test = trg_tensor[:int(0.75*total_num_examples)], trg_tensor[int(0.75*total_num_examples):total_num_examples]

    return src_tensor_train, src_tensor_test, trg_tensor_train, trg_tensor_test

In [8]:
# vocabulary class
class Vocab_Lang():
    def __init__(self, vocab):
        self.word2idx = {'<pad>': 0, '<unk>': 1}
        self.idx2word = {0: '<pad>', 1: '<unk>'}
        self.vocab = vocab
        
        for index, word in enumerate(vocab):
            self.word2idx[word] = index + 2 # +2 because of <pad> and <unk> token
            self.idx2word[index + 2] = word
    
    def __len__(self):
        return len(self.word2idx)

In [9]:
# data loader class
class MyData(Dataset):
    def __init__(self, X, y):
        self.length = torch.LongTensor([np.sum(1 - np.equal(x, 0)) for x in X])
        self.data = torch.LongTensor(X)
        self.target = torch.LongTensor(y)
    
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        return x, y

    def __len__(self):
        return len(self.data)

In [10]:
if __name__ == '__main__':

    # HYPERPARAMETERS 
    BATCH_SIZE = 64
    EMBEDDING_DIM = 256

    src_vocab = Vocab_Lang(src_vocab_list)
    trg_vocab = Vocab_Lang(trg_vocab_list)

    src_tensor, trg_tensor, max_length_src, max_length_trg = preprocess_data_to_tensor(data, src_vocab, trg_vocab)
    src_tensor_train, src_tensor_val, trg_tensor_train, trg_tensor_val = train_test_split(src_tensor, trg_tensor)

    # Create train and val datasets
    train_dataset = MyData(src_tensor_train, trg_tensor_train)
    train_dataset = DataLoader(train_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=True)

    test_dataset = MyData(src_tensor_val, trg_tensor_val)
    test_dataset = DataLoader(test_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=False)

max_length_src: 10, max_length_trg: 17


  self.data = torch.LongTensor(X)


In [None]:
### DO NOT EDIT ###

if __name__ == '__main__':
    idxes = random.choices(range(len(train_dataset.dataset)), k=5)
    src, trg =  train_dataset.dataset[idxes]
    print('Source:', src)
    print('Source Dimensions: ', src.size())
    print('Target:', trg)
    print('Target Dimensions: ', trg.size())

Source: tensor([[  2,   9,  62, 162,  11,   7,   8,   0,   0,   0,   0],
        [  2,   3,  84,  50, 953, 139,   7,   8,   0,   0,   0],
        [  2,   3,  89, 144, 477,   7,   8,   0,   0,   0,   0],
        [  2, 114,  91,  70,  21,  40,  14,   8,   0,   0,   0],
        [  2,  11,  38, 450,   7,   8,   0,   0,   0,   0,   0]])
Source Dimensions:  torch.Size([5, 11])
Target: tensor([[   2,   10,    9, 3900,    7,    8,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0],
        [   2,    3,   91,  174,  151, 1194,    7,    8,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0],
        [   2,   17,   19, 1034,    7,    8,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0],
        [   2,  367,   33,  204,   28,  120, 1313,   16,    8,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0],
        [   2,    9,   26,  534,    7,    8,    0,    0,    0,    0,    0,    0,
     

**6. Training a Transformer**

6.1 Positional Embedding

6.2 Encoder Model

In [11]:
import transformerencoder

6.3 Decoder Model

In [12]:
import transformerdecoder

6.4 Train the transformer model

In [None]:
### DO NOT EDIT ###

def train_transformer_model(encoder, decoder, dataset, optimizer, device, n_epochs):
    encoder.train()
    decoder.train()
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    for epoch in range(n_epochs):
        start = time.time()
        losses = []

        for src, trg in tqdm(train_dataset):
            
            src = src.to(device).transpose(0,1) # [max_src_length, batch_size]
            trg = trg.to(device).transpose(0,1) # [max_trg_length, batch_size]

            enc_out = encoder(src)
            output = decoder(trg[:-1, :], enc_out)

            output = output.reshape(-1, output.shape[2])
            trg = trg[1:].reshape(-1)

            optimizer.zero_grad()

            loss = criterion(output, trg)
            losses.append(loss.item())

            loss.backward()

            # Clip to avoid exploding grading issues
            torch.nn.utils.clip_grad_norm_(encoder.parameters(), max_norm=1)
            torch.nn.utils.clip_grad_norm_(decoder.parameters(), max_norm=1)

            optimizer.step()

        mean_loss = sum(losses) / len(losses)
        print('Epoch:{:2d}/{}\t Loss:{:.4f} ({:.2f}s)'.format(epoch + 1, n_epochs, mean_loss, time.time() - start))


In [None]:
### DO NOT EDIT ###

if __name__ == '__main__':
    # HYPERPARAMETERS - feel free to change
    LEARNING_RATE = 0.001
    DIM_FEEDFORWARD=512
    N_EPOCHS=10
    N_HEADS=2
    N_LAYERS=2

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    transformer_encoder = transformerencoder.TransformerEncoder(src_vocab, EMBEDDING_DIM, N_HEADS, 
                                 N_LAYERS,DIM_FEEDFORWARD,
                                 max_length_src, device).to(device)
    transformer_decoder = transformerdecoder.TransformerDecoder(trg_vocab, EMBEDDING_DIM, N_HEADS, 
                              N_LAYERS,DIM_FEEDFORWARD,
                              max_length_trg, device).to(device)

    transformer_model_params = list(transformer_encoder.parameters()) + list(transformer_decoder.parameters())
    optimizer = torch.optim.Adam(transformer_model_params, lr=LEARNING_RATE)

    print('Encoder and Decoder models initialized!')

Encoder and Decoder models initialized!


In [None]:
### DO NOT EDIT ###

if __name__ == '__main__':
    train_transformer_model(transformer_encoder, transformer_decoder, train_dataset, optimizer, device, N_EPOCHS)

  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 1/10	 Loss:3.2582 (13.73s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 2/10	 Loss:2.2359 (10.49s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 3/10	 Loss:1.7742 (12.11s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 4/10	 Loss:1.4530 (10.55s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 5/10	 Loss:1.2306 (10.55s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 6/10	 Loss:1.0770 (10.45s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 7/10	 Loss:0.9645 (10.54s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 8/10	 Loss:0.8881 (10.64s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch: 9/10	 Loss:0.8284 (10.61s)


  0%|          | 0/468 [00:00<?, ?it/s]

Epoch:10/10	 Loss:0.7802 (10.72s)


6.4 Decoding Function

In [None]:
def decode_transformer_model(encoder, decoder, src, max_decode_len, device):
    """
    Args:
        encoder: Your TransformerEncoder object
        decoder: Your TransformerDecoder object
        src: [max_src_length, batch_size] the source sentences you wish to translate
        max_decode_len: The maximum desired length (int) of your target translated sentences
        device: the device your torch tensors are on (you may need to call x.to(device) for some of your tensors)

    Returns:
        curr_output: [batch_size, max_decode_len] containing your predicted translated sentences
        curr_predictions: [batch_size, max_decode_len, trg_vocab_size] containing the (unnormalized) probabilities of each
            token in your vocabulary at each time step

    Pseudo-code:
    - Obtain encoder output by encoding src sentences
    - For 1 ≤ t ≤ max_decode_len:
        - Obtain dec_input as the best words so far for previous time steps (you can get this from curr_output)
        - Obtain your (unnormalized) prediction probabilities by feeding dec_input and encoder output to decoder
        - Save your (unnormalized) prediction probabilities in curr_predictions at index t
        - Calculate the most likely (highest probability) token and save in curr_output at timestep t
    """
    # Initialize variables
    trg_vocab = decoder.trg_vocab
    batch_size = src.size(1)
    curr_output = torch.zeros((batch_size, max_decode_len))
    curr_predictions = torch.zeros((batch_size, max_decode_len, len(trg_vocab.idx2word)))
    enc_output = None

    # We start the decoding with the start token for each example
    dec_input = torch.tensor([[trg_vocab.word2idx['<start>']]] * batch_size).transpose(0,1)
    curr_output[:, 0] = dec_input.squeeze(1)
    ### TODO: Implement decoding algorithm ###
    enc_output = encoder.forward(src)
    for t in range(1, max_decode_len):
      next_token = curr_output[:,:t].to(torch.long)
      decoder_output = decoder.forward(next_token.transpose(0,1), enc_output)
      # print(decoder_output.permute(1, 0, 2).size())
      curr_predictions[:,t,:] = decoder_output.permute(1, 0, 2)[:,-1,:]
      # print(curr_predictions)
      decoder_output = torch.argmax(curr_predictions[:,t,:], dim = -1)

      curr_output[:, t] = decoder_output
    return curr_output, curr_predictions, enc_output

In [None]:
### DO NOT EDIT ###

if __name__ == '__main__':
    transformer_encoder.eval()
    transformer_decoder.eval()
    idxes = random.choices(range(len(test_dataset.dataset)), k=5)
    src, trg =  train_dataset.dataset[idxes]
    curr_output, _, _ = decode_transformer_model(transformer_encoder, transformer_decoder, src.transpose(0,1).to(device), trg.size(1), device)
    for i in range(len(src)):
        print("Source sentence:", ' '.join([x for x in [src_vocab.idx2word[j.item()] for j in src[i]] if x != '<pad>']))
        print("Target sentence:", ' '.join([x for x in [trg_vocab.idx2word[j.item()] for j in trg[i]] if x != '<pad>']))
        print("Predicted sentence:", ' '.join([x for x in [trg_vocab.idx2word[j.item()] for j in curr_output[i]] if x != '<pad>']))
        print("----------------")

Source sentence: <start> how was your trip ? <end>
Target sentence: <start> comment fut votre voyage ? <end>
Predicted sentence: <start> comment s est passee ta voyage ? <end> ? <end> ? <end> ? <end> ? <end> ? <end>
----------------
Source sentence: <start> she is hostile to me . <end>
Target sentence: <start> elle m est hostile . <end>
Predicted sentence: <start> elle est hostile . <end> . <end> . <end> . <end> . <end> . <end> . <end> .
----------------
Source sentence: <start> you re so sweet . <end>
Target sentence: <start> vous etes tellement gentilles ! <end>
Predicted sentence: <start> vous etes tellement gentille ! <end> ! <end> ! <end> ! <end> ! <end> ! <end> ! <end>
----------------
Source sentence: <start> i shivered . <end>
Target sentence: <start> j ai tremble . <end>
Predicted sentence: <start> je tremble . <end> . <end> . <end> . <end> . <end> . <end> . <end> . <end>
----------------
Source sentence: <start> calm down ! <end>
Target sentence: <start> du calme . <end>
Pred

**Model Evaluation**

Evaluation of the model is based on the blue score

In [None]:
### DO NOT EDIT ###
def get_reference_candidate(target, pred, trg_vocab):
    def _to_token(sentence):
        lis = []
        for s in sentence[1:]:
            x = trg_vocab.idx2word[s]
            if x == "<end>": break
            lis.append(x)
        return lis
    reference = _to_token(list(target.numpy()))
    candidate = _to_token(list(pred.numpy()))
    return reference, candidate
    
def compute_bleu_scores(target_tensor_val, target_output, final_output, trg_vocab):
    bleu_1 = 0.0
    bleu_2 = 0.0
    bleu_3 = 0.0
    bleu_4 = 0.0

    smoother = SmoothingFunction()
    save_reference = []
    save_candidate = []
    for i in range(len(target_tensor_val)):
        reference, candidate = get_reference_candidate(target_output[i], final_output[i], trg_vocab)
    
        bleu_1 += sentence_bleu(reference, candidate, weights=(1,), smoothing_function=smoother.method1)
        bleu_2 += sentence_bleu(reference, candidate, weights=(1/2, 1/2), smoothing_function=smoother.method1)
        bleu_3 += sentence_bleu(reference, candidate, weights=(1/3, 1/3, 1/3), smoothing_function=smoother.method1)
        bleu_4 += sentence_bleu(reference, candidate, weights=(1/4, 1/4, 1/4, 1/4), smoothing_function=smoother.method1)

        save_reference.append(reference)
        save_candidate.append(candidate)
    
    bleu_1 = bleu_1/len(target_tensor_val)
    bleu_2 = bleu_2/len(target_tensor_val)
    bleu_3 = bleu_3/len(target_tensor_val)
    bleu_4 = bleu_4/len(target_tensor_val)

    scores = {"bleu_1": bleu_1, "bleu_2": bleu_2, "bleu_3": bleu_3, "bleu_4": bleu_4}
    print('BLEU 1-gram: %f' % (bleu_1))
    print('BLEU 2-gram: %f' % (bleu_2))
    print('BLEU 3-gram: %f' % (bleu_3))
    print('BLEU 4-gram: %f' % (bleu_4))

    return save_candidate, scores

def evaluate_model(encoder, decoder, test_dataset, target_tensor_val, device):
    trg_vocab = decoder.trg_vocab
    batch_size = test_dataset.batch_size
    n_batch = 0
    total_loss = 0

    encoder.eval()
    decoder.eval()
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    losses=[]
    final_output, target_output = None, None

    with torch.no_grad():
        for batch, (src, trg) in enumerate(test_dataset):
            n_batch += 1
            loss = 0
            
            src, trg = src.transpose(0,1).to(device), trg.transpose(0,1).to(device)
            curr_output, curr_predictions, enc_out = decode_transformer_model(encoder, decoder, src, trg.size(0), device)

            for t in range(1, trg.size(0)):
                output = decoder(trg[:-1, :], enc_out)
                output = output.reshape(-1, output.shape[2])
                loss_trg = trg[1:].reshape(-1)
                loss += criterion(output, loss_trg)
                # loss += criterion(curr_predictions[:,t,:].to(device), trg[t,:].reshape(-1).to(device))

            if final_output is None:
                final_output = torch.zeros((len(target_tensor_val), trg.size(0)))
                target_output = torch.zeros((len(target_tensor_val), trg.size(0)))

            final_output[batch*batch_size:(batch+1)*batch_size] = curr_output
            target_output[batch*batch_size:(batch+1)*batch_size] = trg.transpose(0,1)
            losses.append(loss.item() / (trg.size(0)-1))

        mean_loss = sum(losses) / len(losses)
        print('Loss {:.4f}'.format(mean_loss))
    
    # Compute Bleu scores
    return compute_bleu_scores(target_tensor_val, target_output, final_output, trg_vocab)

In [None]:
### DO NOT EDIT ###

if __name__ == '__main__':
    transformer_save_candidate, transformer_scores = evaluate_model(transformer_encoder, transformer_decoder, test_dataset, trg_tensor_val, device)

Loss 1.4622
BLEU 1-gram: 0.303198
BLEU 2-gram: 0.084675
BLEU 3-gram: 0.061755
BLEU 4-gram: 0.058802


**Saving Transformer Encoder and Decoder Model**

In [None]:
### DO NOT EDIT ###

if __name__=='__main__':
    from google.colab import drive
    drive.mount('/content/drive')
    if transformer_encoder is not None and transformer_decoder is not None:
        print("Saving Transformer model....") 
        torch.save(transformer_encoder, 'drive/My Drive/transformer_encoder.pt')
        torch.save(transformer_decoder, 'drive/My Drive/transformer_decoder.pt')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Saving Transformer model....
