# Neural Machine Translation

This project is focused on the development of an AI translator using Recurrent Neural Networks (RNNs). The objective is to design and train a Seq2Seq architecture using Pytorch, a popular deep learning library, to translate text from one language to another. Specifically, the project uses the Multi30k dataset, which contains approximately 30,000 sentences in English, German, and French.

The AI translator is designed using an encoder-decoder architecture, where the encoder takes the input sentence and produces a fixed-length representation (encoding) of the input sequence. The decoder then uses this encoding to generate the output sentence in the target language. The RNNs, specifically LSTM (Long Short-Term Memory) networks, are used in both the encoder and decoder to capture the temporal dependencies in the input and output sequences.

The intended audience for this project is anyone interested in natural language processing and deep learning, particularly those interested in machine translation. 

Note: This project is designed as an educational project and should not be used for any commercial or production purposes.

### Step 1: Build the Vocabulary & create the Word Embeddings
* The most important part of this step is to create your Vocabulary object using a corpus of data drawn from TorchText.
* Use NLTK to create a function to tokenize the text and look up the index of a word's embeddings.






In [None]:
! pip install numpy==1.16.5
! pip install torch==1.3.1
! pip install torchtext==0.4.0

In [1]:
import os
import math
import time
import torch
import nltk
import random
import numpy as np
import torch.nn as nn
import torch.optim as optim
from typing import List
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

In [3]:
nltk.download('punkt')  # download the 'punkt' package from the Natural Language Toolkit (NLTK) library

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Define two functions to tokenize German and English text data, respectively.
# The 'nltk.word_tokenize' function is used to split the text into individual words.
# The '[::-1]' slice is used to reverse the order of the words in the German text.
def tokenize_de(text: str) -> List[str]:
    return nltk.word_tokenize(text, language='german')[::-1]

def tokenize_en(text: str) -> List[str]:
    return nltk.word_tokenize(text, language='english')

# Define two fields to represent the source and target text data, respectively.
# The 'tokenize' argument specifies the function to use for tokenizing the text data.
# The 'init_token' and 'eos_token' arguments are used to add special tokens to the beginning and end of each sequence.
# The 'lower' argument is used to convert all text to lowercase.
source = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
target = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)


In [5]:
# Load the Multi30k dataset, which contains parallel German-English text data.
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(source, target))

# Print the number of examples in each dataset.
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")


Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [6]:
# Build the vocabulary for the source and target fields using the training data.
# The 'min_freq' argument specifies the minimum frequency a word must have to be included in the vocabulary.
source.build_vocab(train_data, min_freq=2)
target.build_vocab(train_data, min_freq=2)

# Print the number of unique tokens in the source and target vocabularies.
print(f"Unique tokens in source (de) vocabulary: {len(source.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(target.vocab)}")


Unique tokens in source (de) vocabulary: 7860
Unique tokens in target (en) vocabulary: 5920


In [7]:
# Set the batch size and device for processing the data.
BATCH_SIZE = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Construct the iterator objects for the training, validation, and test data.
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)


In [8]:
# Set the hyperparameters for the model.

# 'INPUT_DIM' and 'OUTPUT_DIM' specify the sizes of the input and output vocabularies.
# 'EMBEDDING_DIM' specifies the size of the word embeddings.
# 'HIDDEN_DIM' specifies the size of the hidden state of the LSTM cells.
# 'NUM_LAYERS' specifies the number of layers in the LSTM cells.
# 'DROPOUT' specifies the dropout probability for the output layer.

INPUT_DIM = len(source.vocab)
OUTPUT_DIM = len(target.vocab)
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_LAYERS = 2
DROPOUT = 0.5


### Step 2: Create the Encoder
* A Seq2Seq architecture consists of an encoder and a decoder unit. We will use Pytorch to build a full Seq2Seq model.
* The first step of the architecture is to create an encoder with an LSTM unit

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout):
        super().__init__()

        # Set the instance variables for the sizes and dropout probability.
        self.input_size = input_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout = dropout

        # Initialize the embedding layer with the input size and embedding size.
        self.embedding = nn.Embedding(input_size, embedding_size)

        # Initialize the LSTM layer with the embedding size and hidden size,
        # and set the number of layers and dropout probability.
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers=num_layers, dropout=dropout)

    def forward(self, src_batch):

        # Embed the source batch using the embedding layer.
        embedded = self.embedding(src_batch)

        # Pass the embedded batch through the LSTM layer to get the final hidden and cell states.
        outputs, (hidden, cell) = self.lstm(embedded)

        # Return the final hidden and cell states.
        return hidden, cell

        

### Step 3: Create the Decoder
* The second step of the architecture is to create a decoder using a second LSTM unit.

In [9]:
class Decoder(nn.Module):

    def __init__(self, output_dim, embedding_size, hidden_size, num_layers, dropout):
        super().__init__()
        # initialize the decoder class with the required parameters
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.output_dim = output_dim
        self.num_layers = num_layers
        self.dropout = dropout

        # create an embedding layer to embed the target input sequence
        self.embedding = nn.Embedding(self.output_dim, self.embedding_size)

        # create an LSTM layer with the input size as the embedding size and hidden size as the hidden size
        # output of this LSTM layer will be the output of the decoder
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers=self.num_layers, dropout=self.dropout)

        # create a linear layer to convert the output of the LSTM layer to the required output dimension
        self.out = nn.Linear(self.hidden_size, self.output_dim)

    def forward(self, trg, hidden, cell):
        # trg: target input sequence (batch size, seq len)
        # hidden: last hidden state of the encoder (num layers * num directions, batch size, hidden size)
        # cell: last cell state of the encoder (num layers * num directions, batch size, hidden size)

        # embed the target input sequence
        embedded = self.embedding(trg.unsqueeze(0)) # (1, batch size, embedding size)

        # pass the embedded sequence through the LSTM layer along with the hidden and cell states from the encoder
        # the output of the LSTM layer will be the hidden state and cell state of the current time step
        outputs, (hidden, cell) = self.lstm(embedded, (hidden, cell)) 

        # convert the output of the LSTM layer to the required output dimension using the linear layer
        # the prediction of the current time step will be the output of the linear layer
        prediction = self.out(outputs.squeeze(0)) # (batch size, output dim)

        # return the prediction, hidden state and cell state of the current time step
        return prediction, hidden, cell


### Step 4: Combine them into a Seq2Seq Architecture
* To finalize our model, we will combine the encoder and decoder units into a working model.
* The Seq2Seq2 model it's able to instantiate the encoder and decoder. Then, it will accept the inputs for these units and manage their interaction to get an output using the forward pass function.

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src_batch, trg_batch, teacher_forcing_ratio: float=0.5):

        # get maximum sequence length and batch size
        max_len, batch_size = trg_batch.shape
        # get the size of the target vocabulary
        trg_vocab_size = self.decoder.output_dim

        # create tensor to store outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        # get hidden and cell state from encoder
        hidden, cell = self.encoder(src_batch)

        # use the first target token as input to the decoder
        trg = trg_batch[0]
        # iterate over the remaining target tokens
        for i in range(1, max_len):
            # pass the previous target token and hidden state to the decoder to get a prediction
            prediction, hidden, cell = self.decoder(trg, hidden, cell)
            # store the prediction in the outputs tensor
            outputs[i] = prediction

            # decide whether to use teacher forcing or not for the next token
            if random.random() < teacher_forcing_ratio:
                # use the next target token as input to the decoder
                trg = trg_batch[i]
            else:
                # use the predicted token as input to the decoder
                trg = prediction.argmax(1)

        return outputs



### Step 5: Train & Evaluate th model
* Finally we will train and evaluate our model using a Pytorch training loop.

In [10]:
# instantiates an encoder, a decoder, and a Seq2Seq model.
encoder = Encoder(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT)
decoder = Decoder(OUTPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT)
seq2seq = Seq2Seq(encoder, decoder, device)

In [11]:
optimizer = optim.Adam(seq2seq.parameters(), lr=0.001)

# ignore the padding index when calculating the loss
PAD_IDX = target.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

In [12]:
def train(seq2seq, iterator, optimizer, criterion):
    # set the model to training mode
    seq2seq.train()

    epoch_loss = 0
    # loop through the data iterator
    for batch in iterator:
        
        # reset the gradients to zero
        optimizer.zero_grad()
        # pass the source and target sentences through the model to get the output
        outputs = seq2seq(batch.src, batch.trg)

        # flatten the output and target sentences (excluding the start token)
        outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
        trg_flatten = batch.trg[1:].view(-1)
        # compute the loss between the output and target sentences
        loss = criterion(outputs_flatten, trg_flatten)

        # compute the gradients of the loss with respect to the model parameters
        loss.backward()
        # update the parameters using the computed gradients
        optimizer.step()

        # add the current batch's loss to the epoch loss (scaled by the batch size)
        epoch_loss += loss.item()*0.1

    # return the average epoch loss
    return epoch_loss / len(iterator)


In [13]:
def evaluate(seq2seq, iterator, criterion):
    # set the model to evaluation mode
    seq2seq.eval()

    epoch_loss = 0
    # turn off gradients for evaluation
    with torch.no_grad():
        # loop through the data iterator
        for batch in iterator:
            # generate output using the model, with no teacher forcing
            outputs = seq2seq(batch.src, batch.trg, teacher_forcing_ratio=0) 
            # flatten the output and target sentences (excluding the start token)
            outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
            trg_flatten = batch.trg[1:].view(-1)
            # compute the loss between the output and target sentences
            loss = criterion(outputs_flatten, trg_flatten)
            # add the current batch's loss to the epoch loss (scaled by the batch size)
            epoch_loss += loss.item()*0.1

    # return the average epoch loss
    return epoch_loss / len(iterator)


In [14]:
N_EPOCHS = 20
best_valid_loss = float('inf')

# loop through each epoch
for epoch in range(N_EPOCHS):    
    # train the model on the training data for one epoch
    train_loss = train(seq2seq.cuda(), train_iterator, optimizer, criterion)
    # evaluate the model on the validation data
    valid_loss = evaluate(seq2seq, valid_iterator, criterion)

    # if the current validation loss is better than the best seen so far, save the model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(seq2seq.state_dict(), 'bst-model.pt')
    
    # print the current epoch's loss and perplexity for both training and validation data
    print(f'\n Epoch: {epoch}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



 Epoch: 0
	Train Loss: 0.520 | Train PPL:   1.682
	 Val. Loss: 0.483 |  Val. PPL:   1.621

 Epoch: 1
	Train Loss: 0.472 | Train PPL:   1.603
	 Val. Loss: 0.490 |  Val. PPL:   1.633

 Epoch: 2
	Train Loss: 0.448 | Train PPL:   1.565
	 Val. Loss: 0.468 |  Val. PPL:   1.597

 Epoch: 3
	Train Loss: 0.427 | Train PPL:   1.533
	 Val. Loss: 0.461 |  Val. PPL:   1.586

 Epoch: 4
	Train Loss: 0.411 | Train PPL:   1.509
	 Val. Loss: 0.456 |  Val. PPL:   1.578

 Epoch: 5
	Train Loss: 0.396 | Train PPL:   1.485
	 Val. Loss: 0.442 |  Val. PPL:   1.556

 Epoch: 6
	Train Loss: 0.385 | Train PPL:   1.470
	 Val. Loss: 0.433 |  Val. PPL:   1.542

 Epoch: 7
	Train Loss: 0.373 | Train PPL:   1.452
	 Val. Loss: 0.424 |  Val. PPL:   1.528

 Epoch: 8
	Train Loss: 0.360 | Train PPL:   1.434
	 Val. Loss: 0.417 |  Val. PPL:   1.517

 Epoch: 9
	Train Loss: 0.350 | Train PPL:   1.419
	 Val. Loss: 0.407 |  Val. PPL:   1.502

 Epoch: 10
	Train Loss: 0.339 | Train PPL:   1.404
	 Val. Loss: 0.398 |  Val. PPL:   1.48

In [15]:
# Load the saved best model state
seq2seq.load_state_dict(torch.load('bst-model.pt'))

# Evaluate the model on the test data using the loaded model state
test_loss = evaluate(seq2seq, test_iterator, criterion)

# Print the test loss using a formatted string with appropriate decimal precision
print(f'Test Loss: {test_loss:.3f}')


Test Loss: 0.374


### Step 6: Interact with the System
* Testing our system by converting the outputs of the model to text and displaying it.
* Selects a random index from the training data, retrieves the corresponding source and target sentences, and processes them as tensors to be fed to the model.
* The model is then put into evaluation mode, which disables dropout and other layers that may behave differently during training, and the source and target tensors are fed to the model to generate output. 
* The predicted output indices are extracted and converted to words using the target vocabulary, and the resulting words are joined into a single string.

In [27]:
# Select a random index to use as an example
sample_idx = 50

# Retrieve a source and target sentence pair from the training data using the selected index
sample = train_data.examples[sample_idx]

# Print the source and target sentences using formatted strings
print(f'source sentence: ', ' '.join(sample.src))
print(f'target sentence: ', ' '.join(sample.trg))

# Process the source and target sentences as tensors and move them to the device (GPU)
src_tensor = source.process([sample.src]).to(device)
trg_tensor = target.process([sample.trg]).to(device)

# Set the model to evaluation mode and disable gradient computation
seq2seq.eval()
with torch.no_grad():
    # Feed the source and target tensors to the model to generate output
    outputs = seq2seq(src_tensor, trg_tensor, teacher_forcing_ratio=0)
    
# Extract the predicted output indices from the model output
output_idx = outputs[1:].squeeze(1).argmax(1)

# Convert the predicted output indices to words and join them into a single string
output_sentence = ' '.join([target.vocab.itos[idx] for idx in output_idx])

# Print the predicted output sentence
print(output_sentence)




source sentence:  . gießt hemd weißen einem in mann einen auf wasser der , jacke roten einer in junge ein
target sentence:  a boy in a red jacket pouring water on a man in a white shirt


'a boy in a red jacket is a a man with a woman in a red'