# Machine Translation with Seq2Seq Model

In this notebook, we will explore the fascinating world of **Machine Translation** using a **Seq2Seq model**. Our goal is to build a model that can translate text from one language to another.

Key Highlights:

1. **Seq2Seq Model**: We will be using a Sequence-to-Sequence model, a type of model that converts an input sequence into an output sequence. It's widely used in tasks such as machine translation, speech recognition, and more.

2. **Beam Search**: To improve the quality of our translations, we will implement Beam Search, a heuristic search algorithm that explores the most promising nodes.

3. **BLEU Score**: To evaluate the performance of our model, we will use the Bilingual Evaluation Understudy (BLEU) score. It's a popular metric for machine translation that compares the translated text with the reference text.

Stay tuned as we dive into the code and unravel the intricacies of machine translation!

# Data Source

The data we will be using for this project is the **Bilingual Sentence Pairs** dataset, which can be found at the following link:

[https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs](https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs)

This dataset contains pairs of sentences in different languages, making it an excellent resource for our machine translation task.

In [90]:
import pandas as pd
import spacy
from tqdm import tqdm
import torch
from torch import nn
import lightning as pl
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vocab, GloVe
from gensim.models import KeyedVectors
from typing import Iterable, List, Callable
from torch.utils.data import Dataset, DataLoader

#### Load the Dataset

In [91]:
def read_text(file_name: str) -> pd.DataFrame:
    """
    The data file contains multiple lines of text. Each line contains a pair of sentences, and an attribution information.
    The three parts are separated by tab characters. This function reads the data file and returns a data frame.
    
    Args:
        file_name (str): the name of the data file
        
    Returns:
        pd.DataFrame: a data frame containing the data
    """
    
    # Read each line, split it by tab characters, and store the result in a list
    with open(file_name, 'r') as f:
        lines = [line.strip().split('\t') for line in f.readlines()]
        
    # Some lines are empty, so we need to remove them
    lines = [line for line in lines if len(line) == 3]
    
    # Convert the list to a data frame
    df = pd.DataFrame(lines, columns=['english', 'french', 'attribution'])
    
    return df

In [92]:
# Read the data file
df = read_text('Data/fra.txt')

In [93]:
# Print the first 5 rows
df.head()

Unnamed: 0,english,french,attribution
0,Go.,Va !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Marche.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,Bouge !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Hi.,Salut !,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
4,Hi.,Salut.,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


In [94]:
print(f'We have total {len(df)} pairs of sentences.')

We have total 185583 pairs of sentences.


# Data Preprocessing

In this step, we will preprocess our data to make it suitable for our Seq2Seq model. We will use the spaCy library, which is a powerful tool for natural language processing. Specifically, we will use two models from Spacy: one for English and one for French. 

The preprocessing steps include:
1. Removing punctuation: Punctuation can introduce unnecessary complexity into our model, so we will remove it.
2. Converting to lower case: This ensures that our model does not treat the same word in different cases as different words.

In [95]:
# Download the models if necessary
if not spacy.util.is_package('en_core_web_md'):
    spacy.cli.download('en_core_web_md')
if not spacy.util.is_package('fr_core_news_md'):
    spacy.cli.download('fr_core_news_md')
    
# Load the models
nlp_en = spacy.load('en_core_web_md')
nlp_fr = spacy.load('fr_core_news_md')

In [96]:
# Register the tqdm function with pandas to show a progress bar when applying the function to a data frame
tqdm.pandas()

# Tokenize the English sentences
df['english'] = df['english'].progress_apply(lambda x: ' '.join([token.text.lower() for token in nlp_en.tokenizer(x) if token.is_alpha]))

# Tokenize the French sentences
df['french'] = df['french'].progress_apply(lambda x: ' '.join([token.text.lower() for token in nlp_fr.tokenizer(x) if token.is_alpha]))

100%|██████████| 185583/185583 [00:02<00:00, 72228.88it/s]
100%|██████████| 185583/185583 [00:03<00:00, 49892.00it/s]


In [97]:
# Print random 5 rows
df.sample(5)

Unnamed: 0,english,french,attribution
34374,i was all by myself,étais absolument seul,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
113286,tom dragged himself out of bed,tom se traîna hors de son lit,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
133457,i wo let anything happen to you,je ne laisserai rien vous arriver,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
61625,i got a date tonight,ai un galant ce soir,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
19659,tom came forward,tom est présenté,CC-BY 2.0 (France) Attribution: tatoeba.org #2...


# TorchText

[TorchText](https://pytorch.org/text/stable/index.html) is a PyTorch package that makes text processing easier and more convenient. It provides essential tools for preprocessing text data, including tokenization, building vocabulary, and batching data for input into a model.

In this notebook, we will use TorchText for the following tasks:

1. **Tokenization**: We will use the `get_tokenizer` function to create a tokenizer that splits our sentences into tokens (words).

2. **Building Vocabulary**: We will use the `build_vocab_from_iterator` function to create a vocabulary from our dataset. This vocabulary will map each token to a unique integer, which our model can work with.

3. **Text to Integer Sequence Conversion**: We will create a custom function `text_transform` that uses our vocabulary to convert our sentences into sequences of integers.

4. **Batching**: When training our model, we will use the `BucketIterator` to create batches of our data. This will automatically handle padding of sequences to the same length within each batch.

By using TorchText, we can greatly simplify the preprocessing of our text data and ensure that it is done in a way that is optimal for our PyTorch model.

In [98]:
# Define the tokenizer
en_tokenizer = get_tokenizer('spacy', language='en_core_web_md')
fr_tokenizer = get_tokenizer('spacy', language='fr_core_news_md')

The `yield_tokens` function is a generator function that tokenizes text data from an iterable (like a list or a DataFrame column) and yields the tokens one by one.

Here's how it works:

1. The function takes two arguments: `data_iter`, which is an iterable of text data, and `tokenizer`, which is a callable (like a function) that takes a string and returns a list of tokens.

2. The function iterates over `data_iter`.

3. It applies the `tokenizer` to `text`, which splits the text into tokens.

4. It then yields these tokens one by one. Because it's a generator function, it doesn't return all tokens at once but yields them one by one. This is memory-efficient when dealing with large amounts of text data.

The reason for using this function is to create a stream of tokens from the text data. These tokens are used to build a vocabulary for text processing. The vocabulary maps each unique token to a unique integer, which can be used as input to a machine learning model.

In [99]:
def yield_tokens(data_iter: Iterable, tokenizer: Callable[[str], List[str]]) -> List[str]:
    """
    Yield the tokens from the data iterator.
    
    Args:
        data_iter (Iterable): the data iterator
        tokenizer (Callable[[str], List[str]]): the tokenizer
        
    Returns:
        List[str]: the tokens
    """
    
    for text in data_iter:
        yield tokenizer(text)

The `build_vocab_from_iterator` function in TorchText is used to build a vocabulary from an iterator that yields list or iterator of tokens. 

Here's a step-by-step explanation with a visualization example:

1. The function takes an iterator of tokenized text data. This iterator could be a list of sentences, where each sentence is a list of tokens.

2. The function iterates over this iterator, and for each list of tokens, it adds each token to the vocabulary.

3. The vocabulary is essentially a dictionary where each unique token is a key and the corresponding value is a unique integer. The integer values are assigned in the order the tokens are encountered.

4. The function returns this vocabulary.

Here's a visualization example:

Suppose we have the following tokenized text data:

```
[
    ['I', 'love', 'coding'],
    ['coding', 'is', 'fun'],
    ['I', 'love', 'AI']
]
```

The `build_vocab_from_iterator` function will build the following vocabulary from this data:

```
{
    'I': 0,
    'love': 1,
    'coding': 2,
    'is': 3,
    'fun': 4,
    'AI': 5
}
```

Note: The actual integer values may be different depending on the special tokens you add to the vocabulary (like `<unk>`, `<pad>`, `<sos>`, and `<eos>`), but the concept is the same.

In [100]:
# Build English vocabulary
en_vocab = build_vocab_from_iterator(yield_tokens(df['english'], en_tokenizer), specials=['<pad>', '<sos>', '<eos>', '<unk>'])
fr_vocab = build_vocab_from_iterator(yield_tokens(df['french'], fr_tokenizer), specials=['<pad>', '<sos>', '<eos>', '<unk>'])

# Default index is the index of <unk>
en_vocab.set_default_index(en_vocab['<unk>'])
fr_vocab.set_default_index(fr_vocab['<unk>'])

In [101]:
# Print the size of the vocabularies and some first tokens
print(f'English vocabulary size: {len(en_vocab)}, first 10 tokens: {list(en_vocab.get_itos())[:10]}')
print(f'French vocabulary size: {len(fr_vocab)}, first 10 tokens: {list(fr_vocab.get_itos())[:10]}')

English vocabulary size: 14372, first 10 tokens: ['<pad>', '<sos>', '<eos>', '<unk>', 'i', 'you', 'to', 'the', 'a', 'do']
French vocabulary size: 24666, first 10 tokens: ['<pad>', '<sos>', '<eos>', '<unk>', 'je', 'de', 'pas', 'est', 'que', 'à']


In [102]:
def text_transform(vocab: Vocab, tokenizer: Callable[[str], List[str]], text: str) -> List[int]:
    """
    Transform a text into a list of integers.
    
    Args:
        vocab (Vocab): the vocabulary
        tokenizer (Callable[[str], List[str]]): the tokenizer
        text (str): the input text
        
    Returns:
        List[int]: the list of integers
    """
    
    return [vocab[token] for token in tokenizer(text)]

In [103]:
# Define the text transforms by adding <sos> and <eos> tokens, and converting the text to a list of integers
text_transform_en = lambda text: [en_vocab['<sos>']] + text_transform(en_vocab, en_tokenizer, text) + [en_vocab['<eos>']]
text_transform_fr = lambda text: [fr_vocab['<sos>']] + text_transform(fr_vocab, fr_tokenizer, text) + [fr_vocab['<eos>']]

# Transform the English and French sentences
df['english_transform'] = df['english'].progress_apply(text_transform_en)
df['french_transform'] = df['french'].progress_apply(text_transform_fr)

100%|██████████| 185583/185583 [00:02<00:00, 63239.09it/s]
100%|██████████| 185583/185583 [00:02<00:00, 71976.65it/s]


In [104]:
# Print random 5 rows
df.sample(5)

Unnamed: 0,english,french,attribution,english_transform,french_transform
101201,would you be friends with me,voudriez être ami avec moi,CC-BY 2.0 (France) Attribution: tatoeba.org #5...,"[1, 60, 5, 28, 187, 35, 19, 2]","[1, 686, 47, 243, 41, 61, 2]"
84048,keep your eyes on the road,garde les yeux sur la route,CC-BY 2.0 (France) Attribution: tatoeba.org #1...,"[1, 178, 26, 428, 34, 7, 752, 2]","[1, 748, 24, 397, 66, 13, 625, 2]"
126839,have you ever kissed another girl,as tu déjà embrassé une autre nana,CC-BY 2.0 (France) Attribution: tatoeba.org #2...,"[1, 17, 5, 192, 876, 300, 366, 2]","[1, 48, 15, 151, 1511, 23, 126, 3061, 2]"
99404,the cliff is almost vertical,la falaise est presque verticale,CC-BY 2.0 (France) Attribution: tatoeba.org #4...,"[1, 7, 3612, 10, 305, 7977, 2]","[1, 13, 5049, 7, 293, 15726, 2]"
66397,he is stronger than i am,il a plus de force que moi,CC-BY 2.0 (France) Attribution: tatoeba.org #2...,"[1, 14, 10, 1699, 98, 4, 114, 2]","[1, 12, 17, 32, 5, 1497, 8, 61, 2]"


In [105]:
# Let's check the maximum length of the English and French sentences
en_max_len = df['english_transform'].apply(len).max()
fr_max_len = df['french_transform'].apply(len).max()

In [106]:
print(f'Maximum length of English sentences: {en_max_len}')
print(f'Maximum length of French sentences: {fr_max_len}')

Maximum length of English sentences: 46
Maximum length of French sentences: 57


The `pad_sequence` function is used to ensure that all sequences in a batch have the same length by padding shorter sequences with a specific value, usually 0.

Here's how it works:

1. The function takes two arguments: `sequence`, which is a list of integers representing a sequence, and `max_length`, which is the desired length for all sequences.

2. The function checks if the length of `sequence` is less than `max_length`.

3. If it is, the function appends the padding value (`<pad>` token, represented by the integer 0) to `sequence` until its length is equal to `max_length`.

4. The function then returns the padded sequence.

Here's a visualization example:

Suppose we have the following sequence and max_length:

```python
sequence = [1, 3, 2]
max_length = 5
```

The `pad_sequence` function will pad the sequence with 0s until its length is 5:

```python
padded_sequence = [0, 0, 1, 3, 2]
```

This is useful in batch processing where all sequences need to have the same length for the computations to work. The padding value 0 is typically ignored by the model during training and inference.

In [107]:
def pad_sequence(sequence: List[int], max_len: int, vocab: Vocab, pad_first: bool = True) -> List[int]:
    """
    Pad a sequence with <pad> tokens.
    
    Args:
        sequence (List[int]): the input sequence
        max_len (int): the maximum length
        vocab (Vocab): the vocabulary
        pad_first (bool): whether to pad at the beginning or the end
        
    Returns:
        torch.Tensor: the padded sequence as a tensor of long integers
    """
    
    # Calculate the number of tokens to pad
    pad_len = max_len - len(sequence)
    
    # Pad the sequence
    if pad_first:
        sequence = [vocab['<pad>']] * pad_len + sequence
    else:
        sequence = sequence + [vocab['<pad>']] * pad_len
        
    return sequence

In LSTM models, the order of the sequence matters because the LSTM maintains an internal state that is updated for each element in the sequence. If you pad at the end of the sequence, the LSTM will update its state based on these padding tokens, which are meaningless and could potentially lead to less accurate predictions.

On the other hand, if you pad at the beginning of the sequence, the LSTM will start updating its state based on the meaningful tokens right away, as soon as it encounters them. The padding tokens at the beginning of the sequence will have less impact on the final state of the LSTM, leading to more accurate predictions.

This is especially important when using LSTM models with a fixed maximum sequence length, where sequences shorter than the maximum length need to be padded. By padding at the beginning of the sequence, you ensure that the LSTM's state is influenced as much as possible by the meaningful tokens in the sequence.

In [108]:
# Pad the English and French sentences
# We pad first for the English sentences, and pad last for the French sentences
df['english_transform'] = df['english_transform'].progress_apply(lambda x: pad_sequence(x, en_max_len, en_vocab, True))
df['french_transform'] = df['french_transform'].progress_apply(lambda x: pad_sequence(x, fr_max_len, fr_vocab, False))

100%|██████████| 185583/185583 [00:00<00:00, 790142.52it/s]
100%|██████████| 185583/185583 [00:00<00:00, 283204.91it/s]


In [109]:
# Print random 5 rows
df.sample(5)

Unnamed: 0,english,french,attribution,english_transform,french_transform
107246,we have the best food in town,nous avons la meilleure nourriture de la ville,CC-BY 2.0 (France) Attribution: tatoeba.org #5...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 21, 79, 13, 691, 400, 5, 13, 297, 2, 0, 0,..."
48730,please take my advice,il vous plaît suivez mon conseil,CC-BY 2.0 (France) Attribution: tatoeba.org #2...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 12, 14, 169, 2792, 44, 693, 2, 0, 0, 0, 0,..."
112857,the tire on my bicycle is flat,le pneu de mon vélo est à plat,CC-BY 2.0 (France) Attribution: tatoeba.org #2...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 11, 2779, 5, 44, 454, 7, 9, 2094, 2, 0, 0,..."
145577,i have done this since high school,je ai plus fait ça depuis le lycée,CC-BY 2.0 (France) Attribution: tatoeba.org #3...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 4, 19, 32, 39, 30, 223, 11, 1371, 2, 0, 0,..."
142422,if i were you i would paint it blue,si étais vous je le peindrais en bleu,CC-BY 2.0 (France) Attribution: tatoeba.org #3...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 49, 97, 14, 4, 11, 9717, 22, 1508, 2, 0, 0..."


In [110]:
class TranslationDataset(Dataset):
    def __init__(self, english, french):        
        # Convert the lists of integers to tensors
        self.english = english.apply(lambda x: torch.tensor(x, dtype=torch.long))
        self.french = french.apply(lambda x: torch.tensor(x, dtype=torch.long))

    def __len__(self):
        return len(self.english)

    def __getitem__(self, idx):
        return {
            'english': self.english[idx],
            'french': self.french[idx]
        }

In [111]:
class TranslationDataModule(pl.LightningDataModule):
    def __init__(self, df: pd.DataFrame, batch_size: int = 128):
        """
        Initialize the data module.
        
        Args:
            df (pd.DataFrame): the data frame
            batch_size (int): the batch size
        """
        
        super().__init__()
        
        self.df = df
        self.batch_size = batch_size
        
    def setup(self, stage=None):
        """
        Setup the data module. This function will run before training.
        """
        # Create datasets
        dataset = TranslationDataset(self.df['english_transform'], self.df['french_transform'])
        
        # Calculate the size of the training and validation sets
        train_size = int(len(dataset) * 0.8)
        val_size = len(dataset) - train_size
        
        # Split the dataset into training and validation sets
        self.train_dataset, self.val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
        
    def train_dataloader(self):
        """
        Return the training data loader.
        """
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
    
    def val_dataloader(self):
        """
        Return the validation data loader.
        """
        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False)

In [112]:
# Create the data module
translation_data_module = TranslationDataModule(df)

### Seq2Seq Model

The **Seq2Seq model**, also known as the **Sequence-to-Sequence model**, is a type of model that converts an input sequence into an output sequence. It's widely used in tasks such as machine translation, speech recognition, and more.

The Seq2Seq model consists of two main components:

1. **Encoder**: The encoder processes the input sequence and returns its own internal state. For each input element, the encoder updates its state. After processing the entire input sequence, the encoder outputs its final state, which serves as the "context" of the sequence.

2. **Decoder**: The decoder uses the context (the final state of the encoder) to produce the output sequence. The decoder is also a recurrent network, and it produces the output sequence element by element. For each step, the input to the decoder is the previous element, the output of the decoder from the previous step, and the context.

Here's a visualization of the Seq2Seq model:

```
Input Sequence -> | Encoder | -> Context -> | Decoder | -> Output Sequence
```

#### Embedding Layer

The embedding layer in a neural network is used to transform sparse categorical data, like words in a text dataset, into a dense vector representation that the network can work with. There are two main ways to use the embedding layer:

1. **Self-Trained Embeddings**: In this case, the embedding layer is initialized with random weights and learns an embedding for each word in the vocabulary during the training of the network. This is a good option when you don't have a lot of domain-specific knowledge about the relationships between your categories.

2. **Pre-Trained Embeddings**: In this case, the embedding layer is initialized with the weights from a pre-trained embedding, like Word2Vec or GloVe. These embeddings are trained on large corpora and can capture a lot of semantic information about words. This is a good option when your dataset is small and you want to leverage external knowledge.

Here's example code for these two cases:

```python
import torch
from torch import nn

# Vocabulary size and embedding dimension
vocab_size = 5000
embed_dim = 300

# 1. Self-Trained Embeddings
self_trained_embedding = nn.Embedding(vocab_size, embed_dim)

# 2. Pre-Trained Embeddings
# Load pre-trained embeddings (replace with actual code to load your embeddings)
pretrained_embeddings = torch.randn(vocab_size, embed_dim)  # Note: we need to load from a pretrained model
pretrained_embedding = nn.Embedding.from_pretrained(pretrained_embeddings)
```

In the self-trained embeddings example, the `nn.Embedding` layer is initialized with random weights. In the pre-trained embeddings example, the `nn.Embedding` layer is initialized with weights from `pretrained_embeddings`, which is a tensor that you would typically load from a pre-trained embedding file.

#### Freezing the Pre-Trained Embedding Layer

When using pre-trained embeddings, we have two options:

1. **Freeze the Embedding Layer**: In this case, the weights of the pre-trained embedding layer are kept constant during training. This means that the semantic information captured by the pre-trained embeddings is preserved, and the model cannot modify these embeddings to better fit the training data. This is a good option when your dataset is small and you want to leverage the semantic information in the pre-trained embeddings as much as possible.

2. **Fine-Tune the Embedding Layer**: In this case, the weights of the pre-trained embedding layer are updated during training. This means that the model can modify these embeddings to better fit the training data. This is a good option when your dataset is large and you believe that the pre-trained embeddings may not be optimal for your specific task.

You can decide whether to freeze the pre-trained embedding layer by setting the `requires_grad` attribute of the embedding layer's parameters. If `requires_grad` is `False`, the parameters are frozen and will not be updated during training. If `requires_grad` is `True`, the parameters will be updated during training.

Here's example code showing how to freeze and unfreeze the pre-trained embedding layer:

```python
# Freeze the pre-trained embedding layer
for param in pretrained_embedding.parameters():
    param.requires_grad = False

# Unfreeze the pre-trained embedding layer
for param in pretrained_embedding.parameters():
    param.requires_grad = True
```

In the first block of code, the `requires_grad` attribute of the embedding layer's parameters is set to `False`, freezing the parameters. In the second block of code, the `requires_grad` attribute is set to `True`, allowing the parameters to be updated during training.

In [113]:
# For English, we use the existing GloVe embedding from torchtext
en_glove = GloVe(name='6B', dim=300)

# Create the embedding matrix
en_embedding_matrix = en_glove.get_vecs_by_tokens(en_vocab.get_itos())

In [114]:
# Load pre-trained Word2Vec model
word2vec_model = KeyedVectors.load_word2vec_format('Data/wiki.multi.fr.vec')

# Get the number of words in the model's vocabulary and the size of the embeddings
embed_size = word2vec_model.vector_size

# Get the list of words in the vocabulary
fr_vocab_words = fr_vocab.get_itos()

# Initialize embedding matrix
fr_embedding_matrix = torch.zeros(len(fr_vocab_words), embed_size)

# Fill in the embedding matrix
for i, word in enumerate(fr_vocab_words):
    # Check if the word is in the Word2Vec model's vocabulary
    if word in word2vec_model:
        fr_embedding_matrix[i] = torch.tensor(word2vec_model[word])
    else:
        # If the word is not in the Word2Vec model's vocabulary, leave its embedding as zeros
        pass

In [115]:
class Seq2Seq(pl.LightningModule):
    def __init__(self, en_embedding_matrix, fr_embedding_matrix, hidden_size, output_size, max_output_len):
        """
        Initialize the model.
        
        Args:
            en_embedding_matrix (torch.Tensor): the embedding matrix for English
            fr_embedding_matrix (torch.Tensor): the embedding matrix for French
            hidden_size (int): the hidden size
            output_size (int): the output size
            max_output_len (int): the maximum output length
        """
        super(Seq2Seq, self).__init__()
        
        # Special tokens
        self.sos_token = fr_vocab['<sos>']
        self.eos_token = fr_vocab['<eos>']
        self.pad_token = fr_vocab['<pad>']
        
        # Maximum output length
        self.max_output_len = max_output_len
        
        # Embedding layers
        # We allow the model to update the embeddings, so we do not freeze them
        self.en_embedding = nn.Embedding.from_pretrained(en_embedding_matrix, freeze=False)
        self.fr_embedding = nn.Embedding.from_pretrained(fr_embedding_matrix, freeze=False)
        
        # Encoder block
        self.encoder = nn.LSTM(en_embedding_matrix.shape[1], hidden_size, batch_first=True)
        self.encoder_dropout = nn.Dropout(0.2)
        
        # Decoder block
        self.decoder = nn.LSTM(fr_embedding_matrix.shape[1], hidden_size, batch_first=True)
        self.decoder_leaky_relu = nn.LeakyReLU()
        self.decoder_fc = nn.Linear(hidden_size, output_size)
        
    def encoder_forward(self, x):
        """
        Forward pass of the encoder.
        
        Args:
            x (torch.Tensor): the input tensor
            
        Returns:
            torch.Tensor: the output tensor
            torch.Tensor: the hidden state
            torch.Tensor: the cell state
        """
        
        # Embed the input
        en_embedded = self.en_embedding(x)
        
        # Dropout
        en_embedded = self.encoder_dropout(en_embedded)
        
        # Pass through the LSTM layer
        encoder_output, (hidden_state, cell_state) = self.encoder(en_embedded)
        
        # Return the output, hidden state, and cell state
        return encoder_output, (hidden_state, cell_state)
        
    def decoder_forward(self, x, hidden_and_cell):
        """
        Forward pass of the decoder.
        
        Args:
            x (torch.Tensor): the input tensor
            hidden_and_cell (tuple): the hidden state and cell state
            
        Returns:
            torch.Tensor: the output tensor
            torch.Tensor: the hidden state
            torch.Tensor: the cell state
        """
        
        # Pass through the French embedding layer
        fr_embedded = self.fr_embedding(x)
            
        # Unpack the hidden state and cell state
        hidden_state, cell_state = hidden_and_cell
        
        # Pass through the LSTM layer
        decoder_output, (hidden_state, cell_state) = self.decoder(fr_embedded, (hidden_state, cell_state))
        
        # The output shape is currently (batch_size, 1, hidden_size)
        # We want to change it to (batch_size, hidden_size)
        decoder_output = decoder_output.squeeze(1)
        
                # Apply LeakyReLU
        fr_embedded = self.decoder_leaky_relu(fr_embedded)
        
        # Pass through the fully connected layer
        decoder_output = self.decoder_fc(decoder_output)
        
        # Return the output, hidden state, and cell state
        return decoder_output, (hidden_state, cell_state)
        
    def forward(self, x):
        """
        Forward pass of the Seq2Seq model.
        
        Args:
            x (torch.Tensor): the input tensor
            
        Returns:
            torch.Tensor: the output tensor
        """
        
        # Pass the input through the encoder
        encoder_output, (hidden_state, cell_state) = self.encoder_forward(x)
        
        # Get the batch size
        batch_size = encoder_output.shape[0]
        
        # Prepare the input for the decoder
        # The shape of the input should be (batch_size, 1) because we are sending in one token at a time
        decoder_input = torch.tensor([[self.sos_token]] * batch_size).to(x.device)
        
        # Create a list to store the outputs
        decoder_outputs = []
        for _ in range(self.max_output_len):
            # Pass the input through the decoder
            # The shape of the output should be (batch_size, output_size)
            decoder_output, (hidden_state, cell_state) = self.decoder_forward(decoder_input, (hidden_state, cell_state))
        
            # Add the output to the list
            # The purpose of this storage is to calculate the loss later
            decoder_outputs.append(decoder_output)
            
            # Get the predicted token
            # The predicted_token has shape (batch_size,)
            predicted_token = decoder_output.argmax(1)
            
            # Detach the predicted token so that we can use it as input for the next iteration, and the model does not update its gradients
            decoder_input = predicted_token.detach()
            
            # Reshape the predicted token to (batch_size, 1) so that we can use it as input for the next iteration
            decoder_input = decoder_input.unsqueeze(1)
            
        # Stack tensors of shape (batch_size, output_size) in the list to get a tensor of shape (batch_size, max_output_len, output_size)
        decoder_outputs = torch.stack(decoder_outputs, dim=1)
        
        return decoder_outputs
    
    def training_step(self, batch, batch_idx):
        # Get the input and target
        x = batch['english']
        y = batch['french']
        
        # Get the output
        # The shape of the output is (batch_size, max_output_len, output_size)
        output = self(x)
        
        # Reshape the output to (batch_size * max_output_len, output_size)
        # This is because we want to calculate the loss for each word
        output = output.view(-1, output.size(-1))
        
        # Reshape the target to (batch_size * max_output_len)
        # This is because we want to calculate the loss for each word (the target is the next word)
        y = y.view(-1)
        
        # Calculate the loss
        loss = nn.CrossEntropyLoss(ignore_index=self.pad_token)(output, y)
        
        # Log the loss
        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        # Get the input and target
        x = batch['english']
        y = batch['french']
        
        # Get the output
        # The shape of the output is (batch_size, max_output_len, output_size)
        output = self(x)
        
        # Reshape the output to (batch_size * max_output_len, output_size)
        # This is because we want to calculate the loss for each word
        output = output.view(-1, output.size(-1))
        
        # Reshape the target to (batch_size * max_output_len)
        # This is because we want to calculate the loss for each word (the target is the next word)
        y = y.view(-1)
        
        # Calculate the loss
        loss = nn.CrossEntropyLoss(ignore_index=self.pad_token)(output, y)
        
        # Log the loss
        self.log('val_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        
        return loss
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

In [116]:
# Create the model
seq2seq = Seq2Seq(en_embedding_matrix, fr_embedding_matrix, hidden_size=300, output_size=len(fr_vocab), max_output_len=fr_max_len)

In [117]:
# Early stopping callback
early_stop_callback = pl.pytorch.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min', verbose=True)

# Model checkpoint callback
checkpoint_callback = pl.pytorch.callbacks.ModelCheckpoint(monitor='val_loss', mode='min', verbose=True)

In [118]:
# Create the trainer
trainer = pl.Trainer(max_epochs=100, devices=-1, callbacks=[early_stop_callback, checkpoint_callback])

# Train the model
trainer.fit(seq2seq, translation_data_module)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name               | Type      | Params
-------------------------------------------------
0 | en_embedding       | Embedding | 4.3 M 
1 | fr_embedding       | Embedding | 7.4 M 
2 | encoder            | LSTM      | 722 K 
3 | encoder_dropout    | Dropout   | 0     
4 | decoder            | LSTM      | 722 K 
5 | decoder_leaky_relu | LeakyReLU | 0     
6 | decoder_fc         | Linear    | 7.4 M 
-------------------------------------------------
20.6 M    Trainable params
0         Non-trainable params
20.6 M    Total params
82.323    Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/luanpham/miniconda3/envs/packt_nlp_natural_language_processing_in_python_for_beginners/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.
/Users/luanpham/miniconda3/envs/packt_nlp_natural_language_processing_in_python_for_beginners/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved. New best score: 3.711
Epoch 0, global step 1160: 'val_loss' reached 3.71122 (best 3.71122), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=0-step=1160.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.719 >= min_delta = 0.0. New best score: 2.992
Epoch 1, global step 2320: 'val_loss' reached 2.99199 (best 2.99199), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=1-step=2320.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.367 >= min_delta = 0.0. New best score: 2.625
Epoch 2, global step 3480: 'val_loss' reached 2.62535 (best 2.62535), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=2-step=3480.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.199 >= min_delta = 0.0. New best score: 2.427
Epoch 3, global step 4640: 'val_loss' reached 2.42662 (best 2.42662), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=3-step=4640.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.121 >= min_delta = 0.0. New best score: 2.306
Epoch 4, global step 5800: 'val_loss' reached 2.30573 (best 2.30573), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=4-step=5800.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.073 >= min_delta = 0.0. New best score: 2.233
Epoch 5, global step 6960: 'val_loss' reached 2.23260 (best 2.23260), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=5-step=6960.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.046 >= min_delta = 0.0. New best score: 2.186
Epoch 6, global step 8120: 'val_loss' reached 2.18633 (best 2.18633), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=6-step=8120.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.030 >= min_delta = 0.0. New best score: 2.156
Epoch 7, global step 9280: 'val_loss' reached 2.15640 (best 2.15640), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=7-step=9280.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.018 >= min_delta = 0.0. New best score: 2.139
Epoch 8, global step 10440: 'val_loss' reached 2.13874 (best 2.13874), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=8-step=10440.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.010 >= min_delta = 0.0. New best score: 2.129
Epoch 9, global step 11600: 'val_loss' reached 2.12904 (best 2.12904), saving model to '/Users/luanpham/Insync/minhluan1590@gmail.com/GoogleDrive/working/Teaching/ttth_natural_language_processing_practice/Chapter_05_Machine_Translation/Seq2seq/lightning_logs/version_3/checkpoints/epoch=9-step=11600.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=10` reached.


In [119]:
# Load the tensorboard notebook extension
%load_ext tensorboard

In [121]:
# Start tensorboard
%tensorboard --logdir lightning_logs/

Reusing TensorBoard on port 6006 (pid 39246), started 0:00:06 ago. (Use '!kill 39246' to kill it.)

In [122]:
def translate_sentence(sentence: str):
    """
    Translate a sentence from English to French.
    
    Args:
        sentence (str): the input sentence
        
    Returns:

    """
    
    # Tokenize the sentence
    tokens = ' '.join([token.text.lower() for token in nlp_en.tokenizer(sentence) if token.is_alpha])
    
    # Transform the sentence
    transformed = text_transform_en(tokens)
    
    # Pad the sentence
    padded = pad_sequence(transformed, en_max_len, en_vocab, pad_first=True)
    
    # Convert the sentence to a tensor
    tensor = torch.tensor(padded, dtype=torch.long).unsqueeze(0).to(seq2seq.device)
    
    # Get the output
    output = seq2seq(tensor)
    
    # Get the predicted words
    predicted = output.argmax(dim=-1).squeeze(0)
    
    # Convert the predicted words to a list of integers
    predicted = predicted.tolist()
    
    # Remove the <sos> token
    predicted = predicted[1:]
    
    # Remove the <eos> token
    predicted = predicted[:predicted.index(fr_vocab['<eos>'])]
    
    # Convert the integers to words
    predicted = [fr_vocab.get_itos()[idx] for idx in predicted]
    
    # Join the words
    predicted = ' '.join(predicted)
    
    return predicted

In [123]:
# Translate some sentences
translate_sentence('I am a student.')

'je suis étudiant'

In [124]:
# Translate some sentences
translate_sentence('There is a cat on the table.')

'il y a un chat sur la table'