## Neural Machine Translation with Transformers

## 🎯 Project Overview

Implementation of a Neural Machine Translation system using the Transformer architecture, focusing on sequence-to-sequence learning for language translation tasks.

## 👥 Team Members

- [Carlos Salguero](https://github.com/salgue441)
- [Diego Perdomo](https://github.com/DiegoPerdomoS)
- [Arturo Rendón](https://github.com/00sen)
- [José Riosmena](https://github.com/Riosmena)
- [Dafne Fernández](https://github.com/Dafne224)

## 🎯 Objectives

1. Implement and understand the Transformer architecture
2. Develop a functional language translation system
3. Apply attention mechanisms in sequence-to-sequence learning
4. Evaluate translation quality using appropriate metrics

## 📚 Dataset

This project uses the Tatoeba dataset:

- Source: Tatoeba, a large dataset of sentences and translations
- Type: parallel corpus for language translation
- Content: sequence pairs in source and target languages

## 📝 Evaluation Criteria

| Criterion | Weight |
| --------- | ------ |
| Code quality & documentation | 40% |
| Model implementation | 30% |
| Translation performance | 30% |

#### Script to convert csv to text file


In [1]:
import pandas as pd

In [2]:
PATH = "/kaggle/input/english-spanish/eng-spa.tsv"

In [3]:
df = pd.read_csv(PATH, sep="\t", on_bad_lines="skip")
print(f"Number of columns: {df.shape[1]}")

Number of columns: 4


In [4]:
df.head()

Unnamed: 0,1276,Let's try something.,2481,¡Intentemos algo!
0,1277,I have to go to sleep.,2482,Tengo que irme a dormir.
1,1280,Today is June 18th and it is Muiriel's birthday!,2485,¡Hoy es 18 de junio y es el cumpleaños de Muir...
2,1280,Today is June 18th and it is Muiriel's birthday!,1130137,¡Hoy es el 18 de junio y es el cumpleaños de M...
3,1282,Muiriel is 20 now.,2487,"Ahora, Muiriel tiene 20 años."
4,1282,Muiriel is 20 now.,1130133,Muiriel tiene 20 años ahora.


In [5]:
eng_spa_cols = df.iloc[:, [1, 3]]
eng_spa_cols["length"] = eng_spa_cols.iloc[:, 0].str.len()
eng_spa_cols = eng_spa_cols.sort_values(by="length")
eng_spa_cols = eng_spa_cols.drop(columns=["length"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_spa_cols["length"] = eng_spa_cols.iloc[:, 0].str.len()


Saving the output file locally.


In [6]:
output_file_path = "/kaggle/working/eng-spa4.txt"
eng_spa_cols.to_csv(output_file_path, sep="\t", index=False, header=False)

## Transformer - Attention is all you need


In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re
from typing import List, Tuple

In [8]:
torch.manual_seed(23)

<torch._C.Generator at 0x7af90a73b450>

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


Max sequence length


In [10]:
MAX_SEQ_LEN = 128

## Positional Embedding


In [11]:
class PositionalEmbedding(nn.Module):
    """
    Positional embedding module designed to add positional information
    to the input tokens.

    Attributes:
        pos_embeded_matrix (torch.Tensor): The positional embedding matrix
    """

    def __init__(self, d_model, max_seq_len=MAX_SEQ_LEN):
        super().__init__()
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)

        token_pos = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )

        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the PositionalEmbedding module.

        Args:
            x (torch.Tensor): The input tensor

        Returns:
            torch.Tensor: The input tensor with positional information added
        """

        return x + self.pos_embed_matrix[: x.size(0), :]

## Multi-Head Attention


In [12]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head attention module designed to compute the attention
    scores between the query, key, and value tensors.

    Attributes:
        d_v (int): The dimension of the value tensor
        d_k (int): The dimension of the key tensor
        num_heads (int): The number of heads
        W_q (nn.Linear): The linear projection for the query tensor
        W_k (nn.Linear): The linear projection for the key tensor
        W_v (nn.Linear): The linear projection for the value tensor
        W_o (nn.Linear): The linear projection for the output tensor
    """

    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        assert d_model % num_heads == 0, "Embedding size not compatible with num heads"

        # Calculate the dimension of each head
        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads

        # Define linear projections for query, key, value, and output
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(
        self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor, mask=None
    ) -> torch.Tensor:
        """
        Forward pass of the MultiHeadAttention module.

        Args:
            Q (torch.Tensor): The query tensor
            K (torch.Tensor): The key tensor
            V (torch.Tensor): The value tensor
            mask (torch.Tensor): The mask tensor

        Returns:
            torch.Tensor: The output tensor
        """

        batch_size = Q.size(0)

        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)
        weighted_values = (
            weighted_values.transpose(1, 2)
            .contiguous()
            .view(batch_size, -1, self.num_heads * self.d_k)
        )

        weighted_values = self.W_o(weighted_values)
        return weighted_values, attention

    def scale_dot_product(
        self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor, mask=None
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Dot product attention with scaling and masking.

        Args:
            Q (torch.Tensor): The query tensor
            K (torch.Tensor): The key tensor
            V (torch.Tensor): The value tensor
            mask (torch.Tensor): The mask tensor

        Returns:
            torch.Tensor: The weighted values
            torch.Tensor: The attention scores
        """

        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention = F.softmax(scores, dim=-1)
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention

## Position-wise Feed-Forward Networks


In [13]:
class PositionFeedForward(nn.Module):
    """
    Position-wise feedforward module designed to apply two linear
    transformations with a ReLU activation in between.

    Attributes:
        linear1 (nn.Linear): The first linear transformation
        linear2 (nn.Linear): The second linear transformation
    """

    def __init__(self, d_model, d_ff):
        super().__init__()

        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the PositionFeedForward module.

        Args:
            x (torch.Tensor): The input tensor

        Returns:
            torch.Tensor: The output tensor
        """

        return self.linear2(F.relu(self.linear1(x)))

## Encoder Sublayer


In [14]:
class EncoderSubLayer(nn.Module):
    """
    Encoded sublayer module designed to apply multi-head attention
    and position-wise feedforward operations.

    Attributes:
        self_attn (MultiHeadAttention): The multi-head attention module
        ffn (PositionFeedForward): The position-wise feedforward module
        norm1 (nn.LayerNorm): The first layer normalization module
        norm2 (nn.LayerNorm): The second layer normalization module
        dropout1 (nn.Dropout): The first dropout module
        dropout2 (nn.Dropout): The second dropout
    """

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
        """
        Forward pass of the EncoderSubLayer module.

        Args:
            x (torch.Tensor): The input tensor
            mask (torch.Tensor): The mask tensor

        Returns:
            torch.Tensor: The output tensor
        """

        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        x = x + self.dropout2(self.ffn(x))
        return self.norm2(x)

## Encoder


In [15]:
class Encoder(nn.Module):
    """
    Encoder module designed to apply multiple EncoderSubLayer modules.

    Attributes:
        layers (nn.ModuleList): The list of EncoderSubLayer modules
        norm (nn.LayerNorm): The layer normalization module
    """

    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()

        self.layers = nn.ModuleList(
            [
                EncoderSubLayer(d_model, num_heads, d_ff, dropout)
                for _ in range(num_layers)
            ]
        )

        self.norm = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
        """
        Forward pass of the Encoder module.

        Args:
            x (torch.Tensor): The input tensor
            mask (torch.Tensor): The mask tensor

        Returns:
            torch.Tensor: The output tensor
        """

        for layer in self.layers:
            x = layer(x, mask)

        return self.norm(x)

## Decoder Sublayer


In [16]:
class DecoderSubLayer(nn.Module):
    """
    Decode sublayer module designed to apply multi-head attention
    and position-wise feedforward operations.

    Attributes:
        self_attn (MultiHeadAttention): The multi-head attention module
        cross_attn (MultiHeadAttention): The multi-head attention module
        ffn (PositionFeedForward): The position-wise feedforward module
        norm1 (nn.LayerNorm): The first layer normalization module
        norm2 (nn.LayerNorm): The second layer normalization module
        norm3 (nn.LayerNorm): The third layer normalization module
        dropout1 (nn.Dropout): The first dropout module
        dropout2 (nn.Dropout): The second dropout module
        dropout3 (nn.Dropout): The third
    """

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(
        self, x: torch.Tensor, encoder_output, target_mask=None, encoder_mask=None
    ) -> torch.Tensor:
        """
        Forward pass of the DecoderSubLayer module.

        Args:
            x (torch.Tensor): The input tensor
            encoder_output (torch.Tensor): The encoder output tensor
            target_mask (torch.Tensor): The target mask tensor
            encoder_mask (torch.Tensor): The encoder mask tensor

        Returns:
            torch.Tensor: The output tensor
        """

        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        encoder_attn, _ = self.cross_attn(
            x, encoder_output, encoder_output, encoder_mask
        )
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)

        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output)
        return self.norm3(x)

## Decoder Module


In [17]:
class Decoder(nn.Module):
    """
    Decoder module designed to apply multiple DecoderSubLayer modules.

    Attributes:
        layers (nn.ModuleList): The list of DecoderSubLayer modules
        norm (nn.LayerNorm): The layer normalization module
    """

    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()

        self.layers = nn.ModuleList(
            [
                DecoderSubLayer(d_model, num_heads, d_ff, dropout)
                for _ in range(num_layers)
            ]
        )
        self.norm = nn.LayerNorm(d_model)

    def forward(
        self, x: torch.Tensor, encoder_output, target_mask, encoder_mask
    ) -> torch.Tensor:
        """
        Forward pass of the Decoder module.

        Args:
            x (torch.Tensor): The input tensor
            encoder_output (torch.Tensor): The encoder output tensor
            target_mask (torch.Tensor): The target mask tensor
            encoder_mask (torch.Tensor): The encoder mask tensor

        Returns:
            torch.Tensor: The output tensor
        """

        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)

        return self.norm(x)

## Transformer


In [18]:
class Transformer(nn.Module):
    """
    Transformer module designed to translate sequences from one language to
    another using an encoder and decoder architecture.

    Attributes:
        encoder_embedding (nn.Embedding): The embedding layer for the encoder
        decoder_embedding (nn.Embedding): The embedding layer for the decoder
        pos_embedding (PositionalEmbedding): The positional embedding layer
        encoder (Encoder): The encoder module
        decoder (Decoder): The decoder module
        output_layer (nn.Linear): The output layer
    """

    def __init__(
        self,
        d_model,
        num_heads,
        d_ff,
        num_layers,
        input_vocab_size,
        target_vocab_size,
        max_len=MAX_SEQ_LEN,
        dropout=0.1,
    ):
        super().__init__()

        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
        self.pos_embedding = PositionalEmbedding(d_model, max_len)
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.output_layer = nn.Linear(d_model, target_vocab_size)

    def forward(self, source: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the Transformer module.

        Args:
            source (torch.Tensor): The source tensor
            target (torch.Tensor): The target tensor

        Returns:
            torch.Tensor: The output tensor
        """

        source_mask, target_mask = self.mask(source, target)
        source = self.encoder_embedding(source) * math.sqrt(
            self.encoder_embedding.embedding_dim
        )
        source = self.pos_embedding(source)

        encoder_output = self.encoder(source, source_mask)
        target = self.decoder_embedding(target) * math.sqrt(
            self.decoder_embedding.embedding_dim
        )
        target = self.pos_embedding(target)

        output = self.decoder(target, encoder_output, target_mask, source_mask)
        return self.output_layer(output)

    def mask(
        self, source: torch.Tensor, target: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Masks the source and target tensors to prevent the model from
        attending to the padding tokens.

        Args:
            source (torch.Tensor): The source tensor
            target (torch.Tensor): The target tensor

        Returns:
            List[torch.Tensor]: The source and target masks
        """

        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)

        size = target.size(1)
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask

        return source_mask, target_mask

## Simple test


### Sequence parameters


In [19]:
seq_len_source = 10
seq_len_target = 10
batch_size = 2
input_vocab_size = 50
target_vocab_size = 50

### Input & Output


In [20]:
source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

### Model Hyperparameters


In [21]:
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

## Model Execution


In [22]:
model = Transformer(
    d_model,
    num_heads,
    d_ff,
    num_layers,
    input_vocab_size,
    target_vocab_size,
    max_len=MAX_SEQ_LEN,
    dropout=0.1,
)

In [23]:
model = model.to(device)
source = source.to(device)
target = target.to(device)

Computing the model output


In [24]:
output = model(source, target)
print(f"ouput.shape {output.shape}")

ouput.shape torch.Size([2, 10, 50])


## Translator Eng-Spa


### English File Reading


In [25]:
PATH = "/kaggle/working/eng-spa4.txt"

In [26]:
with open(PATH, "r", encoding="utf-8") as f:
    lines = f.readlines()

eng_spa_pairs = [line.strip().split("\t") for line in lines if "\t" in line]

In [27]:
eng_spa_pairs[:10]

[['Hi.', 'Hola.'],
 ['No.', 'No.'],
 ['Go.', 'Vaya.'],
 ['Go!', 'Vete'],
 ['No!', '¡No!'],
 ['Go!', '¡Fuera!'],
 ['Go!', '¡Sal!'],
 ['Go!', '¡Ve!'],
 ['Ah!', '¡Anda!'],
 ['Go!', 'Váyase']]

In [28]:
eng_sentences = [pair[0] for pair in eng_spa_pairs]
spa_sentences = [pair[1] for pair in eng_spa_pairs]

In [29]:
print(eng_sentences[:10])
print(spa_sentences[:10])

['Hi.', 'No.', 'Go.', 'Go!', 'No!', 'Go!', 'Go!', 'Go!', 'Ah!', 'Go!']
['Hola.', 'No.', 'Vaya.', 'Vete', '¡No!', '¡Fuera!', '¡Sal!', '¡Ve!', '¡Anda!', 'Váyase']


In [30]:
def preprocess_sentence(sentence: str) -> str:
    """
    Function to preprocess a sentence by converting to lowercase, removing special characters,
    and adding start and end tokens.

    Args:
        sentence (str): The input sentence

    Returns:
        str: The preprocessed sentence
    """

    sentence = sentence.lower().strip()
    sentence = re.sub(r'[" "]+', " ", sentence)

    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)

    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = sentence.strip()
    sentence = "<sos> " + sentence + " <eos>"

    return sentence

### Sample Processing


In [31]:
s1 = "¿Hola @ cómo estás? 123"

In [32]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [33]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [34]:
spa_sentences[:10]

['<sos> hola <eos>',
 '<sos> no <eos>',
 '<sos> vaya <eos>',
 '<sos> vete <eos>',
 '<sos> no <eos>',
 '<sos> fuera <eos>',
 '<sos> sal <eos>',
 '<sos> ve <eos>',
 '<sos> anda <eos>',
 '<sos> vayase <eos>']

In [35]:
def build_vocab(sentences: List[str]) -> dict:
    """
    Function to build a vocabulary from a list of sentences.

    Args:
        sentences (List[str]): containing input sentences

    Returns:
        word2idx: dict, mapping words to indices
        idx2word: dict, mapping indices to words
    """

    words = [word for sentence in sentences for word in sentence.split()]
    word_count = Counter(words)

    sorted_word_counts = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}

    word2idx["<pad>"] = 0  # Reserved for padding
    word2idx["<unk>"] = 1  # Reserved for unknown words

    idx2word = {idx: word for word, idx in word2idx.items()}
    return word2idx, idx2word

## Building the english & spanish vocabularies


In [36]:
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [37]:
print(eng_vocab_size, spa_vocab_size)

27688 46991


## EngSpaDataset


In [38]:
class EngSpaDataset(Dataset):
    """
    English-Spanish dataset class designed to convert sentences into
    word indices using the word-to-index dictionaries.

    Attributes:
        eng_sentences (List[str]): List of English sentences
        spa_sentences (List[str]): List of Spanish sentences
        eng_word2idx (dict): Dictionary to map English words to indices
        spa_word2idx (dict): Dictionary to map Spanish words to indices
    """

    def __init__(
        self,
        eng_sentences: List[str],
        spa_sentences: List[str],
        eng_word2idx: dict,
        spa_word2idx: dict,
    ):
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx

    def __len__(self):
        """
        Computes the length of the dataset.

        Returns:
            int: The length of the dataset
        """

        return len(self.eng_sentences)

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Gets an item from the dataset at the specified index.

        Args:
            idx (int): The index of the item to retrieve

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: The English and Spanish sentence indices
        """

        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]

        eng_idxs = [
            self.eng_word2idx.get(word, self.eng_word2idx["<unk>"])
            for word in eng_sentence.split()
        ]

        spa_idxs = [
            self.spa_word2idx.get(word, self.spa_word2idx["<unk>"])
            for word in spa_sentence.split()
        ]

        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

## Collate Function


In [39]:
def collate_fn(
    batch: List[Tuple[torch.Tensor, torch.Tensor]]
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Function to pad sequences in a batch to the same length.

    Args:
        batch (List[Tuple[torch.Tensor, torch.Tensor]]): The batch of data

    Returns:
        eng_batch: tensor, padded English sentences
        spa_batch: tensor, padded Spanish sentences
    """

    eng_batch, spa_batch = zip(*batch)
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]

    eng_batch = torch.nn.utils.rnn.pad_sequence(
        eng_batch, batch_first=True, padding_value=0
    )

    spa_batch = torch.nn.utils.rnn.pad_sequence(
        spa_batch, batch_first=True, padding_value=0
    )

    return eng_batch, spa_batch

## Train Function


In [40]:
def train(model, dataloader, loss_function, optimiser, epochs):
    """
    Training loop for the Transformer model.

    Args:
        model (Transformer): The Transformer model
        dataloader (DataLoader): The DataLoader object
        loss_function (nn.CrossEntropyLoss): The loss function
        optimiser (optim.Adam): The optimiser
        epochs (int): The number of epochs
    """

    model.train()

    for epoch in range(epochs):
        total_loss = 0

        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)

            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)

            optimiser.zero_grad()

            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))

            loss = loss_function(output, target_output)

            loss.backward()
            optimiser.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}")

## Data Loader Parameters


In [41]:
BATCH_SIZE = 64

In [42]:
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)

In [43]:
dataloader = DataLoader(
    dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)

## Training the model


In [44]:
model = Transformer(
    d_model=512,
    num_heads=8,
    d_ff=2048,
    num_layers=6,
    input_vocab_size=eng_vocab_size,
    target_vocab_size=spa_vocab_size,
    max_len=MAX_SEQ_LEN,
    dropout=0.1,
)

In [45]:
model = model.to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)

In [46]:
train(model, dataloader, loss_function, optimiser, epochs=10)

Epoch: 0/10, Loss: 3.5978
Epoch: 1/10, Loss: 2.2009
Epoch: 2/10, Loss: 1.7008
Epoch: 3/10, Loss: 1.3739
Epoch: 4/10, Loss: 1.1242
Epoch: 5/10, Loss: 0.9223
Epoch: 6/10, Loss: 0.7568
Epoch: 7/10, Loss: 0.6298
Epoch: 8/10, Loss: 0.5343
Epoch: 9/10, Loss: 0.4664


## Auxiliary Functions


In [47]:
def sentence_to_indices(sentence, word2idx):
    """
    Converts a sentence into a list of indices using a word-to-index dictionary.

    Args:
        sentence (str): The sentence to convert.
        word2idx (dict): The dictionary mapping words to indices.

    Returns:to
        list: A list of indices corresponding to the words in the sentence.
    """

    return [word2idx.get(word, word2idx["<unk>"]) for word in sentence.split()]


def indices_to_sentence(indices, idx2word):
    """
    Converts a list of indices back into a sentence using an index-to-word dictionary.

    Args:
        indices (list): The list of indices to convert.
        idx2word (dict): The dictionary mapping indices to words.

    Returns:
        str: The sentence corresponding to the indices.
    """

    return " ".join(
        [
            idx2word[idx]
            for idx in indices
            if idx in idx2word and idx2word[idx] != "<pad>"
        ]
    )


def translate_sentence(
    model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device="cpu"
) -> str:
    """
    Translates a sentence from English to Spanish using the trained Transformer model.

    Args:
        model (nn.Module): The trained Transformer model.
        sentence (str): The English sentence to translate.
        eng_word2idx (dict): The dictionary mapping English words to indices.
        spa_idx2word (dict): The dictionary mapping Spanish indices to words.
        max_len (int): The maximum length of the translated sentence.
        device (str): The device to run the model on ('cpu' or 'cuda').

    Returns:
        str: The translated Spanish sentence.
    """

    model.eval()

    sentence = preprocess_sentence(sentence)
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    tgt_indices = [spa_word2idx["<sos>"]]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            output = model(input_tensor, tgt_tensor)

            output = output.squeeze(0)

            next_token = output.argmax(dim=-1)[-1].item()
            tgt_indices.append(next_token)
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

            if next_token == spa_word2idx["<eos>"]:
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)

## Evaluator


In [48]:
def evaluate_translations(
    model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device="cpu"
):
    """
    Evaluates translations for a list of sentences using the trained Transformer model.

    Args:
        model (nn.Module): The trained Transformer model.
        sentences (list): A list of sentences to translate.
        eng_word2idx (dict): The dictionary mapping English words to indices.
        spa_idx2word (dict): The dictionary mapping Spanish indices to words.
        max_len (int): The maximum length of the translated sentence.
        device (str): The device to run the model on ('cpu' or 'cuda').
    """

    for sentence in sentences:
        translation = translate_sentence(
            model, sentence, eng_word2idx, spa_idx2word, max_len, device
        )

        print(f"Input sentence: {sentence}")
        print(f"Traducción: {translation}")
        print()

## Testing the model


In [None]:
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!",
    "The weather is beautiful today.",
    "Could you please help me find my keys?",
    "She works at a technology company in Silicon Valley.",
    "Remember to drink plenty of water throughout the day.",
    "My favorite season is autumn because of the colorful leaves.",
    "We should meet for coffee next Tuesday afternoon.",
    "The new restaurant downtown serves amazing Italian food.",
    "The children are playing in the park with their friends.",
    "Don't forget to submit your assignment by Friday.",
    "The concert last night was absolutely incredible.",
]

In [50]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [53]:
evaluate_translations(
    model,
    test_sentences,
    eng_word2idx,
    spa_idx2word,
    max_len=MAX_SEQ_LEN,
    device=device,
)

Input sentence: Hello, how are you?
Traducción: <sos> hola como estas <eos>

Input sentence: I am learning artificial intelligence.
Traducción: <sos> estoy aprendiendo inteligencia artificial <eos>

Input sentence: Artificial intelligence is great.
Traducción: <sos> la inteligencia artificial es muy artificial <eos>

Input sentence: Good night!
Traducción: <sos> buenas noches <eos>

Input sentence: The weather is beautiful today.
Traducción: <sos> hoy hace un buen tiempo <eos>

Input sentence: Could you please help me find my keys?
Traducción: <sos> podrias ayudarme a encontrar mis llaves por favor <eos>

Input sentence: She works at a technology company in Silicon Valley.
Traducción: <sos> ella trabaja en una empresa de tecnologia de gravedad <eos>

Input sentence: Remember to drink plenty of water throughout the day.
Traducción: <sos> no te olvides de tomar mucha agua por dia <eos>

Input sentence: My favorite season is autumn because of the colorful leaves.
Traducción: <sos> mi oto 