# Team members

| Id        | Student                                 |
|-----------|-----------------------------------------|
| A01795654 | Raul Astorga Castro                     |
| A01795579 | Edson Misael Astorga Castro             |

# TC 5033
## Deep Learning
## Transformers

## Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.



## Import libraries required for the model training

In [22]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re

# Set random seed for reproducibility
torch.manual_seed(23)

<torch._C.Generator at 0x79ca385c6ff0>

## Using GPU

In [23]:
if torch.cuda.is_available():
    device = torch.device('cuda') # GPU will be used if available
elif torch.backends.mps.is_available():
    device = torch.device('mps') # GPU will be used in Apple Silicon Macs if available
else:
    device = torch.device('cpu') # CPU will be used if GPU is not available
print(device)

cuda


## Building the Transformer Model

[![](transformer.png)](transformer.png "Transformer Architecture")

The Transformer model, introduced in the “Attention is All You Need” paper, revolutionizes sequence processing by relying entirely on self-attention mechanisms rather than recurrence or convolution. It consists of an encoder-decoder architecture, where both the encoder and decoder are made up of multiple layers of self-attention and feed-forward networks. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different tokens in the input sequence, regardless of their position. This enables parallelization and significantly improves efficiency, especially for tasks like machine translation. Additionally, positional encodings are used to inject sequence order information, compensating for the model’s lack of inherent positional processing. The Transformer’s architecture has since become the foundation for many state-of-the-art models in natural language processing.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

### Constants

In [24]:
# TODO: pending
MAX_SEQ_LEN = 128

### Positional Encoding Module

[![](positional_encoding.png)](positional_encoding.png "Positional Encoding")

Positional Encoding is a key concept introduced in the “Attention is All You Need” paper to address the lack of sequential order processing in the Transformer model. Since Transformers don’t inherently process input data in a temporal or spatial order like RNNs or CNNs, positional encodings are added to the input embeddings to inject information about the position of each token in a sequence. Typically, these encodings are generated using sinusoidal functions, where each dimension corresponds to a different frequency. This enables the model to capture the relative or absolute positions of tokens, allowing it to process sequences effectively without relying on traditional recurrence or convolution mechanisms.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [25]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len = MAX_SEQ_LEN):
        """
        Initializes the positional embedding matrix, which adds positional information to token embeddings
        by using sine and cosine functions as proposed in "Attention is All You Need".

        Args:
            d_model (int): The dimensionality of the embeddings for each token.
            max_seq_len (int, optional): The maximum number of tokens in a sentence. Defaults to MAX_SEQ_LEN.
        """
        super().__init__()
        # Initializing a matrix for positional embeddings with zeros, with shape [max_seq_len, d_model]
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)
        # Token position array with shape [max_seq_len, 1] to store the positions of tokens in sequence
        token_pos = torch.arange(0, max_seq_len, dtype = torch.float).unsqueeze(1)

        # Computing the scaling term for each position, as described in the Transformer model,
        # which controls the frequency of the sine and cosine functions
        div_term = torch.exp(torch.arange(0, d_model, 2).float()
                             * (-math.log(10000.0)/d_model))

        # Assigning sine values to even indices and cosine values to odd indices in the positional embedding matrix
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)

        # Adding a batch dimension and transposing to match expected input shape [1, max_seq_len, d_model]
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0,1)

    def forward(self, x):
        """
        Adds positional embeddings to the input embeddings.

        Args:
            x (torch.Tensor): The input embeddings of shape [seq_len, batch_size, d_model].

        Returns:
            torch.Tensor: The input embeddings with positional information added.
        """

        # Adds the positional embedding matrix to the input embeddings, broadcasting over batch and sequence dimensions
        return x + self.pos_embed_matrix[:x.size(0), :]

### Multi-Head Attention Module

Module                                                                          |  Architecture
:------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:
[![](multiheadattention_.png)](multiheadattention_.png "Multi-Head Attention")  |  [![](multiheadattention_2.png)](multiheadattention_2.png "Multi-Head Attention")


The Multi-Head Attention module in the Transformer model enhances the self-attention mechanism by allowing the model to focus on different parts of the input sequence simultaneously. Instead of performing a single attention operation, it runs multiple attention operations (or “heads”) in parallel, each with different learned attention weights. The outputs of these attention heads are then concatenated and linearly transformed to produce the final result. This approach enables the model to capture a broader range of relationships and dependencies within the data, as each attention head can focus on different aspects of the sequence. Multi-Head Attention thus improves the model’s ability to understand complex patterns and interactions in the input sequence.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [26]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model = 512, num_heads = 8):
        """
        Multi-head attention mechanism that divides attention computation into multiple heads
        for parallelized self-attention.

        Args:
            d_model (int): Dimensionality of the embeddings for each token.
            num_heads (int): Number of attention heads for multi-head attention.

        Raises:
            AssertionError: If d_model is not divisible by num_heads, as each head must
                            have an equal share of d_model for compatibility.
        """
        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size not compatible with num heads'

        # Dimension per head for keys (d_k) and values (d_v)
        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads

        # Linear transformations for query, key, and value projection
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask = None):
        """
        Computes multi-head attention for given query, key, and value tensors.

        Args:
            Q (torch.Tensor): Query tensor of shape [batch_size, seq_len, num_heads*d_k].
            K (torch.Tensor): Key tensor of shape [batch_size, seq_len, num_heads*d_k].
            V (torch.Tensor): Value tensor of shape [batch_size, seq_len, num_heads*d_k].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions. Defaults to None.

        Returns:
            tuple: Contains the following:
                - torch.Tensor: Weighted values after multi-head attention and transformation.
                - torch.Tensor: Attention scores.
        """
        batch_size = Q.size(0)

        # Linear projections of Q, K, V for each head, and reshaping to [batch_size, num_heads, seq_len, d_k]
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )

        # Scaled dot-product attention computation
        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)

        # Concatenation of attention heads and output projection
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads*self.d_k)
        weighted_values = self.W_o(weighted_values)

        return weighted_values, attention

    def scale_dot_product(self, Q, K, V, mask = None):
        """
        Computes the scaled dot-product attention for the query, key, and value tensors.

        Args:
            Q (torch.Tensor): Query tensor of shape [batch_size, num_heads, seq_len, d_k].
            K (torch.Tensor): Key tensor of shape [batch_size, num_heads, seq_len, d_k].
            V (torch.Tensor): Value tensor of shape [batch_size, num_heads, seq_len, d_k].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions. Defaults to None.

        Returns:
            tuple: Contains the following:
                - torch.Tensor: Weighted values after applying attention.
                - torch.Tensor: Softmaxed attention scores.
        """
        # Calculate the dot product of Q and K, scaled by sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply mask to the scores if provided, setting masked positions to a very low value
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Compute attention scores with softmax
        attention = F.softmax(scores, dim = -1)

        # Calculate weighted values by multiplying attention scores with V
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention


### Feed Forward Module

[![](feed_forward.png)](feed_forward.png "Feed Forward")

The Feed Forward module in the Transformer model is a fully connected layer that follows the Multi-Head Attention in both the encoder and decoder. It consists of two linear transformations with a ReLU activation in between. First, the input is passed through a linear layer, followed by a ReLU activation, and then through another linear layer. This module applies the same transformation independently to each position in the sequence, providing non-linearity and helping the model learn complex relationships. While it operates independently on each token, it allows the Transformer to process information more efficiently by adding depth and expressiveness to the model’s representation of the data.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [27]:
class PositionFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        """
        Feed-forward network used after the self-attention mechanism in each encoder layer.

        Args:
            d_model (int): Dimensionality of the input and output (embedding size).
            d_ff (int): Dimensionality of the inner layer (hidden layer size).
        """
        super().__init__()
        # First linear transformation from d_model to d_ff (hidden layer size)
        self.linear1 = nn.Linear(d_model, d_ff)
         # Second linear transformation back from d_ff to d_model
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        """
        Passes the input through the two linear transformations with a ReLU activation
        in between to introduce non-linearity.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].

        Returns:
            torch.Tensor: Output tensor of the same shape [batch_size, seq_len, d_model].
        """
        # Apply the first linear transformation followed by ReLU activation
        x = F.relu(self.linear1(x))
        # Apply the second linear transformation to get the final output
        return self.linear2(x)

### Encoder Module

[![](encoder.png)](encoder.png "Encoder")

The Encoder module of the Transformer model is responsible for processing the input sequence and generating a context-aware representation of each token. It consists of a stack of identical layers, each comprising two main components: a Multi-Head Attention mechanism and a Feed Forward neural network. In each layer, the Multi-Head Attention computes attention scores to capture dependencies between tokens, while the Feed Forward network applies further transformations to each token’s representation. Both components are followed by layer normalization and residual connections to stabilize training. The encoder produces a set of encoded representations that capture the input sequence’s context, which is then passed to the decoder for further processing in tasks like machine translation.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [28]:
class EncoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout = 0.1):
        """
        Initializes an EncoderSubLayer, a single layer of the encoder, which consists of self-attention and
        feed-forward sublayers with layer normalization and dropout.

        Args:
            d_model (int): Dimensionality of the embeddings for each token.
            num_heads (int): Number of attention heads for multi-head attention.
            d_ff (int): Dimensionality of the feed-forward network in this sublayer.
            dropout (float, optional): Dropout rate for regularization in attention and feed-forward layers. Defaults to 0.1.
        """
        super().__init__()
        # Multi-head self-attention mechanism
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Position-wise feed-forward network
        self.ffn = PositionFeedForward(d_model, d_ff)
        # Layer normalization applied after each sublayer
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        # Dropout layers for regularization
        self.droupout1 = nn.Dropout(dropout)
        self.droupout2 = nn.Dropout(dropout)

    def forward(self, x, mask = None):
        """
        Passes the input through the self-attention, feed-forward layers, with residual connections, normalization, and dropout.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions. Defaults to None.

        Returns:
            torch.Tensor: Processed tensor with the same shape as the input [batch_size, seq_len, d_model].
        """
        # Apply self-attention with residual connection, dropout, and layer normalization
        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.droupout1(attention_score)
        x = self.norm1(x)
        # Apply position-wise feed-forward network with residual connection, dropout, and layer normalization
        x = x + self.droupout2(self.ffn(x))
        return self.norm2(x)

In [29]:
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Initializes the Encoder, a stack of multiple encoder layers, each with self-attention and feed-forward sublayers.

        Args:
            d_model (int): Dimensionality of the embeddings for each token.
            num_heads (int): Number of attention heads for multi-head attention.
            d_ff (int): Dimensionality of the feed-forward network in each encoder layer.
            num_layers (int): Number of encoder layers in the stack (N in the original Transformer architecture).
            dropout (float, optional): Dropout rate applied to embeddings and feed-forward layers. Defaults to 0.1.
        """
        super().__init__()
        # Creating a list of encoder layers, each with multi-head attention and feed-forward sublayers
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # Final layer normalization applied after all encoder layers
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        """
        Passes input embeddings through the encoder stack, applying each encoder layer in sequence.

        Args:
            x (torch.Tensor): Input embeddings of shape [batch_size, seq_len, d_model].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions,
                                           such as padding tokens. Defaults to None.

        Returns:
            torch.Tensor: Encoded representation of the input, with shape [batch_size, seq_len, d_model].
        """
        # Sequentially applies each encoder layer, passing the output to the next layer
        for layer in self.layers:
            x = layer(x, mask)

        # Applies layer normalization to the final encoder output
        return self.norm(x)

### Decoder Module

[![](decoder.png)](decoder.png "Decoder")

The Decoder module of the Transformer model is responsible for generating the output sequence, typically used in tasks like machine translation. Like the encoder, it consists of a stack of identical layers, but with an additional layer of Multi-Head Attention. Each layer in the decoder contains three main components: a Multi-Head Attention mechanism that attends to the encoder’s output, another Multi-Head Attention that attends to the decoder’s previous layer (enabling autoregressive generation), and a Feed Forward neural network. The decoder also incorporates layer normalization and residual connections. The final output of the decoder is passed through a linear layer and softmax function to produce a probability distribution over the target vocabulary, from which the next token is predicted. This process is repeated until the entire sequence is generated.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [30]:
class DecoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initializes the sublayer with self-attention, cross-attention,
        position-wise feed-forward layers, normalization, and dropout.

        Args:
            d_model (int): Dimensionality of embeddings and hidden layers.
            num_heads (int): Number of attention heads in multi-head attention.
            d_ff (int): Dimensionality of the feed-forward layer.
            dropout (float): Dropout rate applied to embeddings and feed-forward layers. Defaults to 0.1.
        """
        super().__init__()
        # Self-attention layer for the target sequence
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Cross-attention layer to attend to encoder outputs
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        # Position-wise feed-forward layer
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        # Layer normalizations for each subcomponent
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        # Dropout layers to prevent overfitting
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        """
        Forward pass through the decoder sublayer, applying self-attention,
        cross-attention, and a feed-forward network, with residual connections
        and normalization.

        Args:
            x (torch.Tensor): Target sequence tensor with shape
                (batch_size, target_seq_len, d_model).
            encoder_output (torch.Tensor): Encoder output with shape
                (batch_size, source_seq_len, d_model).
            target_mask (torch.Tensor, optional): Mask for self-attention
                within the target sequence.
            encoder_mask (torch.Tensor, optional): Mask for cross-attention
                with encoder output.

        Returns:
            torch.Tensor: Processed tensor of shape (batch_size, target_seq_len, d_model).
        """
        # Self-attention over the target sequence with residual connection
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score) # Apply dropout and add residual
        x = self.norm1(x) # Normalize the result

        # Cross-attention with encoder output, allowing decoder to attend to encoded source
        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)  # Apply dropout and add residual
        x = self.norm2(x) # Normalize the result

        # Position-wise feed-forward layer with residual connection
        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output) # Apply dropout and add residual
        return self.norm3(x) # Final layer normalization

In [31]:
class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Transformer decoder, which decodes the encoded information into target sequence
        representations while attending to the encoder output.

        Args:
            d_model (int): Dimension of embeddings and hidden layers.
            num_heads (int): Number of attention heads in each multi-head attention layer.
            d_ff (int): Dimension of the feed-forward layer.
            num_layers (int): Number of decoder sub-layers.
            dropout (float): Dropout rate applied to the decoder layers.
        """
        super().__init__()
        # Create a list of DecoderSubLayer instances, each representing one layer in the decoder stack
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        # Final layer normalization applied to the decoder output
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, target_mask, encoder_mask):
        """
        Passes the input through each decoder layer, allowing each layer to attend to
        both the current input and the encoder output, then applies layer normalization.

        Args:
            x (torch.Tensor): Input tensor with shape [batch_size, target_seq_len, d_model].
            encoder_output (torch.Tensor): Output from the encoder with shape [batch_size, source_seq_len, d_model].
            target_mask (torch.Tensor): Mask to prevent attending to future tokens in the target sequence.
            encoder_mask (torch.Tensor): Mask to prevent attending to padding tokens in the encoder output.

        Returns:
            torch.Tensor: Processed tensor with shape [batch_size, target_seq_len, d_model].
        """
        # Pass input through each DecoderSubLayer in the decoder
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)
        # Apply final layer normalization to the output
        return self.norm(x)

### Transformer Class

In [32]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers,
                 input_vocab_size, target_vocab_size,
                 max_len=MAX_SEQ_LEN, dropout=0.1):
        """
        Initializes the Transformer model with embeddings, encoder, decoder,
        and an output projection layer.

        Args:
            d_model (int): Dimensionality of embeddings and hidden layers.
            num_heads (int): Number of attention heads in multi-head attention.
            d_ff (int): Dimensionality of the feed-forward network.
            num_layers (int): Number of encoder and decoder layers.
            input_vocab_size (int): Size of the input vocabulary.
            target_vocab_size (int): Size of the target vocabulary.
            max_len (int): Maximum length of input sequence.
            dropout (float): Dropout for regularization. Defaults to 0.1.
        """
        super().__init__()

        # Embedding layer for the input vocabulary, maps each token to a vector of size d_model
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)

        # Embedding layer for the target vocabulary, maps target tokens to vectors of size d_model
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)

        # Positional encoding layer to add positional information to the token embeddings, helps model learn sequence order and structure
        self.pos_embedding = PositionalEmbedding(d_model, max_len)

        # Encoder module: processes the source sequence to create context-aware embeddings
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)

        # Decoder module: generates target sequence predictions using encoder outputs as context
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)

        # Output layer that projects decoder outputs into the target vocabulary's size, yielding logits over possible output tokens
        self.output_layer = nn.Linear(d_model, target_vocab_size)

    def forward(self, source, target):
        """
        Performs the forward pass through the Transformer model.

        Args:
            source (torch.Tensor): Input tensor representing the source sequence,
                                   shape (batch_size, source_seq_len).
            target (torch.Tensor): Input tensor representing the target sequence (used as context),
                                   shape (batch_size, target_seq_len).

        Returns:
            torch.Tensor: Tensor with logits over the target vocabulary for each token in the target sequence,
                          shape (batch_size, target_seq_len, target_vocab_size).
        """
        # Generate masks to control which tokens the model should attend to, including padding
        # and causal masking (for future token masking) in the target sequence
        source_mask, target_mask = self.mask(source, target)

        # Embedding and positional encoding for the source sequence:
        # scales embeddings by sqrt(d_model) to maintain variance, adds positional encoding
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)

        # Pass the processed source sequence through the encoder to obtain encoded representations
        encoder_output = self.encoder(source, source_mask)

        # Embedding and positional encoding for the target sequence, processed similarly to source
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)

        # Pass the processed target sequence and encoder outputs to the decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)

        # Project the decoder outputs to the target vocabulary, returning logits over each possible token
        return self.output_layer(output)

    def mask(self, source, target):
        """
        Creates masks to (1) ignore padding tokens in source and target sequences and
        (2) prevent attention to future tokens in the target sequence (causal masking).

        Args:
            source (torch.Tensor): Tensor for the source sequence.
            target (torch.Tensor): Tensor for the target sequence.

        Returns:
            tuple: source_mask and target_mask.
                   - source_mask (torch.Tensor): Mask for padding tokens in the source sequence,
                     shape (batch_size, 1, 1, source_seq_len).
                   - target_mask (torch.Tensor): Mask for both padding tokens and future tokens in the
                     target sequence, shape (batch_size, 1, target_seq_len, target_seq_len).
        """
        # Mask to prevent attention to padding tokens in the source sequence,
        # true values indicate non-padding tokens
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)

        # Mask to prevent attention to padding tokens in the target sequence,
        # causal masking is applied to prevent attending to future tokens in sequence
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)

        # Define a lower-triangular matrix (causal mask) for target sequence, blocking future tokens
        size = target.size(1) # target sequence length
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask  # combines padding and causal masking

        return source_mask, target_mask


## Running a Simple test

### Explanation
This simple test have the tarjet of creating synthetic data that test model correctness functionality. The main idea is provide random data simulating real words and providing them to the model.

The expected output is a tensor which deliver the word index that would match the best on spanish.

All the required variables are defined to run the correct simalation of this test.

### Simple Test Parameters

In [33]:
# Define the sequence length for both source and target sentences; these represent the maximum
# number of tokens in each sentence for source and target languages
seq_len_source = 10  # Sequence length for source sentences
seq_len_target = 10  # Sequence length for target sentences

# Define the batch size, which is the number of sentence pairs processed in parallel
batch_size = 2  # Number of sequences (sentences) in each batch

# Define the vocabulary sizes for source and target languages. Each integer represents a unique token.
input_vocab_size = 50   # Vocabulary size for source language (input language)
target_vocab_size = 50  # Vocabulary size for target language (output language)

# Generate random sequences of token IDs for the source sentences, with integers ranging
# from 1 to input_vocab_size. This simulates actual sentences from the input language.
source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))

# Generate random sequences of token IDs for the target sentences, with integers ranging
# from 1 to target_vocab_size. This simulates actual sentences in the target language.
# Shape: (batch_size, seq_len_target)
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

### Running simple test with model

In [34]:
# Transformer model hyperparameters: embedding dimension, attention heads, feed-forward layer size, and layers
d_model = 512        # Dimension of embeddings
num_heads = 8        # Number of attention heads
d_ff = 2048          # Size of feed-forward network
num_layers = 6       # Number of layers in encoder and decoder

# Initialize the Transformer model with the specified configuration, including input and target vocab sizes
model = Transformer(d_model, num_heads, d_ff, num_layers,
                    input_vocab_size, target_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

# Move model and input data to the specified device (e.g., GPU) for processing
model = model.to(device)
source = source.to(device)
target = target.to(device)

### Output of Simple Test

In [35]:
# Run the model forward pass to get predictions
output = model(source, target)

In [36]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50], where each token in the target sequence
# is represented by its probability distribution across the target vocabulary.
print(f'ouput.shape {output.shape}')

ouput.shape torch.Size([2, 10, 50])


## Translator Eng-Spa

### Explanation

This translator eng-spa is created on training the transformer model written above. The main idea is provide the eng-spa.tsv doc that contains couples of words: first on english and second on the posible translation in spanish.

In order to be able to translate the words, both english and spanish, are corrected removing special signs and setting them on lowercase.

Once the words are cleaned the function build_vocab iterates all the sentences and find all the words on the dictionary. Addionally, index to word dictionaries are created to exchange between tokens and words for english or spanish.

EngSpaDataset is the dataset that returns the tokens index of the words when a sentence is provided.

The train function iterates the model and calculates the loss function to improve the weights of the transformer model.

### Loading parameters and vocabularies

In [37]:
# Path to the file containing English-Spanish sentence pairs
# File downloaded from: https://tatoeba.org/en/downloads
PATH = 'eng-spa.tsv'

In [38]:
# Open the file, read each line, and split English and Spanish sentence pairs by tabs
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [39]:
# Display the first 10 sentence pairs to examine the data format
eng_spa_pairs[:10]

[['\ufeff1276', "Let's try something.", '2481', '¡Intentemos algo!'],
 ['1277', 'I have to go to sleep.', '2482', 'Tengo que irme a dormir.'],
 ['1280',
  "Today is June 18th and it is Muiriel's birthday!",
  '2485',
  '¡Hoy es 18 de junio y es el cumpleaños de Muiriel!'],
 ['1280',
  "Today is June 18th and it is Muiriel's birthday!",
  '1130137',
  '¡Hoy es el 18 de junio y es el cumpleaños de Muiriel!'],
 ['1282', 'Muiriel is 20 now.', '2487', 'Ahora, Muiriel tiene 20 años.'],
 ['1282', 'Muiriel is 20 now.', '1130133', 'Muiriel tiene 20 años ahora.'],
 ['1283', 'The password is "Muiriel".', '2488', 'La contraseña es "Muiriel".'],
 ['1284', 'I will be back soon.', '2489', 'Volveré pronto.'],
 ['1284', 'I will be back soon.', '586853', 'Vuelvo en seguida.'],
 ['1284', 'I will be back soon.', '1031885', 'Yo regresaré pronto.']]

In [40]:
# Separate English and Spanish sentences into distinct lists
eng_sentences = [pair[1] for pair in eng_spa_pairs]  # English sentences
spa_sentences = [pair[3] for pair in eng_spa_pairs]  # Spanish sentences

In [41]:
# Print the first 10 sentences from each language list for verification
print(eng_sentences[:10])  # First 10 English sentences
print(spa_sentences[:10])  # First 10 Spanish sentences

["Let's try something.", 'I have to go to sleep.', "Today is June 18th and it is Muiriel's birthday!", "Today is June 18th and it is Muiriel's birthday!", 'Muiriel is 20 now.', 'Muiriel is 20 now.', 'The password is "Muiriel".', 'I will be back soon.', 'I will be back soon.', 'I will be back soon.']
['¡Intentemos algo!', 'Tengo que irme a dormir.', '¡Hoy es 18 de junio y es el cumpleaños de Muiriel!', '¡Hoy es el 18 de junio y es el cumpleaños de Muiriel!', 'Ahora, Muiriel tiene 20 años.', 'Muiriel tiene 20 años ahora.', 'La contraseña es "Muiriel".', 'Volveré pronto.', 'Vuelvo en seguida.', 'Yo regresaré pronto.']


### Preprocessing sentences

In [42]:
# Define a function to preprocess a sentence by removing special characters and formatting
def preprocess_sentence(sentence):
    """
    Preprocess a sentence by formatting, normalizing, and adding sequence tags.

    This function cleans and formats an input sentence by:
    - Converting to lowercase.
    - Removing leading and trailing whitespace.
    - Replacing multiple spaces with a single space.
    - Normalizing accented characters.
    - Removing numbers and special characters, retaining only alphabetic characters.
    - Adding start-of-sequence (<sos>) and end-of-sequence (<eos>) tags.

    Parameters:
    ----------
    sentence : str
        The input sentence to be preprocessed.

    Returns:
    -------
    str
        The preprocessed sentence, formatted and with <sos> and <eos> tags.

    Example:
    -------
    >>> preprocess_sentence("¿Hola @ cómo estás? 123")
    '<sos> hola como estas <eos>'
    """
    # Convert sentence to lowercase and strip leading/trailing whitespace
    sentence = sentence.lower().strip()
    # Replace multiple spaces with a single space
    sentence = re.sub(r'[" "]+', " ", sentence)
    # Normalize accented characters and remove numbers or special symbols
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    # Strip extra spaces and add start-of-sequence and end-of-sequence tags
    sentence = sentence.strip()
    sentence = '<sos> ' + sentence + ' <eos>'

    return sentence

In [43]:
# Create a sample sentence with special characters for demonstration
s1 = '¿Hola @ cómo estás? 123'

In [44]:
# Print the original sample sentence
print(s1)

# Preprocess the sample sentence and print the cleaned result
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [45]:
# Apply the preprocess function to all English and Spanish sentences in their respective lists
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [46]:
# Display the first 10 preprocessed sentences of each language as examples
print(spa_sentences[:10])
print(eng_sentences[:10])

['<sos> intentemos algo <eos>', '<sos> tengo que irme a dormir <eos>', '<sos> hoy es de junio y es el cumplea os de muiriel <eos>', '<sos> hoy es el de junio y es el cumplea os de muiriel <eos>', '<sos> ahora muiriel tiene a os <eos>', '<sos> muiriel tiene a os ahora <eos>', '<sos> la contrase a es muiriel <eos>', '<sos> volvere pronto <eos>', '<sos> vuelvo en seguida <eos>', '<sos> yo regresare pronto <eos>']
['<sos> let s try something <eos>', '<sos> i have to go to sleep <eos>', '<sos> today is june th and it is muiriel s birthday <eos>', '<sos> today is june th and it is muiriel s birthday <eos>', '<sos> muiriel is now <eos>', '<sos> muiriel is now <eos>', '<sos> the password is muiriel <eos>', '<sos> i will be back soon <eos>', '<sos> i will be back soon <eos>', '<sos> i will be back soon <eos>']


### Building vocabularies

In [47]:
# Function to create a vocabulary from a list of sentences
def build_vocab(sentences):
    """
    Build a vocabulary from a list of sentences.

    This function creates a vocabulary dictionary that maps each unique word in the
    provided list of sentences to a unique index, sorted by word frequency in
    descending order. Two special tokens are added:
    - '<pad>' at index 0 for padding purposes.
    - '<unk>' at index 1 for unknown or out-of-vocabulary words.

    Parameters
    ----------
    sentences : list of str
        List of sentences used to build the vocabulary.

    Returns
    -------
    tuple of (dict, dict)
        - word2idx : dict
            A dictionary mapping each word to a unique index.
        - idx2word : dict
            A reverse dictionary mapping indexes back to words.

    Example
    -------
    >>> sentences = ["hello world", "hello machine learning"]
    >>> word2idx, idx2word = build_vocab(sentences)
    >>> word2idx["hello"]
    2
    >>> idx2word[2]
    'hello'
    """
    # Spl
    # Split each sentence into words and collect all words in a list
    words = [word for sentence in sentences for word in sentence.split()]
    # Count the occurrences of each word in the list
    word_count = Counter(words)
    # Sort words by frequency in descending order
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)
    # Create a dictionary that maps each word to a unique index
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    # Add special token '<pad>' at index 0 for padding purposes
    word2idx['<pad>'] = 0
    # Add special token '<unk>' at index 1 to represent unknown words
    word2idx['<unk>'] = 1
    # Create a reverse dictionary to map indexes back to words
    idx2word = {idx: word for word, idx in word2idx.items()}

    return word2idx, idx2word

In [48]:
# Create English and Spanish vocabularies from respective sentence lists
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)

# Get vocabulary sizes for English and Spanish
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [49]:
# Print vocabulary sizes to confirm
print(eng_vocab_size, spa_vocab_size)

27649 46928


### Custom Dataset class for English-Spanish sentence pairs

In [50]:
class EngSpaDataset(Dataset):
    """
    Custom Dataset class for handling English-Spanish sentence pairs.

    This dataset takes pairs of sentences in English and Spanish, along with
    respective vocabularies, and converts each sentence into a list of token
    indexes according to the provided vocabularies.

    Parameters
    ----------
    eng_sentences : list of str
        List of sentences in English to be used as input data.
    spa_sentences : list of str
        List of sentences in Spanish to be used as target data.
    eng_word2idx : dict
        Dictionary mapping English words to unique integer indexes.
    spa_word2idx : dict
        Dictionary mapping Spanish words to unique integer indexes.

    Methods
    -------
    __len__():
        Returns the number of sentence pairs in the dataset.

    __getitem__(idx):
        Retrieves the token index tensors for the English and Spanish sentences
        at the specified index.

    Returns
    -------
    tuple of (torch.Tensor, torch.Tensor)
        A tuple containing:
            - A tensor with token indexes for the English sentence.
            - A tensor with token indexes for the Spanish sentence.

    Example
    -------
    >>> eng_sentences = ["hello", "how are you"]
    >>> spa_sentences = ["hola", "como estas"]
    >>> dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
    >>> len(dataset)
    2
    >>> eng_tensor, spa_tensor = dataset[0]
    >>> eng_tensor
    tensor([...])
    >>> spa_tensor
    tensor([...])
    """
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        self.eng_sentences = eng_sentences  # List of English sentences
        self.spa_sentences = spa_sentences  # List of Spanish sentences
        self.eng_word2idx = eng_word2idx    # Vocabulary mapping for English
        self.spa_word2idx = spa_word2idx    # Vocabulary mapping for Spanish

    def __len__(self):
        # Return the number of sentence pairs in the dataset
        return len(self.eng_sentences)

    def __getitem__(self, idx):
        """
        Get token index tensors for the English and Spanish sentences at the specified index.

        Parameters
        ----------
        idx : int
            The index of the sentence pair to retrieve.

        Returns
        -------
        tuple of (torch.Tensor, torch.Tensor)
            - eng_idxs : torch.Tensor
                A tensor of token indexes representing the English sentence.
            - spa_idxs : torch.Tensor
                A tensor of token indexes representing the Spanish sentence.
        """
        # Get the English and Spanish sentences at the specified index
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # Convert English sentence to list of token indexes, using <unk> for unknown words
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        # Convert Spanish sentence to list of token indexes, using <unk> for unknown words
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]

        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

### Custom Collate function to prepare batches with padding

In [51]:
def collate_fn(batch):
    """
    Custom collate function for preparing batches of English-Spanish sentence pairs with padding.

    This function processes a batch of token index tensors for English and Spanish sentences,
    truncates them to a maximum sequence length, and applies padding so that all sequences
    in the batch have the same length.

    Parameters
    ----------
    batch : list of tuple of (torch.Tensor, torch.Tensor)
        A list of tuples, where each tuple contains:
            - eng_tensor : torch.Tensor
                A tensor of token indexes representing the English sentence.
            - spa_tensor : torch.Tensor
                A tensor of token indexes representing the Spanish sentence.

    Returns
    -------
    tuple of (torch.Tensor, torch.Tensor)
        A tuple containing:
            - eng_batch : torch.Tensor
                A padded tensor of token indexes for the English sentences, with shape
                (batch_size, padded_seq_len).
            - spa_batch : torch.Tensor
                A padded tensor of token indexes for the Spanish sentences, with shape
                (batch_size, padded_seq_len).

    Example
    -------
    >>> from torch.utils.data import DataLoader
    >>> dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
    >>> dataloader = DataLoader(dataset, batch_size=64, collate_fn=collate_fn)
    >>> for eng_batch, spa_batch in dataloader:
    >>>     print(eng_batch.shape)
    >>>     print(spa_batch.shape)
    """
    # Separate English and Spanish sentences in the batch
    eng_batch, spa_batch = zip(*batch)
    # Truncate sentences to MAX_SEQ_LEN if necessary
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    # Pad English sentences in the batch to the same length with padding value 0
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    # Pad Spanish sentences in the batch to the same length with padding value 0
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch


### Traning function

In [52]:
# Define the training function for the model
def train(model, dataloader, loss_function, optimiser, epochs):
    """
    Train the given model on English-Spanish sentence pairs over a specified number of epochs.

    This function performs training on the provided model using teacher forcing. For each epoch,
    it iterates through batches from the dataloader, computes the loss, and updates model parameters
    to minimize the loss. The function also displays the average loss for each epoch.

    Parameters
    ----------
    model : torch.nn.Module
        The model to be trained, typically a neural network for sequence-to-sequence tasks.
    dataloader : torch.utils.data.DataLoader
        DataLoader object that provides batches of English and Spanish sentence pairs.
    loss_function : torch.nn.Module
        Loss function to calculate the error between model predictions and target output.
    optimiser : torch.optim.Optimizer
        Optimizer to update model parameters based on gradients computed from the loss.
    epochs : int
        The number of complete passes through the training dataset.

    Returns
    -------
    None
    """
    model.train()  # Set model to training mode
    for epoch in range(epochs):  # Loop over each epoch
        total_loss = 0  # Track cumulative loss for the epoch
        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            eng_batch = eng_batch.to(device)  # Move English batch to device
            spa_batch = spa_batch.to(device)  # Move Spanish batch to device
            # Prepare target input and output for the decoder
            target_input = spa_batch[:, :-1]  # Shifted target input for teacher forcing
            target_output = spa_batch[:, 1:].contiguous().view(-1)  # Shifted target output for prediction
            # Reset gradients to avoid accumulation
            optimiser.zero_grad()
            # Forward pass through the model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))  # Flatten output for loss calculation
            # Calculate loss between model output and target output
            loss = loss_function(output, target_output)
            # Backpropagate gradients
            loss.backward()
            # Update model parameters
            optimiser.step()
            # Accumulate loss for this batch
            total_loss += loss.item()

        # Calculate average loss for the epoch and print it
        avg_loss = total_loss / len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')

### Creating Instances and Hyperparameters

In [53]:
# Set batch size for training
BATCH_SIZE = 64

# Create dataset instance using English and Spanish sentences and their vocabularies
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)

# Initialize DataLoader to shuffle and batch the dataset, with custom collation function for padding
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)


In [54]:
# Define the Transformer model with specified dimensions, number of layers, vocab sizes, and dropout
model = Transformer(d_model=512, num_heads=8, d_ff=2048, num_layers=6,
                    input_vocab_size=eng_vocab_size, target_vocab_size=spa_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

In [55]:
# Move the model to the specified device (GPU if available)
model = model.to(device)

# Define the loss function with padding token (<pad>) index ignored in the calculations
loss_function = nn.CrossEntropyLoss(ignore_index=0)

# Use the Adam optimizer with a small learning rate for stable training
optimiser = optim.Adam(model.parameters(), lr=0.0001)


### Training the model

In [56]:
# Train the model using the specified dataloader, loss function, and optimizer for 10 epochs
train(model, dataloader, loss_function, optimiser, epochs = 10)

Epoch: 0/10, Loss: 3.6027
Epoch: 1/10, Loss: 2.2008
Epoch: 2/10, Loss: 1.7015
Epoch: 3/10, Loss: 1.3760
Epoch: 4/10, Loss: 1.1248
Epoch: 5/10, Loss: 0.9226
Epoch: 6/10, Loss: 0.7586
Epoch: 7/10, Loss: 0.6300
Epoch: 8/10, Loss: 0.5348
Epoch: 9/10, Loss: 0.4669


In [57]:
# Save model for future use
torch.save(model.state_dict(), './transformer_model.pt')

### Processing results functions for translation

In [58]:
def sentence_to_indices(sentence, word2idx):
    """
    Convert a sentence into a list of token indices based on the given vocabulary.

    Each word in the sentence is mapped to its corresponding index in the vocabulary. If a word is not
    found in the vocabulary, it is replaced with the index for the unknown token ('<unk>').

    Parameters
    ----------
    sentence : str
        The sentence to be converted into indices.
    word2idx : dict
        A dictionary mapping words to their corresponding indices in the vocabulary.

    Returns
    -------
    list of int
        A list of indices representing the sentence.
    """
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

def indices_to_sentence(indices, idx2word):
    """
    Convert a list of token indices back into a sentence using the reverse vocabulary.

    Each index is mapped back to its corresponding word in the vocabulary. The padding token ('<pad>')
    is excluded from the output sentence.

    Parameters
    ----------
    indices : list of int
        The list of token indices to be converted into a sentence.
    idx2word : dict
        A dictionary mapping indices back to their corresponding words in the vocabulary.

    Returns
    -------
    str
        The reconstructed sentence as a string.
    """
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    """
    Translate a given English sentence into Spanish using a pre-trained model.

    The function preprocesses the input sentence, converts it to token indices, and feeds it through the model
    in evaluation mode. The model generates the Spanish translation until it reaches either the maximum sequence
    length or the end-of-sentence token ('<eos>').

    Parameters
    ----------
    model : torch.nn.Module
        The trained translation model.
    sentence : str
        The input English sentence to translate.
    eng_word2idx : dict
        Vocabulary dictionary mapping English words to indices.
    spa_idx2word : dict
        Reverse vocabulary dictionary mapping Spanish indices back to words.
    max_len : int, optional
        The maximum length for the generated translation (default is MAX_SEQ_LEN).
    device : str, optional
        The device on which the model is run, e.g., 'cpu' or 'cuda' (default is 'cpu').

    Returns
    -------
    str
        The translated Spanish sentence.
    """
    model.eval()  # Set model to evaluation mode for inference
    sentence = preprocess_sentence(sentence)  # Preprocess sentence (e.g., lowercase, clean)
    input_indices = sentence_to_indices(sentence, eng_word2idx)  # Convert input sentence to indices
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)  # Add batch dimension and move to device

    # Initialize the target tensor with the start-of-sentence (<sos>) token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    # Generate tokens until reaching max length or end-of-sentence (<eos>) token
    with torch.no_grad():
        for _ in range(max_len):
            output = model(input_tensor, tgt_tensor)  # Pass source and target through the model
            output = output.squeeze(0)  # Remove batch dimension for easier processing
            next_token = output.argmax(dim=-1)[-1].item()  # Get index of the most likely next token
            tgt_indices.append(next_token)  # Append this token to target indices
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)  # Update target tensor for next step
            if next_token == spa_word2idx['<eos>']:  # Stop if <eos> token is generated
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)  # Convert indices to sentence for final output

### Evaluating

In [60]:
# Function to evaluate translations on a list of English sentences
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    """
    Evaluate translations for a list of English sentences using a pre-trained model.

    This function iterates over a list of English sentences, translates each sentence to Spanish
    using the specified model, and prints both the original sentence and its translated output.

    Parameters
    ----------
    model : torch.nn.Module
        The trained translation model.
    sentences : list of str
        A list of English sentences to be translated.
    eng_word2idx : dict
        Vocabulary dictionary mapping English words to indices.
    spa_idx2word : dict
        Reverse vocabulary dictionary mapping Spanish indices back to words.
    max_len : int, optional
        The maximum length for the generated translation (default is MAX_SEQ_LEN).
    device : str, optional
        The device on which the model is run, e.g., 'cpu' or 'cuda' (default is 'cpu').

    Returns
    -------
    None
        This function does not return any value. It prints each input sentence and its translation to the console.
    """
    for sentence in sentences:
        # Translate each sentence using the translate_sentence function
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        # Print original sentence and its translation for review
        print(f'Input sentence: {sentence}')
        print(f'Translation: {translation}')
        print()

# Example sentences to test the translation model
test_sentences = [
    "I am not a robot.",
    "I like to pet my dog",
    "I can learn artificial intelligence.",
    "The dinner is on the table",
    "I am a human.",
    "Hello friend.",
    "Do you like artificial intelligence?",
    "It's a nice evening.",
    "I did my homework.",
    "We should go out someday.",
    "I will go to my home!.",
    "My car is blue"
]

# Move model to available device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Run evaluation to print translations of test sentences
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)

Input sentence: I am not a robot.
Translation: <sos> no soy un robot <eos>

Input sentence: I like to pet my dog
Translation: <sos> me gusta mi perro de mascota <eos>

Input sentence: I can learn artificial intelligence.
Translation: <sos> puedo aprender inteligencia artificial <eos>

Input sentence: The dinner is on the table
Translation: <sos> la cena esta sobre la mesa <eos>

Input sentence: I am a human.
Translation: <sos> yo soy un humano <eos>

Input sentence: Hello friend.
Translation: <sos> hola amigo <eos>

Input sentence: Do you like artificial intelligence?
Translation: <sos> te gusta la inteligencia artificial <eos>

Input sentence: It's a nice evening.
Translation: <sos> es una buena noche <eos>

Input sentence: I did my homework.
Translation: <sos> hice mis deberes <eos>

Input sentence: We should go out someday.
Translation: <sos> deberiamos salir algun dia <eos>

Input sentence: I will go to my home!.
Translation: <sos> me ire a mi casa <eos>

Input sentence: My car is 

## Conclusion

In this notebook we demonstrate how we can use a transformer model to build an English to Spanish translator. The input data to train the model is a dataset found on the internet comprising 266,338 English to Spanish sentences.

The training processes in this notebook map all the data to train and validate the operations of the transformer model. Each part of the transformer results in an ingenious engineering implementation starting with the sine operation, the encoder block using a multi-head attention scheme providing useful parameters Q, K, and V running in a parallel procedure. Then the decoder block using the information provided by the encoder block to compare key results.

Even with the lack of a full translator for both languages, and even when training is done only in a few epochs, it is quite interesting how well the translation performs in the test and demonstrate how powerful a transformer can be.