# Team members

| Id        | Student                                 |
|-----------|-----------------------------------------|
| A01795654 | Raul Astorga Castro                     |
| A01795579 | Edson Misael Astorga Castro             |
| A01373679 | Luis Miguel Balderas González de Burgos |
| A01730466 | Sinaí Avalos Rivera                     |
| A01410682 | Carlos Miguel Arvizu Durán              |

# TC 5033
## Deep Learning
## Transformers

## Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.



## Script to convert csv to text file 

In [1]:
#This script requires to convert the TSV file to CSV
# easiest way is to open it in Calc or excel and save as csv
#PATH = '/Users/raulastrga/Library/Mobile Documents/com~apple~CloudDocs/Maestría/Advanced Machine Learning Methods/MNA/AMLM/Activity_4/eng-spa.tsv'
#import pandas as pd
#df = pd.read_csv(PATH, sep='\t', on_bad_lines='skip')

In [2]:
#eng_spa_cols = df.iloc[:, [1, 3]]
#eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()  
#eng_spa_cols = eng_spa_cols.sort_values(by='length')  
#eng_spa_cols = eng_spa_cols.drop(columns=['length'])  

#output_file_path = '/media/pepe/DataUbuntu/Databases/spanish_english/eng-spa4.txt'
#eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)

## Import libraries required for the model training

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re

# Set random seed for reproducibility
torch.manual_seed(23)

<torch._C.Generator at 0x1071d90b0>

## Using GPU

In [4]:
if torch.cuda.is_available():
    device = torch.device('cuda') # GPU will be used if available
elif torch.backends.mps.is_available():
    device = torch.device('mps') # GPU will be used in Apple Silicon Macs if available
else:
    device = torch.device('cpu') # CPU will be used if GPU is not available
print(device)

mps


## Building the Transformer Model

[![](transformer.png)](transformer.png "Transformer Architecture")

The Transformer model, introduced in the “Attention is All You Need” paper, revolutionizes sequence processing by relying entirely on self-attention mechanisms rather than recurrence or convolution. It consists of an encoder-decoder architecture, where both the encoder and decoder are made up of multiple layers of self-attention and feed-forward networks. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different tokens in the input sequence, regardless of their position. This enables parallelization and significantly improves efficiency, especially for tasks like machine translation. Additionally, positional encodings are used to inject sequence order information, compensating for the model’s lack of inherent positional processing. The Transformer’s architecture has since become the foundation for many state-of-the-art models in natural language processing.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

### Constants

In [5]:
# TODO: pending
MAX_SEQ_LEN = 128

### Positional Encoding Module

[![](positional_encoding.png)](positional_encoding.png "Positional Encoding")

Positional Encoding is a key concept introduced in the “Attention is All You Need” paper to address the lack of sequential order processing in the Transformer model. Since Transformers don’t inherently process input data in a temporal or spatial order like RNNs or CNNs, positional encodings are added to the input embeddings to inject information about the position of each token in a sequence. Typically, these encodings are generated using sinusoidal functions, where each dimension corresponds to a different frequency. This enables the model to capture the relative or absolute positions of tokens, allowing it to process sequences effectively without relying on traditional recurrence or convolution mechanisms.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [None]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len = MAX_SEQ_LEN):
        """
        Initializes the positional embedding matrix, which adds positional information to token embeddings 
        by using sine and cosine functions as proposed in "Attention is All You Need".

        Args:
            d_model (int): The dimensionality of the embeddings for each token.
            max_seq_len (int, optional): The maximum number of tokens in a sentence. Defaults to MAX_SEQ_LEN.
        """
        super().__init__()
        # Initializing a matrix for positional embeddings with zeros, with shape [max_seq_len, d_model]
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)
        # Token position array with shape [max_seq_len, 1] to store the positions of tokens in sequence
        token_pos = torch.arange(0, max_seq_len, dtype = torch.float).unsqueeze(1)

        # Computing the scaling term for each position, as described in the Transformer model,
        # which controls the frequency of the sine and cosine functions
        div_term = torch.exp(torch.arange(0, d_model, 2).float() 
                             * (-math.log(10000.0)/d_model))
        
        # Assigning sine values to even indices and cosine values to odd indices in the positional embedding matrix
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)

        # Adding a batch dimension and transposing to match expected input shape [1, max_seq_len, d_model]
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0,1)
        
    def forward(self, x):
        """
        Adds positional embeddings to the input embeddings.

        Args:
            x (torch.Tensor): The input embeddings of shape [seq_len, batch_size, d_model].

        Returns:
            torch.Tensor: The input embeddings with positional information added.
        """
        
        # Adds the positional embedding matrix to the input embeddings, broadcasting over batch and sequence dimensions
        return x + self.pos_embed_matrix[:x.size(0), :]

### Multi-Head Attention Module

[![](multiheadattention_.png)](multiheadattention_.png "Multi-Head Attention")

The Multi-Head Attention module in the Transformer model enhances the self-attention mechanism by allowing the model to focus on different parts of the input sequence simultaneously. Instead of performing a single attention operation, it runs multiple attention operations (or “heads”) in parallel, each with different learned attention weights. The outputs of these attention heads are then concatenated and linearly transformed to produce the final result. This approach enables the model to capture a broader range of relationships and dependencies within the data, as each attention head can focus on different aspects of the sequence. Multi-Head Attention thus improves the model’s ability to understand complex patterns and interactions in the input sequence.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model = 512, num_heads = 8):
        """
        Multi-head attention mechanism that divides attention computation into multiple heads 
        for parallelized self-attention.

        Args:
            d_model (int): Dimensionality of the embeddings for each token.
            num_heads (int): Number of attention heads for multi-head attention.

        Raises:
            AssertionError: If d_model is not divisible by num_heads, as each head must 
                            have an equal share of d_model for compatibility.
        """
        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size not compatible with num heads'
        
        # Dimension per head for keys (d_k) and values (d_v)
        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads
        
        # Linear transformations for query, key, and value projection
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def forward(self, Q, K, V, mask = None):
        """
        Computes multi-head attention for given query, key, and value tensors.

        Args:
            Q (torch.Tensor): Query tensor of shape [batch_size, seq_len, num_heads*d_k].
            K (torch.Tensor): Key tensor of shape [batch_size, seq_len, num_heads*d_k].
            V (torch.Tensor): Value tensor of shape [batch_size, seq_len, num_heads*d_k].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions. Defaults to None.

        Returns:
            tuple: Contains the following:
                - torch.Tensor: Weighted values after multi-head attention and transformation.
                - torch.Tensor: Attention scores.
        """
        batch_size = Q.size(0)

        # Linear projections of Q, K, V for each head, and reshaping to [batch_size, num_heads, seq_len, d_k]
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        
        # Scaled dot-product attention computation
        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)

        # Concatenation of attention heads and output projection
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads*self.d_k)
        weighted_values = self.W_o(weighted_values)
        
        return weighted_values, attention
    
    def scale_dot_product(self, Q, K, V, mask = None):
        """
        Computes the scaled dot-product attention for the query, key, and value tensors.

        Args:
            Q (torch.Tensor): Query tensor of shape [batch_size, num_heads, seq_len, d_k].
            K (torch.Tensor): Key tensor of shape [batch_size, num_heads, seq_len, d_k].
            V (torch.Tensor): Value tensor of shape [batch_size, num_heads, seq_len, d_k].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions. Defaults to None.

        Returns:
            tuple: Contains the following:
                - torch.Tensor: Weighted values after applying attention.
                - torch.Tensor: Softmaxed attention scores.
        """
        # Calculate the dot product of Q and K, scaled by sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply mask to the scores if provided, setting masked positions to a very low value
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Compute attention scores with softmax
        attention = F.softmax(scores, dim = -1)
        
        # Calculate weighted values by multiplying attention scores with V
        weighted_values = torch.matmul(attention, V)
        
        return weighted_values, attention
        

### Feed Forward Module

[![](feed_forward.png)](feed_forward.png "Feed Forward")

The Feed Forward module in the Transformer model is a fully connected layer that follows the Multi-Head Attention in both the encoder and decoder. It consists of two linear transformations with a ReLU activation in between. First, the input is passed through a linear layer, followed by a ReLU activation, and then through another linear layer. This module applies the same transformation independently to each position in the sequence, providing non-linearity and helping the model learn complex relationships. While it operates independently on each token, it allows the Transformer to process information more efficiently by adding depth and expressiveness to the model’s representation of the data.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [None]:
class PositionFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        """
        Feed-forward network used after the self-attention mechanism in each encoder layer.

        Args:
            d_model (int): Dimensionality of the input and output (embedding size).
            d_ff (int): Dimensionality of the inner layer (hidden layer size).
        """
        super().__init__()
        # First linear transformation from d_model to d_ff (hidden layer size)
        self.linear1 = nn.Linear(d_model, d_ff)
         # Second linear transformation back from d_ff to d_model
        self.linear2 = nn.Linear(d_ff, d_model)
        
    def forward(self, x):
        """
        Passes the input through the two linear transformations with a ReLU activation 
        in between to introduce non-linearity.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].

        Returns:
            torch.Tensor: Output tensor of the same shape [batch_size, seq_len, d_model].
        """
        # Apply the first linear transformation followed by ReLU activation
        x = F.relu(self.linear1(x))
        # Apply the second linear transformation to get the final output
        return self.linear2(x)

### Encoder Module

[![](encoder.png)](encoder.png "Encoder")

The Encoder module of the Transformer model is responsible for processing the input sequence and generating a context-aware representation of each token. It consists of a stack of identical layers, each comprising two main components: a Multi-Head Attention mechanism and a Feed Forward neural network. In each layer, the Multi-Head Attention computes attention scores to capture dependencies between tokens, while the Feed Forward network applies further transformations to each token’s representation. Both components are followed by layer normalization and residual connections to stabilize training. The encoder produces a set of encoded representations that capture the input sequence’s context, which is then passed to the decoder for further processing in tasks like machine translation.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [None]:
class EncoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout = 0.1):
        """
        Initializes an EncoderSubLayer, a single layer of the encoder, which consists of self-attention and 
        feed-forward sublayers with layer normalization and dropout.

        Args:
            d_model (int): Dimensionality of the embeddings for each token.
            num_heads (int): Number of attention heads for multi-head attention.
            d_ff (int): Dimensionality of the feed-forward network in this sublayer.
            dropout (float, optional): Dropout rate for regularization in attention and feed-forward layers. Defaults to 0.1.
        """
        super().__init__()
        # Multi-head self-attention mechanism
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Position-wise feed-forward network
        self.ffn = PositionFeedForward(d_model, d_ff)
        # Layer normalization applied after each sublayer
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        # Dropout layers for regularization
        self.droupout1 = nn.Dropout(dropout)
        self.droupout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask = None):
        """
        Passes the input through the self-attention, feed-forward layers, with residual connections, normalization, and dropout.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions. Defaults to None.

        Returns:
            torch.Tensor: Processed tensor with the same shape as the input [batch_size, seq_len, d_model].
        """
        # Apply self-attention with residual connection, dropout, and layer normalization
        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.droupout1(attention_score)
        x = self.norm1(x)
        # Apply position-wise feed-forward network with residual connection, dropout, and layer normalization
        x = x + self.droupout2(self.ffn(x))
        return self.norm2(x)

In [None]:
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Initializes the Encoder, a stack of multiple encoder layers, each with self-attention and feed-forward sublayers.

        Args:
            d_model (int): Dimensionality of the embeddings for each token.
            num_heads (int): Number of attention heads for multi-head attention.
            d_ff (int): Dimensionality of the feed-forward network in each encoder layer.
            num_layers (int): Number of encoder layers in the stack (N in the original Transformer architecture).
            dropout (float, optional): Dropout rate applied to embeddings and feed-forward layers. Defaults to 0.1.
        """
        super().__init__()
        # Creating a list of encoder layers, each with multi-head attention and feed-forward sublayers
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # Final layer normalization applied after all encoder layers
        self.norm = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        """
        Passes input embeddings through the encoder stack, applying each encoder layer in sequence.

        Args:
            x (torch.Tensor): Input embeddings of shape [batch_size, seq_len, d_model].
            mask (torch.Tensor, optional): Mask tensor to prevent attention to certain positions, 
                                           such as padding tokens. Defaults to None.

        Returns:
            torch.Tensor: Encoded representation of the input, with shape [batch_size, seq_len, d_model].
        """
        # Sequentially applies each encoder layer, passing the output to the next layer
        for layer in self.layers:
            x = layer(x, mask)
            
        # Applies layer normalization to the final encoder output
        return self.norm(x)

### Decoder Module

[![](decoder.png)](decoder.png "Decoder")

The Decoder module of the Transformer model is responsible for generating the output sequence, typically used in tasks like machine translation. Like the encoder, it consists of a stack of identical layers, but with an additional layer of Multi-Head Attention. Each layer in the decoder contains three main components: a Multi-Head Attention mechanism that attends to the encoder’s output, another Multi-Head Attention that attends to the decoder’s previous layer (enabling autoregressive generation), and a Feed Forward neural network. The decoder also incorporates layer normalization and residual connections. The final output of the decoder is passed through a linear layer and softmax function to produce a probability distribution over the target vocabulary, from which the next token is predicted. This process is repeated until the entire sequence is generated.

> Vaswani, A. et al. (2017). Attention is All You NeedLinks to an external site. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

In [None]:
class DecoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initializes the sublayer with self-attention, cross-attention, 
        position-wise feed-forward layers, normalization, and dropout.

        Args:
            d_model (int): Dimensionality of embeddings and hidden layers.
            num_heads (int): Number of attention heads in multi-head attention.
            d_ff (int): Dimensionality of the feed-forward layer.
            dropout (float): Dropout rate applied to embeddings and feed-forward layers. Defaults to 0.1.
        """
        super().__init__()
        # Self-attention layer for the target sequence
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Cross-attention layer to attend to encoder outputs
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        # Position-wise feed-forward layer
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        # Layer normalizations for each subcomponent
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        # Dropout layers to prevent overfitting
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        
    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        """
        Forward pass through the decoder sublayer, applying self-attention, 
        cross-attention, and a feed-forward network, with residual connections 
        and normalization.

        Args:
            x (torch.Tensor): Target sequence tensor with shape 
                (batch_size, target_seq_len, d_model).
            encoder_output (torch.Tensor): Encoder output with shape 
                (batch_size, source_seq_len, d_model).
            target_mask (torch.Tensor, optional): Mask for self-attention 
                within the target sequence.
            encoder_mask (torch.Tensor, optional): Mask for cross-attention 
                with encoder output.

        Returns:
            torch.Tensor: Processed tensor of shape (batch_size, target_seq_len, d_model).
        """
        # Self-attention over the target sequence with residual connection
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score) # Apply dropout and add residual
        x = self.norm1(x) # Normalize the result
        
        # Cross-attention with encoder output, allowing decoder to attend to encoded source
        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)  # Apply dropout and add residual
        x = self.norm2(x) # Normalize the result
        
        # Position-wise feed-forward layer with residual connection
        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output) # Apply dropout and add residual
        return self.norm3(x) # Final layer normalization

In [None]:
class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Transformer decoder, which decodes the encoded information into target sequence 
        representations while attending to the encoder output.

        Args:
            d_model (int): Dimension of embeddings and hidden layers.
            num_heads (int): Number of attention heads in each multi-head attention layer.
            d_ff (int): Dimension of the feed-forward layer.
            num_layers (int): Number of decoder sub-layers.
            dropout (float): Dropout rate applied to the decoder layers.
        """
        super().__init__()
        # Create a list of DecoderSubLayer instances, each representing one layer in the decoder stack
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        # Final layer normalization applied to the decoder output
        self.norm = nn.LayerNorm(d_model)
        
    def forward(self, x, encoder_output, target_mask, encoder_mask):
        """
        Passes the input through each decoder layer, allowing each layer to attend to
        both the current input and the encoder output, then applies layer normalization.

        Args:
            x (torch.Tensor): Input tensor with shape [batch_size, target_seq_len, d_model].
            encoder_output (torch.Tensor): Output from the encoder with shape [batch_size, source_seq_len, d_model].
            target_mask (torch.Tensor): Mask to prevent attending to future tokens in the target sequence.
            encoder_mask (torch.Tensor): Mask to prevent attending to padding tokens in the encoder output.

        Returns:
            torch.Tensor: Processed tensor with shape [batch_size, target_seq_len, d_model].
        """
        # Pass input through each DecoderSubLayer in the decoder
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)
        # Apply final layer normalization to the output
        return self.norm(x)

### Transformer Class

In [None]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers,
                 input_vocab_size, target_vocab_size, 
                 max_len=MAX_SEQ_LEN, dropout=0.1):
        """
        Initializes the Transformer model with embeddings, encoder, decoder, 
        and an output projection layer.

        Args:
            d_model (int): Dimensionality of embeddings and hidden layers.
            num_heads (int): Number of attention heads in multi-head attention.
            d_ff (int): Dimensionality of the feed-forward network.
            num_layers (int): Number of encoder and decoder layers.
            input_vocab_size (int): Size of the input vocabulary.
            target_vocab_size (int): Size of the target vocabulary.
            max_len (int): Maximum length of input sequence.
            dropout (float): Dropout for regularization. Defaults to 0.1.
        """
        super().__init__()

        # Embedding layer for the input vocabulary, maps each token to a vector of size d_model
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)

        # Embedding layer for the target vocabulary, maps target tokens to vectors of size d_model
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)

        # Positional encoding layer to add positional information to the token embeddings, helps model learn sequence order and structure
        self.pos_embedding = PositionalEmbedding(d_model, max_len)

        # Encoder module: processes the source sequence to create context-aware embeddings
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)

        # Decoder module: generates target sequence predictions using encoder outputs as context
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)

        # Output layer that projects decoder outputs into the target vocabulary's size, yielding logits over possible output tokens
        self.output_layer = nn.Linear(d_model, target_vocab_size)
        
    def forward(self, source, target):
        """
        Performs the forward pass through the Transformer model.

        Args:
            source (torch.Tensor): Input tensor representing the source sequence, 
                                   shape (batch_size, source_seq_len).
            target (torch.Tensor): Input tensor representing the target sequence (used as context),
                                   shape (batch_size, target_seq_len).

        Returns:
            torch.Tensor: Tensor with logits over the target vocabulary for each token in the target sequence,
                          shape (batch_size, target_seq_len, target_vocab_size).
        """
        # Generate masks to control which tokens the model should attend to, including padding
        # and causal masking (for future token masking) in the target sequence
        source_mask, target_mask = self.mask(source, target)

        # Embedding and positional encoding for the source sequence:
        # scales embeddings by sqrt(d_model) to maintain variance, adds positional encoding
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)
        
        # Pass the processed source sequence through the encoder to obtain encoded representations
        encoder_output = self.encoder(source, source_mask)
        
        # Embedding and positional encoding for the target sequence, processed similarly to source
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)
        
        # Pass the processed target sequence and encoder outputs to the decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)
        
        # Project the decoder outputs to the target vocabulary, returning logits over each possible token
        return self.output_layer(output)
    
    def mask(self, source, target):
        """
        Creates masks to (1) ignore padding tokens in source and target sequences and 
        (2) prevent attention to future tokens in the target sequence (causal masking).

        Args:
            source (torch.Tensor): Tensor for the source sequence.
            target (torch.Tensor): Tensor for the target sequence.

        Returns:
            tuple: source_mask and target_mask.
                   - source_mask (torch.Tensor): Mask for padding tokens in the source sequence, 
                     shape (batch_size, 1, 1, source_seq_len).
                   - target_mask (torch.Tensor): Mask for both padding tokens and future tokens in the 
                     target sequence, shape (batch_size, 1, target_seq_len, target_seq_len).
        """
        # Mask to prevent attention to padding tokens in the source sequence,
        # true values indicate non-padding tokens
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)

        # Mask to prevent attention to padding tokens in the target sequence,
        # causal masking is applied to prevent attending to future tokens in sequence
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)

        # Define a lower-triangular matrix (causal mask) for target sequence, blocking future tokens
        size = target.size(1) # target sequence length
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask  # combines padding and causal masking
        
        return source_mask, target_mask
        

### Simple test

In [None]:
# Define the sequence length for both source and target sentences; these represent the maximum
# number of tokens in each sentence for source and target languages
seq_len_source = 10  # Sequence length for source sentences
seq_len_target = 10  # Sequence length for target sentences

# Define the batch size, which is the number of sentence pairs processed in parallel
batch_size = 2  # Number of sequences (sentences) in each batch

# Define the vocabulary sizes for source and target languages. Each integer represents a unique token.
input_vocab_size = 50   # Vocabulary size for source language (input language)
target_vocab_size = 50  # Vocabulary size for target language (output language)

# Generate random sequences of token IDs for the source sentences, with integers ranging
# from 1 to input_vocab_size. This simulates actual sentences from the input language.
source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))

# Generate random sequences of token IDs for the target sentences, with integers ranging
# from 1 to target_vocab_size. This simulates actual sentences in the target language.
# Shape: (batch_size, seq_len_target)
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

In [None]:
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers,
                  input_vocab_size, target_vocab_size, 
                  max_len=MAX_SEQ_LEN, dropout=0.1)

model = model.to(device)
source = source.to(device)
target = target.to(device)

In [None]:
output = model(source, target)

In [None]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50]
print(f'ouput.shape {output.shape}')

### Translator Eng-Spa

In [None]:
# Path of file of sentences english and spanish
PATH = '/Users/raulastrga/Library/Mobile Documents/com~apple~CloudDocs/Maestría/Advanced Machine Learning Methods/MNA/AMLM/Activity_4/eng-spa.tsv'

In [None]:
# Reading and load file
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [None]:
# Printing first 10 lines for example
eng_spa_pairs[:10]

In [None]:
# Separing english and spanish sentences
eng_sentences = [pair[1] for pair in eng_spa_pairs]
spa_sentences = [pair[3] for pair in eng_spa_pairs]

In [None]:
# Printing first 10 lines of every language
print(eng_sentences[:10])
print(spa_sentences[:10])

In [None]:
# Function that preprocess a sentecen for clean special characters with regular expression
def preprocess_sentence(sentence):
    # Putting sentence to lowercase
    sentence = sentence.lower().strip()
    # Looking for one or mor spaces and changing for only one
    sentence = re.sub(r'[" "]+', " ", sentence)
    # Looking for special characters or numbers and changing for regular character or space
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    # Deleting spaces at start or end of the sentence
    sentence = sentence.strip()
    # Adding tags for start and end the sentence
    sentence = '<sos> ' + sentence + ' <eos>'
    
    return sentence

In [None]:
# Creating a example of sentence with special characters
s1 = '¿Hola @ cómo estás? 123'

In [None]:
# Printing the example
print(s1)

# Preproccess the previous example and printing result
print(preprocess_sentence(s1))

In [None]:
# Preprocess the full list of sentencens of every language
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [None]:
# Printing first 10 examples of spanish list
spa_sentences[:10]

In [None]:
# Function for create vocabulary 
def build_vocab(sentences):
    # Separing every word in every sentence and create a list of words
    words = [word for sentence in sentences for word in sentence.split()]
    # Counting all words and how much is repeting 
    word_count = Counter(words)
    # Order words, more counts first 
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)
    # Creating dictionary of words and indexes
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    # Adding Padding at first position
    word2idx['<pad>'] = 0
    # Adding Unknown at second position
    word2idx['<unk>'] = 1
    # Creating dictionary on reverse indexes and words
    idx2word = {idx: word for word, idx in word2idx.items()}
    
    return word2idx, idx2word

In [None]:
# Creating vocabularies for english and spanish
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)

# Creating variables with size of both vocabularies
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [None]:
# Printing both vocabularies size
print(eng_vocab_size, spa_vocab_size)

In [None]:
class EngSpaDataset(Dataset):
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx
        
    def __len__(self):
        return len(self.eng_sentences)
    
    def __getitem__(self, idx):
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # return tokens idxs
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]
        
        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

In [None]:
def collate_fn(batch):
    eng_batch, spa_batch = zip(*batch)
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch
    

In [None]:
def train(model, dataloader, loss_function, optimiser, epochs):
    model.train()
    for epoch in range(epochs):
        total_loss = 0 
        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)
            # Decoder preprocessing
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)
            # Zero grads
            optimiser.zero_grad()
            # run model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))
            # loss\
            loss = loss_function(output, target_output)
            # gradient and update parameters
            loss.backward()
            optimiser.step()
            total_loss += loss.item()
            
        avg_loss = total_loss/len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')
            
            

In [None]:
BATCH_SIZE = 64
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

In [None]:
model = Transformer(d_model=512, num_heads=8, d_ff=2048, num_layers=6,
                    input_vocab_size=eng_vocab_size, target_vocab_size=spa_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

In [None]:
model = model.to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)


In [None]:
train(model, dataloader, loss_function, optimiser, epochs = 10)

In [None]:
def sentence_to_indices(sentence, word2idx):
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

def indices_to_sentence(indices, idx2word):
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    model.eval()
    sentence = preprocess_sentence(sentence)
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    # Initialize the target tensor with <sos> token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            output = model(input_tensor, tgt_tensor)
            output = output.squeeze(0)
            next_token = output.argmax(dim=-1)[-1].item()
            tgt_indices.append(next_token)
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)
            if next_token == spa_word2idx['<eos>']:
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)

In [None]:
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    for sentence in sentences:
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()

# Example sentences to test the translator
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!"
]

# Assuming the model is trained and loaded
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)
