# Lab 7

Based on [NLP From Scratch: Translation with a Sequence to Sequence Network and Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) by [Sean Robertson](https://github.com/spro)

In this lab we will be making a neural network that can translate from French to English.

Here is an example of what the final system will do.
The line with `>` is the input.
The line with `=` is the correct translation.
The line wth `<` is the output of a model.

``` {.sourceCode .sh}
> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .

> elle n est pas poete mais romanciere .
= she is not a poet but a novelist .
< she not not a poet but a novelist .
```

Note that sometimes the model is right, and sometimes it is wrong.

## Setup


In [None]:
!pip install torch

In [None]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

device = "cpu"

Loading data files
==================

The data for this project is a set of many thousands of English to French translation pairs.

First, download [this data](https://www.manythings.org/anki/fra-eng.zip) from the Tatoeba project (learn more [here](https://tatoeba.org/en) and see more preprocessed data samples [here](https://www.manythings.org/anki/)).

Unzip the file and upload it to be in the same location as this notebook. It should be called `fra.txt`.

The data contains lines of tab-separated text like this:

``` {.sourceCode .sh}
See you soon!   À bientôt !     CC-BY 2.0 (France) Attribution: tatoeba.org #32672 (CK) & #337862 (sysko)
See you soon!   À tout à l'heure !      CC-BY 2.0 (France) Attribution: tatoeba.org #32672 (CK) & #829076 (Cocorico)
```

First, we\'ll create an index (`Lang`) that maps words to IDs and vice-versa. We'll also have it keep track of how many times a word has been added, which we'll use later to make the vocabulary smaller by ignoring rare words.

In [None]:
# Special token identifiers, used to mark the beginning and end of a sentence respectively.
SOS_token = 0
EOS_token = 1

# Defines a class 'Lang' to manage language-specific data.
class Lang:
    # The class constructor that initializes a new instance of the language data handler.
    def __init__(self, name):
        self.name = name  # The name of the language (e.g., 'English', 'French').
        self.word2index = {}  # A dictionary to map words to their numeric index.
        self.word2count = {}  # A dictionary to count occurrences of each word.
        self.index2word = {0: "SOS", 1: "EOS"}  # A dictionary to map numeric indices back to words, pre-filled with special tokens.
        self.n_words = 2  # The total number of unique words in the vocabulary, starting with 2 to account for the special tokens.

    # Adds a sentence to the language model, incrementing the vocabulary and word counts.
    def addSentence(self, sentence):
        for word in sentence.split(' '):  # Splits the sentence into words and processes each word.
            self.addWord(word)

    # Adds a word to the language model, updating the necessary mappings and counts.
    def addWord(self, word):
        if word not in self.word2index:
            # If the word is new, it is added to all relevant dictionaries and counters.
            self.word2index[word] = self.n_words  # Maps the word to the current count of unique words.
            self.word2count[word] = 1  # Initializes the word's count to 1.
            self.index2word[self.n_words] = word  # Maps the current count of unique words back to the word.
            self.n_words += 1  # Increments the total count of unique words.
        else:
            # If the word already exists, just increments its count.
            self.word2count[word] += 1

The files are all in Unicode. To simplify the task, we will turn Unicode
characters to ASCII, make everything lowercase, and trim most
punctuation.


In [None]:
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427

def unicodeToAscii(s):
    # Convert a Unicode string 's' to plain ASCII.
    # This is done by first normalizing the string into its decomposed form using 'NFD',
    # which separates characters from their accents. Then, it filters out all nonspacing marks (Mn).
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    # First, convert the string to lowercase and strip leading and trailing whitespaces.
    # This helps in reducing the variation between different uses of capitalization and spaces.
    s = s.lower().strip()

    # Convert the string from Unicode to ASCII, removing diacritics (e.g., accents) from characters.
    # This is crucial for languages with accented characters, making the text processing uniform.
    s = unicodeToAscii(s)

    # Insert a space before any punctuation marks (.!?).
    # This ensures punctuation is treated as a separate word, aiding in tokenization for NLP tasks.
    # For example, "hello!" becomes "hello !".
    s = re.sub(r"([.!?])", r" \1", s)

    # Replace any sequence of characters that are not letters or punctuation marks (.!?)
    # with a single space. This step removes numbers and special characters,
    # focusing on retaining only textual information that's crucial for most NLP tasks.
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)

    # Finally, strip leading and trailing whitespaces that might have been added
    # during the normalization process, ensuring the output is tidy.
    return s.strip()

To read the data file we will split the file into lines, and then split lines into pairs.
The file is English → French, so this function has a `revese` flag that will flip the pairs and allow you to translate French → English.


In [None]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Open and read the contents of a text file named after 'lang2'. The file is expected
    # to contain sentence pairs in 'lang1' and 'lang2', separated by tabs.
    # The sentences are then split into a list where each element is a line from the file.
    lines = open('%s.txt' % (lang2), encoding='utf-8').read().strip().split('\n')

    # For each line in 'lines', split the line into parts using the tab delimiter ('\t'),
    # take the first two parts (assuming they are the sentences in 'lang1' and 'lang2'),
    # and apply the 'normalizeString' function to each. The result is a list of lists,
    # where each inner list contains a pair of normalized sentences.
    pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines]

    # If the 'reverse' flag is set to True, reverse the order of sentences in each pair
    # (i.e., make 'lang2' sentences come first). This is useful when the model needs
    # to translate from 'lang2' to 'lang1' instead of the default 'lang1' to 'lang2'.
    # Additionally, initialize 'Lang' objects for input and output languages accordingly.
    if reverse:
        pairs = [[p[1], p[0]] for p in pairs]  # Swap the sentence order in each pair.
        input_lang = Lang(lang2)  # Initialize 'Lang' object for 'lang2' as the input language.
        output_lang = Lang(lang1)  # Initialize 'Lang' object for 'lang1' as the output language.
    else:
        # If 'reverse' is False, keep the order as is and initialize 'Lang' objects
        # with 'lang1' as the input language and 'lang2' as the output language.
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    # Return the 'Lang' objects for input and output languages, and the list of sentence pairs.
    return input_lang, output_lang, pairs

Since there are a *lot* of example sentences and we want to train
something quickly, we\'ll trim the data set to only relatively short and
simple sentences. Here the maximum length is 7 words (that includes
ending punctuation) and we\'re filtering to sentences that start with "I am" or "I'm".
Note that the second one is "i m " because the normalizeStrings code converted punctuation to spaces.


In [None]:
# Set a maximum sentence length. Sentences longer than this limit will be excluded.
MAX_LENGTH = 7

# Define a tuple of English sentence prefixes to consider. Only sentences starting with these
# prefixes will be kept during the filtering process.
eng_prefixes = (
    "i am ", "i m ",
)

def filterPair(p):
    # Determine if a given pair of sentences ('p') should be kept based on length and prefix criteria.
    
    # Check if the first sentence in the pair is longer than the MAX_LENGTH.
    if len(p[0].split(' ')) >= MAX_LENGTH:
        return False  # Exclude the pair if the first sentence is too long.
    
    # Check if the second sentence in the pair is longer than the MAX_LENGTH.
    elif len(p[1].split(' ')) >= MAX_LENGTH:
        return False  # Exclude the pair if the second sentence is too long.
    
    else:
        # Check if the first sentence starts with any of the specified prefixes.
        for prefix in eng_prefixes:
            if p[0].startswith(prefix):
                return True  # Keep the pair if the first sentence starts with a valid prefix.
    
    # Exclude the pair if none of the prefixes match.
    return False

def filterPairs(pairs):
    # Filter a list of sentence pairs using the filterPair criteria.
    
    keep = []  # Initialize an empty list to store pairs that meet the filtering criteria.
    for pair in pairs:
        # For each pair in the input list, check if it should be kept.
        if filterPair(pair):
            keep.append(pair)  # Add the pair to the 'keep' list if it passes the filter.
    return keep  # Return the list of pairs that meet the filtering criteria.

The full process for preparing the data is:

-   Read text file and split into lines, split lines into pairs
-   Normalize text, filter by length and content
-   Make word lists from sentences in pairs


In [None]:
def prepareData(lang1, lang2, reverse=False):
    # Reads sentence pairs from a file, optionally reversing the sentence order.
    # 'lang1' and 'lang2' are names of the languages (e.g., 'eng' for English, 'fra' for French).
    # The 'reverse' flag, when set to True, reverses the order in which sentences are read,
    # which can be useful for changing the direction of translation.
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    # Filter the read sentence pairs to remove those that don't meet certain criteria,
    # such as length or specific starting phrases.
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    
    print("Counting words...")
    # Process each sentence pair, adding the words from each sentence to their respective
    # language's vocabulary.
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

# Example usage of the prepareData function.
# Prepares the data for English to French translation (can be reversed).
input_lang, output_lang, pairs = prepareData('eng', 'fra')
# Print a random sentence pair from the prepared data to demonstrate the outcome.
print(random.choice(pairs))

# Model - Encoder

In this lab, we will use a GRU encoder.

In [None]:
import torch.nn as nn

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        # Inherits from nn.Module, a base class for all neural network modules.
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size  # Sets the size of the hidden layers in the GRU.

        # nn.Embedding layer converts token indices to dense vectors of a fixed size,
        # 'input_size' is the size of the input vocabulary, and 'hidden_size' is the
        # dimensionality of the embedding vector.
        self.embedding = nn.Embedding(input_size, hidden_size)
        
        # GRU layer: a type of RNN that can handle sequences of variable length.
        # Here it is configured to have 'hidden_size' units. 'batch_first=True'
        # indicates that the input tensors will have the batch size as the first dimension.
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        
        # Dropout layer: a regularization technique where randomly selected neurons are
        # ignored during training. This helps prevent overfitting. 'dropout_p' specifies
        # the probability of an element to be zeroed.
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        # Defines the forward pass of the encoder.
        # 'input' is the input sequence to the encoder.
        
        # First, the input is passed through the embedding layer.
        embedded = self.embedding(input)
        
        # The embeddings are then passed through a dropout layer to prevent overfitting.
        embedded = self.dropout(embedded)
        
        # The output of the dropout layer is fed into the GRU along with the initial hidden state.
        # The GRU returns the output and a new hidden state.
        output, hidden = self.gru(embedded)
        
        # The function returns the output and the final hidden state of the GRU.
        return output, hidden

# Model - Decoder

We'll use a GRU-based decoder too, with attention.

This code includes the option to specify 'no attention', in which case some code is run that returns an empty attention matrix. This allows us to vary the nature of the attention method without adjusting the rest of the decoder.

In [None]:
class AdditiveAttention(nn.Module):
    # Implements an additive (Bahdanau) attention mechanism.
    def __init__(self, hidden_size):
        super(AdditiveAttention, self).__init__()
        # Linear transformations for the attention mechanism.
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)
        # Output size is double the hidden size because it combines context and decoder states.
        self.out_size = hidden_size * 2

    def forward(self, query, keys):
        # Computes attention scores and weighted sum (context vector) for the given query and keys.
        # Query: decoder's hidden state. Keys: encoder outputs.
        # Scores: raw attention scores for each key given the query.
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        # Adjusting dimensions for softmax operation.
        scores = scores.squeeze(2).unsqueeze(1)

        # Softmax to obtain attention weights.
        weights = F.softmax(scores, dim=-1)
        # Weighted sum of keys to get the context vector.
        context = torch.bmm(weights, keys)
        return context, weights

class NoAttention(nn.Module):
    # A placeholder attention mechanism that does not actually perform attention.
    def __init__(self, hidden_size):
        super(NoAttention, self).__init__()
        self.out_size = hidden_size  # Output size matches hidden size for consistency.

    def forward(self, query, keys):
        # Returns zeros for context and weights, mimicking absence of attention.
        context = torch.zeros([query.shape[0], query.shape[1], 0]).to(device)
        weights = torch.zeros(keys.shape).to(device)
        return context, weights

def get_attention_module(name, hidden_size):
    # Factory function to select and return an attention module by name.
    if name == 'none':
        return NoAttention(hidden_size)
    elif name == "additive":
        return AdditiveAttention(hidden_size)
    else:
        raise Exception(f"Attention type {name} is not defined")

class AttnDecoderRNN(nn.Module):
    # Decoder RNN that can use either no attention or additive attention.
    def __init__(self, hidden_size, output_size, attention_type="none", dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attention = get_attention_module(attention_type, hidden_size)
        self.gru = nn.GRU(self.attention.out_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        # Main forward pass of the decoder.
        # Handles both training mode (with teacher forcing) and inference mode.
        # Loops through each time step, applying attention and GRU updates.

        # Initialization steps for inputs and hidden states.
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        decoder_hidden = encoder_hidden

        # Containers for outputs and attention weights.
        decoder_outputs = []
        attentions = []

        # Iteratively generate sequence.
        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)

            # Determine next input based on teaching forcing or inference mode.
            if target_tensor is not None:
                decoder_input = target_tensor[:, i].unsqueeze(1)  # Teacher forcing
            else:
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # Inference mode

        # Concatenate and finalize outputs.
        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions

    def forward_step(self, input, hidden, encoder_outputs):
        # Performs a single decoder step (one time step).
        embedded = self.dropout(self.embedding(input))

        # Generate context vector using attention.
        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        # Update GRU state.
        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

# Training

## Preparing Training Data

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will append the
EOS token to both sequences.


In [None]:
def indexesFromSentence(lang, sentence):
    # Converts a sentence into a list of word indices according to a given language's vocabulary.
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    # Converts a sentence into a PyTorch tensor of word indices, appending the EOS (End of Sentence) token.
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)  # Appends the EOS token's index to signify the end of the sentence.
    # Converts the list of indices into a PyTorch tensor and returns it.
    return torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)


def tensorsFromPair(pair):
    # Given a pair of sentences (input and target), this function converts both into tensors.
    input_tensor = tensorFromSentence(input_lang, pair[0])  # Input sentence tensor.
    target_tensor = tensorFromSentence(output_lang, pair[1])  # Target sentence tensor.
    return (input_tensor, target_tensor)


def get_dataloader(batch_size):
    # Prepares the data and creates a DataLoader for batching during training.
    input_lang, output_lang, pairs = prepareData('eng', 'fra')  # Prepares and returns language data and pairs.

    n = len(pairs)
    input_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)  # Initializes a numpy array for input sentence indices.
    target_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)  # Initializes a numpy array for target sentence indices.

    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromSentence(input_lang, inp) + [EOS_token]  # Gets input indices, appends EOS token.
        tgt_ids = indexesFromSentence(output_lang, tgt) + [EOS_token]  # Gets target indices, appends EOS token.
        # Fills the respective numpy arrays with indices.
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    # Converts the numpy arrays to PyTorch tensors and moves them to the specified device.
    train_data = TensorDataset(torch.LongTensor(input_ids).to(device),
                               torch.LongTensor(target_ids).to(device))
    # Creates a DataLoader with random sampling for batch generation.
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    return input_lang, output_lang, train_dataloader

# Training

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the `<SOS>` token as its first input, and the last hidden state of the
encoder as its first hidden state. We use 'teacher forcing` as described in the lecture.

In [None]:
def train_epoch(dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion):
    """
    Trains the model for one epoch using the given dataloader, encoder, decoder, and optimizers.

    Parameters:
    - dataloader: DataLoader providing batches of input and target tensors.
    - encoder: The encoder model which processes the input tensors.
    - decoder: The decoder model which generates the output sequence.
    - encoder_optimizer: Optimizer for updating the encoder's weights.
    - decoder_optimizer: Optimizer for updating the decoder's weights.
    - criterion: Loss function to calculate the difference between
                 the decoder's outputs and the target tensors.

    Returns:
    - The average loss over all batches in this epoch.
    """
    total_loss = 0  # Initialize total loss for this epoch.

    # Iterate over batches of data in the dataloader.
    for data in dataloader:
        input_tensor, target_tensor = data  # Unpack the batch into input and target tensors.

        # Clear gradients before processing the batch.
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        # Pass the input tensor through the encoder.
        encoder_outputs, encoder_hidden = encoder(input_tensor)

        # Pass the encoder's outputs and hidden state to the decoder, along with the target tensor.
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        # Compute the loss between the decoder's output and the actual target tensor.
        # The .view(-1, decoder_outputs.size(-1)) reshapes the decoder's output
        # to a 2D tensor where rows correspond to batch elements concatenated together,
        # and columns correspond to the output size. The target is similarly flattened.
        loss = criterion(decoder_outputs.view(-1, decoder_outputs.size(-1)), target_tensor.view(-1))

        loss.backward()  # Compute the gradient of the loss with respect to model parameters.

        # Update the encoder and decoder parameters based on gradients.
        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()  # Accumulate the loss.

    # Calculate the average loss per batch for this epoch.
    return total_loss / len(dataloader)


This is a helper function to print time elapsed and estimated time
remaining given the current time and progress %.


In [None]:
import time
import math

def asMinutes(s):
    """
    Converts seconds into a minutes and seconds format.
    
    Parameters:
    - s: The time in seconds.
    
    Returns:
    - A string representing the time in minutes and seconds ('Xd Xm').
    """
    m = math.floor(s / 60)  # Convert seconds to minutes, discarding any remainder.
    s -= m * 60  # Calculate the remaining seconds.
    return '%dm %ds' % (m, s)  # Format and return the string.

def timeSince(since, percent):
    """
    Calculates and formats the time elapsed since a starting point and estimates remaining time.
    
    Parameters:
    - since: The starting time (usually obtained via time.time()).
    - percent: The completion percentage of the task.
    
    Returns:
    - A string indicating both the elapsed time and the estimated remaining time.
    """
    now = time.time()  # Get the current time.
    s = now - since  # Calculate elapsed time since the start.
    es = s / (percent)  # Estimate the total time based on the current progress.
    rs = es - s  # Calculate the remaining time by subtracting elapsed time from the total estimated time.
    
    # Format and return the elapsed and remaining times as a string.
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))


The whole training process looks like this:

-   Start a timer
-   Initialize optimizers and criterion
-   Create training pairs

Then we call `train` many times and occasionally print the progress (%
of examples, time so far, estimated time) and average loss.


In [None]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001, print_every=100):
    """
    Trains an encoder-decoder model.

    Parameters:
    - train_dataloader: DataLoader providing batches of data for training.
    - encoder: The encoder part of the sequence-to-sequence model.
    - decoder: The decoder part of the sequence-to-sequence model.
    - n_epochs: Total number of epochs to train the models.
    - learning_rate: Learning rate for the optimizers.
    - print_every: Frequency of reporting the average loss.
    """
    start = time.time()  # Record the start time for calculating elapsed time.
    print_loss_total = 0  # Sum of losses, reset every 'print_every' epochs.

    # Initialize optimizers for both encoder and decoder with the Adam algorithm.
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)

    # Define the loss function. NLLLoss is common for classification problems.
    criterion = nn.NLLLoss()

    # Training loop over the specified number of epochs.
    for epoch in range(1, n_epochs + 1):
        # Perform one epoch of training and return the loss.
        loss = train_epoch(train_dataloader, encoder, decoder,
                           encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss  # Accumulate loss.

        # Every 'print_every' epochs, print the average loss and reset the total loss.
        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every  # Calculate average loss.
            print_loss_total = 0  # Reset total loss for the next 'print_every' epochs.
            # Print a summary: elapsed time, current epoch, progress (%), and average loss.
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                         epoch, epoch / n_epochs * 100, print_loss_avg))

Evaluation
==========

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder\'s predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there. We also store the decoder\'s
attention outputs for display later.

We are going to use greedy decoding (top-K with a value of 1).

In [None]:
def evaluate(encoder, decoder, sentence, input_lang, output_lang):
    # Temporarily disables gradient calculations to save memory and computations since they are not needed.
    with torch.no_grad():
        # Convert the input sentence into a tensor of word indices.
        input_tensor = tensorFromSentence(input_lang, sentence)

        # Pass the input tensor through the encoder to obtain its outputs and final hidden state.
        encoder_outputs, encoder_hidden = encoder(input_tensor)

        # Pass the encoder outputs and hidden state, along with an initial decoder input (if required),
        # into the decoder to produce the output sequence.
        # Note: This code snippet appears to be missing the part where the initial decoder input and
        # subsequent inputs are provided. Typically, the decoder processes one token at a time.
        decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden)

        # Select the top prediction (highest probability) from the decoder's output at each time step.
        _, topi = decoder_outputs.topk(1)
        decoded_ids = topi.squeeze()  # Remove extraneous dimensions.

        decoded_words = []  # To store the decoded words.
        for idx in decoded_ids:
            # Check for the EOS token. If found, append '<EOS>' to the decoded words and stop decoding.
            if idx.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            # Convert each index back to a word and append to the list of decoded words.
            decoded_words.append(output_lang.index2word[idx.item()])

    # Return the list of decoded words and any attention weights from the decoder.
    return decoded_words, decoder_attn

We can evaluate random sentences from the training set and print out the
input, target, and output to make some subjective quality judgements:


In [None]:
def evaluateRandomly(encoder, decoder, n=5):
    # Sets the encoder and decoder to evaluation mode, which turns off dropout and batch normalization,
    # ensuring consistent behavior for inference.
    encoder.eval()
    decoder.eval()

    # Loop over n examples chosen randomly.
    for i in range(n):
        # Randomly select a sentence pair from the global 'pairs' list.
        pair = random.choice(pairs)
        
        # Print the input sentence from the pair.
        print('>', pair[0])
        # Print the target (correct) translation or response.
        print('=', pair[1])
        
        # Use the 'evaluate' function to generate the output sentence for the input sentence.
        output_words, _ = evaluate(encoder, decoder, pair[0], input_lang, output_lang)
        # Join the list of output words into a single sentence.
        output_sentence = ' '.join(output_words)
        
        # Print the model's translation or response.
        print('<', output_sentence)
        print('')  # Print a newline for readability between each evaluated pair.

# Training and Evaluating

Time to initialize a network and start training!

To make this efficient, even on a CPU, we have used a small amount of data, with short sentences, and small models. Even so, this will take a few minutes to run.

First, we'll train without attention.

In [None]:
# Set the size of the hidden layers in the encoder and decoder models.
hidden_size = 128
# Specify the batch size for training, determining how many examples are processed together.
batch_size = 32

# Prepare the dataloader for the training process, which includes loading the dataset,
# processing the text into tensors, and batching the data.
input_lang, output_lang, train_dataloader = get_dataloader(batch_size)

# Initialize the encoder model with the size of the input language vocabulary and the hidden size.
# The model is moved to the 'device', which could be a GPU or CPU.
encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
# Initialize the decoder model with the hidden size, the size of the output language vocabulary,
# and the type of attention mechanism to use ("none" in this case).
decoder = AttnDecoderRNN(hidden_size, output_lang.n_words, "none").to(device)

# Train the encoder and decoder models using the prepared dataloader, specifying the number
# of epochs to train for and how frequently to print the training progress.
train(train_dataloader, encoder, decoder, 80, print_every=10)

# After training, randomly select examples from the dataset and evaluate the performance
# of the trained models. This provides a quick qualitative assessment of how well the models
# are translating or responding to inputs.
evaluateRandomly(encoder, decoder)


Now let's try with attention.

In [None]:
# Define the size of the hidden layers for the models. This impacts the model's capacity to learn complex patterns.
hidden_size = 128
# Define the batch size, which affects how many examples are processed together in each iteration of training.
batch_size = 32

# Generate the training data loader, which automates the process of loading the data in batches during training.
# It also performs initial preprocessing like tokenizing sentences and converting them into numerical format.
input_lang, output_lang, train_dataloader = get_dataloader(batch_size)

# Initialize the encoder model. This model will process the input sequence and generate a context or a series
# of contextual embeddings representing the input.
# The 'input_lang.n_words' parameter ensures the embedding layer can represent any word in the input vocabulary.
encoder2 = EncoderRNN(input_lang.n_words, hidden_size).to(device)

# Initialize the decoder model with an additive attention mechanism. The decoder uses the context provided by the encoder
# to generate the output sequence. The attention mechanism allows the decoder to focus on different parts of the input
# sequence at each step of the generation process, improving the ability to handle long sequences.
decoder2 = AttnDecoderRNN(hidden_size, output_lang.n_words, "additive").to(device)

# Train the encoder and decoder models using the prepared data loader. This script sets the models to train for 80 epochs,
# and prints out the training progress and loss after every epoch.
train(train_dataloader, encoder2, decoder2, 80, print_every=1)

# After training, evaluate the model performance by randomly selecting examples from the dataset and
# translating them using the trained models. This gives a qualitative measure of how well the model has learned
# to translate from the input language to the output language.
evaluateRandomly(encoder2, decoder2)


# Visualizing Attention

Let's have a look at the attention scores being calculated.

This code will print a table, with one row for each output token, and values in the row indicating the attention score for each input token.


In [None]:
def showAttention(input_sentence, output_words, attentions):
    """
    Prints a formatted table of attention weights, showing how much focus the
    decoder put on each input word for each output word.
    """
    # Print the header row with the input sentence words.
    for word in [''] + input_sentence.split() + ["<EOS>"]:
        print("{:>10}".format(word), end='')
    print()
    
    # Convert the attention tensor to a list for easier processing.
    scores = attentions.cpu().tolist()
    
    # For each output word and corresponding attention weights row...
    for word, row in zip(output_words, scores[::-1]):
        print("{:<10}".format(word), end='')  # Print the output word.
        # Then, print the attention weights for this word against all input words.
        for val in row:
            print("{:>10.1f}".format(val * 100), end='')  # Format the weights as percentages.
        print()

def evaluateAndShowAttention(input_sentence):
    """
    Evaluates an input sentence using the trained encoder and decoder models, then
    displays the attention weights for the generated output.
    """
    # Evaluate the input sentence, returning the output words and attention weights.
    output_words, attentions = evaluate(encoder2, decoder2, input_sentence, input_lang, output_lang)
    
    # Print the original input sentence and the model's output.
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    
    # Display the attention weights between input and output words.
    showAttention(input_sentence, output_words, attentions[0, :len(output_words), :])
    
evaluateAndShowAttention('i m quite sure')



# Task 1

Implement dot-product attention, train a model, and measure its performance.

To help, note that we can do the dot product step like so:

```
# Do a dot product by multiplying the two matrices and summing
out = (tensor1 * tensor2).sum(-1)
```

In [None]:
# TODO

# Task 2

Modify the encoder to be a bi-directional GRU, and use the Additive attention. The documentation for the GRU should be helpful:


To make the change simpler, use 64 dimensions for the hidden state in the encoder (that way when you combine the forward and backward states you will get a 128 dimensional vector, the same as before for the decoder).

You may find these two documentation pages helpful:

- GRU, for shifting to bi-directional, https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU
- Reshape, for combining the hidden states from forward and backward, https://pytorch.org/docs/stable/generated/torch.reshape.html

In [None]:
# TODO