## Transformers

Wei Li

An example of Machine Translation from English to German using the Multi30k Dataset using transformers.

References:
- Attention is all you need: https://arxiv.org/pdf/1706.03762.pdf
- The Annotated Transformer: https://nlp.seas.harvard.edu/annotated-transformer/
- https://nlp.seas.harvard.edu/annotated-transformer/

This notebook is adapted from and expanded on the original notebook example in CMU 11-785 (Bhiksha Raj & Rita Singh).

See also embedding, pad pack sequenes and language models basics.

<img src="./images/transformer1.png" width="400" height="430">

Source: Figure 1 https://arxiv.org/pdf/1706.03762.pdf

## Imports

In [1]:
# !nvidia-smi

# Uncomment to install
# !pip install -U torchtext
# !pip install -U pip setuptools wheel
# !pip install spacy
# !python -m spacy download "de_core_news_sm"
# !python -m spacy download "en_core_web_sm"
# !pip install portalocker>=2.0
# !pip install -U torchdata
# !pip install sacrebleu
# !pip install torchsummaryX
# You may need to restart your runtime after this

In [2]:
#!python -m spacy download en_core_web_sm  # English language model.
#!python -m spacy download de_core_news_sm #  German language model.

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import sacrebleu
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import Multi30k
from typing import Tuple
import torchdata
import spacy
import random

# %pip install watermark
%load_ext watermark
%watermark -u -t -d -v -p numpy,torch,sacrebleu,torchtext,spacy

Last updated: 2024-01-02 20:14:08

Python implementation: CPython
Python version       : 3.8.17
IPython version      : 8.12.2

numpy    : 1.21.5
torch    : 1.12.1
sacrebleu: 2.3.1
torchtext: 0.13.1
spacy    : 3.5.1



In [4]:
from utils_evaluation import (
    set_all_seeds,
    set_deterministic,
    generate_tgt_mask,
    generate_src_mask,
    calculate_bleu,
    inference,
    evaluate_test_set_bleu,
)

In [5]:
## Setting

# RANDOM_SEED = 2022
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# set_all_seeds(RANDOM_SEED)
# set_deterministic()

Some main components and features included in `en_core_web_sm`:

- Tokenization: The model can split a given text into individual words or tokens.
- Part-of-speech (POS) Tagging: It assigns grammatical labels to each token, such as noun, verb, adjective, etc.
- Named Entity Recognition (NER): This component identifies and classifies named entities in the text, such as person names, organizations, locations, etc.
- Dependency Parsing: It analyzes the grammatical structure of a sentence and represents it as a dependency tree, showing how different words relate to each other.
- Word Vectors: The model provides word embeddings, which are numerical representations of words that capture semantic similarities between them.

#### download data

In [6]:
# Use this cell if you get a UTF Encoding Error
# import locale
# def getpreferredencoding(do_setlocale = True):
#     return "UTF-8"
# locale.getpreferredencoding = getpreferredencoding

In [7]:
# use server http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt

# !mkdir Multi30k
# !mkdir Multi30k/train/
# !mkdir Multi30k/val/
# !mkdir Multi30k/test/

# !wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz
# !wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz
# !wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz

# # Extract archives to respective folders
# !tar -xf training.tar.gz -C Multi30k/train/
# !tar -xf validation.tar.gz -C Multi30k/val/
# !tar -xf mmt16_task1_test.tar.gz -C Multi30k/test/

# # training.tar.gz is extracted into Multi30k/train/
# # validation.tar.gz is extracted into Multi30k/val/
# # mmt16_task1_test.tar.gz is extracted into Multi30k/test/

In [8]:
# note: the server http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt is down
# use the following instead

# !mkdir Multi30k
# !mkdir Multi30k/train/
# !mkdir Multi30k/val/
# !mkdir Multi30k/test/

# !wget https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
# !wget https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
# !wget https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz

# # Extract archives to respective folders
# !tar -xf training.tar.gz -C Multi30k/train/
# !tar -xf validation.tar.gz -C Multi30k/val/
# !tar -xf mmt16_task1_test.tar.gz -C Multi30k/test/

# # training.tar.gz is extracted into Multi30k/train/
# # validation.tar.gz is extracted into Multi30k/val/
# # mmt16_task1_test.tar.gz is extracted into Multi30k/test/

## Dataset

In [9]:
from collections import Counter
from tqdm import tqdm

root = "Multi30k/"
DATA_DIR = "../data/"+root


# Initialize tokenizers for English and German using spaCy models
# 'spacy' indicates the type of tokenizer (from spaCy)
# 'en_core_web_sm' is the small English model, 'de_core_news_sm' is the small German model
en_tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
#  it returns a spaCy tokenizer instance.
de_tokenizer = get_tokenizer("spacy", language="de_core_news_sm")


# Define a function to tokenize English text
# Uses the English tokenizer to split the text into tokens
def tokenize_en(text):
    doc = en_tokenizer(str(text))
    return [token for token in doc]


# Define a function to tokenize German text
# Uses the German tokenizer to split the text into tokens
def tokenize_de(text):
    doc = de_tokenizer(str(text))
    return [token for token in doc]


# Define a class to create a vocabulary from text
class VOCAB:
    # Initializer for the VOCAB class
    # tokenizer: a function for tokenizing text
    # min_freq: minimum frequency for a word to be included in the vocabulary
    # data: the text data to build the vocabulary from
    # special_tokens: a list of special tokens (like <pad>, <sos>, etc.)

    # Attributes:
    #       .stoi: returns the string-index dictionary
    #       .itos: returns the the list of vocubulary strings (index-string)

    # Methods:
    #       build_vocab: build vocabulary object using data (a list of strings)
    #       __len__: returns the length of the vocabulary
    #       __get_item__: retrieve the index corresponding to the token from the stoi dictionary.
    #                      If the token is not found in the dictionary, it returns the index of a special token <unk> instead.

    def __init__(
        self,
        tokenizer,
        min_freq=2,
        data=None,
        special_tokens=["<pad>", "<sos>", "<eos>", "<unk>"],
    ):
        self.tokenizer = tokenizer
        self.min_freq = min_freq
        self.special_tokens = special_tokens
        self.build_vocab(data)

    # Method to build the vocabulary
    def build_vocab(self, data):
        counter = Counter()
        # Iterate over the data, tokenize each text and update the counter
        for text in tqdm(data):
            tokens = self.tokenizer(text)
            counter.update(tokens)

        # Filter tokens that meet the minimum frequency threshold
        tokens = [token for token, freq in counter.items() if freq >= self.min_freq]

        # Add special tokens to the start of the tokens list
        tokens = self.special_tokens + tokens

        # Create string-to-index mapping
        self.stoi = {token: index for index, token in enumerate(tokens)}
        self.itos = tokens  # Also create index-to-string mapping

    # Return the length of the vocabulary
    def __len__(self):
        return len(self.stoi)

    # Retrieve an item index from the vocabulary; return index for '<unk>' for unknown tokens
    def __getitem__(self, token):
        return self.stoi.get(token, self.stoi["<unk>"])
        # This line attempts to retrieve the index corresponding to the token from the stoi dictionary.
        # If the token is not found in the dictionary, it returns the index of a special token <unk> instead.


# File paths for the English and German training data
# en_file = "Multi30k/train/train.en"
# de_file = "Multi30k/train/train.de"

en_file = DATA_DIR+"train/train.en"
de_file = DATA_DIR+"train/train.de"

# Open and read the English training data file
with open(en_file, "r", encoding="utf8") as f:
    train_data_en = [text.strip() for text in f.readlines()]

# Open and read the German training data file
with open(de_file, "r", encoding="utf8") as f:
    train_data_de = [text.strip() for text in f.readlines()]

# Create vocabulary objects for English and German training data
# Here, min_freq is set to 1, meaning all tokens are included
EN_VOCAB = VOCAB(tokenize_en, min_freq=1, data=train_data_en)
DE_VOCAB = VOCAB(tokenize_de, min_freq=1, data=train_data_de)

# Print the sizes of the created English and German vocabularies
print("\nVocab Size English", len(EN_VOCAB))
print("\nVocab Size German", len(DE_VOCAB))

100%|██████████| 29001/29001 [00:00<00:00, 37041.68it/s]
100%|██████████| 29001/29001 [00:01<00:00, 19266.52it/s]


Vocab Size English 10837

Vocab Size German 19214





In [10]:
# Define a custom dataset class for translation tasks
class TranslationDataset(Dataset):
    # Constructor for the dataset class
    def __init__(self, en_data, de_data, src_tokenizer, tgt_tokenizer, src_vocab, tgt_vocab):
        # Store English and German data (eahc is a list of strings)
        self.en_data = en_data  # English data
        self.de_data = de_data  # German data

        # Store source and target tokenizers
        self.src_tokenizer = src_tokenizer  # Tokenizer for the source language (English)
        self.tgt_tokenizer = tgt_tokenizer  # Tokenizer for the target language (German)

        # Store source and target vocabularies
        self.src_vocab = src_vocab  # Vocabulary for the source language
        self.tgt_vocab = tgt_vocab  # Vocabulary for the target language

    # Method to get a single item from the dataset
    def __getitem__(self, index):
        # get the index-th tuple of source text and target text (both in torch tensor of indices of tokens)
        # <sos> and <eos> are added here
        # 
        # Return:
        #   a tuple of (src_tensor, tgt_tensor)
        #       src_tensor is tensor shape [seq_length] consisting of indices for a source sentence (not padded)
        #       tgt_tensor is tensor shape [seq_length] consisting of indices for a target sentence (not padded)

        # Get the source and target texts for the given index
        src_txt, tgt_txt = self.en_data[index], self.de_data[index]
        # src_txt, tgt_txt is a string

        # Tokenize the source and target texts and convert tokens to indices
        src_tokens = [self.src_vocab[token] for token in self.src_tokenizer(src_txt)]
        # scr_tokens: a list of indices for tokens
        tgt_tokens = [self.tgt_vocab[token] for token in self.tgt_tokenizer(tgt_txt)]

        # Add start-of-sequence (<sos>) index and end-of-sequence (<eos>) index 
        src_tokens = [self.src_vocab['<sos>']] + src_tokens + [self.src_vocab['<eos>']]
        tgt_tokens = [self.tgt_vocab['<sos>']] + tgt_tokens + [self.tgt_vocab['<eos>']]

        # Convert the token lists to PyTorch tensors
        src_tensor = torch.LongTensor(src_tokens)  # Tensor for source sentence
        tgt_tensor = torch.LongTensor(tgt_tokens)  # Tensor for target sentence

        return src_tensor, tgt_tensor

    # Method to get the size of the dataset
    def __len__(self):
        # Returns the lengths of the dataset (lengths of the list) 

        # Ensure source and target datasets are of the same length
        assert len(self.en_data) == len(self.de_data)

        return len(self.en_data)  # Return the length of the dataset

    def collate_fn(self, batch):
        # batch: a list of tuples, where each tuple contains a pair of tensors
        # (one from the source language, and one from the target language)
        # padding <pad> is added.
        #
        # Returns:
        #   a tuple of (src_tensors, tgt_tensors)
        #   src_tensors is tensor shape [batch_size, max_seq_length] consisting of indices for source (padded)
        #   tgt_tensors is tensor shape [batch_size, max_seq_length] consisting of indices for target (padded)

        # Unzip the batch to separate source (src) and target (tgt) tensors
        src_tensors, tgt_tensors = zip(*batch)

        # Pad the sequences in the batch for source and target languages.
        # This makes all sequences in the batch the same length by adding padding tokens.
        # 'pad_sequence' is a utility function that handles the padding.
        # 'padding_value' is the token used for padding (index of <pad> token in vocabulary).
        # 'batch_first=True' makes sure that the batch size is the first dimension of the output tensor.
        # After padding, src_tensors and tgt_tensors will be 2D tensors of shape [batch_size, max_seq_length],
        # where 'batch_size' is the number of items in the batch and 'max_seq_length' is the length
        # of the longest sequence in the batch.
        src_tensors = torch.nn.utils.rnn.pad_sequence(src_tensors, padding_value=self.src_vocab['<pad>'], batch_first=True)
        tgt_tensors = torch.nn.utils.rnn.pad_sequence(tgt_tensors, padding_value=self.tgt_vocab['<pad>'], batch_first=True)

        # Return the padded source and target tensors.
        # Now, each tensor in the batch has the same length, and they are suitable
        # for batch processing in models (like RNNs, LSTMs, Transformers, etc.).
        return src_tensors, tgt_tensors



As we shall see later, during training (teacher forcing), the source sequences will be input in the encoder, while the target sequences will be input in the decoder (as *tartget-input* sequences), and also target output (as *target-output* sequences) in the loss computation.

In [11]:
# en_file = "Multi30k/train/train.en"
# de_file = "Multi30k/train/train.de"
en_file = DATA_DIR+"train/train.en"
de_file = DATA_DIR+"train/train.de"

# Open the English text file and read its contents
with open(en_file, "r", encoding="utf8") as f:
    train_data_en = [text.strip() for text in f.readlines()]

# Open the German text file and read its contents
with open(de_file, "r", encoding="utf8") as f:
    train_data_de = [text.strip() for text in f.readlines()]

# en_file = "Multi30k/val/val.en"
# de_file = "Multi30k/val/val.de"
en_file = DATA_DIR+"val/val.en"
de_file = DATA_DIR+"val/val.de"

# Open the English text file and read its contents
with open(en_file, "r", encoding="utf8") as f:
    val_data_en = [text.strip() for text in f.readlines()]

# Open the German text file and read its contents
with open(de_file, "r", encoding="utf8") as f:
    val_data_de = [text.strip() for text in f.readlines()]

# en_file = "Multi30k/test/test.en"
# de_file = "Multi30k/test/test.de"
en_file = DATA_DIR+"test/test.en"
de_file = DATA_DIR+"test/test.de"

# Open the English text file and read its contents
with open(en_file, "r", encoding="utf8") as f:
    test_data_en = [text.strip() for text in f.readlines()]

# Open the German text file and read its contents
with open(de_file, "r", encoding="utf8") as f:
    test_data_de = [text.strip() for text in f.readlines()]

train_dataset = TranslationDataset(
    train_data_en, train_data_de, tokenize_en, tokenize_de, EN_VOCAB, DE_VOCAB
)
val_dataset = TranslationDataset(
    val_data_en, val_data_de, tokenize_en, tokenize_de, EN_VOCAB, DE_VOCAB
)
test_dataset = TranslationDataset(
    test_data_en, test_data_de, tokenize_en, tokenize_de, EN_VOCAB, DE_VOCAB
)

BATCH_SIZE = 128

train_dataloader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=train_dataset.collate_fn,
)
val_dataloader = DataLoader(
    val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=val_dataset.collate_fn
)
test_dataloader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=test_dataset.collate_fn,
)

### Example Sequences

In [12]:
train_dataset[1]

(tensor([ 1, 15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14,  2]),
 tensor([ 1, 17,  7, 18, 19, 20, 21, 22, 16,  2]))

In [13]:
# check the sentence
' '.join([EN_VOCAB.itos[i] for i in train_dataset[0][0]]), ' '.join([DE_VOCAB.itos[i] for i in train_dataset[0][1]])

('<sos> Two young , White males are outside near many bushes . <eos>',
 '<sos> Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche . <eos>')

In [14]:
# orginal sentences
test_data_en[0], test_data_de[0]

('A man in an orange hat starring at something.',
 'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.')

In [15]:
# what processed sentences look like through TranslationDataset class.
' '.join([EN_VOCAB.itos[i] for i in test_dataset[0][0]]), ' '.join([DE_VOCAB.itos[i] for i in test_dataset[0][1]])

('<sos> A man in an orange hat starring at something . <eos>',
 '<sos> Ein Mann mit einem orangefarbenen Hut , der etwas <unk> . <eos>')

Recall that that each iteration from `train_dataloader` returns `(src, tgt)`, where src is `[batch_size, max_seq_length_in_source_in_the_batch]`,  and tgt shape `[batch_size, max_seq_length_in_target_in_the_batch]`.  We write a function to find out the max length in the source (across all batches), and the max length in the target (across all batches).

In [16]:
def find_batch_max_lengths(train_dataloader):
    max_lengths_src = []  # List to store max lengths of source sequences for each batch
    max_lengths_tgt = []  # List to store max lengths of target sequences for each batch

    for src, tgt in train_dataloader:
        # Append max lengths for this batch
        max_lengths_src.append(src.shape[1])  # Second dimension is (padded) sequence length for source in the batch
        max_lengths_tgt.append(tgt.shape[1])  # Second dimension is (padded) sequence length for target in the batch

    return max_lengths_src, max_lengths_tgt

# Example usage:
max_lengths_src, max_lengths_tgt = find_batch_max_lengths(train_dataloader)

# Finding the overall maximum length in source and target across all batches
max_length_src = max(max_lengths_src)
max_length_tgt = max(max_lengths_tgt)

print(f"Max length in source across all batches: {max_length_src}")
print(f"Max length in target across all batches: {max_length_tgt}")



Max length in source across all batches: 43
Max length in target across all batches: 46


## Attention mechanism

<img src="./images/transformer_attentions.png" width="400" height="230">

Source: Figure 2 https://arxiv.org/pdf/1706.03762.pdf

Multi-Head Self-Attention is a key component of the Transformer architecture. It is designed to capture the dependencies between words in an input sequence without requiring sequential processing, which is the main drawback of RNNs and LSTMs.

The idea behind self-attention is to compute a weighted sum of all words in the input sequence, where the weights are determined by how relevant each word in the entire sequence is to the current one under consideration. This mechanism allows the model to consider the entire context when processing each word.

In Multi-Head Self-Attention, this process is performed multiple times (in parallel) with different linear projections of the input, which allows the model to capture different types of relationships between words (like Subject-Verb, Adjective-Adverb, Subject-Object, etc). These multiple attention "heads" are then concatenated and projected to create the final output of the self-attention layer.

The computation of the self-attention weights involves three learnable vectors for each word: Query (Q), Key (K), and Value (V). The dot product between the query and the key determines the relevance score of each word, which is then normalized using the softmax function. Finally, the weighted sum of the value vectors produces the output for each word.

(From the Paper)
The Transformer uses multi-head attention in three different ways:

• In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models.

• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input
of the softmax which correspond to illegal connections.

When we attend to sequences, attending to padding tokens in the input sequences is undesired for the Encoder, while attending to the future tokens in the target sequences is not desired in the decoder. To prevent this, we create source and target masks that can be used to nullify the influence of these tokens when calculating our context and attention weights. The **source mask** is used to prevent attention to the padding tokens in the source sequence, while the **target mask** ensures that the decoder only attends to the previous tokens in the target sequence during training (causal masking).

#### masks

In the following code, the attention score will be of shape `(batch_size, num_heads, seq_length_q, seq_length_k)` (3rd dim is the entry in queries, 4th dim is the entry in keys), the mask tensor should be compatible with these dimensions for the masking operation to work correctly. Here are the details:

Purpose of Mask:

The mask tensor is used to prevent the model from attending to certain positions within the sequence. This is typically used for two purposes:

1. Padding Mask: To ignore padded positions in the input sequences, ensuring that the model does not treat padding tokens as meaningful input.
2. Look-Ahead Mask: In sequence-to-sequence tasks like language translation, to prevent positions from attending to future positions in the sequence. This is particularly important during *training* to preserve the auto-regressive property.

Shape of Mask:

1. For Padding Mask, the mask is usually created based on the input sequences' lengths and has an initial shape of `(batch_size, seq_length_k)`. However, to align with the attention scores, it needs to be reshaped or broadcasted to `(batch_size, 1, 1, seq_length_k)`. This allows the same mask to be applied across all heads and all positions in the sequence.
2. For Look-Ahead Mask, the shape is typically `(batch_size, 1, seq_length_q, seq_length_q)`, as it's the same for all sequences and heads. It's designed to mask future positions (lower triangle of the matrix) and is broadcastable to the shape of the attention scores.

Broadcasting in Masking Operation:

PyTorch (and other deep learning frameworks) often use broadcasting rules, allowing tensors of different shapes to be combined in a meaningful way. For example, a mask of shape `(batch_size, 1, 1, seq_length_k)` can be applied to an attention score tensor of shape `(batch_size, num_heads, seq_length_q, seq_length_k)` due to broadcasting.

Application of Mask:

The mask is applied in the attention score computation, usually by setting the scores of masked positions to a large negative value (like -inf) before the softmax operation. This ensures that these positions get an attention weight close to zero.

### Attention

we compute the attention function on a set of queries
simultaneously, packed together into a matrix $Q$.  The keys and
values are also packed together into matrices $K$ and $V$.  We
compute the matrix of outputs as:

$$
   \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$

In [17]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SingleHeadAttention(nn.Module):
    """
    A version of single head attention.

    Allowing dropout after calculating attention probability; and applying linear output layer
    after attention values.

    """
    
    def __init__(self, d_model, dropout=0.1):
        """
        Initialize the SingleHeadAttention module.
        d_model (int): The dimensionality of the input embedding and output embedding.
                       i.e., we assume equal dimensionality for input and output        
        """
        super(SingleHeadAttention, self).__init__()
        self.d_model = d_model

        # Linear layers for transforming the queries, keys, and values.
        # Each has input and output dimensions equal to d_model.
        self.wq = nn.Linear(d_model, d_model)  # Query transformation
        self.wk = nn.Linear(d_model, d_model)  # Key transformation
        self.wv = nn.Linear(d_model, d_model)  # Value transformation

        # Output linear transformation, again with input and output dimensions d_model.
        self.wo = nn.Linear(d_model, d_model)

        # Dropout layer for regularization.
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        """
        Forward pass for SingleHeadAttention.

        Args:
            q (Tensor): The input query tensor of shape (batch_size, seq_length_q, d_model).
            k (Tensor): The input key tensor of shape (batch_size, seq_length_k, d_model). 
            v (Tensor): The input value tensor of shape (batch_size, seq_length_v, d_model).
                        seq_length_k=seq_length_v: the seq lengths of the input
                        if self-attention, then seq_length_k=seq_length_v=seq_length_k=seq_length_q
            mask (Tensor, optional): The mask tensor for ignoring certain elements. Defaults to None.

        Returns:
            a tuple of out, attn_scores, attn
            out: The output of attended values--tensor of shape (batch_size, seq_length_q, d_model).
            attn_scores: attention scores matrix (batch_size, seq_length_q, seq_length_k)
            attn: attention probability matrix (batch_size, seq_length_q, seq_length_k)
        """

        # Extract batch size from the query tensor.
        batch_size = q.size(0)

        # Linearly transform queries, keys, and values.
        # Shapes after transformation: (batch_size, seq_length, d_model)
        q = self.wq(q)  # Transformed queries
        k = self.wk(k)  # Transformed keys
        v = self.wv(v)  # Transformed values

        # Calculate attention scores.
        # Shape of attn_scores: (batch_size, seq_length_q, seq_length_k)
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.d_model ** 0.5)

        # Apply the mask if provided.
        # The mask shapes should be compatible with attn_scores shape.
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply softmax to the attention scores, followed by dropout.
        # Shape of attn remains: (batch_size, seq_length_q, seq_length_k)
        attn = self.dropout(F.softmax(attn_scores, dim=-1))

        # Apply the attention to the values.
        # Shape of out: (batch_size, seq_length_q, d_model)
        out = torch.matmul(attn, v)

        # Pass the result through the final linear layer.
        # Shape of out remains: (batch_size, seq_length_q, d_model)
        out = self.wo(out)

        return out, attn_scores, attn


In [18]:
# Source Mask Example
# Create the single head attention model
d_model = 2
model = SingleHeadAttention(d_model, dropout=0.0)

# Define sample query, key, and value tensors
batch_size = 3
seq_length_q = 3
seq_length_k = 4
seq_length_v = 4

q = torch.rand(batch_size, seq_length_q, d_model)
k = torch.rand(batch_size, seq_length_k, d_model)
v = torch.rand(batch_size, seq_length_v, d_model)

# Consider a source mask (assuming padding tokens at the end of the sequence)
# Here, 1 indicates non-padding token, 0 indicates padding token
source_mask = torch.tensor([
    [1, 1, 1, 0],  # First sequence has one padding token at the end
    [1, 1, 0, 0],  # Second sequence has two padding tokens at the end
    [1, 1, 1, 1]   # Third sequence has no padding tokens
])
# print(source_mask.shape) # (3, 4) (batch_size, seq_length_k)

# Reshape and expand the source mask to match the shape of attention scores
source_mask = source_mask.unsqueeze(1).expand(batch_size, seq_length_q, seq_length_k)

# Output the shape of the source mask
print(source_mask.shape) # (3, 3, 4) (batch_size, seq_length_q, seq_length_k)

# Forward pass through the model with the source mask
output_source_masked = model(q, k, v, mask=source_mask)

print()
output_source_masked[2].shape, output_source_masked[2]

torch.Size([3, 3, 4])



(torch.Size([3, 3, 4]),
 tensor([[[0.3358, 0.3484, 0.3158, 0.0000],
          [0.3383, 0.3420, 0.3197, 0.0000],
          [0.3367, 0.3442, 0.3191, 0.0000]],
 
         [[0.5059, 0.4941, 0.0000, 0.0000],
          [0.4982, 0.5018, 0.0000, 0.0000],
          [0.5012, 0.4988, 0.0000, 0.0000]],
 
         [[0.2432, 0.2576, 0.2503, 0.2489],
          [0.2443, 0.2489, 0.2554, 0.2513],
          [0.2429, 0.2540, 0.2530, 0.2500]]], grad_fn=<SoftmaxBackward0>))

In [19]:
# Look-Ahead Mask Example
# Create a Look-Ahead Mask to prevent positions from attending to future positions
# look_ahead_mask: shape (batch_size, seq_length_q, seq_length_q)

# Define sample query, key, and value tensors
batch_size = 3
seq_length_q = 3
seq_length_k = 3  # Adjusted for self-attention (seq_length_k = seq_length_q)
seq_length_v = 3  # Adjusted for self-attention (seq_length_v = seq_length_q)

q = torch.rand(batch_size, seq_length_q, d_model)
k = torch.rand(batch_size, seq_length_k, d_model)
v = torch.rand(batch_size, seq_length_v, d_model)

# Create a Look-Ahead Mask for self-attention
# This mask will be lower triangular (including diagonal), allowing each position
# to attend to itself and past positions, but not future ones
look_ahead_mask = torch.tril(torch.ones((batch_size, seq_length_q, seq_length_q))).bool()

print(look_ahead_mask)

# Forward pass through the model with the look-ahead mask
output_look_ahead_masked = model(q, k, v, mask=look_ahead_mask)

print()
output_look_ahead_masked[2].shape, output_look_ahead_masked[2]


tensor([[[ True, False, False],
         [ True,  True, False],
         [ True,  True,  True]],

        [[ True, False, False],
         [ True,  True, False],
         [ True,  True,  True]],

        [[ True, False, False],
         [ True,  True, False],
         [ True,  True,  True]]])



(torch.Size([3, 3, 3]),
 tensor([[[1.0000, 0.0000, 0.0000],
          [0.4815, 0.5185, 0.0000],
          [0.3263, 0.3553, 0.3184]],
 
         [[1.0000, 0.0000, 0.0000],
          [0.5005, 0.4995, 0.0000],
          [0.3320, 0.3323, 0.3356]],
 
         [[1.0000, 0.0000, 0.0000],
          [0.5092, 0.4908, 0.0000],
          [0.3448, 0.3307, 0.3245]]], grad_fn=<SoftmaxBackward0>))

#### multi-head attentions

In [20]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        """
        Initialize the MultiHeadSelfAttention module.

        Args:
            d_model (int): The dimensionality of the input embedding and output embedding.
                        i.e., we assume equal dimensionality for input and output for the attention layer
                            In the self.forward method, 
                            q (Tensor): The input query tensor of shape (batch_size, seq_length_q, d_model).
                            k (Tensor): The input key tensor of shape (batch_size, seq_length_k, d_model). 
                            v (Tensor): The input value tensor of shape (batch_size, seq_length_v, d_model).
            num_heads (int): The number of attention heads.

        Description:
            This module implements the multi-head self-attention mechanism as described in the
            "Attention Is All You Need" paper. It splits the input into multiple heads, allowing the
            model to jointly attend to information from different representation subspaces at different
            positions. Each head processes a portion of the input independently, and their outputs
            are concatenated and linearly transformed into the expected dimensionality.        

            So for each head, dim_k=dim_v=d_model/num_heads,
            When num_heads=1, dim_k=dim_v=d_model.

            ref. Vaswani Fig 2.
        """
        super(MultiHeadSelfAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads # each head handles a porportion to main the same d_model

        assert self.head_dim * num_heads == d_model, "Invalid number of heads or d_model dimensions"

        self.wq = nn.Linear(d_model, d_model) # in_features, out_features = d_model, d_model
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)

        self.wo = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        """
        Forward pass for MultiHeadSelfAttention.
        We assume equal dimensionality for input and output for the attention layer=d_model

        Args:
            q (Tensor): The input query tensor of shape (batch_size, seq_length_q, d_model).
            k (Tensor): The input key tensor of shape (batch_size, seq_length_k, d_model). 
            v (Tensor): The input value tensor of shape (batch_size, seq_length_v, d_model).
                        seq_length_k=seq_length_v: the seq lengths of the input
                        If self-attention, then seq_length_k=seq_length_v=seq_length_k=seq_length_q
            mask (Tensor, optional): The mask tensor for ignoring certain elements. Defaults to None.

        Returns:
            Tensor: The output of attended values--tensor of shape (batch_size, seq_length_q, d_model).
        """
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.head_dim)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.head_dim)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.head_dim)
        # self.wq(q): [batch_size, seq_length_q, d_model].
        # -1 is a placeholder that tells PyTorch to automatically calculate the 
        #   appropriate size for this dimension, which in this context will be seq_length.
        # self.num_heads is the number of attention heads. This dimension is explicitly set, 
        #   splitting the d_model dimension into num_heads smaller heads.
        # self.head_dim is the dimensionality of each head, calculated as d_model // num_heads. 
        #   It indicates how many features each head will process.
        # q shape (batch_size, seq_length_q, num_heads, head_dim)

        # Each head can focus on different features of the input

        # Transpose to get dimensions (batch_size, num_heads, seq_length, head_dim)
        # Now, each head's data is grouped together
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # do computation head-wise:
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        # q has a shape of (batch_size, num_heads, seq_length_q, head_dim).
        # k.transpose(-2, -1) has a shape of (batch_size, num_heads, head_dim, seq_length_k).
        # The matrix multiplication between q and the transposed k results in a tensor of shape 
        # (batch_size, num_heads, seq_length_q, seq_length_k). 
        # This shape represents attention scores for each query against all keys in all heads for each sequence in the batch.        

        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))

        # Softmax to compute attention weights
        attn = F.softmax(attn_scores, dim=-1)
        # attn: (batch_size, num_heads, seq_length_q, seq_length_k)

        out = torch.matmul(attn, v).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        # v: (batch_size, num_heads, seq_length_v, head_dim)   
        # seq_length_k=seq_length_v
        # torch.matmul(attn, v): (batch_size, num_heads, seq_length_q, head_dim)
        # .transpose(1, 2) changes the shape to (batch_size, seq_length_q, num_heads, head_dim).
        # .contiguous() is used as a safety measure to ensure that the tensor is 
        # stored in a contiguous block of memory, which is required for the subsequent .view() operation.
        # .view(batch_size, -1, self.d_model) reshapes the tensor back to (batch_size, seq_length_q, d_model).
        # note that d_model = num_heads*head_dim
    
        out = self.wo(out) # (batch_size, seq_length_q, d_model)

        return out

## Position-wise Feed Forward Networks

An FFN consists of two linear layers with a non-linear activation function in between, such as ReLU (Rectified Linear Unit). This (two-layer) feed forward network is applied position-wise (i.e., applied to *each position* separately and identically). The output of the first linear layer increases the dimensionality of the input, while the second linear layer reduces it back to the original dimension. This expansion and reduction allow the FFN to learn complex patterns and relationships between features.

The FFN is applied after the Multi-Head Self-Attention layer in both the encoder and decoder blocks of the Transformer.

$$\mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2$$

While the linear transformations are the same across different
positions, they use different parameters from layer to
layer. 

In [21]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        """
        Initialize the PositionwiseFeedForward module. 
        Note that dropout is applied here after the first layer instead of the lasy layer.

        Args:
            d_model (int): The dimensionality of the input embedding and output embedding.
            d_ff (int): The dimensionality of the hidden layer in the feed-forward network (generally larger than d_model).
            dropout (float, optional): The dropout probability. Defaults to 0.1.
        """
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        """
        Forward pass for PositionwiseFeedForward.

        Args:
            x (Tensor): The input tensor of shape (batch_size, seq_length, d_model).

        Returns:
            Tensor: The output tensor of shape (batch_size, seq_length, d_model).
        """
        out = self.linear1(x)
        out = F.relu(out)
        out = self.dropout(out)
        out = self.linear2(out)

        return out

## Add and Norm

The Add & Norm module is crucial for maintaining a stable gradient flow in deep networks, such as the Transformer. It consists of two parts: residual connections and layer normalization.

The Add & Norm module is applied after both the Multi-Head Self-Attention layer and the Position-wise Feed-Forward layer in the Transformer.

In [22]:
class AddNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        """
        Initialize the AddNorm module.

        Args:
            d_model (int): The dimensionality of the input and output.
            eps (float, optional): A small constant for numerical stability. Defaults to 1e-6.
        """
        super(AddNorm, self).__init__()
        self.norm = nn.LayerNorm(d_model, eps=eps)
        # d_model is the dimensionality of the input and output tensors, and 
        # it is also the normalization dimension for LayerNorm. This means the
        # normalization is applied across the d_model features of the input tensor.

    def forward(self, x, residual):
        """
        Forward pass for AddNorm.

        Args:
            x (Tensor): The input tensor of shape (batch_size, seq_length, d_model).
            residual (Tensor): The residual tensor of the same shape as the input tensor.

        Returns:
            Tensor: The output tensor of shape (batch_size, seq_length, d_model).
        """
        out = x + residual
        out = self.norm(out)

        return out

  ## Positional Encoding

Positional Encoding is a technique used in the Transformer architecture to inject positional information into the input sequence. Since the Multi-Head Self-Attention mechanism is **permutation-equivariant**, it cannot capture the relative position of words in the input sequence by itself. Positional Encoding helps to address this issue by adding a fixed vector to each word's embeddings, which is computed based on its position in the sequence.

The positional encoding function used in the Transformer is based on sine and cosine functions with different frequencies. This choice allows the model to learn to attend to both nearby and distant words, as well as to interpolate the positional encoding for longer sequences than those seen during training.

Note for $i=0, \ldots, d_{model}/2$, the wavelengths form a geometric progression from $2\pi$ to $10000 \cdot2\pi$.  

$$PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})$$

$$PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})$$

Note: $1/10000^{2i/d_{\text{model}}}=\exp(-2i\log(1000)/d_{\text{model}} )$

In [23]:
import math
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len, dropout=0.1):
        """
        Initialize the PositionalEncoding module.
        We apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. 

        Args:
            d_model (int): The dimensionality of the input.
            max_seq_len (int): The maximum length of the input sequence.
            dropout (float, optional): The dropout probability. Defaults to 0.1.
        """
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros(max_seq_len, d_model)

        # Create a tensor representing position indices from 0 to max_seq_len
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

        # Compute the division term used in the formulas for sine and cosine
        # The division term ensures different wavelengths for different dimensions
        # In torch.arange(0, d_model, 2), it starts at 0, ends before d_model, and increments by 2.
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))

        # Compute sine values for even indices in the positional encoding matrix
        pe[:, 0::2] = torch.sin(position * div_term)

        # Compute cosine values for odd indices in the positional encoding matrix
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add an extra dimension to the positional encoding for batch size
        pe = pe.unsqueeze(0)
        # pe (batch_size, max_seq_len, d_model)

        # Register 'pe' as a buffer that is not a model parameter.
        # Buffers, such as running averages, are not updated by backpropagation.
        # They are, however, saved and restored in the state_dict and moved to
        # GPU along with the model during .to(device) calls.
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Forward pass for PositionalEncoding.

        Args:
            x (Tensor): The input tensor of shape (batch_size, seq_length, d_model).

        Returns:
            Tensor: The output tensor of shape (batch_size, seq_length, d_model).
        """
        x = x + self.pe[:, :x.size(1), :]
        # pe added to x should have same seq_length as x
        x = self.dropout(x)
        return x

## Encoder Block

The Encoder Block in the Transformer architecture consists of the following layers:

- Multi-Head Self-Attention layer
- Add & Norm (Residual connection and Layer Normalization)
- Position-wise Feed-Forward Network layer
- Add & Norm (Residual connection and Layer Normalization)

In the Transformer, multiple encoder blocks (6 according to the paper) are stacked on top of each other to form the complete encoder module.

In [24]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initialize the EncoderBlock module.
        (multi_head_attend + add_norm + feed_forward + add_norm ) X 6

        Args:
            d_model (int): The dimensionality of the input (and output).
            num_heads (int): The number of attention heads.
            d_ff (int): The dimensionality of the hidden layer in the feed-forward network (generally larger than d_model)
            dropout (float, optional): The dropout probability (used in positional encoding). Defaults to 0.1.
        """
        super(EncoderBlock, self).__init__()
        self.self_attn = MultiHeadSelfAttention(d_model, num_heads)
        self.norm1 = AddNorm(d_model)
        self.ffn = PositionwiseFeedForward(d_model, d_ff, dropout)
        self.norm2 = AddNorm(d_model)

    def forward(self, x, mask=None):
        """
        Forward pass for EncoderBlock.

        Args:
            x (Tensor): The input tensor of shape (batch_size, seq_length, d_model).
            mask (Tensor, optional): The mask tensor for ignoring certain elements. Defaults to None.

        Returns:
            Tensor: The output tensor of shape (batch_size, seq_length, d_model).
        """
        x1 = self.self_attn(x, x, x, mask)  # q=k=v=x (self attention)
        x = self.norm1(x, x1)
        x1 = self.ffn(x)
        x = self.norm2(x, x1)

        return x

## Decoder

The Decoder Block in the Transformer architecture consists of the following layers:

- Masked Multi-Head Self-Attention layer (optionally followed by dropout)
- Add & Norm (Residual connection and Layer Normalization)
- Encoder-Decoder Multi-Head Cross Attention layer (optionally followed by dropout)
- Add & Norm (Residual connection and Layer Normalization)
- Position-wise Feed-Forward Network layer (optionally followed by dropout)
- Add & Norm (Residual connection and Layer Normalization) 

In the Transformer, multiple decoder blocks (6 per the paper) are stacked on top of each other to form the complete decoder module.

In [25]:
class DecoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initialize the DecoderBlock module.
        (multi_head_attend + add_norm + encoder_decoder_attend +  add_norm + feed_forward + add_norm ) X 6

        Args:
            d_model (int): The total dimensionality of the input (and output).
            num_heads (int): The number of attention heads.
            d_ff (int): The dimensionality of the hidden layer in the feed-forward network.
            dropout (float, optional): The dropout probability (used in PositionwiseFeedForward). Defaults to 0.1.
        """
        
        super(DecoderBlock, self).__init__()
        self.self_attn = MultiHeadSelfAttention(d_model, num_heads)
        self.norm1 = AddNorm(d_model)
        self.enc_dec_attn = MultiHeadSelfAttention(d_model, num_heads)
        self.norm2 = AddNorm(d_model)
        self.ffn = PositionwiseFeedForward(d_model, d_ff, dropout)
        self.norm3 = AddNorm(d_model)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        """
        Forward pass for DecoderBlock (as used in training).

        Args:
            x (Tensor): The "target input" tensor of shape (batch_size, x_seq_length, d_model). 
            enc_output (Tensor): The encoder output tensor of shape (batch_size, enc_output_seq_length, d_model). 
            src_mask (Tensor, optional): The source mask (padding mask on encoder output) tensor for ignoring certain elements. Defaults to None.
            tgt_mask (Tensor, optional): The target mask (look-ahead mask + padding mask on target_input) tensor for ignoring certain elements. Defaults to None.

            Each sequence in x is usually a "target input" sequence for training (teacher forcing).
            Each sequence in enc_output is a sequence of length of that of a source sequence, which 
            is returned as output from self-attention layer in encoder.

        Returns:
            Tensor: The output tensor of shape (batch_size, x_seq_length, d_model).
        """
        x1 = self.self_attn(x, x, x, tgt_mask)  # q=k=v=x (self attention)
        x = self.norm1(x, x1)
        x1 = self.enc_dec_attn(x, enc_output, enc_output, src_mask) # q=x, k=v=enc_output (cross attention)
        x = self.norm2(x, x1)
        x1 = self.ffn(x)
        x = self.norm3(x, x1)

        return x

## The Transformer

Now that we have implemented all the building blocks, let's assemble the complete Transformer architecture.

We initialize the following components:

- Source and target embedding layers
- Positional encoding module
- Encoder and decoder layer stacks
- Final linear layer to produce the probability distribution over the target vocabulary

In the forward method, we first pass the source and target input tensors through their respective embedding layers and add the positional encoding. Then, we pass the source input through each encoder layer sequentially, followed by passing the target input and encoder output through each decoder layer sequentially. Finally, we apply the linear layer to produce the output tensor with shape (batch_size, tgt_seq_length, tgt_vocab_size).

In [26]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, D_MODEL, num_heads, d_ff, max_seq_len, num_layers, dropout=0.1):
        """
        Initialize the Transformer module.

        This module applies dropout to position-wise feed forward and in the positional encoding.

        Args:
            src_vocab_size (int): The size of the source vocabulary.
            tgt_vocab_size (int): The size of the target vocabulary.
            d_model (int): The dimensionality of the embedding
            num_heads (int): The number of attention heads.
            d_ff (int): The dimensionality of the hidden layer in the feed-forward network.
            max_seq_len (int): The maximum length of the input sequence (as needed in PositionalEncoding)
            num_layers (int): The number of layers in the encoder and decoder.
            dropout (float, optional): The dropout probability. Defaults to 0.1.
        """
        super(Transformer, self).__init__()

        # # Converts token indices to embeddings of dimension D_MODEL
        self.src_embedding = nn.Embedding(src_vocab_size, D_MODEL)
            # nn.Embedding args (num_embeddings, embedding_dim) = (src_vocab_size, D_MODEL)
            # where num_embeddings =  size of the dictionary of embeddings
            #       embedding_dim = the size of each embedding vector
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, D_MODEL)
        self.pos_encoding = PositionalEncoding(D_MODEL, max_seq_len, dropout)

        self.encoder_layers = nn.ModuleList([EncoderBlock(D_MODEL, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderBlock(D_MODEL, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(D_MODEL, tgt_vocab_size)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        """
        Forward pass for Transformer.

        Args:
        src (Tensor): The source input tensor of shape (batch_size, src_seq_length).
        tgt (Tensor): The target input tensor of shape (batch_size, tgt_seq_length).
        src_mask (Tensor, optional): The source mask (padding mask on encoder output) tensor for ignoring certain elements. Defaults to None.
        tgt_mask (Tensor, optional): The target mask (look-ahead mask + padding mask on target input) tensor for ignoring certain elements. Defaults to None.

        Returns:
        Tensor: The output tensor of logits, shape (batch_size, tgt_seq_length, tgt_vocab_size).
        """
        src = self.src_embedding(src)
        src = self.pos_encoding(src)
        # src after embedding and pos_encoding: (batch_size, src_seq_length, D_MODEL)

        tgt = self.tgt_embedding(tgt)
        tgt = self.pos_encoding(tgt)

        for layer in self.encoder_layers:
            src = layer(src, src_mask)

        for layer in self.decoder_layers:
            tgt = layer(tgt, src, src_mask, tgt_mask)

        out = self.fc(tgt)

        return out

## Define Model and associated Parameters

In [27]:
# Define Hyper Parameters
NUM_EPOCHS      = 5
D_MODEL         = 256
ATTN_HEADS      = 8
NUM_LAYERS      = 3
FEEDFORWARD_DIM = 512
DROPOUT         = 0.1
MAX_SEQ_LEN     = 150 
SRC_VOCAB_SIZE  = len(EN_VOCAB)
TGT_VOCAB_SIZE  = len(DE_VOCAB)
LR              = 0

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [28]:
class NoamScheduler:
    def __init__(self, optimizer, d_model, warmup_steps=4000):
        """
        Initializes the NoamScheduler. A customized learning rate scheduler.

        Args:
            optimizer: Optimizer whose learning rate will be scheduled.
            d_model (int): Dimensionality of the model embeddings.
            warmup_steps (int): Number of warmup steps for learning rate scheduling.
        """
        self.optimizer = optimizer  # The optimizer to adjust the learning rate for
        self.d_model = d_model  # Dimensionality of model embeddings
        self.warmup_steps = warmup_steps  # Number of warmup steps
        self.current_step = 0  # Counter for tracking the number of optimization steps

    def step(self):
        """
        Updates the learning rate for each parameter group in the optimizer.

        This is called inside a trainining process.
        """
        self.current_step += 1  # Increment the number of steps
        lr = self.learning_rate()  # Calculate the new learning rate
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr  # Update the learning rate in the optimizer

    def learning_rate(self):
        """
        Computes the learning rate based on the current step.

        Returns:
            float: The computed learning rate.
        """
        step = self.current_step
        # The learning rate formula
        return (self.d_model ** -0.5) * min(step ** -0.5, step * self.warmup_steps ** -1.5)


The learning rate formula used in the `NoamScheduler` can be represented as follows:

\begin{align*}
\text{Learning Rate}=d_{\text{model}}^{-0.5} \cdot \min \left(\operatorname{step}^{-0.5}, \text{step} \cdot \text{warmup\_steps}^{-1.5}\right)
\end{align*}

- During the initial `warmup_steps`, the learning rate increases linearly.
- After `warmup_steps`, the learning rate decreases proportionally to the inverse square root of the step number.
- The `d_model` term is used for scaling the learning rate according to the model size.

This schedule allows for a rapid increase in the learning rate during the initial warmup phase, which is then moderated for a smoother descent, potentially leading to better training stability and performance for Transformer models.

In [29]:
from torch.optim import Adam
from torch.optim.lr_scheduler import LambdaLR
model = Transformer(SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, D_MODEL, ATTN_HEADS, FEEDFORWARD_DIM, MAX_SEQ_LEN, NUM_LAYERS, DROPOUT).to(DEVICE)
# optimizer = Adam(model.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR, betas=(0.9, 0.98), eps=1e-9, weight_decay=5e-2)

warmup_steps = 2 * len(train_dataloader) # len(train_dataloader) =227 batches; warmup_steps = 2 epochs = total of 2*227 updates
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.9)
# scheduler = LambdaLR(optimizer, lr_lambda=lambda step: (D_MODEL ** -0.5) * min((step + 1) ** -0.5, (step + 1) * warmup_steps ** -1.5), verbose=True)
scheduler = NoamScheduler(optimizer, d_model=D_MODEL, warmup_steps=warmup_steps)

criterion = torch.nn.CrossEntropyLoss(ignore_index=DE_VOCAB['<pad>'], label_smoothing=0.1)
# The function expects the model output to be raw, unnormalized scores (often called logits) for each class.
scaler = torch.cuda.amp.GradScaler()



On `torch.nn.CrossEntropyLoss(ignore_index=DE_VOCAB['<pad>'], label_smoothing=0.1)`:

`ignore_index=DE_VOCAB['<pad>']`:

`ignore_index` is a parameter that specifies a target value that is ignored in the loss computation.
In this case, `DE_VOCAB['<pad>']` is the index of the padding token (<pad>) in the target (German) vocabulary.
By setting ignore_index to the index of the padding token, the loss calculation will exclude these tokens. This is important in tasks like language translation where different sentences have different lengths, and padding is used to make all sentences in a batch have the same length.

`label_smoothing=0.1`:

`label_smoothing` is a technique used to make the model less confident in its predictions, by smoothing the hard labels in the target.
The parameter 0.1 indicates the smoothing level. In this case, it means that a small fraction of the loss will be distributed across all classes, making the model's output distribution slightly softer.
This can lead to better generalization and prevent the model from becoming too confident about its predictions, which can be beneficial in preventing overfitting.

## Helper Functions

In [30]:
# Example: illustrate generate_tgt_mask() and generate_src_mask()

# Shape: (batch_size, max_seq_length_in_the_batch)= (3, 4)
src = torch.tensor(
    [
        [1, 2, 3, 0],  # First sequence with a padding token
        [4, 5, 0, 0],  # Second sequence with two padding tokens
        [6, 7, 8, 9],  # Third sequence with no padding tokens
    ]
)

pad_idx = 0  # Define the padding index (0 is the padded value)

# Generate the source mask (batch_size, 1, 1, max_seq_length)
src_mask = generate_src_mask(src, pad_idx)

src_mask.shape, src_mask

(torch.Size([3, 1, 1, 4]),
 tensor([[[[ True,  True,  True, False]]],
 
 
         [[[ True,  True, False, False]]],
 
 
         [[[ True,  True,  True,  True]]]]))

In [31]:
# Example
# Define a sample target input tensor with padding tokens (assuming pad_idx=0)
# Shape: (batch_size, max_seq_length)
tgt = torch.tensor([
    [1, 2, 3, 0],  # First sequence with a padding token
    [4, 5, 0, 0],  # Second sequence with two padding tokens
    [6, 7, 8, 9]   # Third sequence with no padding tokens
])

pad_idx = 0  # Define the padding index

# Generate the target mask (batch_size, 1, max_seq_length, max_seq_length)
tgt_mask = generate_tgt_mask(tgt, pad_idx)

tgt_mask.shape, tgt_mask

(torch.Size([3, 1, 4, 4]),
 tensor([[[[ True, False, False, False],
           [ True,  True, False, False],
           [ True,  True,  True, False],
           [ True,  True,  True, False]]],
 
 
         [[[ True, False, False, False],
           [ True,  True, False, False],
           [ True,  True, False, False],
           [ True,  True, False, False]]],
 
 
         [[[ True, False, False, False],
           [ True,  True, False, False],
           [ True,  True,  True, False],
           [ True,  True,  True,  True]]]]))

## Model Summary

In [32]:
# Model Summary
from torchsummaryX import summary
src, tgt = next(iter(train_dataloader))
src, tgt = src.to(DEVICE), tgt.to(DEVICE)

tgt_input = tgt[:, :-1]
tgt_output = tgt[:, 1:]
src_mask = generate_src_mask(src, EN_VOCAB['<pad>'])
tgt_mask = generate_tgt_mask(tgt_input, DE_VOCAB['<pad>'])

summary(model, src, tgt_input, src_mask, tgt_mask)

# summary(model, src, tgt_input, src_mask, tgt_mask): 
# This line generates a summary of the model's architecture and 
# how the input data (src, tgt_input, and the masks) flows through it;
# providing detailed insights about the model, 
# such as the size of each layer, the number of parameters, 
# the shape of the output at each stage, and the computational cost.

                                            Kernel Shape      Output Shape  \
Layer                                                                        
0_src_embedding                             [256, 10837]    [128, 26, 256]   
1_pos_encoding.Dropout_dropout                         -    [128, 26, 256]   
2_tgt_embedding                             [256, 19214]    [128, 29, 256]   
3_pos_encoding.Dropout_dropout                         -    [128, 29, 256]   
4_encoder_layers.0.self_attn.Linear_wq        [256, 256]    [128, 26, 256]   
5_encoder_layers.0.self_attn.Linear_wk        [256, 256]    [128, 26, 256]   
6_encoder_layers.0.self_attn.Linear_wv        [256, 256]    [128, 26, 256]   
7_encoder_layers.0.self_attn.Linear_wo        [256, 256]    [128, 26, 256]   
8_encoder_layers.0.norm1.LayerNorm_norm            [256]    [128, 26, 256]   
9_encoder_layers.0.ffn.Linear_linear1         [256, 512]    [128, 26, 512]   
10_encoder_layers.0.ffn.Dropout_dropout                -    [128

  df_sum = df.sum()


Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_src_embedding,"[256, 10837]","[128, 26, 256]",2774272.0,2774272.0
1_pos_encoding.Dropout_dropout,-,"[128, 26, 256]",,
2_tgt_embedding,"[256, 19214]","[128, 29, 256]",4918784.0,4918784.0
3_pos_encoding.Dropout_dropout,-,"[128, 29, 256]",,
4_encoder_layers.0.self_attn.Linear_wq,"[256, 256]","[128, 26, 256]",65792.0,65536.0
...,...,...,...,...
69_decoder_layers.2.ffn.Linear_linear1,"[256, 512]","[128, 29, 512]",131584.0,131072.0
70_decoder_layers.2.ffn.Dropout_dropout,-,"[128, 29, 512]",,
71_decoder_layers.2.ffn.Linear_linear2,"[512, 256]","[128, 29, 256]",131328.0,131072.0
72_decoder_layers.2.norm3.LayerNorm_norm,[256],"[128, 29, 256]",512.0,256.0


## Train and Validate

In [33]:
def train_epoch(model, dataloader, optimizer, criterion, device):
    """
    Note: The source sequences will be input in the encoder,
    while the target sequences will be input in the decoder (as tgt_input sequences),
    and also target output (as tgt_output sequences) in the loss computation.
    """
    model.train()
    total_loss = 0

    batch_bar = tqdm(
        total=len(dataloader), dynamic_ncols=True, leave=False, position=0, desc="Train"
    )

    for i, (src, tgt) in enumerate(dataloader):
        src, tgt = src.to(device), tgt.to(device)
        # print("Src", src.shape, "Tgt", tgt.shape)

        # target as input (teacher forcing)
        tgt_input = tgt[
            :, :-1
        ]  # (batch_size, max_tgt_seq_len - 1), so exlucing the last token <eos>
        # target as output (for loss evaluation)
        tgt_output = tgt[
            :, 1:
        ]  # (batch_size, max_tgt_seq_len - 1), so exlucing the first token <sos>

        src_mask = generate_src_mask(src, EN_VOCAB["<pad>"])
        # (batch_size, 1, 1, max_seq_length)
        tgt_mask = generate_tgt_mask(tgt_input, DE_VOCAB["<pad>"])
        # (batch_size, max_tgt_seq_len - 1)

        optimizer.zero_grad()

        if device.type == "cpu":
            output = model(src, tgt_input, src_mask, tgt_mask)
            # output shape (batch_size, max_tgt_seq_len - 1, tgt_vocab_size)
            # this is prediction at tgt_input (batch_size, max_tgt_seq_len - 1)
            # i.e., prediction at the first token (after <sos>), and all tokens (including possibly <pad>), exlucing prediction for <eos>.
            loss = criterion(output.reshape(-1, output.size(2)), tgt_output.reshape(-1))
            # output.reshape(-1, output.size(2)): (batch_size * max_tgt_seq_len-1, output_vocab_size)
            # tgt_output.reshape(-1): shape (batch_size * max_tgt_seq_length-1)

            # CrossEntropyLoss function takes the reshaped output and
            # tgt_output to compute the loss. It compares the predicted
            # output (logits for each token in the vocabulary) with the actual
            # target token indices.
            loss.backward()
            optimizer.step()
        else:
            with torch.cuda.amp.autocast():
                output = model(src, tgt_input, src_mask, tgt_mask)
                loss = criterion(
                    output.reshape(-1, output.size(2)), tgt_output.reshape(-1)
                )

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        scheduler.step()

        total_loss += loss.item()
        batch_bar.set_postfix(
            loss="{:.04f}".format(total_loss / (i + 1)),
            lr="{:.09f}".format(float(optimizer.param_groups[0]["lr"])),
        )
        batch_bar.update()

    # Return the average loss for the epoch
    return total_loss / len(dataloader)


def validate_epoch(model, dataloader, criterion, device):
    """
    Basically same procedure as the train_epoch, except no optimization.
    """
    model.eval()
    epoch_loss = 0
    epoch_bleu_score = 0

    batch_bar = tqdm(
        total=len(dataloader),
        dynamic_ncols=True,
        leave=False,
        position=0,
        desc="Validate",
    )

    with torch.no_grad():
        for i, (src, tgt) in enumerate(dataloader):
            src, tgt = src.to(device), tgt.to(device)

            tgt_input = tgt[
                :, :-1
            ]  # (batch_size, max_tgt_seq_len - 1), so exlucing <eos>
            tgt_output = tgt[
                :, 1:
            ]  # (batch_size, max_tgt_seq_len - 1), so exlucing <sos>

            src_mask = generate_src_mask(src, EN_VOCAB["<pad>"])
            tgt_mask = generate_tgt_mask(tgt_input, DE_VOCAB["<pad>"])

            with torch.cuda.amp.autocast():
                output = model(src, tgt_input, src_mask, tgt_mask)
                # output shape (batch_size, max_tgt_seq_len - 1, tgt_vocab_size)
                # this is prediction at tgt_input (batch_size, max_tgt_seq_len - 1)
                # i.e., prediction at the first token (after <sos>), and all tokens (including possibly <pad>), exlucing prediction for <eos>.
                loss = criterion(
                    output.reshape(-1, output.shape[-1]), tgt_output.reshape(-1)
                )

            epoch_loss += loss.item()  # Accumulate the loss
            # Calculate and accumulate the BLEU score
            epoch_bleu_score += calculate_bleu(tgt_output, output.argmax(-1), DE_VOCAB)
            # output.argmax(-1): shape (batch_size, max_tgt_seq_len - 1) token indices

            batch_bar.set_postfix(
                loss="{:.04f}".format(epoch_loss / (i + 1)),
                bleu="{:.04f}".format(epoch_bleu_score / (i + 1)),
            )

            batch_bar.update()

    # Normalize the loss and BLEU score by the number of minibatches in validation samples
    epoch_loss /= len(dataloader)
    epoch_bleu_score /= len(dataloader)

    return epoch_loss, epoch_bleu_score

# Experiments

In [34]:
best_val_loss = float('inf')
train_losses = []
val_losses = []
bleu_scores = [] 

for epoch in range(1, NUM_EPOCHS + 1):
    print(f"Epoch {epoch}/{NUM_EPOCHS}")

    # Training
    train_loss = train_epoch(model, train_dataloader, optimizer, criterion, DEVICE)
    train_losses.append(train_loss)

    # Validation
    val_loss, bleu_score = validate_epoch(model, val_dataloader, criterion, DEVICE)
    val_losses.append(val_loss)
    bleu_scores.append(bleu_score)

    # Update learning rate (commented out: since we update inside train_epoch)
    # scheduler.step() 

    # Print results
    print(f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | BLEU Score: {bleu_score:.4f}")

    # Save the model with the best validation loss
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pth')

Epoch 1/5


                                                                                 

Train Loss: 5.8270 | Val Loss: 4.2420 | BLEU Score: 6.7642
Epoch 2/5


                                                                                     

Train Loss: 3.8166 | Val Loss: 3.6280 | BLEU Score: 17.1142
Epoch 3/5


                                                                                     

Train Loss: 3.3091 | Val Loss: 3.3153 | BLEU Score: 12.6427
Epoch 4/5


                                                                                     

Train Loss: 2.9304 | Val Loss: 3.1390 | BLEU Score: 27.8041
Epoch 5/5


                                                                                     

Train Loss: 2.6701 | Val Loss: 3.0689 | BLEU Score: 23.1824




# Evaluate Test Set

In [35]:
test_out = evaluate_test_set_bleu(model, test_dataloader, EN_VOCAB, DE_VOCAB, DEVICE)
print("Test BLEU score:", test_out[0].score)

Evaluating: 100%|██████████| 8/8 [00:35<00:00,  4.43s/it]
That's 100 lines that end in a tokenized period ('.')
It looks like you forgot to detokenize your test data, which may hurt your score.
If you insist your data is detokenized, or don't care, you can suppress this message with the `force` parameter.


Test BLEU score: 29.65422317710202


In [36]:
# Randomly select an example from the test set to display
rand_index = random.randint(
    0, len(test_dataset)
)  # len(test_dataset): total number of test sentences
print("\n\nExample Sentence and its Translation")
# Display the source sentence, ground truth, and machine-translated sentence
# for the example, excluding <pad>, <sos> and <eos>
print(
    "Source Sentence in English               :",
    " ".join(
        [
            EN_VOCAB.itos[i]
            for i in test_dataset[rand_index][0]
            if EN_VOCAB.itos[i] not in ["<pad>", "<sos>", "<eos>"]
        ]
    ),
)
print("Ground Truth Sentence in German          :", test_out[1][rand_index])
print("Machine Translated Sentence in German    :", test_out[2][rand_index])



Example Sentence and its Translation
Source Sentence in English               : A man in a martial arts uniform in midair .
Ground Truth Sentence in German          : Ein Mann in einem Karateanzug in der Luft .
Machine Translated Sentence in German    : Ein Mann in einem Kampfsportler mitten in der Luft .


In [37]:
# check Ground Truth Sentence in German, and Machine Translated Sentence in German
test_out[1][0:5], test_out[2][0:5]


(['Ein Mann mit einem orangefarbenen Hut , der etwas <unk> .',
  'Ein Boston Terrier läuft über <unk> Gras vor einem weißen Zaun .',
  'Ein Mädchen in einem Karateanzug bricht ein Brett mit einem Tritt .',
  'Fünf Leute in Winterjacken und mit Helmen stehen im Schnee mit <unk> im Hintergrund .',
  'Leute Reparieren das Dach eines Hauses .'],
 ['Ein Mann mit einem orangen Hut starrt auf etwas .',
  'Ein Schäferhund rennt auf einer weißen Wiese vor einem weißen Zaun .',
  'Ein Mädchen in Karateanzügen Kleidung hat einen Stock vor einem Tritt .',
  'Fünf Personen in Jacken und mit Helmen stehen im Schnee , im Hintergrund .',
  'Leute reparieren das Dach eines Hauses .'])