# A3: Make Your Own Machine Translation 

In this assignment, we will explore the domain of neural machine translation. The focus will be on
translating between your native language and English. We will experiment with different types of attention
mechanisms, including general attention, multiplicative attention, and additive attention, to evaluate their
effectiveness in the translation process.

#### Step 0: Prepare Environment - Import Libraries and select device

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import datasets, math, re, random, time
from collections import Counter
from tqdm import tqdm

In [2]:
# mimimum required torch version for MPS support "1.12+"
torch.__version__

'2.10.0'

In [3]:
# universal device selection: use gpu if available, else cpu
import torch

# def get_device():
#     if torch.cuda.is_available():
#         return torch.device("cuda")      # NVIDIA GPU
#     elif torch.backends.mps.is_available():
#         return torch.device("mps")       # Apple Silicon GPU
#     else:
#         return torch.device("cpu")

device = torch.device("cpu")

print(f"Using device: {device}")

# CPU preferred, as MPS keeps on crashing during training with memory errors.
#RuntimeError: MPS backend out of memory (MPS allocated: 86.95 GiB, 
# other allocations: 1.14 GiB, max allowed: 88.13 GiB). Tried to allocate 42.25 MiB on private pool. 
# Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).


Using device: cpu


## Task 1. Get Language Pair - Based on MT + Transformer.ipynb, modify the dataset as follows:

### 1.1) Find a dataset suitable for translation between your native language and English. Ensure to source this dataset from reputable public databases or repositories. It is imperative to give proper credit to the dataset source in your documentation. (1 points)


Context Behind Nepali Language:

My native language is Nepali. It is spoken by roughly 32 million people around the world as first or second language. 
Percentage estimate is ~0.4% (total estimated world population 8 Billion people).

There's few coproa exists on Nepali to English translation on Hugging Face.
1. Opus project [Helsinki NLP Research](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ne)

Helsinki-NLP refers to the language technology research group at the University of Helsinki. Here, we publish various resource related to multilingual NLP, machine translation, text simplification to name a few application areas. We focus on wide language coverage, open data sets and public pre-trained models.

2. [IRIIS project](https://huggingface.co/IRIIS-RESEARCH)
IRIIS-Research is a research group that publishes large-scale raw text corpora on Hugging Face, including one of the largest publicly available Nepali text datasets. This is monolingual Nepali text and large dataset in GBs (~10 GB). 

3. [ERLA](https://catalog.elra.info/en-us/repository/search/?q=nepali)

Founded in 1995, ELRA, the ELRA Language Resources Association is a non-profit organisation whose main mission is to make Language Resources (LRs) for Human Language Technologies (HLT) available to the community at large.

4. [FLORES+](https://huggingface.co/datasets/openlanguagedata/flores_plus)
FLORES+ is a multilingual machine translation benchmark released under CC BY-SA 4.0. This dataset was originally released by FAIR researchers at Meta under the name FLORES. Further information about these initial releases can be found in Dataset Sources below. The data is now being managed by OLDI, the Open Language Data Initiative. The + has been added to the name to disambiguate between the original datasets and this new actively developed version.

Archived Flores : [Flores 200](https://huggingface.co/datasets/facebook/flores/blob/main/README.md#dataset-card-for-flores-200)

More on research paper [Natural language processing for Nepali text: a review](../resources/Shahi-Sitaula2021_Article_NaturalLanguageProcessingForNe.pdf)

<strong>For the assignment purpose, I am using smaller dataset from hugging face OPUS-100</strong>

Opus chosen for:
1. Better quality
2. Proper train/val/test split
3. Managable size with subset
4. More realistic results

** Tatoeba dataset is smaller good for fast training but it has simple sentences, and may not generalize well.


| Dataset     | Size                | Languages      | Use Case                | Quality                  |
|-------------|---------------------|----------------|-------------------------|--------------------------|
| WMT14       | ~4.5M pairs (de-en) | 2-6 pairs      | Training                | High (news)              |
| WMT16       | ~4.5M pairs         | 6-8 pairs      | Training                | High (news)              |
| WMT19       | ~38M pairs (de-en)  | 10+ pairs      | Training                | High (news)              |
| OPUS-100    | ~55M total          | 100 languages  | Multilingual training   | Medium                   |
| Tatoeba     | ~10M total          | 400+ languages | Evaluation/Small training| Medium (user-contributed)|


#### Step 1: Data preparation - using OPUS dataset

In [4]:
from datasets import load_dataset

EN_LANGUAGE = 'en'
NE_LANGUAGE = 'ne'
# Use "de-en" as dataset doesn't have en-de and treat English as source, German as target.
LANG_PAIR = f"{EN_LANGUAGE}-{NE_LANGUAGE}"
print("Translation Language Pair:", LANG_PAIR)

dataset = load_dataset("opus100", LANG_PAIR)

Translation Language Pair: en-ne




In [5]:
dataset

DatasetDict({
    test: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 406381
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
})

In [6]:
dataset['train'][100]

{'translation': {'en': 'The following item is due:',
  'ne': 'निम्न वस्तुको म्याद समाप्त हुन्छ:'}}

In [7]:
# Selecting smaller subsets for faster training/testing
train = dataset["train"].select(range(100_000))  # 100K samples
val = dataset["validation"]
test = dataset["test"]

print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")

Train: 100000, Val: 2000, Test: 2000


In [8]:
train[0], val[0], test[0]

({'translation': {'en': '_Inv', 'ne': 'Inv'}},
 {'translation': {'en': '%1: the message is displayed silently.',
   'ne': '% 1: सन्देश ध्वनि बिना प्रदर्शित हुन्छ ।'}},
 {'translation': {'en': 'Delete Thread', 'ne': 'थ्रेड मेट्नुहोस्'}})


### 1.2) Describe in detail the process of preparing the dataset for use in your translation model. This
includes steps like text normalization, tokenization, and word segmentation, particularly focusing
on your native language’s specific requirements. Specify the libraries or tools you will use for these
tasks and give appropriate credit to the developers or organizations behind these tools. If your
native language requires special handling in tokenization (e.g., for languages like Chinese, Thai, or
Japanese), mention the libraries (like Jieba, PyThaiNLP, or Mecab) and the procedures used for
word segmentation. (1 points)
Note: proper attribution for both the dataset and the tools used in its processing is essential for maintaining
academic integrity.

#### Step 2: Data preprocessing


For English tokenization , pretarined model is used:

```bash
uv add spacy
uv add pip

uv run python3 -m spacy download en_core_web_sm 
```

Instead of downloading using uv python, using python script download and save to local.

There's no spaCy model for Nepali. For tokenization, 

1. use tokenizing algorithm and train on data 
    
    Pros: Customized according to data, handles OOV
    
    Cons: Need to train


SentencePiece is a tokenization algorithm (BPE or unigram) - it's simpler and faster.

2. Use pretrained tokenizer
    
    Pros: Ready to use

    Cons: May not fit to dataset used


```python
from transformers import AutoTokenizer

# Already trained, supports Nepali!
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
```

Using pre-trained model from Meta [NLLB](https://github.com/facebookresearch/fairseq/tree/nllb)

Research Paper : [No Language Left Behind:Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)



Create two dictionaries 1. for holding our tokenizers and 2. for holding all the vocabs with assigned numbers for each unique word

In [9]:
# Place-holders
token_transform = {}
vocab_transform = {}

In [10]:
import spacy
from spacy.cli import download
import os
import shutil
import glob

from transformers import AutoTokenizer

MODEL_DIRECTORY = "./../models"

# Create models directory
os.makedirs(MODEL_DIRECTORY, exist_ok=True)

SPACY_MODEL_PATH = os.path.join(MODEL_DIRECTORY, "en_core_web_sm")

def load_spacy_model():
    """Load spaCy model from custom directory, download if needed"""
    # Check if config.cfg exists (proper model structure)
    if os.path.exists(os.path.join(SPACY_MODEL_PATH, "config.cfg")):
        return spacy.load(SPACY_MODEL_PATH)
    
    # Download and copy to custom folder
    print("Downloading spaCy model...")
    download("en_core_web_sm")
    
    import en_core_web_sm
    source_path = en_core_web_sm.__path__[0]
    
    # Find the actual model directory (contains config.cfg)
    # It's usually nested like: en_core_web_sm/en_core_web_sm-3.x.x/
    # Config checks added to fix - OSError: [E053] Could not read config file from ../models/en_core_web_sm/config.cfg
    config_files = glob.glob(os.path.join(source_path, "**", "config.cfg"), recursive=True)
    if config_files:
        actual_model_dir = os.path.dirname(config_files[0])
    else:
        actual_model_dir = source_path
    
    # Copy the actual model files
    os.makedirs(MODEL_DIRECTORY, exist_ok=True)
    if os.path.exists(SPACY_MODEL_PATH):
        shutil.rmtree(SPACY_MODEL_PATH)
    shutil.copytree(actual_model_dir, SPACY_MODEL_PATH)
    
    # Load spaCy models directly (faster than get_tokenizer for batch processing)
    return spacy.load(SPACY_MODEL_PATH, disable=["parser", "tagger", "ner", "lemmatizer"])


def load_nllb():
    """Load NLLB tokenizer from custom directory, download if needed"""
    nllb_path = os.path.join(MODEL_DIRECTORY, "nllb-tokenizer")
    if os.path.exists(nllb_path):
        return AutoTokenizer.from_pretrained(nllb_path)
    
    print("Downloading NLLB tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
    tokenizer.save_pretrained(nllb_path)
    return tokenizer

In [11]:
spacy_model = load_spacy_model()

# Add English tokenizer to token_transform
def spacy_tokenize(text):
    """Tokenize text using spaCy"""
    return [tok.text for tok in spacy_model(text)]

token_transform[EN_LANGUAGE] = spacy_tokenize

In [12]:
from transformers import AutoTokenizer

nllb = load_nllb()
sample_nepali_sentence = "नेपाल सुन्दर छ"

# Add Nepali tokenizer to token_transform
token_transform[NE_LANGUAGE] = nllb.tokenize

print(token_transform[NE_LANGUAGE](sample_nepali_sentence))

['▁नेपाल', '▁सुन्दर', '▁छ']


In [13]:
train_nepali_text = train[100]['translation'][NE_LANGUAGE]
print("Sentence:", train_nepali_text)
print("Tokenization: ", token_transform[NE_LANGUAGE](train_nepali_text))

Sentence: निम्न वस्तुको म्याद समाप्त हुन्छ:
Tokenization:  ['▁निम्न', '▁वस्तु', 'को', '▁म्या', 'द', '▁समाप्त', '▁हुन्छ', ':']


In [14]:
#example of tokenization of the english part
train_english_text = train[100]['translation'][EN_LANGUAGE]
print("Sentence:", train_english_text)
print("Tokenization: ", token_transform[EN_LANGUAGE](train_english_text))

Sentence: The following item is due:
Tokenization:  ['The', 'following', 'item', 'is', 'due', ':']


In [15]:
SRC_LANGUAGE = EN_LANGUAGE
TRG_LANGUAGE = NE_LANGUAGE

Function to token input

In [16]:
# helper function to yield list of tokens
# here data can be `train` or `val` or `test`
def yield_tokens(data, language):
    for data_sample in data:
        yield token_transform[language](data_sample['translation'][language])

Before we tokenize, let's define some special symbols so our neural network understand the embeddings of these symbols, namely the unknown, the padding, the start of sentence, and end of sentence.

special symbols `<unk>`, `<pad>`, `<sos>`, `<eos>` with indexes 0, 1, 2, 3 respectively. Where each symbol has meanings as such:
>
>   `<unk>`: To represent Unknown
>
>   `<pad>`: Padding, used to ensure all sequences are of same length
>
>   `<sos>`: Start of sentence
>
>   `<eos>`: End of sentence


In [17]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, SOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<sos>', '<eos>']

#### Step 4: Numericalization

Next we gonna create function (torchtext called vocabs) that turn these tokens into integers.  Here we build Vocab class as torchtext.vocab is not supported in Python v3.13+

In [18]:
# torchtext.vocab replacement - Vocab class to mimic torchtext API
from collections import Counter

special_symbols = ['<unk>', '<pad>', '<sos>', '<eos>']
UNK_IDX, PAD_IDX, SOS_IDX, EOS_IDX = 0, 1, 2, 3

class Vocab:
    """A simple Vocab class to replace torchtext.vocab"""
    def __init__(self, stoi, itos, default_index=0):
        self.stoi = stoi  # string to index
        self.itos = itos  # index to string (list)
        self.default_index = default_index
    
    def __call__(self, tokens):
        """Convert list of tokens to list of indices"""
        return [self.stoi.get(token, self.default_index) for token in tokens]
    
    def __len__(self):
        return len(self.itos)
    
    def __getitem__(self, token):
        return self.stoi.get(token, self.default_index)
    
    def set_default_index(self, index):
        self.default_index = index
    
    def get_itos(self):
        return self.itos

def build_vocab(token_iterator, min_freq=2, specials=None, special_first=True):
    """Build vocabulary from token iterator"""
    if specials is None:
        specials = []
    
    # Count token frequencies
    counter = Counter()
    for tokens in token_iterator:
        counter.update(tokens)
    
    # Build itos (index to string) list
    itos = []
    if special_first:
        itos.extend(specials)
    
    # Add tokens that meet min_freq threshold
    for token, freq in counter.items():
        if freq >= min_freq and token not in specials:
            itos.append(token)
    
    if not special_first:
        itos.extend(specials)
    
    # Build stoi (string to index) dict
    stoi = {token: idx for idx, token in enumerate(itos)}
    
    return Vocab(stoi, itos)

In [19]:
from concurrent.futures import ThreadPoolExecutor
import time
from tqdm import tqdm

def build_vocab_en_fast(data):
    """Build English vocab using spaCy's fast pipe() method"""
    print(f"Building vocab for {EN_LANGUAGE} (batch mode)...")
    start = time.time()
    
    # Collect all English texts
    texts = [sample['translation'][EN_LANGUAGE] for sample in tqdm(data, desc="Collecting EN texts")]
    
    # Batch tokenize with spaCy pipe() - MUCH faster!
    counter = Counter()
    for doc in tqdm(spacy_model.pipe(texts, batch_size=1000, n_process=1), 
                    total=len(texts), desc="Tokenizing EN"):
        counter.update([tok.text for tok in doc])
    
    # Build vocab
    itos = list(special_symbols)
    for token, freq in counter.items():
        if freq >= 2 and token not in special_symbols:
            itos.append(token)
    stoi = {token: idx for idx, token in enumerate(itos)}
    
    vocab = Vocab(stoi, itos)
    vocab.set_default_index(UNK_IDX)
    print(f"  {EN_LANGUAGE} done in {time.time() - start:.1f}s, vocab size: {len(vocab)}")
    return vocab

def build_vocab_ne(data):
    """Build Nepali vocab (NLLB tokenizer)"""
    print(f"Building vocab for {NE_LANGUAGE}...")
    start = time.time()
    
    counter = Counter()
    for sample in tqdm(data, desc="Tokenizing NE"):
        tokens = token_transform[NE_LANGUAGE](sample['translation'][NE_LANGUAGE])
        counter.update(tokens)
    
    # Build vocab
    itos = list(special_symbols)
    for token, freq in counter.items():
        if freq >= 2 and token not in special_symbols:
            itos.append(token)
    stoi = {token: idx for idx, token in enumerate(itos)}
    
    vocab = Vocab(stoi, itos)
    vocab.set_default_index(UNK_IDX)
    print(f"  {NE_LANGUAGE} done in {time.time() - start:.1f}s, vocab size: {len(vocab)}")
    return vocab

# Build vocabularies in parallel
start_total = time.time()
with ThreadPoolExecutor(max_workers=2) as executor:
    future_en = executor.submit(build_vocab_en_fast, train)
    future_ne = executor.submit(build_vocab_ne, train)
    
    vocab_transform[EN_LANGUAGE] = future_en.result()
    vocab_transform[NE_LANGUAGE] = future_ne.result()

print(f"\nTotal time: {time.time() - start_total:.1f}s")
print(f"{SRC_LANGUAGE} vocab size: {len(vocab_transform[SRC_LANGUAGE])}")
print(f"{TRG_LANGUAGE} vocab size: {len(vocab_transform[TRG_LANGUAGE])}")

# Warning : Token indices sequence length is longer than the specified maximum sequence length for this model (3645 > 1024). Running this sequence through the model will result in indexing errors
print("""NLLB tokenizer warning about max sequence length may be ignored for vocab building. 
It is only used as a tokenizer here. The 1024 limit applies during model training/inference.""")

Building vocab for en (batch mode)...
Building vocab for ne...


Collecting EN texts: 100%|██████████| 100000/100000 [00:00<00:00, 174785.75it/s]
Tokenizing EN:   2%|▏         | 2001/100000 [00:02<01:25, 1144.59it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3645 > 1024). Running this sequence through the model will result in indexing errors
Tokenizing NE: 100%|██████████| 100000/100000 [00:03<00:00, 30908.52it/s]
Tokenizing EN:   3%|▎         | 3001/100000 [00:02<01:17, 1247.51it/s]

  ne done in 3.3s, vocab size: 8333


Tokenizing EN: 100%|██████████| 100000/100000 [01:02<00:00, 1593.71it/s]

  en done in 63.3s, vocab size: 15836

Total time: 63.3s
en vocab size: 15836
ne vocab size: 8333
It is only used as a tokenizer here. The 1024 limit applies during model training/inference.





The parallalization addition didn't benefit much as the volume of Nepali dataset was not huge and NLLB is fast since it's implemented in Rust and it only took few seconds to get competed. 

The main time consuming tokenization was for EN language. After using `pipe()` for batch processing , the number reduced by more than 50%. It takes little more than 1 minute compared to ~3 minutes without batching.

In [20]:
vocab_transform[SRC_LANGUAGE](['here', 'is', 'a', 'unknownword', 'a'])

[2822, 289, 211, 0, 211]

In [21]:
#we can reverse it....
mapping = vocab_transform[SRC_LANGUAGE].get_itos()

#print 2822, for example
mapping[2822]

'here'

In [22]:
mapping[0]


'<unk>'

In [23]:
#let's try special symbols
mapping[1], mapping[2], mapping[3]

('<pad>', '<sos>', '<eos>')

In [24]:
#check unique vocabularies
len(mapping)

15836

#### Step 5: Prepare data loaders

One thing we change here is the <code>collate_fn</code> which now also returns the length of sentence.  This is required for <code>packed_padded_sequence</code>

In [25]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

BATCH_SIZE = 64

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids):
    return torch.cat((torch.tensor([SOS_IDX]), 
                      torch.tensor(token_ids), 
                      torch.tensor([EOS_IDX])))

# src and trg language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TRG_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tensors
def collate_batch(batch):
    src_batch, src_len_batch, trg_batch = [], [], []
    for sample in batch:
        # OPUS-100 format: {'translation': {'en': '...', 'ne': '...'}}
        src_sample = sample['translation'][SRC_LANGUAGE]
        trg_sample = sample['translation'][TRG_LANGUAGE]
        
        processed_text = text_transform[SRC_LANGUAGE](src_sample.rstrip("\n"))
        src_batch.append(processed_text)
        trg_batch.append(text_transform[TRG_LANGUAGE](trg_sample.rstrip("\n")))
        src_len_batch.append(processed_text.size(0))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    trg_batch = pad_sequence(trg_batch, padding_value=PAD_IDX)
    return src_batch, torch.tensor(src_len_batch, dtype=torch.int64), trg_batch

Create train, val, and test dataloaders

In [26]:
# Reduce batch size for MPS memory constraints (64 -> 32)
batch_size = 32

train_loader = DataLoader(train, batch_size=batch_size, shuffle=True,  collate_fn=collate_batch)
valid_loader = DataLoader(val,   batch_size=batch_size, shuffle=False, collate_fn=collate_batch)
test_loader  = DataLoader(test,  batch_size=batch_size, shuffle=False, collate_fn=collate_batch)

Let's test the train loader.

In [27]:
for en, _, ne in train_loader:
    break

In [28]:
print("English shape: ", en.shape)  # (seq len, batch_size)
print("Nepali shape: ", ne.shape)   # (seq len, batch_size)

English shape:  torch.Size([15, 32])
Nepali shape:  torch.Size([22, 32])


## Task 2. Experiment with Attention Mechanisms

<strong>Implement a sequence-to-sequence neural network for the translation task.</strong> 

Note: For an in-depth exploration of attention mechanisms, you can refer to this $paper^1$.

$^1$ An Attentive Survey of Attention Models https://arxiv.org/pdf/1904.02874.pdf

Your implementation should include the following attention mechanisms, with their respective equations:

##### Step 6: Design Model :: Seq-to-Seq

In [29]:
class Seq2SeqPackedAttention(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device  = device
        
    def create_mask(self, src):
        #src: [src len, batch_size]
        mask = (src == self.src_pad_idx).permute(1, 0)  #permute so that it's the same shape as attention
        #mask: [batch_size, src len] #(0, 0, 0, 0, 0, 1, 1)
        return mask
        
    def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):
        #src: [src len, batch_size]
        #trg: [trg len, batch_size]
        
        #initialize something
        batch_size = src.shape[1]
        trg_len    = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        outputs    = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        attentions = torch.zeros(trg_len, batch_size, src.shape[0]).to(self.device)
        
        #send our src text into encoder
        encoder_outputs, hidden = self.encoder(src, src_len)
        #encoder_outputs refer to all hidden states (last layer)
        #hidden refer to the last hidden state (of each layer, of each direction)
        
        input_ = trg[0, :]
        
        mask   = self.create_mask(src) #(0, 0, 0, 0, 0, 1, 1)
        
        #for each of the input of the trg text
        for t in range(1, trg_len):
            #send them to the decoder
            output, hidden, attention = self.decoder(input_, hidden, encoder_outputs, mask)
            #output: [batch_size, output_dim] ==> predictions
            #hidden: [batch_size, hid_dim]
            #attention: [batch_size, src len]
            
            #append the output to a list
            outputs[t] = output
            attentions[t] = attention
            
            teacher_force = random.random() < teacher_forcing_ratio
            top1          = output.argmax(1)  #autoregressive
            
            input_ = trg[t] if teacher_force else top1
            
        return outputs, attentions

#### Step 7: Design Encoder

In [30]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn       = nn.GRU(emb_dim, hid_dim, bidirectional=True)
        self.fc        = nn.Linear(hid_dim * 2, hid_dim)
        self.dropout   = nn.Dropout(dropout)
        
    def forward(self, src, src_len):
        #embedding
        embedded = self.dropout(self.embedding(src))
        #packed
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len.to('cpu'), enforce_sorted=False)
        #rnn
        packed_outputs, hidden = self.rnn(packed_embedded)
        #unpacked
        outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs)
        #-1, -2 hidden state
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim = 1)))
        
        #outputs: [src len, batch_size, hid dim * 2]
        #hidden:  [batch_size, hid_dim]
        
        return outputs, hidden

### 2.1) General Attention: (0.5 points)

$$e_i = s^T h_i \in \mathbb{R} \quad \text{where} \quad d_1 = d_2$$

#### Step 8: Design Attention Mechanisms

In [31]:
class GeneralAttention(nn.Module):
    """
    General (Dot-Product) Attention: e_i = s^T h_i
    Requires projection to match dimensions
    """
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        # Project encoder outputs to match decoder hidden dim
        # enc_hid_dim is already the full encoder output dim (hid_dim * 2 for bidirectional)
        self.proj = nn.Linear(enc_hid_dim, dec_hid_dim)
        
    def forward(self, hidden, encoder_outputs, mask):
        # hidden: [batch_size, dec_hid_dim]
        # encoder_outputs: [src_len, batch_size, enc_hid_dim]
        
        # Project encoder outputs to decoder dimension
        encoder_projected = self.proj(encoder_outputs)  # [src_len, batch_size, dec_hid_dim]
        encoder_projected = encoder_projected.permute(1, 0, 2)  # [batch_size, src_len, dec_hid_dim]
        
        hidden = hidden.unsqueeze(2)  # [batch_size, dec_hid_dim, 1]
        
        # Dot product attention: s^T h
        attention = torch.bmm(encoder_projected, hidden).squeeze(2)  # [batch_size, src_len]
        
        # Mask padding tokens
        attention = attention.masked_fill(mask, -1e10)
        
        return torch.softmax(attention, dim=1)

### 2.2) Multiplicative Attention: (0.5 points)

$$e_i = s^T W h_i \in \mathbb{R} \quad \text{where} \quad W \in \mathbb{R}^{d_2 \times d_1}$$

In [32]:
class MultiplicativeAttention(nn.Module):
    """
    Multiplicative Attention: e_i = s^T W h_i
    W is a learnable weight matrix
    """
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        # W matrix: maps encoder hidden to decoder hidden space
        # enc_hid_dim is already the full encoder output dim (hid_dim * 2 for bidirectional)
        self.W = nn.Linear(enc_hid_dim, dec_hid_dim, bias=False)
        
    def forward(self, hidden, encoder_outputs, mask):
        # hidden: [batch_size, dec_hid_dim]
        # encoder_outputs: [src_len, batch_size, enc_hid_dim]
        
        # Apply W to encoder outputs
        encoder_transformed = self.W(encoder_outputs)  # [src_len, batch_size, dec_hid_dim]
        encoder_transformed = encoder_transformed.permute(1, 0, 2)  # [batch_size, src_len, dec_hid_dim]
        
        hidden = hidden.unsqueeze(2)  # [batch_size, dec_hid_dim, 1]
        
        # s^T W h
        attention = torch.bmm(encoder_transformed, hidden).squeeze(2)  # [batch_size, src_len]
        
        # Mask padding tokens
        attention = attention.masked_fill(mask, -1e10)
        
        return torch.softmax(attention, dim=1)

### 2.3) Additive Attention: (0.5 points)

$$e_i = v^T \tanh(W_1 h_i + W_2 s) \in \mathbb{R}$$

In [33]:
class AdditiveAttention(nn.Module):
    """
    Additive (Bahdanau) Attention: e_i = v^T tanh(W1 h_i + W2 s)
    Most flexible - different dimensions allowed
    """
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        # enc_hid_dim is already the full encoder output dim (hid_dim * 2 for bidirectional)
        self.W1 = nn.Linear(enc_hid_dim, dec_hid_dim)  # for encoder hidden
        self.W2 = nn.Linear(dec_hid_dim, dec_hid_dim)  # for decoder hidden
        self.v = nn.Linear(dec_hid_dim, 1, bias=False) # to get scalar score
        
    def forward(self, hidden, encoder_outputs, mask):
        # hidden: [batch_size, dec_hid_dim]
        # encoder_outputs: [src_len, batch_size, enc_hid_dim]
        
        src_len = encoder_outputs.shape[0]
        
        # Repeat hidden for each source position
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # [batch_size, src_len, dec_hid_dim]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)  # [batch_size, src_len, enc_hid_dim]
        
        # v^T tanh(W1 h + W2 s)
        energy = torch.tanh(self.W1(encoder_outputs) + self.W2(hidden))  # [batch_size, src_len, dec_hid_dim]
        attention = self.v(energy).squeeze(2)  # [batch_size, src_len]
        
        # Mask padding tokens
        attention = attention.masked_fill(mask, -1e10)
        
        return torch.softmax(attention, dim=1)

### Design Decoder with Attention

#### Step 9: Design decoder to pass attention

In [34]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention  = attention
        self.embedding  = nn.Embedding(output_dim, emb_dim)
        self.rnn        = nn.GRU((hid_dim * 2) + emb_dim, hid_dim)
        self.fc         = nn.Linear((hid_dim * 2) + hid_dim + emb_dim, output_dim)
        self.dropout    = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs, mask):
        #input: [batch_size]
        #hidden: [batch_size, hid_dim]
        #encoder_ouputs: [src len, batch_size, hid_dim * 2]
        #mask: [batch_size, src len]
                
        #embed our input
        input    = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        #embedded = [1, batch_size, emb_dim]
        
        #calculate the attention
        a = self.attention(hidden, encoder_outputs, mask)
        #a = [batch_size, src len]
        a = a.unsqueeze(1)
        #a = [batch_size, 1, src len]
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #encoder_ouputs: [batch_size, src len, hid_dim * 2]
        weighted = torch.bmm(a, encoder_outputs)
        #weighted: [batch_size, 1, hid_dim * 2]
        weighted = weighted.permute(1, 0, 2)
        #weighted: [1, batch_size, hid_dim * 2]
        
        #send the input to decoder rnn
            #concatenate (embed, weighted encoder_outputs)
            #[1, batch_size, emb_dim]; [1, batch_size, hid_dim * 2]
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        #rnn_input: [1, batch_size, emb_dim + hid_dim * 2]
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
            
        #send the output of the decoder rnn to fc layer to predict the word
            #prediction = fc(concatenate (output, weighted, embed))
        embedded = embedded.squeeze(0)
        output   = output.squeeze(0)
        weighted = weighted.squeeze(0)
        prediction = self.fc(torch.cat((embedded, output, weighted), dim = 1))
        #prediction: [batch_size, output_dim]
            
        return prediction, hidden.squeeze(0), a.squeeze(1)

### Training

#### Step 10: Model Training

We use a simplified version of the weight initialization scheme used in the paper. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.


In [35]:
def initialize_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)

Desgin mode to accept custom attention

In [36]:
input_dim   = len(vocab_transform[SRC_LANGUAGE])
output_dim  = len(vocab_transform[TRG_LANGUAGE])
emb_dim     = 256  
hid_dim     = 512  
dropout     = 0.5
SRC_PAD_IDX = PAD_IDX

attention_models = {}

attention_models['general']        = GeneralAttention(hid_dim * 2, hid_dim)
attention_models['multiplicative'] = MultiplicativeAttention(hid_dim * 2, hid_dim)
attention_models['additive']       = AdditiveAttention(hid_dim * 2, hid_dim)


def get_model_with_attention(attention_type='general'):
    attn = attention_models.get(attention_type)
    
    if attn is None:
        raise ValueError(f"Unknown attention type: {attention_type}")
    
    enc  = Encoder(input_dim,  emb_dim,  hid_dim, dropout)
    dec  = Decoder(output_dim, emb_dim,  hid_dim, dropout, attn)

    model = Seq2SeqPackedAttention(enc, dec, SRC_PAD_IDX, device).to(device)
    model.apply(initialize_weights)
    
    return model

# Example: Create model with General Attention
for attention_type in attention_models.keys():
    print(f"Creating model with {attention_type} attention...")
    model = get_model_with_attention(attention_type)
    print(model)
    print("\n" + "="*80 + "\n")


Creating model with general attention...
Seq2SeqPackedAttention(
  (encoder): Encoder(
    (embedding): Embedding(15836, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): GeneralAttention(
      (proj): Linear(in_features=1024, out_features=512, bias=True)
    )
    (embedding): Embedding(8333, 256)
    (rnn): GRU(1280, 512)
    (fc): Linear(in_features=1792, out_features=8333, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)


Creating model with multiplicative attention...
Seq2SeqPackedAttention(
  (encoder): Encoder(
    (embedding): Embedding(15836, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): MultiplicativeAttention(
      (W): Linear(in_features=1024, out_features

In [37]:
#we can print the complexity by the number of parameters
def count_parameters(model):
    params = [p.numel() for p in model.parameters() if p.requires_grad]
    for item in params:
        print(f'{item:>6}')
    print(f'______\n{sum(params):>6}')
    
count_parameters(model)

4054016
393216
786432
  1536
  1536
393216
786432
  1536
  1536
524288
   512
524288
   512
262144
   512
   512
2133248
1966080
786432
  1536
  1536
14932736
  8333
______
27562125


Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token. 

In [38]:
import torch.optim as optim

lr = 0.001

#training hyperparameters
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX) #combine softmax with cross entropy

In [39]:
def train(model, loader, optimizer, criterion, clip, loader_length):
    
    model.train()
    epoch_loss = 0
    
    for i, (src, src_length, trg) in enumerate(loader):
        
        src = src.to(device)
        trg = trg.to(device)
        
        optimizer.zero_grad()
        
        output, attentions = model(src, src_length, trg)
        
        #trg    = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        output_dim = output.shape[-1]
        
        #the loss function only works on 2d inputs with 1d targets thus we need to flatten each of them
        output = output[1:].view(-1, output_dim)
        trg    = trg[1:].view(-1)
        #trg    = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        #clip the gradients to prevent them from exploding (a common issue in RNNs)
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
        # Clear MPS cache periodically to prevent memory buildup
        if device.type == 'mps' and i % 50 == 0:
            torch.mps.empty_cache()
        
    return epoch_loss / loader_length

Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

In [40]:
def evaluate(model, loader, criterion, loader_length):
        
    #turn off dropout (and batch norm if used)
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for src, src_length, trg in loader:
        
            src = src.to(device)
            trg = trg.to(device)

            output, attentions = model(src, src_length, trg, 0) #turn off teacher forcing

            #trg    = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg    = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / loader_length

Putting everything together

In [41]:
train_loader_length = len(list(iter(train_loader)))
val_loader_length   = len(list(iter(valid_loader)))
test_loader_length  = len(list(iter(test_loader)))

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

: 

In [None]:
# Clear MPS cache before training
if device.type == 'mps':
    torch.mps.empty_cache()
    import gc
    gc.collect()

best_valid_loss = float('inf')
num_epochs = 3
clip       = 1

save_path = f'{MODEL_DIRECTORY}/{model.__class__.__name__}.pt'

train_losses = []
valid_losses = []

for epoch in tqdm(range(num_epochs), desc="Epochs"):
    
    start_time = time.time()

    train_loss = train(model, train_loader, optimizer, criterion, clip, train_loader_length)
    valid_loss = evaluate(model, valid_loader, criterion, val_loader_length)
    
    #for plotting
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), save_path)
    
    tqdm.write(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    tqdm.write(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    tqdm.write(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')
    
    # Clear cache after each epoch
    if device.type == 'mps':
        torch.mps.empty_cache()
    
    #lower perplexity is better

Epochs:   0%|          | 0/3 [00:00<?, ?it/s]

## Task 3. Evaluation and Verification - For the final evaluation and verification, perform the following:

### 3.1) Compare the performance of these attention mechanisms in terms of translation accuracy, computational efficiency, and other relevant metrics. (1 points)

### 3.2) Provide performance plots showing training and validation loss for each type of attention mechanism (General, Multiplicative, and Additive). These plots will help in visualizing and comparing the learning curves of different attention models. (0.5 points)

### 3.3) Display the attention maps generated by your model. Attention maps are crucial for understanding how the model focuses on different parts of the input sequence while generating the translation. This visualization will offer insights into the interpretability of your model. (0.5 points)

### 3.4) Analyze the results and discuss the effectiveness of the selected attention mechanism in translating between your native language and English. (0.5 points)

Note: Provide the performance table and graph to Readme.md GitHub as well.

## Task 4. Machine Translation - Web Application Development - Develop a simple web application that showcases the capabilities of your language model in machine translation. (2 points)

1) The application should feature an input box where users can enter a sentence or phrase in a source
language.
2) Based on the input, the model should generate and display the translated version in a target language.
For example, if the input is ”Hello, how are you?” in English, the model might generate
”Hola, ¿c´omo est´as?” in Spanish.
3) Provide documentation on how the web application interfaces with the language model for machine
translation.
Note : Choose the most effective attention mechanism based on your experiments in Task 2.
As always, the example Dash Project in the GitHub repository contains an example that you can follow
(if you use the Dash framework).

### Save Model and Vocabularies for Flask App

Note: For enhancing UI, Vibe coding was done.

In [None]:
# Save vocabularies for Flask app
import os

save_dir = "models"
os.makedirs(save_dir, exist_ok=True)

vocab_data = {
    'en_stoi': vocab_transform[EN_LANGUAGE].stoi,
    'en_itos': vocab_transform[EN_LANGUAGE].itos,
    'ne_stoi': vocab_transform[NE_LANGUAGE].stoi,
    'ne_itos': vocab_transform[NE_LANGUAGE].itos,
}

torch.save(vocab_data, os.path.join(save_dir, "vocabs.pt"))
print(f"Vocabularies saved to {save_dir}/vocabs.pt")
print(f"  EN vocab size: {len(vocab_data['en_itos'])}")
print(f"  NE vocab size: {len(vocab_data['ne_itos'])}")
print(f"\nModel saved to: {save_path}")
print("\nTo run the Flask app:")