# Summer of Code - Artificial Intelligence

## Week 11: Deep Learning

### Day 01: Machine Translation

In this notebook, we will explore **Machine Translation** using **Attention Mechanism** in PyTorch.


# Neural Machine Translation

Neural Machine Translation (NMT) is another NLP task where we translate text from one language to another using neural networks.


## Encoder-Decoder Architecture

The most common architecture for NMT is the Encoder-Decoder architecture. The encoder processes the input sentence and encodes it into a fixed-length context vector, which is then used by the decoder to generate the translated sentence.

<img src="images/encoder_decoder.png" alt="Encoder-Decoder Architecture" width="600"/>


### English to Urdu Translation Dataset


In [None]:
from zipfile import ZipFile
from urllib.request import urlopen, Request


url = "https://www.manythings.org/anki/urd-eng.zip"
req = Request(
    url,
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    },
)
with urlopen(req) as response:
    with open("urd-eng.zip", "wb") as f:
        f.write(response.read())
with ZipFile("urd-eng.zip", "r") as zip_ref:
    zip_ref.extractall("./eng-urd")
print("Dataset downloaded and extracted.")

In [132]:
with open("./eng-urd/urd.txt", "r", encoding="utf-8") as f:
    data = f.readlines()

data[:5]

['Hi.\tسلام۔\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #9020897 (nusrat)\n',
 'Help!\tمدد۔\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #1462368 (nabeel_tahir)\n',
 'Thanks.\tشکریہ۔\tCC-BY 2.0 (France) Attribution: tatoeba.org #2057650 (nava) & #9020893 (nusrat)\n',
 'We won.\tہم جیت گئے۔\tCC-BY 2.0 (France) Attribution: tatoeba.org #2107675 (CK) & #2123755 (nabeel_tahir)\n',
 'Beat it.\tبھاگ جائو۔\tCC-BY 2.0 (France) Attribution: tatoeba.org #37902 (CM) & #1610833 (nabeel_tahir)\n']

In [133]:
source_target = [line.strip().split("\t")[:2] for line in data]
source_target[:3]

[['Hi.', 'سلام۔'], ['Help!', 'مدد۔'], ['Thanks.', 'شکریہ۔']]

In [134]:
import numpy as np


np.random.shuffle(source_target)
source_target[:3]

[["I'll be sixteen on my next birthday.",
  'میں اپنی اگلی سالگرہ پر سولہ سال کا ہو جائو گا۔'],
 ["You shouldn't eat here.", 'تمہیں یہاں نہیں کھانا چاہئیے۔'],
 ['They became citizens of Japan.', 'انہوں نے جاپانی شہریت حاصل کرلی۔']]

In [135]:
source_sentences, target_sentences = zip(*source_target)
for i in range(3):
    print(f"{source_sentences[i]} => {target_sentences[i]}")

I'll be sixteen on my next birthday. => میں اپنی اگلی سالگرہ پر سولہ سال کا ہو جائو گا۔
You shouldn't eat here. => تمہیں یہاں نہیں کھانا چاہئیے۔
They became citizens of Japan. => انہوں نے جاپانی شہریت حاصل کرلی۔


### Build Vocabulary


In [138]:
def tokenize(sentences):
    return [s.lower().split() for s in sentences]

In [139]:
source_tokens = tokenize(source_sentences)
target_tokens = tokenize(target_sentences)

source_tokens[:3], target_tokens[:3]

([["i'll", 'be', 'sixteen', 'on', 'my', 'next', 'birthday.'],
  ['you', "shouldn't", 'eat', 'here.'],
  ['they', 'became', 'citizens', 'of', 'japan.']],
 [['میں',
   'اپنی',
   'اگلی',
   'سالگرہ',
   'پر',
   'سولہ',
   'سال',
   'کا',
   'ہو',
   'جائو',
   'گا۔'],
  ['تمہیں', 'یہاں', 'نہیں', 'کھانا', 'چاہئیے۔'],
  ['انہوں', 'نے', 'جاپانی', 'شہریت', 'حاصل', 'کرلی۔']])

In [140]:
from collections import Counter


def build_vocabulary(tokenized_sentences, source=True, max_vocab_size=1000):
    word_counts = Counter()
    for s in tokenized_sentences:
        word_counts.update(s)

    # Create vocabulary with special tokens
    if source:
        vocab = {"<PAD>": 0, "<UNK>": 1}
        most_common = word_counts.most_common(max_vocab_size - 2)
    else:
        vocab = {"<PAD>": 0, "<UNK>": 1, "<SOS>": 2, "<EOS>": 3}
        most_common = word_counts.most_common(max_vocab_size - 4)
    # Add most common words
    for word, _ in most_common:
        vocab[word] = len(vocab)

    return vocab, word_counts


source_vocab, source_word_counts = build_vocabulary(source_tokens, max_vocab_size=1500)
target_vocab, target_word_counts = build_vocabulary(
    target_tokens, source=False, max_vocab_size=1500
)
print("English Vocabulary Size:", len(source_vocab))
print("Urdu Vocabulary Size:", len(target_vocab))
print("Most common English words:", source_word_counts.most_common(5))
print("Most common Urdu words:", target_word_counts.most_common(5))

English Vocabulary Size: 1500
Urdu Vocabulary Size: 1500
Most common English words: [('i', 284), ('the', 265), ('to', 223), ('you', 195), ('a', 167)]
Most common Urdu words: [('میں', 376), ('ہے۔', 324), ('نے', 182), ('اس', 155), ('وہ', 149)]


In [141]:
print(target_vocab)

{'<PAD>': 0, '<UNK>': 1, '<SOS>': 2, '<EOS>': 3, 'میں': 4, 'ہے۔': 5, 'نے': 6, 'اس': 7, 'وہ': 8, 'کے': 9, 'کو': 10, 'نہیں': 11, 'کی': 12, 'سے': 13, 'ٹام': 14, 'مجھے': 15, 'تم': 16, 'کیا': 17, 'کہ': 18, 'ہوں۔': 19, 'کا': 20, 'ہو': 21, 'ہے': 22, 'یہ': 23, 'آپ': 24, 'ایک': 25, 'کر': 26, 'ہے؟': 27, 'تھا۔': 28, 'میرے': 29, 'اپنی': 30, 'ہیں۔': 31, 'بہت': 32, 'اور': 33, 'رہا': 34, 'گا۔': 35, 'کافی': 36, 'پہ': 37, 'میری': 38, 'کچھ': 39, 'بھی': 40, 'اسے': 41, 'ہم': 42, 'گھر': 43, 'ہو؟': 44, 'گیا': 45, 'مریم': 46, 'تمہیں': 47, 'نہ': 48, 'زیادہ': 49, 'ہوا': 50, 'تھی۔': 51, 'ہو۔': 52, 'ہی': 53, 'کرنا': 54, 'سکتے': 55, 'رہے': 56, 'رہی': 57, 'کرنے': 58, 'اپنے': 59, 'میرا': 60, 'سال': 61, 'وقت': 62, 'گیا۔': 63, 'پاس': 64, 'گی۔': 65, 'پسند': 66, 'تک': 67, 'جا': 68, 'تھا': 69, 'کوئی': 70, 'لئیے': 71, 'ابھی': 72, 'آج': 73, 'تو': 74, 'کبھی': 75, 'پر': 76, 'یہاں': 77, 'ساتھ': 78, 'تمھیں': 79, 'آ': 80, 'جلدی': 81, '۔': 82, 'کسی': 83, 'سب': 84, 'گئے': 85, 'گاڑی': 86, 'جائو': 87, 'جب': 88, 'ٹھیک': 89, 'کرتا':

In [142]:
train_size = int(0.9 * len(source_tokens))

X_train_tokens = list(source_tokens[:train_size])
X_val_tokens = list(source_tokens[train_size:])

X_train_dec_tokens = [
    ["<SOS>"] + sentence.copy()
    for sentence in target_tokens[:train_size]
]
X_val_dec_tokens = [
    ["<SOS>"] + sentence.copy()
    for sentence in target_tokens[train_size:]
]

y_train_tokens = [
    sentence.copy() + ["<EOS>"]
    for sentence in target_tokens[:train_size]
]
y_val_tokens = [
    sentence.copy() + ["<EOS>"]
    for sentence in target_tokens[train_size:]
]

print("X_train sample:", X_train_tokens[0])
print("X_train_dec sample:", X_train_dec_tokens[0])
print("y_train sample:", y_train_tokens[0])

X_train sample: ["i'll", 'be', 'sixteen', 'on', 'my', 'next', 'birthday.']
X_train_dec sample: ['<SOS>', 'میں', 'اپنی', 'اگلی', 'سالگرہ', 'پر', 'سولہ', 'سال', 'کا', 'ہو', 'جائو', 'گا۔']
y_train sample: ['میں', 'اپنی', 'اگلی', 'سالگرہ', 'پر', 'سولہ', 'سال', 'کا', 'ہو', 'جائو', 'گا۔', '<EOS>']


In [143]:
def pad_tokens(sentence_tokens, max_length=15):
    padded_tokens = []
    for tokens in sentence_tokens:
        if len(tokens) > max_length:
            tokens = tokens[:max_length]
        else:
            tokens = tokens + ["<PAD>"] * (max_length - len(tokens))
        padded_tokens.append(tokens)
    return padded_tokens

In [144]:
pad_tokens(y_train_tokens[:2])

[['میں',
  'اپنی',
  'اگلی',
  'سالگرہ',
  'پر',
  'سولہ',
  'سال',
  'کا',
  'ہو',
  'جائو',
  'گا۔',
  '<EOS>',
  '<PAD>',
  '<PAD>',
  '<PAD>'],
 ['تمہیں',
  'یہاں',
  'نہیں',
  'کھانا',
  'چاہئیے۔',
  '<EOS>',
  '<PAD>',
  '<PAD>',
  '<PAD>',
  '<PAD>',
  '<PAD>',
  '<PAD>',
  '<PAD>',
  '<PAD>',
  '<PAD>']]

In [145]:
def to_sequence(sentence_tokens, vocab, max_length=15):
    padded_tokens = pad_tokens(sentence_tokens, max_length)
    sequences = []
    for tokens in padded_tokens:
        sequence = [vocab.get(token, vocab["<UNK>"]) for token in tokens]
        sequences.append(sequence)
    return sequences


sample_text = ["I am happy", "This is a test sentence"]
sample_tokens = tokenize(sample_text)
sample_sequence = to_sequence(sample_tokens, source_vocab, max_length=10)
print("Sample text:", sample_sequence)

Sample text: [[2, 41, 1454, 0, 0, 0, 0, 0, 0, 0], [15, 8, 6, 1, 1, 0, 0, 0, 0, 0]]


In [146]:
X_train = to_sequence(X_train_tokens, source_vocab, max_length=15)
X_val = to_sequence(X_val_tokens, source_vocab, max_length=15)

X_train_dec = to_sequence(X_train_dec_tokens, target_vocab, max_length=15)
X_val_dec = to_sequence(X_val_dec_tokens, target_vocab, max_length=15)

y_train = to_sequence(y_train_tokens, target_vocab, max_length=15)
y_val = to_sequence(y_val_tokens, target_vocab, max_length=15)


print("X_train sample:", X_train[0])
print("y_train_dec sample:", X_train_dec[0])
print("y_train sample:", y_train[0])

X_train sample: [38, 18, 241, 29, 10, 140, 242, 0, 0, 0, 0, 0, 0, 0, 0]
y_train_dec sample: [2, 4, 30, 391, 298, 76, 299, 61, 20, 21, 87, 35, 0, 0, 0]
y_train sample: [4, 30, 391, 298, 76, 299, 61, 20, 21, 87, 35, 3, 0, 0, 0]


In [147]:
import torch
from torch.utils.data import TensorDataset

train_data = TensorDataset(
    torch.tensor(X_train, dtype=torch.long),
    torch.tensor(X_train_dec, dtype=torch.long),
    torch.tensor(y_train, dtype=torch.long),
)

val_data = TensorDataset(
    torch.tensor(X_val, dtype=torch.long),
    torch.tensor(X_val_dec, dtype=torch.long),
    torch.tensor(y_val, dtype=torch.long),
)
print("Number of training samples:", len(train_data))
print("Number of validation samples:", len(val_data))

Number of training samples: 1034
Number of validation samples: 115


In [148]:
from torch.utils.data import DataLoader


batch_size = 16
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False)

In [149]:
import torch.nn as nn
import torch.nn.functional as F

In [150]:
class BahdanauAttention(nn.Module):
    """Additive attention mechanism (Bahdanau et al., 2015)"""

    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.hidden_size = hidden_size
        # Linear layers for computing attention scores
        self.W_decoder = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_encoder = nn.Linear(hidden_size, hidden_size, bias=False)
        self.v = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, decoder_hidden, encoder_outputs):
        """
        Args:
            decoder_hidden: (num_layers, batch, hidden_size) - current decoder hidden state
            encoder_outputs: (batch, seq_len, hidden_size) - all encoder outputs
        Returns:
            context: (batch, 1, hidden_size) - weighted sum of encoder outputs
            attention_weights: (batch, seq_len) - attention distribution
        """
        # Take only the last layer's hidden state
        decoder_hidden = decoder_hidden[-1].unsqueeze(1)  # (batch, 1, hidden_size)

        # Compute attention scores
        # decoder_hidden: (batch, 1, hidden_size) -> (batch, seq_len, hidden_size)
        decoder_hidden = decoder_hidden.repeat(1, encoder_outputs.size(1), 1)

        # Additive attention: score = v^T * tanh(W_d * h_d + W_e * h_e)
        energy = torch.tanh(
            self.W_decoder(decoder_hidden) + self.W_encoder(encoder_outputs)
        )  # (batch, seq_len, hidden_size)

        attention_scores = self.v(energy).squeeze(2)  # (batch, seq_len)

        # Normalize to get attention weights
        attention_weights = F.softmax(attention_scores, dim=1)  # (batch, seq_len)

        # Compute context vector as weighted sum
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        )  # (batch, 1, hidden_size)

        return context, attention_weights


class LuongAttention(nn.Module):
    """Multiplicative attention mechanism (Luong et al., 2015)"""

    def __init__(self, hidden_size, attention_type="dot"):
        super(LuongAttention, self).__init__()
        self.hidden_size = hidden_size
        self.attention_type = attention_type

        if attention_type == "general":
            self.W = nn.Linear(hidden_size, hidden_size, bias=False)
        elif attention_type == "concat":
            self.W = nn.Linear(hidden_size * 2, hidden_size, bias=False)
            self.v = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, decoder_hidden, encoder_outputs):
        """
        Args:
            decoder_hidden: (num_layers, batch, hidden_size)
            encoder_outputs: (batch, seq_len, hidden_size)
        Returns:
            context: (batch, 1, hidden_size)
            attention_weights: (batch, seq_len)
        """
        # Take only the last layer's hidden state
        decoder_hidden = decoder_hidden[-1]  # (batch, hidden_size)

        if self.attention_type == "dot":
            # Dot product: score = h_d^T * h_e
            attention_scores = torch.bmm(
                decoder_hidden.unsqueeze(1), encoder_outputs.transpose(1, 2)
            ).squeeze(
                1
            )  # (batch, seq_len)

        elif self.attention_type == "general":
            # General: score = h_d^T * W * h_e
            transformed_encoder = self.W(
                encoder_outputs
            )  # (batch, seq_len, hidden_size)
            attention_scores = torch.bmm(
                decoder_hidden.unsqueeze(1), transformed_encoder.transpose(1, 2)
            ).squeeze(
                1
            )  # (batch, seq_len)

        elif self.attention_type == "concat":
            # Concat: score = v^T * tanh(W * [h_d; h_e])
            decoder_hidden_expanded = decoder_hidden.unsqueeze(1).repeat(
                1, encoder_outputs.size(1), 1
            )  # (batch, seq_len, hidden_size)
            combined = torch.cat(
                [decoder_hidden_expanded, encoder_outputs], dim=2
            )  # (batch, seq_len, 2*hidden_size)
            energy = torch.tanh(self.W(combined))  # (batch, seq_len, hidden_size)
            attention_scores = self.v(energy).squeeze(2)  # (batch, seq_len)

        # Normalize to get attention weights
        attention_weights = F.softmax(attention_scores, dim=1)  # (batch, seq_len)

        # Compute context vector
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        )  # (batch, 1, hidden_size)

        return context, attention_weights


In [152]:
class Encoder(nn.Module):
    def __init__(
        self,
        input_vocab_size,
        embed_size,
        hidden_size,
        num_layers=2,
        bidirectional=False,
    ):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_vocab_size, embed_size, padding_idx=0)
        self.bidirectional = bidirectional
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        self.gru = nn.GRU(
            embed_size,
            hidden_size,
            num_layers,
            batch_first=True,
            bidirectional=bidirectional,
            dropout=0.5 if num_layers > 1 else 0,
        )
        self.embed_dropout = nn.Dropout(0.3)
        if bidirectional:
            # Project concatenated bidirectional hidden states to decoder size
            self.hidden_projection = nn.Linear(hidden_size * 2, hidden_size)
            # Project bidirectional outputs to match decoder hidden size
            self.output_projection = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, x):
        embedded = self.embed_dropout(self.embedding(x))
        outputs, hidden = self.gru(embedded)

        if self.bidirectional:
            # Project outputs: (batch, seq_len, hidden_size * 2) -> (batch, seq_len, hidden_size)
            outputs = self.output_projection(outputs)

            # Reshape: (num_layers * 2, batch, hidden) -> (num_layers, 2, batch, hidden)
            hidden = hidden.view(self.num_layers, 2, -1, self.hidden_size)
            # Concatenate forward and backward
            hidden = torch.cat([hidden[:, 0, :, :], hidden[:, 1, :, :]], dim=2)
            # Project to decoder size
            hidden = self.hidden_projection(hidden)

        return outputs, hidden


class Decoder(nn.Module):
    def __init__(
        self,
        output_vocab_size,
        embed_size,
        hidden_size,
        attention_type=None,
        num_layers=2,
    ):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.attention_type = attention_type
        self.embedding = nn.Embedding(output_vocab_size, embed_size, padding_idx=0)
        self.embed_dropout = nn.Dropout(0.3)

        # Attention mechanism (only if attention_type is specified)
        if attention_type is not None:
            if attention_type == "bahdanau":
                self.attention = BahdanauAttention(hidden_size)
            else:  # luong (dot, general, or concat)
                self.attention = LuongAttention(hidden_size, attention_type)

            # GRU with combined input (embedding + context)
            gru_input_size = embed_size + hidden_size
            # Output projection layer for attention
            self.concat_layer = nn.Linear(hidden_size * 2, hidden_size)
        else:
            # Standard RNN without attention
            gru_input_size = embed_size
            self.concat_layer = None

        self.gru = nn.GRU(
            gru_input_size,
            hidden_size,
            num_layers,
            batch_first=True,
            dropout=0.5 if num_layers > 1 else 0,
        )

        self.dropout = nn.Dropout(0.3)

    def forward(self, x, hidden, encoder_outputs=None):
        """
        Args:
            x: (batch, seq_len) - decoder input tokens
            hidden: (num_layers, batch, hidden_size) - initial hidden state
            encoder_outputs: (batch, src_seq_len, hidden_size) - encoder outputs (MUST be hidden_size, not bidirectional size)
        Returns:
            outputs: (batch, seq_len, hidden_size) - decoder outputs
            hidden: (num_layers, batch, hidden_size) - final hidden state
            attention_weights: (batch, seq_len, src_seq_len) or None - attention weights
        """
        embedded = self.embed_dropout(self.embedding(x))  # (batch, seq_len, embed_size)

        # No attention - standard RNN decoder
        if self.attention_type is None:
            outputs, hidden = self.gru(embedded, hidden)
            outputs = self.dropout(outputs)
            return outputs, hidden, None

        # With attention
        batch_size = x.size(0)
        seq_len = x.size(1)

        outputs = []
        all_attention_weights = []

        # Process each time step
        for t in range(seq_len):
            # Get embedding for current time step
            embed_t = embedded[:, t : t + 1, :]  # (batch, 1, embed_size)

            # Compute attention context
            context, attention_weights = self.attention(hidden, encoder_outputs)
            all_attention_weights.append(attention_weights)

            # Concatenate embedding and context
            gru_input = torch.cat(
                [embed_t, context], dim=2
            )  # (batch, 1, embed_size + hidden_size)

            # Pass through GRU
            output, hidden = self.gru(
                gru_input, hidden
            )  # output: (batch, 1, hidden_size)

            # Combine GRU output with context (Luong's approach)
            combined = torch.cat([output, context], dim=2)  # (batch, 1, 2*hidden_size)
            output = torch.tanh(self.concat_layer(combined))  # (batch, 1, hidden_size)
            output = self.dropout(output)

            outputs.append(output)

        # Concatenate all outputs
        outputs = torch.cat(outputs, dim=1)  # (batch, seq_len, hidden_size)
        attention_weights = torch.stack(
            all_attention_weights, dim=1
        )  # (batch, seq_len, src_seq_len)

        return outputs, hidden, attention_weights


class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.fc = nn.Linear(decoder.hidden_size, len(target_vocab))
        self.dropout = nn.Dropout(0.3)

    def forward(self, source, target):
        # Get encoder outputs and hidden state
        encoder_outputs, encoder_hidden = self.encoder(source)

        # Pass encoder outputs to decoder if attention is used
        if self.decoder.attention_type is not None:
            decoder_outputs, decoder_hidden, attention_weights = self.decoder(
                target, encoder_hidden, encoder_outputs
            )
        else:
            # No attention - only pass hidden state
            decoder_outputs, decoder_hidden, attention_weights = self.decoder(
                target, encoder_hidden, encoder_outputs=None
            )

        decoder_outputs = self.dropout(decoder_outputs)
        return self.fc(decoder_outputs), attention_weights

# Training and Evaluation


In [153]:
import tqdm


def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    progress_bar = tqdm.tqdm(dataloader, desc="Training")
    for enc_inputs, dec_inputs, targets in progress_bar:
        # Move data to device
        enc_inputs, dec_inputs, targets = (
            enc_inputs.to(device),
            dec_inputs.to(device),
            targets.to(device),
        )

        outputs, _ = model(enc_inputs, dec_inputs)
        outputs = outputs.reshape(-1, outputs.size(-1))
        targets = targets.reshape(-1)

        loss = criterion(outputs, targets)
        optimizer.zero_grad()
        loss.backward()

        optimizer.step()

        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        non_pad_mask = targets != 0  # Create mask for non-padding tokens
        total += non_pad_mask.sum().item()
        correct += ((predicted == targets) & non_pad_mask).sum().item()
        total_loss += loss.item()

        # Update progress bar
        progress_bar.set_postfix(
            {"loss": f"{loss.item():.4f}", "acc": f"{100 * correct / total:.2f}%"}
        )

    avg_loss = total_loss / len(dataloader)
    accuracy = 100 * correct / total

    return avg_loss, accuracy

## Step 11: Validation Function

The validation function evaluates the model without updating weights. This helps us monitor overfitting and select the best model.


In [154]:
def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    progress_bar = tqdm.tqdm(dataloader, desc="Validating")
    with torch.no_grad():

        for enc_inputs, dec_inputs, targets in progress_bar:
            # Move data to device
            enc_inputs, dec_inputs, targets = (
                enc_inputs.to(device),
                dec_inputs.to(device),
                targets.to(device),
            )

            outputs, _ = model(enc_inputs, dec_inputs)
            outputs = outputs.reshape(-1, outputs.size(-1))
            targets = targets.reshape(-1)

            loss = criterion(outputs, targets)

            # Calculate accuracy
            _, predicted = torch.max(outputs.data, 1)
            non_pad_mask = targets != 0  # Create mask for non-padding tokens
            total += non_pad_mask.sum().item()
            correct += ((predicted == targets) & non_pad_mask).sum().item()
            total_loss += loss.item()

            # Update progress bar with current accuracy
            progress_bar.set_postfix(
                {"loss": f"{loss.item():.4f}", "acc": f"{100 * correct / total:.2f}%"}
            )

    avg_loss = total_loss / len(dataloader)
    accuracy = 100 * correct / total

    return avg_loss, accuracy

In [168]:
# Instantiate the models
embed_size = 64
hidden_size = 256
attention_type = 'general'
num_layers = 1
encoder = Encoder(
    len(source_vocab), embed_size, hidden_size, num_layers, bidirectional=False
)
decoder = Decoder(
    len(target_vocab), embed_size, hidden_size, attention_type, num_layers
)

model = EncoderDecoder(encoder, decoder)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model

EncoderDecoder(
  (encoder): Encoder(
    (embedding): Embedding(1500, 64, padding_idx=0)
    (gru): GRU(64, 256, batch_first=True)
    (embed_dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(1500, 64, padding_idx=0)
    (embed_dropout): Dropout(p=0.3, inplace=False)
    (attention): LuongAttention(
      (W): Linear(in_features=256, out_features=256, bias=False)
    )
    (concat_layer): Linear(in_features=512, out_features=256, bias=True)
    (gru): GRU(320, 256, batch_first=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (fc): Linear(in_features=256, out_features=1500, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
)

In [169]:
from torchinfo import summary

input_sample = torch.zeros((1, 15), dtype=torch.long).to(device)
summary(model, input_data=(input_sample, input_sample))

Layer (type:depth-idx)                   Output Shape              Param #
EncoderDecoder                           [1, 15, 1500]             --
├─Encoder: 1-1                           [1, 15, 256]              --
│    └─Embedding: 2-1                    [1, 15, 64]               96,000
│    └─Dropout: 2-2                      [1, 15, 64]               --
│    └─GRU: 2-3                          [1, 15, 256]              247,296
├─Decoder: 1-2                           [1, 15, 256]              --
│    └─Embedding: 2-4                    [1, 15, 64]               96,000
│    └─Dropout: 2-5                      [1, 15, 64]               --
│    └─LuongAttention: 2-6               [1, 1, 256]               --
│    │    └─Linear: 3-1                  [1, 15, 256]              65,536
│    └─GRU: 2-7                          [1, 1, 256]               443,904
│    └─Linear: 2-8                       [1, 1, 256]               131,328
│    └─Dropout: 2-9                      [1, 1, 256]      

In [170]:
import torch.optim as optim


# Loss function (ignore padding tokens)
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Optimizer
learning_rate = 0.001
optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5
)

print(f"Number of training batches: {len(train_loader)}")
print(f"Number of validation batches: {len(val_loader)}")

Number of training batches: 65
Number of validation batches: 8


In [171]:
num_epochs = 100
best_val_acc = 0.0
patience = 10
patience_counter = 0

# Track training history
train_losses = []
train_accs = []
val_losses = []
val_accs = []

print("Starting training...")
print(f"Device: {device}")
print(f"Number of epochs: {num_epochs}\n")

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    # Train
    train_loss, train_acc = train_epoch(
        model, train_loader, criterion, optimizer, device
    )
    # Validate
    val_loss, val_acc = validate(model, val_loader, criterion, device)

    # Track history
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    
    scheduler.step(val_loss)

    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        patience_counter = 0
        torch.save(model.state_dict(), "nmt_model.pth")
        print(f"  ✓ Saved best model (Val Acc: {val_acc:.2f}%)")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping triggered after {epoch + 1} epochs")
            break

print("Training complete!")
print(f"Best validation accuracy: {best_val_acc:.2f}%")

Starting training...
Device: cuda
Number of epochs: 100

Epoch 1/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.82it/s, loss=5.4506, acc=12.50%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 19.26it/s, loss=4.7254, acc=19.02%]


  ✓ Saved best model (Val Acc: 19.02%)
Epoch 2/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.10it/s, loss=5.2321, acc=16.15%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 23.79it/s, loss=4.3902, acc=21.30%]


  ✓ Saved best model (Val Acc: 21.30%)
Epoch 3/100


Training: 100%|██████████| 65/65 [00:09<00:00,  7.13it/s, loss=5.1133, acc=18.25%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 17.92it/s, loss=4.1457, acc=22.61%]


  ✓ Saved best model (Val Acc: 22.61%)
Epoch 4/100


Training: 100%|██████████| 65/65 [00:09<00:00,  7.19it/s, loss=4.6413, acc=21.08%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 21.26it/s, loss=3.7349, acc=26.30%]


  ✓ Saved best model (Val Acc: 26.30%)
Epoch 5/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.63it/s, loss=4.3858, acc=23.53%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 22.14it/s, loss=3.4344, acc=26.41%]


  ✓ Saved best model (Val Acc: 26.41%)
Epoch 6/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.00it/s, loss=4.7023, acc=25.17%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 14.77it/s, loss=3.2521, acc=29.35%]


  ✓ Saved best model (Val Acc: 29.35%)
Epoch 7/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.28it/s, loss=3.8844, acc=26.99%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 21.54it/s, loss=3.2368, acc=31.09%]


  ✓ Saved best model (Val Acc: 31.09%)
Epoch 8/100


Training: 100%|██████████| 65/65 [00:10<00:00,  5.93it/s, loss=3.7879, acc=28.97%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 20.21it/s, loss=2.9183, acc=32.17%]


  ✓ Saved best model (Val Acc: 32.17%)
Epoch 9/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.18it/s, loss=3.9889, acc=31.14%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 20.03it/s, loss=2.7032, acc=35.11%]


  ✓ Saved best model (Val Acc: 35.11%)
Epoch 10/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.29it/s, loss=3.9767, acc=33.53%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 21.97it/s, loss=2.7368, acc=33.91%]


Epoch 11/100


Training: 100%|██████████| 65/65 [00:09<00:00,  7.00it/s, loss=3.2743, acc=35.79%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 22.04it/s, loss=2.5959, acc=36.41%]


  ✓ Saved best model (Val Acc: 36.41%)
Epoch 12/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.62it/s, loss=3.3248, acc=37.61%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 23.10it/s, loss=2.4527, acc=37.07%]


  ✓ Saved best model (Val Acc: 37.07%)
Epoch 13/100


Training: 100%|██████████| 65/65 [00:09<00:00,  7.09it/s, loss=3.4518, acc=40.69%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 14.81it/s, loss=2.3177, acc=36.30%]


Epoch 14/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.46it/s, loss=2.5559, acc=42.41%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 21.49it/s, loss=2.3038, acc=37.93%]


  ✓ Saved best model (Val Acc: 37.93%)
Epoch 15/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.52it/s, loss=2.3374, acc=44.65%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 20.96it/s, loss=2.3881, acc=39.67%]


  ✓ Saved best model (Val Acc: 39.67%)
Epoch 16/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.47it/s, loss=2.1705, acc=47.49%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 24.40it/s, loss=2.2143, acc=41.09%]


  ✓ Saved best model (Val Acc: 41.09%)
Epoch 17/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.64it/s, loss=2.2432, acc=49.79%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.08it/s, loss=2.3670, acc=41.85%]


  ✓ Saved best model (Val Acc: 41.85%)
Epoch 18/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.45it/s, loss=2.4011, acc=52.78%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 24.96it/s, loss=2.2521, acc=43.59%]


  ✓ Saved best model (Val Acc: 43.59%)
Epoch 19/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.20it/s, loss=2.0218, acc=54.96%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 11.46it/s, loss=2.2192, acc=43.26%]


Epoch 20/100


Training: 100%|██████████| 65/65 [00:11<00:00,  5.73it/s, loss=2.2301, acc=56.89%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 28.09it/s, loss=2.2458, acc=44.57%]


  ✓ Saved best model (Val Acc: 44.57%)
Epoch 21/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.79it/s, loss=1.8818, acc=58.74%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 20.15it/s, loss=2.2822, acc=43.91%]


Epoch 22/100


Training: 100%|██████████| 65/65 [00:09<00:00,  7.04it/s, loss=1.3242, acc=61.58%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 24.22it/s, loss=2.3222, acc=44.24%]


Epoch 23/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.77it/s, loss=1.8317, acc=63.54%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 29.07it/s, loss=2.1946, acc=45.87%]


  ✓ Saved best model (Val Acc: 45.87%)
Epoch 24/100


Training: 100%|██████████| 65/65 [00:09<00:00,  7.01it/s, loss=1.5315, acc=65.12%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 23.91it/s, loss=2.1833, acc=44.67%]


Epoch 25/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.94it/s, loss=1.6440, acc=67.12%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 27.37it/s, loss=2.1222, acc=45.98%]


  ✓ Saved best model (Val Acc: 45.98%)
Epoch 26/100


Training: 100%|██████████| 65/65 [00:08<00:00,  8.04it/s, loss=1.2274, acc=68.65%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.08it/s, loss=2.0272, acc=46.96%]


  ✓ Saved best model (Val Acc: 46.96%)
Epoch 27/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.86it/s, loss=1.1804, acc=70.58%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 24.74it/s, loss=2.2235, acc=46.52%]


Epoch 28/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.25it/s, loss=1.2526, acc=70.37%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 18.90it/s, loss=2.2747, acc=46.85%]


Epoch 29/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.22it/s, loss=1.3167, acc=73.53%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.17it/s, loss=2.2057, acc=48.59%]


  ✓ Saved best model (Val Acc: 48.59%)
Epoch 30/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.53it/s, loss=1.0080, acc=76.32%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 27.51it/s, loss=2.0836, acc=48.37%]


Epoch 31/100


Training: 100%|██████████| 65/65 [00:08<00:00,  8.01it/s, loss=0.8338, acc=78.17%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 31.12it/s, loss=2.2020, acc=48.26%]


Epoch 32/100


Training: 100%|██████████| 65/65 [00:06<00:00,  9.82it/s, loss=0.8936, acc=78.72%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 28.45it/s, loss=2.0330, acc=48.15%]


Epoch 33/100


Training: 100%|██████████| 65/65 [00:06<00:00, 10.00it/s, loss=0.8187, acc=79.40%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 34.62it/s, loss=2.2066, acc=48.70%]


  ✓ Saved best model (Val Acc: 48.70%)
Epoch 34/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.56it/s, loss=1.0987, acc=79.96%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.99it/s, loss=2.1685, acc=47.83%]


Epoch 35/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.60it/s, loss=0.9104, acc=80.91%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.84it/s, loss=2.0412, acc=48.15%]


Epoch 36/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.84it/s, loss=0.6543, acc=81.85%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 22.06it/s, loss=2.1216, acc=47.61%]


Epoch 37/100


Training: 100%|██████████| 65/65 [00:10<00:00,  6.41it/s, loss=0.7764, acc=82.56%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 20.25it/s, loss=2.1036, acc=48.37%]


Epoch 38/100


Training: 100%|██████████| 65/65 [00:09<00:00,  6.86it/s, loss=0.7144, acc=83.29%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 15.56it/s, loss=2.1482, acc=47.93%]


Epoch 39/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.25it/s, loss=0.8981, acc=83.65%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.88it/s, loss=2.0802, acc=48.15%]


Epoch 40/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.94it/s, loss=0.7820, acc=82.76%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.58it/s, loss=2.1010, acc=47.50%]


Epoch 41/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.85it/s, loss=0.6016, acc=84.20%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.77it/s, loss=2.1609, acc=48.37%]


Epoch 42/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.98it/s, loss=0.8675, acc=84.25%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.86it/s, loss=2.1583, acc=48.80%]


  ✓ Saved best model (Val Acc: 48.80%)
Epoch 43/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.54it/s, loss=0.7285, acc=84.59%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 19.01it/s, loss=2.1267, acc=48.37%]


Epoch 44/100


Training: 100%|██████████| 65/65 [00:08<00:00,  7.96it/s, loss=0.6321, acc=84.34%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 25.65it/s, loss=2.1263, acc=48.48%]


Epoch 45/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.90it/s, loss=0.7260, acc=85.00%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 26.12it/s, loss=2.1724, acc=48.48%]


Epoch 46/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.99it/s, loss=0.5730, acc=85.37%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 28.10it/s, loss=2.1603, acc=48.48%]


Epoch 47/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.72it/s, loss=0.7942, acc=85.16%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 26.80it/s, loss=2.1514, acc=48.37%]


Epoch 48/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.88it/s, loss=0.6889, acc=85.00%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 21.13it/s, loss=2.1646, acc=47.83%]


Epoch 49/100


Training: 100%|██████████| 65/65 [00:07<00:00,  9.03it/s, loss=0.5186, acc=85.60%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 19.49it/s, loss=2.1627, acc=48.26%]


Epoch 50/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.83it/s, loss=0.6792, acc=85.94%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 26.08it/s, loss=2.1783, acc=48.37%]


Epoch 51/100


Training: 100%|██████████| 65/65 [00:07<00:00,  8.95it/s, loss=0.7499, acc=85.97%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 26.27it/s, loss=2.1934, acc=48.37%]


Epoch 52/100


Training: 100%|██████████| 65/65 [00:07<00:00,  9.00it/s, loss=0.6229, acc=86.18%]
Validating: 100%|██████████| 8/8 [00:00<00:00, 27.68it/s, loss=2.2033, acc=48.80%]


Early stopping triggered after 52 epochs
Training complete!
Best validation accuracy: 48.80%


# Load Best Model


In [172]:
model.load_state_dict(torch.load("nmt_model.pth", weights_only=True))
model = model.to(device)
validate(model, val_loader, criterion, device)

Validating:   0%|          | 0/8 [00:00<?, ?it/s]

Validating: 100%|██████████| 8/8 [00:00<00:00, 27.81it/s, loss=2.1583, acc=48.80%]


(3.3357225358486176, 48.80434782608695)

In [173]:
target_i2t = {idx: token for token, idx in target_vocab.items()}


def translate(sentence, model, source_vocab, target_vocab, max_length=20, device="cpu"):
    model.eval()
    sentence_tokens = sentence.lower().split()
    sequence = to_sequence([sentence_tokens], source_vocab, max_length=max_length)
    encoder_input = torch.tensor(sequence, dtype=torch.long).to(device)

    # Start with <SOS> token
    decoder_input = [target_vocab["<SOS>"]]
    translation = []

    with torch.no_grad():
        # Get encoder outputs and hidden state
        encoder_outputs, encoder_hidden = model.encoder(encoder_input)

        for _ in range(max_length):
            # Prepare decoder input
            dec_input = torch.tensor([decoder_input], dtype=torch.long).to(device)

            # Decode - pass encoder_outputs if attention is used
            if model.decoder.attention_type is not None:
                decoder_outputs, encoder_hidden, _ = model.decoder(
                    dec_input, encoder_hidden, encoder_outputs
                )
            else:
                decoder_outputs, encoder_hidden, _ = model.decoder(
                    dec_input, encoder_hidden, encoder_outputs=None
                )

            # Get prediction for the last token
            output = model.fc(decoder_outputs[:, -1, :])
            predicted_id = output.argmax(dim=-1).item()

            # Check for EOS token
            if predicted_id == target_vocab["<EOS>"]:
                break

            # Get the predicted word
            predicted_word = target_i2t.get(predicted_id, "")

            # Skip special tokens in output
            if predicted_word not in ["<PAD>", "<UNK>", "<SOS>", "<EOS>"]:
                translation.append(predicted_word)

            # Add predicted token to decoder input for next iteration
            decoder_input.append(predicted_id)

    return " ".join(translation)


# Test the translation function
test_sentences = source_sentences[:10]

print("Testing translations:\n")
for sentence in test_sentences:
    translated = translate(sentence, model, source_vocab, target_vocab, device=device)
    print(f"Source: {sentence}")
    print(f"Target: {translated}")
    print()

Testing translations:

Source: I'll be sixteen on my next birthday.
Target: میں اپنی اگلی سالگرہ پر سولہ سال ہو گی۔

Source: You shouldn't eat here.
Target: تم ٹھیک ہو۔

Source: They became citizens of Japan.
Target: انہوں نے جاپانی شہریت حاصل کرلی۔

Source: You need to study more.
Target: تم کیسے ہو۔

Source: I can't do this to Tom.
Target: میں یہ کرنا ہوں۔

Source: Please wait till he comes back.
Target: برائے مہربانی اونچا بولے۔

Source: Wolves don't usually attack people.
Target: بھیڑیے جانے لوگوں

Source: I've been foolish.
Target: مجھے لگ رہا ہے۔

Source: I can't remember the melody of that song.
Target: میں نے اس گیت کا فرانسیسی ہے۔

Source: Tom is only a beginner.
Target: ٹام کو بہت خوبصورت ہے۔

