Here is a **detailed README** that you can use for your project. It outlines the entire process, from setup to running the app, and provides an overview of the project's components.

---

# **English to Urdu Translation Model Using Transformer and LSTM**

This project demonstrates the implementation of machine translation from **English to Urdu** using two models:
- **Transformer Model**
- **LSTM Model**

The translation models are trained on the **UMC005 English-Urdu Parallel Corpus**, and the system allows users to input English sentences and get translated Urdu sentences. Additionally, the project includes:
- Evaluation metrics (BLEU, ROUGE)
- Attention visualization for the Transformer model
- Pre-trained models that can be used for translation without retraining

## **Project Structure**

The project is structured as follows:

```
/Q1
├── data/
│   ├── bible/
│   ├── quran/
├── models/
│   ├── transformer.py
│   ├── lstm.py
├── utils/
│   ├── tokenizer.py
│   ├── dataset.py
│   ├── metrics.py
│   ├── visualization.py
├── app.py
├── train_transformer.py
├── train_lstm.py
├── evaluate.py
├── requirements.txt
├── README.md
└── saved_models/
```

### **Key Files and Folders**:
- **`data/`**: Contains the English and Urdu sentences for training, validation, and testing.
- **`models/`**: Contains model implementations for both Transformer (`transformer.py`) and LSTM (`lstm.py`).
- **`utils/`**: Contains utility functions:
  - **`tokenizer.py`**: Tokenizer implementation for Byte Pair Encoding (BPE).
  - **`dataset.py`**: Dataset class for loading and processing data.
  - **`metrics.py`**: Functions for calculating evaluation metrics like BLEU and ROUGE.
  - **`visualization.py`**: Function for visualizing attention weights.
- **`app.py`**: Streamlit interface for inputting text and getting translations, along with model evaluation and attention visualization.
- **`train_transformer.py`**: Script to train the Transformer model.
- **`train_lstm.py`**: Script to train the LSTM model.
- **`evaluate.py`**: Script to evaluate the models on the test dataset using BLEU and ROUGE metrics.

---

## **Installation and Setup**

### **1. Install Required Libraries**

Before running the project, you need to install the required dependencies. The project uses **PyTorch**, **Streamlit**, and several other libraries.

To install the required packages, run the following command:

```bash
pip install -r requirements.txt
```

### **2. Download the Data**

You will need to download the **UMC005 English-Urdu Parallel Corpus** for training and evaluation. This corpus is available from the official site. You can place the downloaded data in the `data/` directory.

After downloading, the folder structure will look like:

```
/data
├── bible
│   ├── train.en
│   ├── train.ur
│   ├── dev.en
│   ├── dev.ur
│   ├── test.en
│   ├── test.ur
├── quran
│   ├── train.en
│   ├── train.ur
│   ├── dev.en
│   ├── dev.ur
│   ├── test.en
│   ├── test.ur
```

### **3. Pre-trained Models**

You can use the pre-trained models stored in the `models/` directory:
- **`tokenizer_en.pt`**: English tokenizer
- **`tokenizer_ur.pt`**: Urdu tokenizer
- **`transformer_model.pth`**: Pre-trained Transformer model
- **`lstm_model.pth`**: Pre-trained LSTM model

These models are loaded in the `app.py` file for translation purposes. If you don't have these models, you can train them using the provided `train_transformer.py` and `train_lstm.py` scripts.

---

## **Running the App**

### **1. Launching the Streamlit App**

The core of this project is the **Streamlit interface** (`app.py`), which allows users to input English sentences and get their corresponding Urdu translations.

To run the app:

```bash
streamlit run app.py
```

This will open a web browser with the interface where you can:
- Enter English text to get the Urdu translation.
- Visualize the attention weights for the Transformer model.
- Evaluate the models on the test set and display BLEU and ROUGE scores.

### **2. Training the Models**

If you want to train the **Transformer** or **LSTM** models from scratch, you can run the following scripts:

- **Training the Transformer model**:

  ```bash
  python train_transformer.py
  ```

- **Training the LSTM model**:

  ```bash
  python train_lstm.py
  ```

This will train the models on the UMC005 English-Urdu dataset and save the trained models in the `saved_models/` directory.

### **3. Evaluating the Models**

You can evaluate the models' performance (on BLEU and ROUGE metrics) by running the `evaluate.py` script:

```bash
python evaluate.py
```

This will evaluate both the **Transformer** and **LSTM** models on the test dataset and print the BLEU and ROUGE scores.

### **4. Visualizing Attention**

In the **Streamlit interface** (`app.py`), you can click the **"Visualize Attention"** button to see the attention heatmap for the Transformer model. This shows which parts of the input sentence the model focused on while generating the translation.

---

## **Evaluation Metrics**

### **1. BLEU Score**

The **BLEU (Bilingual Evaluation Understudy)** score measures the precision of n-grams (typically unigram, bigram, trigram) between the predicted and reference translations. A higher BLEU score indicates better translation quality.

### **2. ROUGE Score**

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** is another metric that evaluates the overlap between n-grams in the predicted and reference sentences. ROUGE-1, ROUGE-2, and ROUGE-L scores are commonly used to evaluate translation quality.

Both **BLEU** and **ROUGE** scores are computed in the `evaluate.py` and displayed on the Streamlit interface.

---

## **Attention Visualization**

The **Transformer model** includes an attention mechanism, which helps the model decide which parts of the input to focus on during translation. The attention weights can be visualized as a heatmap, showing the alignment between words in the input and output sentences. This is done using the **`VizAttention()`** function in the `utils/visualization.py` file.

### Example of Attention Visualization:

- **Input**: "I love books."
- **Output**: "مجھے کتابیں پسند ہیں۔"
- The heatmap will show how much attention was given to each word in the input sentence when generating each word in the output sentence.

---

## **Future Improvements**

### **1. Fine-tuning Pre-trained Models**
Currently, the models are trained from scratch. However, fine-tuning **pre-trained models** (like **BERT** or **GPT**) on the English-Urdu dataset might improve performance. You could experiment with fine-tuning these models using **transfer learning**.

### **2. Hyperparameter Tuning**
Although hyperparameters like learning rate, batch size, and model depth are set, further **hyperparameter tuning** using techniques like **Grid Search** or **Bayesian Optimization** could yield better results.

### **3. Multi-Lingual Models**
This project focuses on **English to Urdu** translation. However, you could extend it to **multi-lingual translation** by training on additional language pairs.

---

## **Conclusion**

This project demonstrates the implementation of machine translation from **English to Urdu** using both **Transformer** and **LSTM** models. It provides a **Streamlit-based user interface** for easy text input and translation, along with model evaluation and attention visualization. You can either use pre-trained models or train your own on the **UMC005 English-Urdu Parallel Corpus**.

Feel free to experiment with the models, fine-tune them, and extend the project for additional language pairs and features!

---

### **Acknowledgments**
- The project uses the **UMC005 English-Urdu Parallel Corpus** for training and evaluation.
- The models are based on the Transformer architecture as described in the paper **"Attention is All You Need"** and the LSTM-based sequence-to-sequence model.

---

### **License**

This project is open-source and free for non-commercial educational and research purposes. For commercial use, please reach out to the authors for licensing details.

Keeping this and above objectives in mind write complete code:
Input folder format:
bible:
Bible-EN
Bible-UR
Bible-UR-normalized
dev.en
dev.ur
test.en
test.ur
train.en
train.ur

quran:
Quran-EN
Quran-UR
Quran-UR-normalized
dev.en
dev.ur
test.en
test.ur
train.en
train.ur

In [1]:
# import os
# import re
# import torch
# import torch.nn as nn
# import torch.optim as optim
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# import random
# from torch.utils.data import Dataset, DataLoader
# from torch.nn.utils.rnn import pad_sequence
# import math
# import time
# from sklearn.model_selection import train_test_split
# import nltk
# from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
# from rouge_score import rouge_scorer
# import tkinter as tk
# from tkinter import scrolledtext
# import unicodedata
# from torchtext.data.metrics import bleu_score
# from collections import Counter, defaultdict

# # Download NLTK resources
# nltk.download('punkt')

# # Set seeds for reproducibility
# SEED = 42
# torch.manual_seed(SEED)
# torch.cuda.manual_seed(SEED)
# np.random.seed(SEED)
# random.seed(SEED)
# torch.backends.cudnn.deterministic = True

# # Device configuration
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# print(f"Using device: {device}")

# # Define hyperparameters
# BATCH_SIZE = 64
# EMBEDDING_SIZE = 512
# NUM_HEADS = 8
# NUM_ENCODER_LAYERS = 6
# NUM_DECODER_LAYERS = 6
# FFN_DIM = 2048
# DROPOUT = 0.1
# LEARNING_RATE = 0.0001
# NUM_EPOCHS = 20
# MAX_SEQ_LENGTH = 100
# VOCAB_SIZE = 32000  # For BPE tokenization

# # Data path configuration - adjust based on your file structure
# DATA_DIR = "data"  # Base directory for the corpus
# TRAIN_EN_PATH = os.path.join(DATA_DIR, "bible/train.en")
# TRAIN_UR_PATH = os.path.join(DATA_DIR, "bible/train.ur")
# DEV_EN_PATH = os.path.join(DATA_DIR, "bible/dev.en")
# DEV_UR_PATH = os.path.join(DATA_DIR, "bible/dev.ur")
# TEST_EN_PATH = os.path.join(DATA_DIR, "bible/test.en")
# TEST_UR_PATH = os.path.join(DATA_DIR, "bible/test.ur")

# # Add Quran data paths
# QURAN_TRAIN_EN_PATH = os.path.join(DATA_DIR, "quran/train.en")
# QURAN_TRAIN_UR_PATH = os.path.join(DATA_DIR, "quran/train.ur")
# QURAN_DEV_EN_PATH = os.path.join(DATA_DIR, "quran/dev.en")
# QURAN_DEV_UR_PATH = os.path.join(DATA_DIR, "quran/dev.ur")
# QURAN_TEST_EN_PATH = os.path.join(DATA_DIR, "quran/test.en")
# QURAN_TEST_UR_PATH = os.path.join(DATA_DIR, "quran/test.ur")

# # Add model save paths
# MODEL_SAVE_DIR = "saved_models"
# TRANSFORMER_MODEL_PATH = os.path.join(MODEL_SAVE_DIR, "transformer_model.pt")
# LSTM_MODEL_PATH = os.path.join(MODEL_SAVE_DIR, "lstm_model.pt")
# TOKENIZER_EN_PATH = os.path.join(MODEL_SAVE_DIR, "tokenizer_en.pt")
# TOKENIZER_UR_PATH = os.path.join(MODEL_SAVE_DIR, "tokenizer_ur.pt")

# os.makedirs(MODEL_SAVE_DIR, exist_ok=True)

# # Part 1: Data Preprocessing and Tokenization
# #############################################################

# class BPETokenizer:
#     """
#     Byte Pair Encoding (BPE) tokenizer for both English and Urdu text
#     """
#     def __init__(self, vocab_size=VOCAB_SIZE):
#         self.vocab_size = vocab_size
#         self.special_tokens = {
#             "<PAD>": 0,  # Padding token
#             "<SOS>": 1,  # Start of sentence token
#             "<EOS>": 2,  # End of sentence token
#             "<UNK>": 3   # Unknown token
#         }
#         self.token_to_id = {token: idx for token, idx in self.special_tokens.items()}
#         self.id_to_token = {idx: token for token, idx in self.special_tokens.items()}
#         self.merges = {}
#         self.vocab = {}
        
#     def train(self, texts):
#         """
#         Train the BPE tokenizer on a list of texts
#         """
#         # Initialize vocabulary with characters
#         word_freqs = Counter()
#         for text in texts:
#             word_freqs.update(text.split())
        
#         # Initialize each word as a sequence of characters
#         self.vocab = {token: list(token) for token in word_freqs.keys()}
        
#         # Count pairs
#         pairs = self._count_pairs(self.vocab, word_freqs)
        
#         # Merge pairs until vocab_size is reached or no more pairs
#         num_merges = self.vocab_size - len(self.special_tokens)
#         for i in range(num_merges):
#             if not pairs:
#                 break
                
#             # Find most frequent pair
#             best_pair = max(pairs, key=pairs.get)
            
#             # Update vocabulary and merges
#             self._merge_pair(best_pair, self.vocab, word_freqs)
#             self.merges[best_pair] = len(self.token_to_id)
            
#             # Add new token to vocabulary
#             new_token = best_pair[0] + best_pair[1]
#             self.token_to_id[new_token] = len(self.token_to_id)
#             self.id_to_token[self.token_to_id[new_token]] = new_token
            
#             # Update pair counts
#             pairs = self._count_pairs(self.vocab, word_freqs)
            
#             if i % 1000 == 0:
#                 print(f"Merge {i}/{num_merges}: {best_pair} -> {new_token}")
    
#     def _count_pairs(self, vocab, word_freqs):
#         """
#         Count the frequency of adjacent pairs in the vocabulary
#         """
#         pairs = defaultdict(int)
#         for word, freq in word_freqs.items():
#             word_tokens = vocab[word]
#             for i in range(len(word_tokens) - 1):
#                 pairs[(word_tokens[i], word_tokens[i+1])] += freq
#         return pairs
    
#     def _merge_pair(self, pair, vocab, word_freqs):
#         """
#         Merge a pair of tokens in the vocabulary
#         """
#         for word in vocab:
#             word_tokens = vocab[word]
#             new_tokens = []
#             i = 0
#             while i < len(word_tokens):
#                 if i < len(word_tokens) - 1 and (word_tokens[i], word_tokens[i+1]) == pair:
#                     new_tokens.append(word_tokens[i] + word_tokens[i+1])
#                     i += 2
#                 else:
#                     new_tokens.append(word_tokens[i])
#                     i += 1
#             vocab[word] = new_tokens
    
#     def tokenize(self, text):
#         """
#         Tokenize a text using the learned BPE merges
#         """
#         words = text.split()
#         tokens = []
        
#         for word in words:
#             # Start with characters
#             word_tokens = list(word)
            
#             # Apply merges
#             while len(word_tokens) > 1:
#                 pairs = [(word_tokens[i], word_tokens[i+1]) for i in range(len(word_tokens)-1)]
                
#                 # Find the pair with the highest priority (first occurrence in merges)
#                 pair_scores = [(pair, self.merges.get(pair, float('inf'))) for pair in pairs]
#                 best_pair, best_score = min(pair_scores, key=lambda x: x[1])
                
#                 if best_pair not in self.merges:
#                     break
                    
#                 # Apply the merge
#                 new_tokens = []
#                 i = 0
#                 while i < len(word_tokens):
#                     if i < len(word_tokens) - 1 and (word_tokens[i], word_tokens[i+1]) == best_pair:
#                         new_tokens.append(word_tokens[i] + word_tokens[i+1])
#                         i += 2
#                     else:
#                         new_tokens.append(word_tokens[i])
#                         i += 1
#                 word_tokens = new_tokens
            
#             tokens.extend(word_tokens)
        
#         # Add EOS token
#         tokens.append("<EOS>")
        
#         # Convert tokens to IDs
#         ids = []
#         for token in tokens:
#             if token in self.token_to_id:
#                 ids.append(self.token_to_id[token])
#             else:
#                 ids.append(self.special_tokens["<UNK>"])
        
#         return ids
    
#     def detokenize(self, ids):
#         """
#         Convert token IDs back to text
#         """
#         tokens = [self.id_to_token.get(id, "<UNK>") for id in ids if id != self.special_tokens["<PAD>"]]
        
#         # Remove special tokens
#         tokens = [token for token in tokens if token not in ["<SOS>", "<EOS>", "<PAD>", "<UNK>"]]
        
#         # Join tokens to form the text
#         text = ''.join(tokens)
        
#         return text
    
#     def save(self, path):
#         """
#         Save tokenizer to a file
#         """
#         torch.save({
#             'token_to_id': self.token_to_id,
#             'id_to_token': self.id_to_token,
#             'merges': self.merges,
#             'vocab': self.vocab,
#             'vocab_size': self.vocab_size
#         }, path)
    
#     def load(self, path):
#         """
#         Load tokenizer from a file
#         """
#         data = torch.load(path)
#         self.token_to_id = data['token_to_id']
#         self.id_to_token = data['id_to_token']
#         self.merges = data['merges']
#         self.vocab = data['vocab']
#         self.vocab_size = data['vocab_size']


# def preprocess_english(text):
#     """
#     Preprocess English text
#     """
#     # Convert to lowercase
#     text = text.lower()
    
#     # Add spaces around punctuation
#     text = re.sub(r'([.,!?;:])', r' \1 ', text)
    
#     # Remove multiple spaces
#     text = re.sub(r'\s+', ' ', text).strip()
    
#     return text

# def preprocess_urdu(text):
#     """
#     Preprocess Urdu text
#     """
#     # Add spaces around punctuation
#     text = re.sub(r'([۔،!?;:])', r' \1 ', text)
    
#     # Normalize Unicode
#     text = unicodedata.normalize('NFKC', text)
    
#     # Remove multiple spaces
#     text = re.sub(r'\s+', ' ', text).strip()
    
#     return text

# def load_corpus(en_path, ur_path):
#     """
#     Load and preprocess English-Urdu corpus
#     """
#     with open(en_path, 'r', encoding='utf-8') as f:
#         en_lines = [preprocess_english(line.strip()) for line in f.readlines()]
    
#     with open(ur_path, 'r', encoding='utf-8') as f:
#         ur_lines = [preprocess_urdu(line.strip()) for line in f.readlines()]
    
#     # Ensure equal length
#     assert len(en_lines) == len(ur_lines), f"Mismatch in corpus lengths: {len(en_lines)} vs {len(ur_lines)}"
    
#     # Filter out empty lines and lines that are too long
#     filtered_pairs = [(en, ur) for en, ur in zip(en_lines, ur_lines) 
#                      if len(en.split()) <= MAX_SEQ_LENGTH and len(ur.split()) <= MAX_SEQ_LENGTH]
    
#     en_filtered, ur_filtered = zip(*filtered_pairs) if filtered_pairs else ([], [])
    
#     return list(en_filtered), list(ur_filtered)

# def train_tokenizers(en_texts, ur_texts):
#     """
#     Train BPE tokenizers for English and Urdu
#     """
#     # Initialize tokenizers
#     en_tokenizer = BPETokenizer(vocab_size=VOCAB_SIZE)
#     ur_tokenizer = BPETokenizer(vocab_size=VOCAB_SIZE)
    
#     # Train tokenizers
#     print("Training English tokenizer...")
#     en_tokenizer.train(en_texts)
    
#     print("Training Urdu tokenizer...")
#     ur_tokenizer.train(ur_texts)
    
#     # Save tokenizers
#     en_tokenizer.save(TOKENIZER_EN_PATH)
#     ur_tokenizer.save(TOKENIZER_UR_PATH)
    
#     return en_tokenizer, ur_tokenizer

# class TranslationDataset(Dataset):
#     """
#     Dataset for machine translation
#     """
#     def __init__(self, en_texts, ur_texts, en_tokenizer, ur_tokenizer):
#         self.en_texts = en_texts
#         self.ur_texts = ur_texts
#         self.en_tokenizer = en_tokenizer
#         self.ur_tokenizer = ur_tokenizer
    
#     def __len__(self):
#         return len(self.en_texts)
    
#     def __getitem__(self, idx):
#         en_text = self.en_texts[idx]
#         ur_text = self.ur_texts[idx]
        
#         # Tokenize texts
#         en_tokens = [self.en_tokenizer.special_tokens["<SOS>"]] + self.en_tokenizer.tokenize(en_text)
#         ur_tokens = [self.ur_tokenizer.special_tokens["<SOS>"]] + self.ur_tokenizer.tokenize(ur_text)
        
#         return {
#             'en_text': en_text,
#             'ur_text': ur_text,
#             'en_tokens': torch.tensor(en_tokens),
#             'ur_tokens': torch.tensor(ur_tokens)
#         }

# def collate_fn(batch):
#     """
#     Collate function for batching
#     """
#     en_texts = [item['en_text'] for item in batch]
#     ur_texts = [item['ur_text'] for item in batch]
    
#     en_tokens = [item['en_tokens'] for item in batch]
#     ur_tokens = [item['ur_tokens'] for item in batch]
    
#     # Pad sequences
#     en_tokens_padded = pad_sequence(en_tokens, batch_first=True, padding_value=0)
#     ur_tokens_padded = pad_sequence(ur_tokens, batch_first=True, padding_value=0)
    
#     return {
#         'en_texts': en_texts,
#         'ur_texts': ur_texts,
#         'en_tokens': en_tokens_padded,
#         'ur_tokens': ur_tokens_padded
#     }

# # Part 2: Transformer Model Implementation
# #############################################################

# class PositionalEncoding(nn.Module):
#     """
#     Positional encoding for transformer model
#     """
#     def __init__(self, d_model, dropout=0.1, max_len=5000):
#         super(PositionalEncoding, self).__init__()
#         self.dropout = nn.Dropout(p=dropout)
        
#         # Create positional encoding
#         pe = torch.zeros(max_len, d_model)
#         position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
#         div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
#         pe[:, 0::2] = torch.sin(position * div_term)
#         pe[:, 1::2] = torch.cos(position * div_term)
#         pe = pe.unsqueeze(0).transpose(0, 1)
#         self.register_buffer('pe', pe)
        
#     def forward(self, x):
#         """
#         Args:
#             x: Tensor, shape [seq_len, batch_size, embedding_dim]
#         """
#         x = x + self.pe[:x.size(0), :]
#         return self.dropout(x)

# class Transformer(nn.Module):
#     """
#     Transformer model for machine translation
#     """
#     def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers,
#                  num_decoder_layers, dim_feedforward, dropout=0.1):
#         super(Transformer, self).__init__()
        
#         # Embedding layers
#         self.src_embedding = nn.Embedding(src_vocab_size, d_model)
#         self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        
#         # Positional encoding
#         self.positional_encoding = PositionalEncoding(d_model, dropout)
        
#         # Transformer layers
#         self.transformer = nn.Transformer(
#             d_model=d_model,
#             nhead=nhead,
#             num_encoder_layers=num_encoder_layers,
#             num_decoder_layers=num_decoder_layers,
#             dim_feedforward=dim_feedforward,
#             dropout=dropout,
#             batch_first=True
#         )
        
#         # Output layer
#         self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        
#         # Initialize parameters
#         self._init_parameters()
        
#     def _init_parameters(self):
#         """
#         Initialize parameters with Xavier uniform distribution
#         """
#         for p in self.parameters():
#             if p.dim() > 1:
#                 nn.init.xavier_uniform_(p)
    
#     def forward(self, src, tgt, src_mask=None, tgt_mask=None, 
#                 memory_mask=None, src_key_padding_mask=None, 
#                 tgt_key_padding_mask=None, memory_key_padding_mask=None):
#         """
#         Args:
#             src: Source sequence [batch_size, src_len]
#             tgt: Target sequence [batch_size, tgt_len]
#             *_mask, *_key_padding_mask: Masks for attention mechanism
#         """
#         # Create padding masks
#         if src_key_padding_mask is None:
#             src_key_padding_mask = (src == 0)
        
#         if tgt_key_padding_mask is None:
#             tgt_key_padding_mask = (tgt == 0)
        
#         # Create causal mask for decoder
#         if tgt_mask is None:
#             tgt_len = tgt.size(1)
#             tgt_mask = self.transformer.generate_square_subsequent_mask(tgt_len).to(tgt.device)
        
#         # Embed tokens
#         src_embedded = self.src_embedding(src)
#         tgt_embedded = self.tgt_embedding(tgt)
        
#         # Add positional encoding
#         src_embedded = self.positional_encoding(src_embedded)
#         tgt_embedded = self.positional_encoding(tgt_embedded)
        
#         # Forward pass through transformer
#         output = self.transformer(
#             src=src_embedded,
#             tgt=tgt_embedded,
#             src_mask=src_mask,
#             tgt_mask=tgt_mask,
#             memory_mask=memory_mask,
#             src_key_padding_mask=src_key_padding_mask,
#             tgt_key_padding_mask=tgt_key_padding_mask,
#             memory_key_padding_mask=src_key_padding_mask
#         )
        
#         # Project to vocabulary
#         output = self.fc_out(output)
        
#         return output
    
#     def get_attention_weights(self, src, tgt, src_mask=None, tgt_mask=None, 
#                               memory_mask=None, src_key_padding_mask=None, 
#                               tgt_key_padding_mask=None, memory_key_padding_mask=None):
#         """
#         Retrieve attention weights for visualization
#         """
#         # Create padding masks
#         if src_key_padding_mask is None:
#             src_key_padding_mask = (src == 0)
        
#         if tgt_key_padding_mask is None:
#             tgt_key_padding_mask = (tgt == 0)
        
#         # Create causal mask for decoder
#         if tgt_mask is None:
#             tgt_len = tgt.size(1)
#             tgt_mask = self.transformer.generate_square_subsequent_mask(tgt_len).to(tgt.device)
        
#         # Embed tokens
#         src_embedded = self.src_embedding(src)
#         tgt_embedded = self.tgt_embedding(tgt)
        
#         # Add positional encoding
#         src_embedded = self.positional_encoding(src_embedded)
#         tgt_embedded = self.positional_encoding(tgt_embedded)
        
#         # Store attention weights
#         attention_weights = {}
        
#         # Encoder self-attention
#         for i, layer in enumerate(self.transformer.encoder.layers):
#             # Hook to capture attention weights
#             def get_attention_hook(layer_idx):
#                 def hook(module, input, output):
#                     attention_weights[f'encoder_layer_{layer_idx}'] = module.self_attn.attn_output_weights
#                 return hook
            
#             handle = layer.self_attn.register_forward_hook(get_attention_hook(i))
            
#         # Decoder self-attention and cross-attention
#         for i, layer in enumerate(self.transformer.decoder.layers):
#             # Hook for self-attention
#             def get_self_attention_hook(layer_idx):
#                 def hook(module, input, output):
#                     attention_weights[f'decoder_self_attn_layer_{layer_idx}'] = module.self_attn.attn_output_weights
#                 return hook
            
#             # Hook for cross-attention
#             def get_cross_attention_hook(layer_idx):
#                 def hook(module, input, output):
#                     attention_weights[f'decoder_cross_attn_layer_{layer_idx}'] = module.multihead_attn.attn_output_weights
#                 return hook
            
#             handle_self = layer.self_attn.register_forward_hook(get_self_attention_hook(i))
#             handle_cross = layer.multihead_attn.register_forward_hook(get_cross_attention_hook(i))
        
#         # Forward pass through transformer
#         output = self.transformer(
#             src=src_embedded,
#             tgt=tgt_embedded,
#             src_mask=src_mask,
#             tgt_mask=tgt_mask,
#             memory_mask=memory_mask,
#             src_key_padding_mask=src_key_padding_mask,
#             tgt_key_padding_mask=tgt_key_padding_mask,
#             memory_key_padding_mask=src_key_padding_mask
#         )
        
#         # Remove hooks
#         for handle in [handle] + [handle_self, handle_cross]:
#             handle.remove()
        
#         return attention_weights

# # LSTM Model Implementation
# class Encoder(nn.Module):
#     """
#     Bidirectional Encoder for LSTM model
#     """
#     def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
#         super().__init__()
        
#         self.hid_dim = hid_dim
#         self.n_layers = n_layers
        
#         self.embedding = nn.Embedding(input_dim, emb_dim)
#         self.rnn = nn.LSTM(
#             emb_dim,
#             hid_dim,
#             n_layers,
#             dropout=dropout,
#             batch_first=True,
#             bidirectional=True   # ✅ now bidirectional
#         )
#         self.dropout = nn.Dropout(dropout)
        
#     def forward(self, src):
#         """
#         src: [batch_size, src_len]
#         returns:
#           - outputs: [batch_size, src_len, hid_dim * 2]  # ✅ (forward + backward)
#           - hidden: [n_layers * 2, batch_size, hid_dim]
#           - cell:   [n_layers * 2, batch_size, hid_dim]
#         """
#         embedded = self.dropout(self.embedding(src))
#         outputs, (hidden, cell) = self.rnn(embedded)
        
#         return outputs, hidden, cell

# class Attention(nn.Module):
#     """
#     Attention mechanism for Bidirectional Encoder
#     """
#     def __init__(self, enc_hid_dim, dec_hid_dim):
#         super().__init__()
        
#         self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)  # ✅ enc_hid_dim * 2
#         self.v = nn.Linear(dec_hid_dim, 1, bias=False)
        
#     def forward(self, hidden, encoder_outputs, mask=None):
#         """
#         hidden: [batch_size, dec_hid_dim]
#         encoder_outputs: [batch_size, src_len, enc_hid_dim * 2]
#         mask: [batch_size, src_len] (optional)
#         """
#         batch_size = encoder_outputs.shape[0]
#         src_len = encoder_outputs.shape[1]
        
#         # Repeat hidden for each time step
#         hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
#         # Calculate energy
#         energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        
#         # Calculate attention scores
#         attention = self.v(energy).squeeze(2)
        
#         # Apply mask if given
#         if mask is not None:
#             attention = attention.masked_fill(mask == 0, -1e10)
        
#         attention_weights = torch.softmax(attention, dim=1)
        
#         # Calculate context vector
#         context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)
        
#         return context, attention_weights


# class Decoder(nn.Module):
#     def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, n_layers, dropout, attention):
#         super().__init__()

#         self.output_dim = output_dim
#         self.attention = attention

#         self.embedding = nn.Embedding(output_dim, emb_dim)

#         self.rnn = nn.LSTM(
#             (enc_hid_dim * 2) + emb_dim,  # Update to match encoder's bidirectional output size
#             dec_hid_dim,
#             n_layers,
#             dropout=dropout,
#             batch_first=True
#         )

#         self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)  # Update input size
#         self.dropout = nn.Dropout(dropout)

#     def forward(self, input, hidden, cell, encoder_outputs, mask=None):
#         embedded = self.dropout(self.embedding(input))

#         context, attention_weights = self.attention(hidden[-1], encoder_outputs, mask)

#         rnn_input = torch.cat((embedded, context.unsqueeze(1)), dim=2)

#         output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))

#         output = torch.cat((output.squeeze(1), context, embedded.squeeze(1)), dim=1)

#         prediction = self.fc_out(output)

#         return prediction, hidden, cell, attention_weights

# class LSTMSeq2Seq(nn.Module):
#     def __init__(self, encoder, decoder, device):
#         super().__init__()

#         self.encoder = encoder
#         self.decoder = decoder
#         self.device = device

#     def forward(self, src, tgt, teacher_forcing_ratio=0.5):
#         batch_size = src.shape[0]
#         tgt_len = tgt.shape[1]
#         tgt_vocab_size = self.decoder.output_dim

#         outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

#         src_mask = (src != 0).float().to(self.device)

#         # Encode source sentence
#         encoder_outputs, hidden, cell = self.encoder(src)

#         # Merge forward and backward hidden and cell states
#         hidden = self._cat_directions(hidden)
#         cell = self._cat_directions(cell)

#         # First input to the decoder is the <SOS> token
#         input = tgt[:, 0].unsqueeze(1)

#         attention_weights = []

#         for t in range(1, tgt_len):
#             output, hidden, cell, attn_weights = self.decoder(input, hidden, cell, encoder_outputs, src_mask)
#             outputs[:, t, :] = output
#             attention_weights.append(attn_weights)

#             teacher_force = random.random() < teacher_forcing_ratio
#             top1 = output.argmax(1).unsqueeze(1)
#             input = tgt[:, t].unsqueeze(1) if teacher_force else top1

#         return outputs, torch.stack(attention_weights, dim=1)

#     def _cat_directions(self, h):
#         new_h = torch.cat((h[0:h.size(0):2], h[1:h.size(0):2]), dim=2)  # Concatenate forward and backward
#         return new_h


In [2]:
import os
import re
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import math
import time
from sklearn.model_selection import train_test_split
import nltk
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import tkinter as tk
from tkinter import scrolledtext
import unicodedata
from torchtext.data.metrics import bleu_score
from collections import Counter, defaultdict

# Download NLTK resources
nltk.download('punkt')

# Set seeds for reproducibility
SEED = 42
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.backends.cudnn.deterministic = True

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Define hyperparameters
BATCH_SIZE = 64
EMBEDDING_SIZE = 512
NUM_HEADS = 8
NUM_ENCODER_LAYERS = 6
NUM_DECODER_LAYERS = 6
FFN_DIM = 2048
DROPOUT = 0.1
LEARNING_RATE = 0.0001
NUM_EPOCHS = 20
MAX_SEQ_LENGTH = 100
VOCAB_SIZE = 32000  # For BPE tokenization

# Data path configuration - adjust based on your file structure
DATA_DIR = "data"  # Base directory for the corpus
TRAIN_EN_PATH = os.path.join(DATA_DIR, "bible/train.en")
TRAIN_UR_PATH = os.path.join(DATA_DIR, "bible/train.ur")
DEV_EN_PATH = os.path.join(DATA_DIR, "bible/dev.en")
DEV_UR_PATH = os.path.join(DATA_DIR, "bible/dev.ur")
TEST_EN_PATH = os.path.join(DATA_DIR, "bible/test.en")
TEST_UR_PATH = os.path.join(DATA_DIR, "bible/test.ur")

# Add Quran data paths
QURAN_TRAIN_EN_PATH = os.path.join(DATA_DIR, "quran/train.en")
QURAN_TRAIN_UR_PATH = os.path.join(DATA_DIR, "quran/train.ur")
QURAN_DEV_EN_PATH = os.path.join(DATA_DIR, "quran/dev.en")
QURAN_DEV_UR_PATH = os.path.join(DATA_DIR, "quran/dev.ur")
QURAN_TEST_EN_PATH = os.path.join(DATA_DIR, "quran/test.en")
QURAN_TEST_UR_PATH = os.path.join(DATA_DIR, "quran/test.ur")

# Add model save paths
MODEL_SAVE_DIR = "saved_models"
TRANSFORMER_MODEL_PATH = os.path.join(MODEL_SAVE_DIR, "transformer_model.pt")
LSTM_MODEL_PATH = os.path.join(MODEL_SAVE_DIR, "lstm_model.pt")
TOKENIZER_EN_PATH = os.path.join(MODEL_SAVE_DIR, "tokenizer_en.pt")
TOKENIZER_UR_PATH = os.path.join(MODEL_SAVE_DIR, "tokenizer_ur.pt")

os.makedirs(MODEL_SAVE_DIR, exist_ok=True)

# Part 1: Data Preprocessing and Tokenization
#############################################################

# Tokenizer and Preprocessing (same as in the original code, no changes needed)

# Part 2: Transformer Model Implementation
#############################################################

class PositionalEncoding(nn.Module):
    """
    Positional encoding for transformer model
    """
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create positional encoding
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

class Transformer(nn.Module):
    """
    Transformer model for machine translation
    """
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers,
                 num_decoder_layers, dim_feedforward, dropout=0.1):
        super(Transformer, self).__init__()
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        
        # Positional encoding
        self.positional_encoding = PositionalEncoding(d_model, dropout)
        
        # Transformer layers
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        
        # Output layer
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        
        # Initialize parameters
        self._init_parameters()
        
    def _init_parameters(self):
        """
        Initialize parameters with Xavier uniform distribution
        """
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    
    def forward(self, src, tgt, src_mask=None, tgt_mask=None, 
                memory_mask=None, src_key_padding_mask=None, 
                tgt_key_padding_mask=None, memory_key_padding_mask=None):
        """
        Args:
            src: Source sequence [batch_size, src_len]
            tgt: Target sequence [batch_size, tgt_len]
            *_mask, *_key_padding_mask: Masks for attention mechanism
        """
        # Create padding masks
        if src_key_padding_mask is None:
            src_key_padding_mask = (src == 0)
        
        if tgt_key_padding_mask is None:
            tgt_key_padding_mask = (tgt == 0)
        
        # Create causal mask for decoder
        if tgt_mask is None:
            tgt_len = tgt.size(1)
            tgt_mask = self.transformer.generate_square_subsequent_mask(tgt_len).to(tgt.device)
        
        # Embed tokens
        src_embedded = self.src_embedding(src)
        tgt_embedded = self.tgt_embedding(tgt)
        
        # Add positional encoding
        src_embedded = self.positional_encoding(src_embedded)
        tgt_embedded = self.positional_encoding(tgt_embedded)
        
        # Forward pass through transformer
        output = self.transformer(
            src=src_embedded,
            tgt=tgt_embedded,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            memory_mask=memory_mask,
            src_key_padding_mask=src_key_padding_mask,
            tgt_key_padding_mask=tgt_key_padding_mask,
            memory_key_padding_mask=src_key_padding_mask
        )
        
        # Project to vocabulary
        output = self.fc_out(output)
        
        return output
    
    def get_attention_weights(self, src, tgt, src_mask=None, tgt_mask=None, 
                              memory_mask=None, src_key_padding_mask=None, 
                              tgt_key_padding_mask=None, memory_key_padding_mask=None):
        """
        Retrieve attention weights for visualization
        """
        # Create padding masks
        if src_key_padding_mask is None:
            src_key_padding_mask = (src == 0)
        
        if tgt_key_padding_mask is None:
            tgt_key_padding_mask = (tgt == 0)
        
        # Create causal mask for decoder
        if tgt_mask is None:
            tgt_len = tgt.size(1)
            tgt_mask = self.transformer.generate_square_subsequent_mask(tgt_len).to(tgt.device)
        
        # Embed tokens
        src_embedded = self.src_embedding(src)
        tgt_embedded = self.tgt_embedding(tgt)
        
        # Add positional encoding
        src_embedded = self.positional_encoding(src_embedded)
        tgt_embedded = self.positional_encoding(tgt_embedded)
        
        # Store attention weights
        attention_weights = {}
        
        # Encoder self-attention
        for i, layer in enumerate(self.transformer.encoder.layers):
            # Hook to capture attention weights
            def get_attention_hook(layer_idx):
                def hook(module, input, output):
                    attention_weights[f'encoder_layer_{layer_idx}'] = module.self_attn.attn_output_weights
                return hook
            
            handle = layer.self_attn.register_forward_hook(get_attention_hook(i))
            
        # Decoder self-attention and cross-attention
        for i, layer in enumerate(self.transformer.decoder.layers):
            # Hook for self-attention
            def get_self_attention_hook(layer_idx):
                def hook(module, input, output):
                    attention_weights[f'decoder_self_attn_layer_{layer_idx}'] = module.self_attn.attn_output_weights
                return hook
            
            # Hook for cross-attention
            def get_cross_attention_hook(layer_idx):
                def hook(module, input, output):
                    attention_weights[f'decoder_cross_attn_layer_{layer_idx}'] = module.multihead_attn.attn_output_weights
                return hook
            
            handle_self = layer.self_attn.register_forward_hook(get_self_attention_hook(i))
            handle_cross = layer.multihead_attn.register_forward_hook(get_cross_attention_hook(i))
        
        # Forward pass through transformer
        output = self.transformer(
            src=src_embedded,
            tgt=tgt_embedded,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            memory_mask=memory_mask,
            src_key_padding_mask=src_key_padding_mask,
            tgt_key_padding_mask=tgt_key_padding_mask,
            memory_key_padding_mask=src_key_padding_mask
        )
        
        # Remove hooks
        for handle in [handle] + [handle_self, handle_cross]:
            handle.remove()
        
        return attention_weights

class Attention(nn.Module):
    """
    Attention mechanism for Bidirectional Encoder
    """
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        # Fix: Use bidirectional hidden size for encoder
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias=False)
        
    def forward(self, hidden, encoder_outputs, mask=None):
        """
        hidden: [batch_size, dec_hid_dim]
        encoder_outputs: [batch_size, src_len, enc_hid_dim * 2]
        mask: [batch_size, src_len] (optional)
        """
        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]

        # Repeat hidden for each time step to match encoder outputs
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # [batch_size, src_len, dec_hid_dim]
        
        # Concatenate hidden state and encoder outputs
        combined = torch.cat((hidden, encoder_outputs), dim=2)  # [batch_size, src_len, dec_hid_dim + enc_hid_dim * 2]
        
        # Calculate energy
        energy = torch.tanh(self.attn(combined))  # [batch_size, src_len, dec_hid_dim]
        
        # Calculate attention scores
        attention = self.v(energy).squeeze(2)  # [batch_size, src_len]
        
        # Apply mask if given
        if mask is not None:
            attention = attention.masked_fill(mask == 0, -1e10)
        
        attention_weights = torch.softmax(attention, dim=1)  # [batch_size, src_len]
        
        # Calculate context vector
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)  # [batch_size, enc_hid_dim * 2]
        
        return context, attention_weights


class LSTMSeq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        tgt_len = tgt.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

        src_mask = (src != 0).float().to(self.device)

        # Encode source sentence
        encoder_outputs, hidden, cell = self.encoder(src)

        # Merge forward and backward hidden and cell states
        hidden = self._cat_directions(hidden)
        cell = self._cat_directions(cell)

        # First input to the decoder is the <SOS> token
        input = tgt[:, 0].unsqueeze(1)

        attention_weights = []

        for t in range(1, tgt_len):
            output, hidden, cell, attn_weights = self.decoder(input, hidden, cell, encoder_outputs, src_mask)
            outputs[:, t, :] = output
            attention_weights.append(attn_weights)

            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1).unsqueeze(1)
            input = tgt[:, t].unsqueeze(1) if teacher_force else top1

        return outputs, torch.stack(attention_weights, dim=1)

    def _cat_directions(self, h):
        new_h = torch.cat((h[0:h.size(0):2], h[1:h.size(0):2]), dim=2)  # Concatenate forward and backward
        return new_h

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\97156\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Using device: cuda


In [3]:
def train_epoch(model, dataloader, optimizer, criterion, clip, device, teacher_forcing_ratio=0.5):
    """
    Train the model for one epoch
    """
    model.train()
    epoch_loss = 0
    
    for batch in dataloader:
        # Get input and target sequences
        src = batch['en_tokens'].to(device)
        tgt = batch['ur_tokens'].to(device)
        
        # Forward pass
        if isinstance(model, Transformer):
            # For transformer model
            # Shift target for teacher forcing
            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:]
            
            # Forward pass
            output = model(src, tgt_input)
            
            # Reshape output and target for loss calculation
            output_flat = output.contiguous().view(-1, output.shape[-1])
            tgt_output_flat = tgt_output.contiguous().view(-1)
            
            # Calculate loss
            loss = criterion(output_flat, tgt_output_flat)
        else:
            # For LSTM model
            output, _ = model(src, tgt, teacher_forcing_ratio)
            
            # Reshape output and target for loss calculation
            output_flat = output[:, 1:].contiguous().view(-1, output.shape[-1])
            tgt_output_flat = tgt[:, 1:].contiguous().view(-1)
            
            # Calculate loss
            loss = criterion(output_flat, tgt_output_flat)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        
        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        # Update parameters
        optimizer.step()
        
        # Update epoch loss
        epoch_loss += loss.item()
    
    return epoch_loss / len(dataloader)

def evaluate(model, dataloader, criterion, device):
    """
    Evaluate the model
    """
    model.eval()
    epoch_loss = 0
    
    with torch.no_grad():
        for batch in dataloader:
            # Get input and target sequences
            src = batch['en_tokens'].to(device)
            tgt = batch['ur_tokens'].to(device)
            
            # Forward pass
            if isinstance(model, Transformer):
                # For transformer model
                # Shift target for teacher forcing
                tgt_input = tgt[:, :-1]
                tgt_output = tgt[:, 1:]
                
                # Forward pass
                output = model(src, tgt_input)
                
                # Reshape output and target for loss calculation
                output_flat = output.contiguous().view(-1, output.shape[-1])
                tgt_output_flat = tgt_output.contiguous().view(-1)
                
                # Calculate loss
                loss = criterion(output_flat, tgt_output_flat)
            else:
                # For LSTM model
                output, _ = model(src, tgt, 0)  # No teacher forcing during evaluation
                
                # Reshape output and target for loss calculation
                output_flat = output[:, 1:].contiguous().view(-1, output.shape[-1])
                tgt_output_flat = tgt[:, 1:].contiguous().view(-1)
                
                # Calculate loss
                loss = criterion(output_flat, tgt_output_flat)
            
            # Update epoch loss
            epoch_loss += loss.item()
    
    return epoch_loss / len(dataloader)

def translate_sentence(model, sentence, en_tokenizer, ur_tokenizer, device, max_length=MAX_SEQ_LENGTH):
    """
    Translate a single English sentence to Urdu
    """
    model.eval()
    
    # Preprocess and tokenize sentence
    processed_sentence = preprocess_english(sentence)
    tokens = [en_tokenizer.special_tokens["< SOS >"]] + en_tokenizer.tokenize(processed_sentence)
    tokens_tensor = torch.tensor(tokens).unsqueeze(0).to(device)
    
    # Initialize target with SOS token
    if isinstance(model, Transformer):
        # For transformer model
        # Initialize target tensor with SOS token
        tgt = torch.tensor([ur_tokenizer.special_tokens["< SOS >"]]).unsqueeze(0).to(device)
        
        # Generate translation one token at a time
        for _ in range(max_length):
            # Forward pass
            output = model(tokens_tensor, tgt)
            
            # Get the next token
            next_token = output[:, -1, :].argmax(dim=1).item()
            
            # Break if EOS token
            if next_token == ur_tokenizer.special_tokens["<EOS>"]:
                break
            
            # Add token to target
            tgt = torch.cat([tgt, torch.tensor([[next_token]]).to(device)], dim=1)
        
        # Convert to list and remove SOS token
        tgt = tgt.squeeze(0).tolist()[1:]
    else:
        # For LSTM model
        # Encode source sentence
        with torch.no_grad():
            encoder_outputs, hidden, cell = model.encoder(tokens_tensor)
        
        # Initialize target with SOS token
        tgt = [ur_tokenizer.special_tokens["< SOS >"]]
        
        # Initialize hidden states
        for _ in range(max_length):
            # Convert token to tensor
            tgt_tensor = torch.tensor([tgt[-1]]).unsqueeze(0).to(device)
            
            # Forward pass
            with torch.no_grad():
                output, hidden, cell, _ = model.decoder(tgt_tensor, hidden, cell, encoder_outputs)
            
            # Get the next token
            next_token = output.argmax(1).item()
            
            # Add token to target
            tgt.append(next_token)
            
            # Break if EOS token
            if next_token == ur_tokenizer.special_tokens["<EOS>"]:
                break
        
        # Remove SOS token
        tgt = tgt[1:]
    
    # Convert tokens to text
    translated_text = ur_tokenizer.detokenize(tgt)
    
    return translated_text

def calculate_bleu(translation_pairs, en_tokenizer, ur_tokenizer, model, device):
    """
    Calculate BLEU score
    """
    references = []
    hypotheses = []
    
    for en_text, ur_text in translation_pairs:
        # Translate English text
        translation = translate_sentence(model, en_text, en_tokenizer, ur_tokenizer, device)
        
        # Tokenize reference and hypothesis
        reference_tokens = ur_text.split()
        hypothesis_tokens = translation.split()
        
        # Add to lists
        references.append([reference_tokens])
        hypotheses.append(hypothesis_tokens)
    
    # Calculate BLEU score
    smoothing = SmoothingFunction().method1
    bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smoothing)
    
    return bleu_score

def calculate_rouge(translation_pairs, en_tokenizer, ur_tokenizer, model, device):
    """
    Calculate ROUGE score
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=False)
    rouge_scores = {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}
    
    for en_text, ur_text in translation_pairs:
        # Translate English text
        translation = translate_sentence(model, en_text, en_tokenizer, ur_tokenizer, device)
        
        # Calculate ROUGE score
        scores = scorer.score(ur_text, translation)
        
        # Update scores
        for key in rouge_scores:
            rouge_scores[key] += scores[key].fmeasure
    
    # Calculate average
    for key in rouge_scores:
        rouge_scores[key] /= len(translation_pairs)
    
    return rouge_scores

def plot_learning_curves(train_losses, valid_losses, save_path):
    """
    Plot training and validation loss curves
    """
    plt.figure(figsize=(10, 6))
    plt.plot(train_losses, label='Training Loss')
    plt.plot(valid_losses, label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss Curves')
    plt.legend()
    plt.savefig(save_path)
    plt.close()

def visualize_attention(model, sentence, en_tokenizer, ur_tokenizer, device):
    """
    Visualize attention weights
    """
    model.eval()
    
    # Preprocess and tokenize sentence
    processed_sentence = preprocess_english(sentence)
    tokens = [en_tokenizer.special_tokens["< SOS >"]] + en_tokenizer.tokenize(processed_sentence)
    tokens_tensor = torch.tensor(tokens).unsqueeze(0).to(device)
    
    # For transformer model
    if isinstance(model, Transformer):
        # Initialize target tensor with SOS token
        tgt = torch.tensor([ur_tokenizer.special_tokens["< SOS >"]]).unsqueeze(0).to(device)
        
        # Generate translation one token at a time
        attention_matrices = []
        translated_tokens = []
        
        for _ in range(MAX_SEQ_LENGTH):
            # Get attention weights
            attention_weights = model.get_attention_weights(tokens_tensor, tgt)
            
            # Get cross-attention from the last decoder layer
            cross_attention = attention_weights[f'decoder_cross_attn_layer_{NUM_DECODER_LAYERS-1}'].detach().cpu().numpy()
            attention_matrices.append(cross_attention)
            
            # Forward pass
            output = model(tokens_tensor, tgt)
            
            # Get the next token
            next_token = output[:, -1, :].argmax(dim=1).item()
            translated_tokens.append(next_token)
            
            # Break if EOS token
            if next_token == ur_tokenizer.special_tokens["<EOS>"]:
                break
            
            # Add token to target
            tgt = torch.cat([tgt, torch.tensor([[next_token]]).to(device)], dim=1)
        
        # Convert tokens to text
        en_text = processed_sentence.split()
        ur_text = [ur_tokenizer.id_to_token.get(token, "<UNK>") for token in translated_tokens]
        
        # Plot attention heatmap
        fig, ax = plt.subplots(figsize=(12, 8))
        attention = np.mean(np.array(attention_matrices), axis=0).squeeze()
        
        # Ensure attention has the right shape
        if len(attention.shape) == 3:
            attention = attention[0]
        
        ax.imshow(attention[:len(ur_text), :len(en_text)], cmap='viridis')
        
        # Set axis labels
        ax.set_xticks(range(len(en_text)))
        ax.set_yticks(range(len(ur_text)))
        ax.set_xticklabels(en_text, rotation=90)
        ax.set_yticklabels(ur_text)
        
        # Set title
        ax.set_title('Attention Visualization')
        
        plt.tight_layout()
        return fig
    else:
        # For LSTM model (if needed)
        return None

# Part 4: Main Training Loop and GUI Implementation
#############################################################

def load_and_preprocess_data():
    """
    Load and preprocess the corpus data
    """
    print("Loading and preprocessing data...")
    
    # Load Bible corpus
    en_bible_train, ur_bible_train = load_corpus(TRAIN_EN_PATH, TRAIN_UR_PATH)
    en_bible_dev, ur_bible_dev = load_corpus(DEV_EN_PATH, DEV_UR_PATH)
    en_bible_test, ur_bible_test = load_corpus(TEST_EN_PATH, TEST_UR_PATH)
    
    # Load Quran corpus
    en_quran_train, ur_quran_train = load_corpus(QURAN_TRAIN_EN_PATH, QURAN_TRAIN_UR_PATH)
    en_quran_dev, ur_quran_dev = load_corpus(QURAN_DEV_EN_PATH, QURAN_DEV_UR_PATH)
    en_quran_test, ur_quran_test = load_corpus(QURAN_TEST_EN_PATH, QURAN_TEST_UR_PATH)
    
    # Combine corpus
    en_train = en_bible_train + en_quran_train
    ur_train = ur_bible_train + ur_quran_train
    en_dev = en_bible_dev + en_quran_dev
    ur_dev = ur_bible_dev + ur_quran_dev
    en_test = en_bible_test + en_quran_test
    ur_test = ur_bible_test + ur_quran_test
    
    print(f"Training set size: {len(en_train)}")
    print(f"Validation set size: {len(en_dev)}")
    print(f"Test set size: {len(en_test)}")
    
    # Train tokenizers
    en_tokenizer, ur_tokenizer = train_tokenizers(en_train, ur_train)
    
    # Create datasets
    train_dataset = TranslationDataset(en_train, ur_train, en_tokenizer, ur_tokenizer)
    dev_dataset = TranslationDataset(en_dev, ur_dev, en_tokenizer, ur_tokenizer)
    test_dataset = TranslationDataset(en_test, ur_test, en_tokenizer, ur_tokenizer)
    
    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
    dev_loader = DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    
    # Create test pairs for BLEU and ROUGE calculations
    test_pairs = list(zip(en_test, ur_test))
    
    return train_loader, dev_loader, test_loader, test_pairs, en_tokenizer, ur_tokenizer

def train_transformer_model(train_loader, dev_loader, en_tokenizer, ur_tokenizer):
    """
    Train the transformer model
    """
    print("Training transformer model...")
    
    # Initialize model
    src_vocab_size = len(en_tokenizer.token_to_id)
    tgt_vocab_size = len(ur_tokenizer.token_to_id)
    
    transformer = Transformer(
        src_vocab_size=src_vocab_size,
        tgt_vocab_size=tgt_vocab_size,
        d_model=EMBEDDING_SIZE,
        nhead=NUM_HEADS,
        num_encoder_layers=NUM_ENCODER_LAYERS,
        num_decoder_layers=NUM_DECODER_LAYERS,
        dim_feedforward=FFN_DIM,
        dropout=DROPOUT
    ).to(device)
    
    # Initialize optimizer and scheduler
    optimizer = optim.Adam(transformer.parameters(), lr=LEARNING_RATE)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2, verbose=True)
    
    # Initialize criterion
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    
    # Initialize variables for early stopping
    best_valid_loss = float('inf')
    patience = 5
    patience_counter = 0
    
    # Initialize lists for learning curves
    train_losses = []
    valid_losses = []
    
    # Training loop
    for epoch in range(NUM_EPOCHS):
        start_time = time.time()
        
        # Train
        train_loss = train_epoch(transformer, train_loader, optimizer, criterion, 1.0, device)
        
        # Evaluate
        valid_loss = evaluate(transformer, dev_loader, criterion, device)
        
        # Update learning rate
        scheduler.step(valid_loss)
        
        # Save losses for plotting
        train_losses.append(train_loss)
        valid_losses.append(valid_loss)
        
        # Calculate time elapsed
        end_time = time.time()
        epoch_mins, epoch_secs = divmod(end_time - start_time, 60)
        
        # Print epoch results
        print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs:.2f}s')
        print(f'\tTrain Loss: {train_loss:.4f} | Train PPL: {math.exp(train_loss):.4f}')
        print(f'\t Val. Loss: {valid_loss:.4f} |  Val. PPL: {math.exp(valid_loss):.4f}')
        
        # Save the best model
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(transformer.state_dict(), TRANSFORMER_MODEL_PATH)
            patience_counter = 0
        else:
            patience_counter += 1
        
        # Early stopping
        if patience_counter >= patience:
            print("Early stopping!")
            break
    
    # Plot learning curves
    plot_learning_curves(train_losses, valid_losses, 'transformer_learning_curves.png')
    
    # Load best model
    transformer.load_state_dict(torch.load(TRANSFORMER_MODEL_PATH))
    
    return transformer

def train_lstm_model(train_loader, dev_loader, en_tokenizer, ur_tokenizer):
    """
    Train the LSTM model
    """
    print("Training LSTM model...")
    
    # Initialize model parameters
    src_vocab_size = len(en_tokenizer.token_to_id)
    tgt_vocab_size = len(ur_tokenizer.token_to_id)
    emb_dim = 256
    enc_hid_dim = 512
    dec_hid_dim = 512
    n_layers = 2
    enc_dropout = 0.5
    dec_dropout = 0.5
    
    # Initialize encoder, attention, and decoder
    encoder = Encoder(src_vocab_size, emb_dim, enc_hid_dim, n_layers, enc_dropout).to(device)
    attention = Attention(enc_hid_dim, dec_hid_dim).to(device)
    decoder = Decoder(tgt_vocab_size, emb_dim, enc_hid_dim, dec_hid_dim, n_layers, dec_dropout, attention).to(device)
    
    # Initialize model
    lstm_model = LSTMSeq2Seq(encoder, decoder, device).to(device)
    
    # Initialize optimizer and scheduler
    optimizer = optim.Adam(lstm_model.parameters(), lr=LEARNING_RATE)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2, verbose=True)
    
    # Initialize criterion
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    
    # Initialize variables for early stopping
    best_valid_loss = float('inf')
    patience = 5
    patience_counter = 0
    
    # Initialize lists for learning curves
    train_losses = []
    valid_losses = []
    
    # Training loop
    for epoch in range(NUM_EPOCHS):
        start_time = time.time()
        
        # Train
        train_loss = train_epoch(lstm_model, train_loader, optimizer, criterion, 1.0, device, 0.5)
        
        # Evaluate
        valid_loss = evaluate(lstm_model, dev_loader, criterion, device)
        
        # Update learning rate
        scheduler.step(valid_loss)
        
        # Save losses for plotting
        train_losses.append(train_loss)
        valid_losses.append(valid_loss)
        
        # Calculate time elapsed
        end_time = time.time()
        epoch_mins, epoch_secs = divmod(end_time - start_time, 60)
        
        # Print epoch results
        print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs:.2f}s')
        print(f'\tTrain Loss: {train_loss:.4f} | Train PPL: {math.exp(train_loss):.4f}')
        print(f'\t Val. Loss: {valid_loss:.4f} |  Val. PPL: {math.exp(valid_loss):.4f}')
        
        # Save the best model
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(lstm_model.state_dict(), LSTM_MODEL_PATH)
            patience_counter = 0
        else:
            patience_counter += 1
        
        # Early stopping
        if patience_counter >= patience:
            print("Early stopping!")
            break
    
    # Plot learning curves
    plot_learning_curves(train_losses, valid_losses, 'lstm_learning_curves.png')
    
    # Load best model
    lstm_model.load_state_dict(torch.load(LSTM_MODEL_PATH))
    
    return lstm_model

def evaluate_models(transformer, lstm_model, test_loader, test_pairs, en_tokenizer, ur_tokenizer):
    """
    Evaluate models on test set
    """
    print("Evaluating models...")
    
    # Initialize criterion
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    
    # Evaluate transformer model
    transformer_loss = evaluate(transformer, test_loader, criterion, device)
    transformer_ppl = math.exp(transformer_loss)
    transformer_bleu = calculate_bleu(test_pairs[:100], en_tokenizer, ur_tokenizer, transformer, device)
    transformer_rouge = calculate_rouge(test_pairs[:100], en_tokenizer, ur_tokenizer, transformer, device)
    
    # Evaluate LSTM model
    lstm_loss = evaluate(lstm_model, test_loader, criterion, device)
    lstm_ppl = math.exp(lstm_loss)
    lstm_bleu = calculate_bleu(test_pairs[:100], en_tokenizer, ur_tokenizer, lstm_model, device)
    lstm_rouge = calculate_rouge(test_pairs[:100], en_tokenizer, ur_tokenizer, lstm_model, device)
    
    # Print results
    print("Transformer Model:")
    print(f"Test Loss: {transformer_loss:.4f} | Test PPL: {transformer_ppl:.4f}")
    print(f"BLEU Score: {transformer_bleu:.4f}")
    print(f"ROUGE Scores: {transformer_rouge}")
    
    print("\nLSTM Model:")
    print(f"Test Loss: {lstm_loss:.4f} | Test PPL: {lstm_ppl:.4f}")
    print(f"BLEU Score: {lstm_bleu:.4f}")
    print(f"ROUGE Scores: {lstm_rouge}")
    
    # Compare models
    print("\nModel Comparison:")
    print(f"Loss Difference: {abs(transformer_loss - lstm_loss):.4f}")
    print(f"PPL Difference: {abs(transformer_ppl - lstm_ppl):.4f}")
    print(f"BLEU Difference: {abs(transformer_bleu - lstm_bleu):.4f}")
    
    # Create results table
    results = {
        'Model': ['Transformer', 'LSTM'],
        'Loss': [transformer_loss, lstm_loss],
        'Perplexity': [transformer_ppl, lstm_ppl],
        'BLEU': [transformer_bleu, lstm_bleu],
        'ROUGE-1': [transformer_rouge['rouge1'], lstm_rouge['rouge1']],
        'ROUGE-2': [transformer_rouge['rouge2'], lstm_rouge['rouge2']],
        'ROUGE-L': [transformer_rouge['rougeL'], lstm_rouge['rougeL']]
    }
    
    results_df = pd.DataFrame(results)
    print("\nResults Table:")
    print(results_df)
    
    # Save results to CSV
    results_df.to_csv('model_comparison.csv', index=False)

class TranslationGUI:
    """
    GUI for English to Urdu translation
    """
    def __init__(self, transformer, lstm_model, en_tokenizer, ur_tokenizer):
        self.transformer = transformer
        self.lstm_model = lstm_model
        self.en_tokenizer = en_tokenizer
        self.ur_tokenizer = ur_tokenizer
        self.current_model = transformer  # Default to transformer
        
        # Initialize Tkinter root
        self.root = tk.Tk()
        self.root.title("English to Urdu Translator")
        self.root.geometry("800x600")
        
        # Create frame for model selection
        self.model_frame = tk.Frame(self.root)
        self.model_frame.pack(pady=10)
        
        # Create model selection radio buttons
        self.model_var = tk.StringVar(value="transformer")
        tk.Radiobutton(self.model_frame, text="Transformer", variable=self.model_var, value="transformer", command=self.update_model).pack(side=tk.LEFT, padx=10)
        tk.Radiobutton(self.model_frame, text="LSTM", variable=self.model_var, value="lstm", command=self.update_model).pack(side=tk.LEFT, padx=10)
        
        # Create input label and text entry
        tk.Label(self.root, text="Enter English text:").pack(anchor=tk.W, padx=10, pady=5)
        self.input_entry = tk.Entry(self.root, width=80)
        self.input_entry.pack(padx=10, pady=5, fill=tk.X)
        self.input_entry.bind("<Return>", self.translate)
        
        # Create translate button
        self.translate_button = tk.Button(self.root, text="Translate", command=self.translate)
        self.translate_button.pack(pady=10)
        
        # Create conversation history text widget
        self.conversation = scrolledtext.ScrolledText(self.root, wrap=tk.WORD, width=80, height=20)
        self.conversation.pack(padx=10, pady=10, fill=tk.BOTH, expand=True)
        self.conversation.tag_configure("english", justify=tk.LEFT)
        self.conversation.tag_configure("urdu", justify=tk.RIGHT)
        
        # Create attention visualization button
        self.attention_button = tk.Button(self.root, text="Visualize Attention", command=self.visualize_attention)
        self.attention_button.pack(pady=10)
    
    def update_model(self):
        """
        Update the current model based on radio button selection
        """
        if self.model_var.get() == "transformer":
            self.current_model = self.transformer
        else:
            self.current_model = self.lstm_model
    
    def translate(self, event=None):
        """
        Translate input text and update conversation history
        """
        # Get input text
        input_text = self.input_entry.get()
        
        if not input_text:
            return
        
        # Add input text to conversation
        self.conversation.insert(tk.END, f"{input_text}\n", "english")
        
        # Translate text
        translation = translate_sentence(self.current_model, input_text, self.en_tokenizer, self.ur_tokenizer, device)
        
        # Add translation to conversation
        self.conversation.insert(tk.END, f"{translation}\n\n", "urdu")
        
        # Clear input entry
        self.input_entry.delete(0, tk.END)
        
        # Scroll to bottom
        self.conversation.see(tk.END)
    
    def visualize_attention(self):
        """
        Visualize attention weights for the last translated sentence
        """
        # Get the last English text from conversation
        text = self.conversation.get("1.0", tk.END)
        lines = text.strip().split("\n")
        
        if len(lines) >= 2:
            last_english = lines[-3]  # Skip the translation and empty line
            
            # Only works with Transformer model
            if isinstance(self.current_model, Transformer):
                fig = visualize_attention(self.current_model, last_english, self.en_tokenizer, self.ur_tokenizer, device)
                
                if fig:
                    fig.show()
                else:
                    print("Error: Could not visualize attention")
            else:
                print("Attention visualization is only available for Transformer model")
    
    def run(self):
        """
        Run the GUI main loop
        """
        self.root.mainloop()

def main():
    """
    Main function
    """
    # Create directory for saving models
    os.makedirs(MODEL_SAVE_DIR, exist_ok=True)
    
    # Load and preprocess data
    train_loader, dev_loader, test_loader, test_pairs, en_tokenizer, ur_tokenizer = load_and_preprocess_data()
    
    # Check if models exist
    transformer_exists = os.path.exists(TRANSFORMER_MODEL_PATH)
    lstm_exists = os.path.exists(LSTM_MODEL_PATH)
    
    # Initialize models
    src_vocab_size = len(en_tokenizer.token_to_id)
    tgt_vocab_size = len(ur_tokenizer.token_to_id)
    
    # Initialize transformer model
    transformer = Transformer(
        src_vocab_size=src_vocab_size,
        tgt_vocab_size=tgt_vocab_size,
        d_model=EMBEDDING_SIZE,
        nhead=NUM_HEADS,
        num_encoder_layers=NUM_ENCODER_LAYERS,
        num_decoder_layers=NUM_DECODER_LAYERS,
        dim_feedforward=FFN_DIM,
        dropout=DROPOUT
    ).to(device)
    
    # Initialize LSTM model
    emb_dim = 256
    enc_hid_dim = 512
    dec_hid_dim = 512
    n_layers = 2
    enc_dropout = 0.5
    dec_dropout = 0.5
    
    encoder = Encoder(src_vocab_size, emb_dim, enc_hid_dim, n_layers, enc_dropout).to(device)
    attention = Attention(enc_hid_dim, dec_hid_dim).to(device)
    decoder = Decoder(tgt_vocab_size, emb_dim, enc_hid_dim, dec_hid_dim, n_layers, dec_dropout, attention).to(device)
    lstm_model = LSTMSeq2Seq(encoder, decoder, device).to(device)
    
    # Load trained models if they exist, otherwise train them
    if transformer_exists:
        print("Loading trained transformer model...")
        transformer.load_state_dict(torch.load(TRANSFORMER_MODEL_PATH))
    else:
        print("Training transformer model...")
        transformer = train_transformer_model(train_loader, dev_loader, en_tokenizer, ur_tokenizer)
    
    if lstm_exists:
        print("Loading trained LSTM model...")
        lstm_model.load_state_dict(torch.load(LSTM_MODEL_PATH))
    else:
        print("Training LSTM model...")
        lstm_model = train_lstm_model(train_loader, dev_loader, en_tokenizer, ur_tokenizer)
    # Evaluate models on test set
    print("Evaluating models on test set...")
    evaluate_models(transformer, lstm_model, test_loader, test_pairs, en_tokenizer, ur_tokenizer)
    
    # Create and run GUI
    print("Launching translation GUI...")
    gui = TranslationGUI(transformer, lstm_model, en_tokenizer, ur_tokenizer)
    gui.run()

if __name__ == "__main__":
    main()

Loading and preprocessing data...


NameError: name 'load_corpus' is not defined