# Transformer-based Neural Machine Translation

**Author:** Molla Samser  
**Website:** https://rskworld.in  
**Email:** help@rskworld.in, support@rskworld.in  
**Phone:** +91 93305 39277  
**Designer & Tester:** Rima Khatun

This notebook demonstrates a complete Transformer-based neural machine translation system implementing:
- Multi-head self-attention mechanism
- Positional encoding
- Encoder-decoder architecture
- Beam search for high-quality translation generation


In [None]:
# Import necessary libraries
"""
Author: Molla Samser (https://rskworld.in)
"""

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from transformer_model import Transformer
from data_preprocessing import Vocabulary, normalize_string, load_data
from inference import translate_sentence, greedy_decode, load_model
import os

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")


## 1. Model Architecture Visualization

Let's create and examine the transformer model architecture.


In [None]:
# Initialize a sample transformer model
"""
Author: Molla Samser (https://rskworld.in)
"""

# Model hyperparameters
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_len = 100
dropout = 0.1

# Create model
model = Transformer(
    src_vocab_size=src_vocab_size,
    tgt_vocab_size=tgt_vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_encoder_layers=num_layers,
    num_decoder_layers=num_layers,
    d_ff=d_ff,
    max_len=max_len,
    dropout=dropout
)

print("Model Architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")


## 2. Understanding Self-Attention Mechanism

Let's visualize how self-attention works by creating a simple example.


In [None]:
# Demonstrate self-attention mechanism
"""
Author: Molla Samser (https://rskworld.in)

This cell demonstrates the self-attention mechanism by showing
how words in a sentence attend to each other.
"""

from transformer_model import MultiHeadAttention

# Example: Simple attention visualization
d_model = 128
num_heads = 4
batch_size = 1
seq_len = 5

# Create sample input (batch_size, seq_len, d_model)
x = torch.randn(batch_size, seq_len, d_model)

# Create attention module
attn = MultiHeadAttention(d_model, num_heads)

# Compute attention
output, attention_weights = attn(x, x, x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nSelf-attention allows each word to attend to all words in the sequence,")
print(f"capturing relationships regardless of distance.")


## 3. Training the Model (Example)

Note: This is a demonstration. For actual training, use the `train.py` script with your parallel corpus.


In [None]:
# Example training setup (commented out for demo)
"""
Author: Molla Samser (https://rskworld.in)

To train the model, you would:
1. Prepare parallel corpus in format: source_sentence ||| target_sentence
2. Run: python train.py --data_path your_data.txt --num_epochs 50 --batch_size 32
"""

print("Training example:")
print("=" * 50)
print("python train.py \\")
print("    --data_path data/parallel_corpus.txt \\")
print("    --num_epochs 50 \\")
print("    --batch_size 32 \\")
print("    --d_model 512 \\")
print("    --num_heads 8 \\")
print("    --num_layers 6 \\")
print("    --lr 0.0001 \\")
print("    --save_dir ./models")
print("\nThis will:")
print("1. Load and preprocess the parallel corpus")
print("2. Build source and target vocabularies")
print("3. Train the transformer model")
print("4. Save the best model checkpoint")


## 4. Translation Inference

If you have a trained model, you can use it for translation:


In [None]:
# Translation example
"""
Author: Molla Samser (https://rskworld.in)

This demonstrates how to use a trained model for translation.
Uncomment and modify if you have a trained model.
"""

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Example usage (commented out - requires trained model):
"""
# Load vocabularies
from data_preprocessing import load_vocabularies
src_vocab, tgt_vocab = load_vocabularies('./models')

# Load model
model = load_model('./models/best_model.pt', device)

# Translate a sentence
source_sentence = "Hello, how are you?"
translation = translate_sentence(
    source_sentence,
    './models/best_model.pt',
    './models',
    device=device,
    use_beam_search=True,
    beam_width=5
)

print(f"Source: {source_sentence}")
print(f"Translation: {translation}")
"""

print("Translation Inference:")
print("=" * 50)
print("To translate sentences, use:")
print("1. Single sentence: python inference.py --model_path models/best_model.pt --sentence 'Your sentence here'")
print("2. File translation: python inference.py --model_path models/best_model.pt --input_file input.txt --output_file output.txt")
print("\nOptions:")
print("- --use_beam_search: Use beam search (better quality, slower)")
print("- --beam_width: Number of beams (default: 5)")


## 5. Model Components Explanation

### 5.1 Positional Encoding

The positional encoding adds information about the position of words in the sequence, since transformers don't have inherent notion of order.


In [None]:
# Visualize positional encoding
"""
Author: Molla Samser (https://rskworld.in)
"""

from transformer_model import PositionalEncoding
import matplotlib.pyplot as plt

# Create positional encoding
d_model = 128
max_len = 50
pos_encoding = PositionalEncoding(d_model, max_len, dropout=0)

# Generate encoding for visualization
x = torch.zeros(1, max_len, d_model)
pe = pos_encoding(x)

# Plot
plt.figure(figsize=(12, 6))
plt.imshow(pe[0].numpy().T, aspect='auto', cmap='RdYlGn')
plt.colorbar()
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Positional Encoding Pattern')
plt.tight_layout()
plt.show()

print("Positional encoding adds unique patterns for each position,")
print("allowing the model to understand word order.")


## 6. Key Features

1. **Self-Attention**: Each word can attend to all words in the sequence
2. **Multi-Head Attention**: Multiple attention heads capture different types of relationships
3. **Positional Encoding**: Adds positional information to embeddings
4. **Encoder-Decoder**: Separate encoder and decoder for source and target languages
5. **Residual Connections**: Helps with training deep networks
6. **Layer Normalization**: Stabilizes training
7. **Beam Search**: Generates high-quality translations

## 7. Project Structure

```
transformer-nmt/
├── transformer_model.py      # Core transformer architecture
├── data_preprocessing.py     # Data loading and preprocessing
├── train.py                  # Training script
├── inference.py              # Inference and translation
├── transformer_nmt_demo.ipynb # This notebook
├── requirements.txt          # Python dependencies
├── README.md                 # Project documentation
└── models/                   # Saved models and vocabularies
```

## 8. Contact Information

**Author:** Molla Samser  
**Website:** https://rskworld.in  
**Email:** help@rskworld.in, support@rskworld.in  
**Phone:** +91 93305 39277  
**Designer & Tester:** Rima Khatun

For more projects and resources, visit: https://rskworld.in
