# NLP From Scratch: Translation with a Sequence to Sequence Network and Attention 🇦🇺

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/pytorch-mastery/blob/main/examples/pytorch-nlp/translation-seq2seq-network-attention.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/pytorch-mastery/blob/main/examples/pytorch-nlp/translation-seq2seq-network-attention.ipynb)

Build a sequence-to-sequence neural machine translation model from scratch using PyTorch. Features advanced attention mechanisms for English-Vietnamese translation with Australian tourism context, comparing PyTorch implementation patterns with TensorFlow approaches.

## Learning Objectives

By the end of this tutorial, you will:

- 🔄 **Master Seq2Seq Architecture** - Understand encoder-decoder neural translation models
- 🎯 **Implement Attention Mechanisms** - Build attention layers for better translation quality
- 🇦🇺 **Handle Australian Tourism Data** - Process English-Vietnamese translation pairs
- 📊 **Visualize Attention Weights** - Understand what the model focuses on during translation
- 🔧 **Compare with TensorFlow** - Learn PyTorch vs TensorFlow implementation differences
- 📈 **Monitor Training Progress** - Use TensorBoard for comprehensive training visualization
- 🌐 **Evaluate Translation Quality** - Implement BLEU score and other translation metrics

## What You'll Build

1. **English-Vietnamese Seq2Seq Translator** - Neural machine translation for Australian tourism content
2. **Attention Visualization** - See what parts of source sentences the model focuses on
3. **Interactive Translation Demo** - Test translations with real Australian tourism examples
4. **Training Monitoring** - Complete TensorBoard integration for loss and attention visualization
5. **Performance Evaluation** - BLEU scores and translation quality assessment

---

In [None]:
# Environment Detection and Setup
import sys
import subprocess
import os
import time

# Detect the runtime environment
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules or "kaggle" in os.environ.get('KAGGLE_URL_BASE', '')
IS_LOCAL = not (IS_COLAB or IS_KAGGLE)

print(f"🔍 Environment Detection:")
print(f"   Local Development: {IS_LOCAL}")
print(f"   Google Colab: {IS_COLAB}")
print(f"   Kaggle Notebooks: {IS_KAGGLE}")

# Platform-specific system setup
if IS_COLAB:
    print("\n⚙️  Setting up Google Colab environment...")
    !apt update -qq
    !apt install -y -qq software-properties-common
elif IS_KAGGLE:
    print("\n⚙️  Setting up Kaggle environment...")
else:
    print("\n⚙️  Setting up local environment...")

In [None]:
# Install required packages for sequence-to-sequence translation
required_packages = [
    "torch",
    "pandas",
    "seaborn", 
    "matplotlib",
    "tensorboard",
    "scikit-learn",
    "nltk",  # For BLEU score evaluation
    "plotly",  # For attention visualization
]

print("📦 Installing packages for seq2seq translation...")
for package in required_packages:
    if IS_COLAB or IS_KAGGLE:
        !pip install -q {package}
    else:
        try:
            subprocess.run([sys.executable, "-m", "pip", "install", "-q", package], 
                          capture_output=True, check=True)
        except subprocess.CalledProcessError:
            print(f"   ⚠️  {package} installation skipped (likely already installed)")

print("✅ Package installation completed!")

In [None]:
# Core PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter

# Standard data science stack
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Text processing and evaluation
import re
import string
import random
from collections import Counter, defaultdict
import unicodedata

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# NLTK for BLEU score evaluation
try:
    import nltk
    nltk.download('punkt', quiet=True)
    from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
    from nltk.tokenize import word_tokenize
    print("✅ NLTK imported successfully for BLEU evaluation")
except ImportError:
    print("⚠️  NLTK not available - BLEU scores will be computed manually")

# Set visualization style
sns.set_style("whitegrid")
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (14, 8)

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

print("📚 Successfully imported all libraries for seq2seq translation!")

In [None]:
import torch
import platform

def detect_device():
    """
    Detect the best available PyTorch device with comprehensive hardware support.
    
    Priority order:
    1. CUDA (NVIDIA GPUs) - Best performance for deep learning
    2. MPS (Apple Silicon) - Optimized for M1/M2/M3 Macs  
    3. CPU (Universal) - Always available fallback
    
    Returns:
        torch.device: The optimal device for PyTorch operations
        str: Human-readable device description for logging
    """
    # Check for CUDA (NVIDIA GPU)
    if torch.cuda.is_available():
        device = torch.device("cuda")
        gpu_name = torch.cuda.get_device_name(0)
        device_info = f"CUDA GPU: {gpu_name}"
        
        # Additional CUDA info for optimization
        cuda_version = torch.version.cuda
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
        
        print(f"🚀 Using CUDA acceleration")
        print(f"   GPU: {gpu_name}")
        print(f"   CUDA Version: {cuda_version}")
        print(f"   GPU Memory: {gpu_memory:.1f} GB")
        
        return device, device_info
    
    # Check for MPS (Apple Silicon)
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        device = torch.device("mps")
        device_info = "Apple Silicon MPS"
        
        # Get system info for Apple Silicon
        system_info = platform.uname()
        
        print(f"🍎 Using Apple Silicon MPS acceleration")
        print(f"   System: {system_info.system} {system_info.release}")
        print(f"   Machine: {system_info.machine}")
        print(f"   Processor: {system_info.processor}")
        
        return device, device_info
    
    # Fallback to CPU
    else:
        device = torch.device("cpu")
        device_info = "CPU (No GPU acceleration available)"
        
        # Get CPU info for optimization guidance
        cpu_count = torch.get_num_threads()
        system_info = platform.uname()
        
        print(f"💻 Using CPU (no GPU acceleration detected)")
        print(f"   Processor: {system_info.processor}")
        print(f"   PyTorch Threads: {cpu_count}")
        print(f"   System: {system_info.system} {system_info.release}")
        
        # Provide optimization suggestions for CPU-only setups
        print(f"\n💡 CPU Optimization Tips:")
        print(f"   • Reduce batch size to prevent memory issues")
        print(f"   • Consider using smaller models for faster training")
        print(f"   • Enable PyTorch optimizations: torch.set_num_threads({cpu_count})")
        
        return device, device_info

# Usage in the notebook
device, device_info = detect_device()
print(f"\n✅ PyTorch device selected: {device}")
print(f"📊 Device info: {device_info}")

# Set global device for the notebook
DEVICE = device

## 🗃️ Dataset: Australian Tourism English-Vietnamese Translation

We'll create a comprehensive dataset of English-Vietnamese translation pairs focused on Australian tourism content. This dataset includes:

- **Tourist attractions** in major Australian cities
- **Travel experiences** and recommendations
- **Cultural information** about Australian destinations
- **Practical travel advice** for visitors

This approach follows the repository's policy of using Australian context and Vietnamese as the secondary language for all multilingual examples.

In [None]:
# Australian Tourism English-Vietnamese Translation Dataset
class AustralianTourismTranslationDataset:
    """
    English-Vietnamese translation pairs focused on Australian tourism content.
    
    This dataset follows repository guidelines:
    - Australian context for all examples
    - Vietnamese as the secondary language
    - Tourism and cultural content focus
    """
    
    def __init__(self):
        self.translation_pairs = [
            # Sydney attractions and experiences
            ("The Sydney Opera House is a masterpiece of modern architecture.", 
             "Nhà hát Opera Sydney là kiệt tác kiến trúc hiện đại."),
            ("You can climb the Sydney Harbour Bridge for stunning city views.", 
             "Bạn có thể leo cầu Cảng Sydney để ngắm nhìn thành phố tuyệt đẹp."),
            ("Bondi Beach is perfect for surfing and sunbathing.", 
             "Bãi biển Bondi hoàn hảo cho lướt sóng và tắm nắng."),
            ("The Royal Botanic Gardens offer peaceful walks near the harbor.", 
             "Vườn Bách thảo Hoàng gia cung cấp những con đường yên tĩnh gần cảng."),
            
            # Melbourne culture and attractions
            ("Melbourne is famous for its coffee culture and street art.", 
             "Melbourne nổi tiếng với văn hóa cà phê và nghệ thuật đường phố."),
            ("The laneways of Melbourne hide amazing cafes and galleries.", 
             "Những con hẻm Melbourne ẩn chứa những quán cà phê và phòng tranh tuyệt vời."),
            ("Queen Victoria Market is the largest open-air market in the Southern Hemisphere.", 
             "Chợ Queen Victoria là chợ trời lớn nhất ở Nam bán cầu."),
            ("The Great Ocean Road starts from Melbourne and offers spectacular coastal views.", 
             "Con đường Great Ocean bắt đầu từ Melbourne và mang đến tầm nhìn ven biển ngoạn mục."),
            
            # Brisbane and Queensland
            ("Brisbane is the gateway to the Gold Coast and Sunshine Coast.", 
             "Brisbane là cửa ngõ đến Gold Coast và Sunshine Coast."),
            ("The Great Barrier Reef is accessible from Cairns in tropical Queensland.", 
             "Rạn san hô Great Barrier có thể đến từ Cairns ở Queensland nhiệt đới."),
            ("Fraser Island is the world's largest sand island.", 
             "Đảo Fraser là hòn đảo cát lớn nhất thế giới."),
            ("The Gold Coast theme parks offer thrilling rides and entertainment.", 
             "Các công viên giải trí Gold Coast cung cấp trò chơi ly kỳ và giải trí."),
            
            # Perth and Western Australia
            ("Perth has beautiful beaches and is one of the sunniest cities in the world.", 
             "Perth có những bãi biển đẹp và là một trong những thành phố nắng nhất thế giới."),
            ("Rottnest Island near Perth is home to the friendly quokkas.", 
             "Đảo Rottnest gần Perth là nhà của những chú quokka thân thiện."),
            ("The Pinnacles Desert offers a unique landscape of limestone pillars.", 
             "Sa mạc Pinnacles mang đến cảnh quan độc đáo với những cột đá vôi."),
            
            # Adelaide and South Australia
            ("Adelaide is surrounded by world-class wine regions.", 
             "Adelaide được bao quanh bởi các vùng rượu vang đẳng cấp thế giới."),
            ("Kangaroo Island is famous for its wildlife and natural beauty.", 
             "Đảo Kangaroo nổi tiếng với động vật hoang dã và vẻ đẹp tự nhiên."),
            ("The Barossa Valley produces some of Australia's finest wines.", 
             "Thung lũng Barossa sản xuất một số loại rượu vang tốt nhất Australia."),
            
            # Tasmania (Hobart)
            ("Tasmania offers pristine wilderness and clean air.", 
             "Tasmania mang đến thiên nhiên hoang sơ và không khí trong lành."),
            ("Cradle Mountain-Lake St Clair is perfect for hiking and nature photography.", 
             "Núi Cradle-Hồ St Clair hoàn hảo cho đi bộ đường dài và chụp ảnh thiên nhiên."),
            ("MONA in Hobart is one of the world's most provocative art museums.", 
             "MONA ở Hobart là một trong những bảo tàng nghệ thuật khiêu khích nhất thế giới."),
            
            # Darwin and Northern Territory
            ("Darwin is the gateway to Kakadu National Park.", 
             "Darwin là cửa ngõ đến Công viên Quốc gia Kakadu."),
            ("Uluru is sacred to Aboriginal people and a UNESCO World Heritage site.", 
             "Uluru là thiêng liêng đối với người thổ dân và là di sản thế giới UNESCO."),
            ("Katherine Gorge offers spectacular boat cruises through ancient landscapes.", 
             "Hẻm núi Katherine cung cấp những chuyến du thuyền ngoạn mục qua cảnh quan cổ xưa."),
            
            # Canberra (Capital)
            ("Canberra is Australia's capital and home to important national institutions.", 
             "Canberra là thủ đô Australia và là nơi có các tổ chức quốc gia quan trọng."),
            ("The Australian War Memorial honors the nation's military history.", 
             "Đài tưởng niệm Chiến tranh Australia tôn vinh lịch sử quân sự của đất nước."),
            
            # General travel advice
            ("Australia uses the Australian dollar as its currency.", 
             "Australia sử dụng đô la Australia làm tiền tệ."),
            ("The best time to visit Australia is during the spring and autumn months.", 
             "Thời gian tốt nhất để thăm Australia là vào các tháng mùa xuân và mùa thu."),
            ("Tipping is not mandatory in Australia but is appreciated for good service.", 
             "Tiền boa không bắt buộc ở Australia nhưng được đánh giá cao cho dịch vụ tốt."),
            ("Public transport cards make it easy to travel around Australian cities.", 
             "Thẻ giao thông công cộng giúp di chuyển dễ dàng quanh các thành phố Australia."),
            
            # Food and culture
            ("Try the famous Australian meat pies and sausage rolls.", 
             "Hãy thử những chiếc bánh thịt và bánh cuốn xúc xích nổi tiếng của Australia."),
            ("Vegemite is a unique Australian spread that locals love.", 
             "Vegemite là một loại mứt độc đáo của Australia mà người địa phương yêu thích."),
            ("Fish and chips by the beach is a classic Australian experience.", 
             "Cá và khoai tây chiên bên bãi biển là trải nghiệm Australia kinh điển."),
            
            # Wildlife
            ("Kangaroos and koalas are Australia's most famous native animals.", 
             "Kangaroo và koala là những động vật bản địa nổi tiếng nhất của Australia."),
            ("The Tasmanian devil is found only in Tasmania.", 
             "Quỷ Tasmania chỉ có thể tìm thấy ở Tasmania."),
            ("Wombats are sturdy marsupials that dig extensive burrow systems.", 
             "Wombat là loài thú có túi chắc chắn đào hệ thống hang rộng lớn.")
        ]
        
        print(f"📊 Australian Tourism Translation Dataset Loaded")
        print(f"   Total translation pairs: {len(self.translation_pairs)}")
        print(f"   Languages: English → Vietnamese")
        print(f"   Context: Australian tourism and culture")
    
    def get_pairs(self):
        """Return all translation pairs."""
        return self.translation_pairs
    
    def get_sample(self, n=5):
        """Get a random sample of translation pairs."""
        return random.sample(self.translation_pairs, min(n, len(self.translation_pairs)))
    
    def get_source_sentences(self):
        """Get all English source sentences."""
        return [pair[0] for pair in self.translation_pairs]
    
    def get_target_sentences(self):
        """Get all Vietnamese target sentences."""
        return [pair[1] for pair in self.translation_pairs]

# Create the dataset
dataset = AustralianTourismTranslationDataset()

# Display some examples
print("\n🌏 Sample Translation Pairs:")
print("=" * 70)
for i, (eng, vie) in enumerate(dataset.get_sample(5), 1):
    print(f"{i}. 🇬🇧 English: {eng}")
    print(f"   🇻🇳 Vietnamese: {vie}")
    print()

## 🔤 Text Preprocessing and Vocabulary Building

For sequence-to-sequence translation, we need to:

1. **Normalize text** - Handle Unicode characters, punctuation, and case
2. **Tokenize sentences** - Split into words/subwords
3. **Build vocabularies** - Create word-to-index mappings for both languages
4. **Add special tokens** - `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`
5. **Convert to tensors** - Transform text into numerical sequences

This preprocessing pipeline handles both English and Vietnamese text with proper Unicode support.

In [None]:
class TextPreprocessor:
    """
    Text preprocessing pipeline for English-Vietnamese translation.
    
    Handles Unicode normalization, tokenization, and vocabulary building
    for both languages with special token support.
    """
    
    def __init__(self, max_vocab_size=10000, min_freq=1):
        self.max_vocab_size = max_vocab_size
        self.min_freq = min_freq
        
        # Special tokens
        self.SOS_TOKEN = 0  # Start of sequence
        self.EOS_TOKEN = 1  # End of sequence
        self.PAD_TOKEN = 2  # Padding
        self.UNK_TOKEN = 3  # Unknown word
        
        self.special_tokens = {
            '<SOS>': self.SOS_TOKEN,
            '<EOS>': self.EOS_TOKEN,
            '<PAD>': self.PAD_TOKEN,
            '<UNK>': self.UNK_TOKEN
        }
        
        # Vocabularies will be built from data
        self.src_vocab = {}  # English
        self.tgt_vocab = {}  # Vietnamese
        self.src_vocab_inv = {}  # Index to word
        self.tgt_vocab_inv = {}  # Index to word
    
    def normalize_string(self, s):
        """
        Normalize text for translation preprocessing.
        
        Steps:
        1. Unicode normalization
        2. Lowercase conversion
        3. Remove extra whitespace
        4. Handle punctuation spacing
        """
        # Unicode normalization
        s = unicodedata.normalize('NFD', s)
        
        # Convert to lowercase
        s = s.lower().strip()
        
        # Add spaces around punctuation
        s = re.sub(r"([.!?])", r" \1", s)
        s = re.sub(r"[^a-zA-Z.!?ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠ-ỹ]+", r" ", s)
        
        # Remove multiple spaces
        s = re.sub(r"\s+", " ", s).strip()
        
        return s
    
    def tokenize(self, text):
        """
        Simple tokenization by splitting on whitespace.
        
        For production use, consider:
        - Subword tokenization (BPE, SentencePiece)
        - Language-specific tokenizers
        - Handling of Vietnamese word segmentation
        """
        return self.normalize_string(text).split()
    
    def build_vocabulary(self, sentences, is_source=True):
        """
        Build vocabulary from list of sentences.
        
        Args:
            sentences: List of sentences to process
            is_source: True for source language (English), False for target (Vietnamese)
        """
        print(f"📝 Building {'source (English)' if is_source else 'target (Vietnamese)'} vocabulary...")
        
        # Count word frequencies
        word_counts = Counter()
        for sentence in sentences:
            tokens = self.tokenize(sentence)
            word_counts.update(tokens)
        
        # Filter by frequency and limit vocabulary size
        filtered_words = [word for word, count in word_counts.items() if count >= self.min_freq]
        most_common = sorted(filtered_words, key=lambda w: word_counts[w], reverse=True)[:self.max_vocab_size-4]
        
        # Create vocabulary dictionaries
        vocab = dict(self.special_tokens)  # Start with special tokens
        vocab_inv = {idx: token for token, idx in self.special_tokens.items()}
        
        for i, word in enumerate(most_common, start=4):  # Start after special tokens
            vocab[word] = i
            vocab_inv[i] = word
        
        if is_source:
            self.src_vocab = vocab
            self.src_vocab_inv = vocab_inv
        else:
            self.tgt_vocab = vocab
            self.tgt_vocab_inv = vocab_inv
        
        vocab_name = "source" if is_source else "target"
        print(f"   {vocab_name} vocabulary size: {len(vocab)}")
        print(f"   Most common words: {most_common[:10]}")
        
        return vocab
    
    def encode_sentence(self, sentence, is_source=True, add_eos=True):
        """
        Convert sentence to sequence of token indices.
        
        Args:
            sentence: Input sentence string
            is_source: True for source language, False for target
            add_eos: Whether to add end-of-sequence token
        """
        vocab = self.src_vocab if is_source else self.tgt_vocab
        tokens = self.tokenize(sentence)
        
        # Convert tokens to indices
        indices = [vocab.get(token, self.UNK_TOKEN) for token in tokens]
        
        # Add EOS token for target sequences
        if add_eos:
            indices.append(self.EOS_TOKEN)
        
        return indices
    
    def decode_sequence(self, indices, is_source=True):
        """
        Convert sequence of indices back to sentence.
        """
        vocab_inv = self.src_vocab_inv if is_source else self.tgt_vocab_inv
        
        # Convert indices to tokens, stop at EOS
        tokens = []
        for idx in indices:
            if idx == self.EOS_TOKEN:
                break
            elif idx == self.PAD_TOKEN:
                continue
            else:
                tokens.append(vocab_inv.get(idx, '<UNK>'))
        
        return ' '.join(tokens)

# Initialize preprocessor and build vocabularies
preprocessor = TextPreprocessor(max_vocab_size=8000, min_freq=1)

# Get all sentences
source_sentences = dataset.get_source_sentences()
target_sentences = dataset.get_target_sentences()

# Build vocabularies
preprocessor.build_vocabulary(source_sentences, is_source=True)
preprocessor.build_vocabulary(target_sentences, is_source=False)

print(f"\n📊 Vocabulary Statistics:")
print(f"   English vocabulary size: {len(preprocessor.src_vocab)}")
print(f"   Vietnamese vocabulary size: {len(preprocessor.tgt_vocab)}")

# Test encoding/decoding
test_sentence = "The Sydney Opera House is beautiful."
encoded = preprocessor.encode_sentence(test_sentence, is_source=True)
decoded = preprocessor.decode_sequence(encoded, is_source=True)

print(f"\n🔄 Encoding/Decoding Test:")
print(f"   Original: {test_sentence}")
print(f"   Encoded: {encoded}")
print(f"   Decoded: {decoded}")

## 🔄 Sequence-to-Sequence Dataset and DataLoader

We need a PyTorch `Dataset` class that:

1. **Handles variable-length sequences** with proper padding
2. **Provides batch processing** with source and target sequences
3. **Supports different sequence lengths** for encoder and decoder
4. **Includes teacher forcing** setup for training

The dataset will return source sequences, target input sequences (with SOS), and target output sequences (with EOS) for proper seq2seq training.

In [None]:
class Seq2SeqDataset(Dataset):
    """
    PyTorch Dataset for sequence-to-sequence translation.
    
    Returns:
    - source_seq: Encoded source sentence (English)
    - target_input_seq: Target sentence with SOS token (for decoder input)
    - target_output_seq: Target sentence with EOS token (for loss calculation)
    """
    
    def __init__(self, translation_pairs, preprocessor, max_src_len=50, max_tgt_len=50):
        self.pairs = translation_pairs
        self.preprocessor = preprocessor
        self.max_src_len = max_src_len
        self.max_tgt_len = max_tgt_len
        
        # Pre-process all pairs
        self.processed_pairs = []
        for src_text, tgt_text in translation_pairs:
            # Encode source sentence
            src_indices = preprocessor.encode_sentence(src_text, is_source=True, add_eos=False)
            
            # Encode target sentence
            tgt_indices = preprocessor.encode_sentence(tgt_text, is_source=False, add_eos=False)
            
            # Skip sequences that are too long
            if len(src_indices) <= max_src_len - 1 and len(tgt_indices) <= max_tgt_len - 2:
                self.processed_pairs.append((src_indices, tgt_indices))
        
        print(f"📦 Seq2Seq Dataset Created:")
        print(f"   Original pairs: {len(translation_pairs)}")
        print(f"   Filtered pairs: {len(self.processed_pairs)}")
        print(f"   Max source length: {max_src_len}")
        print(f"   Max target length: {max_tgt_len}")
    
    def __len__(self):
        return len(self.processed_pairs)
    
    def __getitem__(self, idx):
        src_indices, tgt_indices = self.processed_pairs[idx]
        
        # Pad source sequence and add EOS
        src_seq = src_indices + [self.preprocessor.EOS_TOKEN]
        src_seq += [self.preprocessor.PAD_TOKEN] * (self.max_src_len - len(src_seq))
        
        # Target input: SOS + target sequence (for decoder input)
        tgt_input_seq = [self.preprocessor.SOS_TOKEN] + tgt_indices
        tgt_input_seq += [self.preprocessor.PAD_TOKEN] * (self.max_tgt_len - len(tgt_input_seq))
        
        # Target output: target sequence + EOS (for loss calculation)
        tgt_output_seq = tgt_indices + [self.preprocessor.EOS_TOKEN]
        tgt_output_seq += [self.preprocessor.PAD_TOKEN] * (self.max_tgt_len - len(tgt_output_seq))
        
        return {
            'source': torch.tensor(src_seq[:self.max_src_len], dtype=torch.long),
            'target_input': torch.tensor(tgt_input_seq[:self.max_tgt_len], dtype=torch.long),
            'target_output': torch.tensor(tgt_output_seq[:self.max_tgt_len], dtype=torch.long),
            'src_len': len(src_indices) + 1,  # +1 for EOS
            'tgt_len': len(tgt_indices) + 1   # +1 for EOS
        }

def create_data_loaders(translation_pairs, preprocessor, batch_size=16, train_split=0.8):
    """
    Create train and validation DataLoaders for seq2seq training.
    """
    # Split data
    train_pairs, val_pairs = train_test_split(
        translation_pairs, 
        train_size=train_split, 
        random_state=42
    )
    
    # Create datasets
    train_dataset = Seq2SeqDataset(train_pairs, preprocessor, max_src_len=50, max_tgt_len=50)
    val_dataset = Seq2SeqDataset(val_pairs, preprocessor, max_src_len=50, max_tgt_len=50)
    
    # Create data loaders
    train_loader = DataLoader(
        train_dataset, 
        batch_size=batch_size, 
        shuffle=True, 
        pin_memory=torch.cuda.is_available()
    )
    
    val_loader = DataLoader(
        val_dataset, 
        batch_size=batch_size, 
        shuffle=False, 
        pin_memory=torch.cuda.is_available()
    )
    
    return train_loader, val_loader, train_dataset, val_dataset

# Create data loaders
translation_pairs = dataset.get_pairs()
train_loader, val_loader, train_dataset, val_dataset = create_data_loaders(
    translation_pairs, 
    preprocessor, 
    batch_size=8  # Smaller batch size for CPU training
)

print(f"\n📊 Data Loader Statistics:")
print(f"   Training batches: {len(train_loader)}")
print(f"   Validation batches: {len(val_loader)}")
print(f"   Batch size: {train_loader.batch_size}")

# Test data loader
sample_batch = next(iter(train_loader))
print(f"\n🔍 Sample Batch Structure:")
print(f"   Source shape: {sample_batch['source'].shape}")
print(f"   Target input shape: {sample_batch['target_input'].shape}")
print(f"   Target output shape: {sample_batch['target_output'].shape}")

# Show example
idx = 0
src_text = preprocessor.decode_sequence(sample_batch['source'][idx], is_source=True)
tgt_text = preprocessor.decode_sequence(sample_batch['target_output'][idx], is_source=False)
print(f"\n📝 Example from batch:")
print(f"   🇬🇧 Source: {src_text}")
print(f"   🇻🇳 Target: {tgt_text}")

## 🧠 Sequence-to-Sequence Model Architecture

We'll implement a complete seq2seq model with attention mechanism:

### 1. **Encoder**
- **Embedding layer** for source language tokens
- **Bidirectional LSTM** to process source sequences
- **Hidden state combination** from forward and backward passes

### 2. **Attention Mechanism**
- **Additive attention** (Bahdanau attention)
- **Context vector computation** from encoder hidden states
- **Attention weight visualization** capability

### 3. **Decoder**
- **Embedding layer** for target language tokens
- **LSTM with attention context** for sequence generation
- **Output projection** to target vocabulary

This architecture is inspired by the original attention mechanism papers and optimized for our English-Vietnamese translation task.

In [None]:
class Encoder(nn.Module):
    """
    Encoder for sequence-to-sequence model.
    
    Uses bidirectional LSTM to encode source sequences.
    Returns all hidden states for attention mechanism.
    
    TensorFlow equivalent:
        encoder = tf.keras.layers.Bidirectional(
            tf.keras.layers.LSTM(hidden_size, return_sequences=True, return_state=True)
        )
    """
    
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=1, dropout=0.1):
        super(Encoder, self).__init__()
        
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=2)  # PAD_TOKEN = 2
        self.dropout = nn.Dropout(dropout)
        
        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            embed_dim, 
            hidden_dim, 
            num_layers, 
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Linear layer to combine bidirectional hidden states
        self.hidden_projection = nn.Linear(hidden_dim * 2, hidden_dim)
        self.cell_projection = nn.Linear(hidden_dim * 2, hidden_dim)
    
    def forward(self, src_seq, src_lengths=None):
        """
        Forward pass through encoder.
        
        Args:
            src_seq: Source sequence tensor (batch_size, max_src_len)
            src_lengths: Actual lengths of sequences (for packing)
        
        Returns:
            encoder_outputs: All hidden states (batch_size, max_src_len, hidden_dim * 2)
            final_hidden: Final hidden state (batch_size, hidden_dim)
            final_cell: Final cell state (batch_size, hidden_dim)
        """
        batch_size = src_seq.size(0)
        
        # Embedding
        embedded = self.embedding(src_seq)  # (batch_size, max_src_len, embed_dim)
        embedded = self.dropout(embedded)
        
        # Pack sequences if lengths are provided (for efficiency)
        if src_lengths is not None:
            packed_embedded = nn.utils.rnn.pack_padded_sequence(
                embedded, src_lengths.cpu(), batch_first=True, enforce_sorted=False
            )
            packed_outputs, (hidden, cell) = self.lstm(packed_embedded)
            encoder_outputs, _ = nn.utils.rnn.pad_packed_sequence(
                packed_outputs, batch_first=True
            )
        else:
            encoder_outputs, (hidden, cell) = self.lstm(embedded)
        
        # encoder_outputs: (batch_size, max_src_len, hidden_dim * 2)
        # hidden: (num_layers * 2, batch_size, hidden_dim)
        # cell: (num_layers * 2, batch_size, hidden_dim)
        
        # Combine bidirectional hidden states
        # Take the last layer's forward and backward hidden states
        forward_hidden = hidden[-2]  # Forward direction
        backward_hidden = hidden[-1]  # Backward direction
        final_hidden = torch.cat([forward_hidden, backward_hidden], dim=1)
        final_hidden = self.hidden_projection(final_hidden)  # (batch_size, hidden_dim)
        
        # Same for cell states
        forward_cell = cell[-2]
        backward_cell = cell[-1]
        final_cell = torch.cat([forward_cell, backward_cell], dim=1)
        final_cell = self.cell_projection(final_cell)  # (batch_size, hidden_dim)
        
        return encoder_outputs, final_hidden, final_cell


class Attention(nn.Module):
    """
    Additive attention mechanism (Bahdanau attention).
    
    Computes attention weights between decoder hidden state and encoder outputs.
    Returns context vector as weighted sum of encoder hidden states.
    
    TensorFlow equivalent would use tf.keras.layers.Attention or custom implementation
    with tf.keras.layers.Dense layers for the same computations.
    """
    
    def __init__(self, encoder_hidden_dim, decoder_hidden_dim, attention_dim):
        super(Attention, self).__init__()
        
        self.encoder_hidden_dim = encoder_hidden_dim  # 2 * hidden_dim (bidirectional)
        self.decoder_hidden_dim = decoder_hidden_dim  # hidden_dim
        self.attention_dim = attention_dim
        
        # Linear layers for attention computation
        self.encoder_projection = nn.Linear(encoder_hidden_dim, attention_dim)
        self.decoder_projection = nn.Linear(decoder_hidden_dim, attention_dim)
        self.attention_vector = nn.Linear(attention_dim, 1, bias=False)
        
    def forward(self, decoder_hidden, encoder_outputs, mask=None):
        """
        Compute attention weights and context vector.
        
        Args:
            decoder_hidden: Current decoder hidden state (batch_size, decoder_hidden_dim)
            encoder_outputs: All encoder hidden states (batch_size, src_len, encoder_hidden_dim)
            mask: Padding mask (batch_size, src_len)
        
        Returns:
            context: Context vector (batch_size, encoder_hidden_dim)
            attention_weights: Attention weights (batch_size, src_len)
        """
        batch_size, src_len, _ = encoder_outputs.size()
        
        # Project encoder outputs
        encoder_proj = self.encoder_projection(encoder_outputs)  # (batch_size, src_len, attention_dim)
        
        # Project decoder hidden state and expand
        decoder_proj = self.decoder_projection(decoder_hidden).unsqueeze(1)  # (batch_size, 1, attention_dim)
        decoder_proj = decoder_proj.expand(batch_size, src_len, -1)  # (batch_size, src_len, attention_dim)
        
        # Compute attention scores
        attention_scores = self.attention_vector(torch.tanh(encoder_proj + decoder_proj))  # (batch_size, src_len, 1)
        attention_scores = attention_scores.squeeze(2)  # (batch_size, src_len)
        
        # Apply mask if provided
        if mask is not None:
            attention_scores.masked_fill_(mask == 0, -float('inf'))
        
        # Compute attention weights
        attention_weights = F.softmax(attention_scores, dim=1)  # (batch_size, src_len)
        
        # Compute context vector
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)  # (batch_size, 1, encoder_hidden_dim)
        context = context.squeeze(1)  # (batch_size, encoder_hidden_dim)
        
        return context, attention_weights


class Decoder(nn.Module):
    """
    Decoder with attention mechanism for sequence-to-sequence model.
    
    Uses attention to focus on relevant parts of the source sequence
    during target sequence generation.
    
    TensorFlow equivalent would use tf.keras.layers.LSTM with
    custom attention mechanism integration.
    """
    
    def __init__(self, vocab_size, embed_dim, hidden_dim, encoder_hidden_dim, 
                 attention_dim, num_layers=1, dropout=0.1):
        super(Decoder, self).__init__()
        
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=2)  # PAD_TOKEN = 2
        self.dropout = nn.Dropout(dropout)
        
        # Attention mechanism
        self.attention = Attention(encoder_hidden_dim, hidden_dim, attention_dim)
        
        # LSTM (input: embedding + context)
        self.lstm = nn.LSTM(
            embed_dim + encoder_hidden_dim,  # embedding + context vector
            hidden_dim,
            num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output projection
        self.output_projection = nn.Linear(
            hidden_dim + encoder_hidden_dim,  # hidden + context
            vocab_size
        )
    
    def forward(self, target_token, hidden, cell, encoder_outputs, mask=None):
        """
        Single step forward pass of decoder.
        
        Args:
            target_token: Current target token (batch_size, 1)
            hidden: Previous decoder hidden state (batch_size, hidden_dim)
            cell: Previous decoder cell state (batch_size, hidden_dim)
            encoder_outputs: All encoder hidden states (batch_size, src_len, encoder_hidden_dim)
            mask: Source sequence mask (batch_size, src_len)
        
        Returns:
            output: Output logits (batch_size, vocab_size)
            hidden: New hidden state (batch_size, hidden_dim)
            cell: New cell state (batch_size, hidden_dim)
            attention_weights: Attention weights (batch_size, src_len)
        """
        # Embedding
        embedded = self.embedding(target_token)  # (batch_size, 1, embed_dim)
        embedded = self.dropout(embedded)
        
        # Compute attention
        context, attention_weights = self.attention(hidden, encoder_outputs, mask)
        
        # Concatenate embedding and context
        lstm_input = torch.cat([embedded, context.unsqueeze(1)], dim=2)  # (batch_size, 1, embed_dim + encoder_hidden_dim)
        
        # LSTM forward pass
        lstm_output, (new_hidden, new_cell) = self.lstm(lstm_input, (hidden.unsqueeze(0), cell.unsqueeze(0)))
        
        # Remove sequence dimension and layer dimension
        lstm_output = lstm_output.squeeze(1)  # (batch_size, hidden_dim)
        new_hidden = new_hidden.squeeze(0)  # (batch_size, hidden_dim)
        new_cell = new_cell.squeeze(0)  # (batch_size, hidden_dim)
        
        # Concatenate LSTM output and context for final projection
        output_input = torch.cat([lstm_output, context], dim=1)  # (batch_size, hidden_dim + encoder_hidden_dim)
        
        # Output projection
        output = self.output_projection(output_input)  # (batch_size, vocab_size)
        
        return output, new_hidden, new_cell, attention_weights


class Seq2SeqModel(nn.Module):
    """
    Complete sequence-to-sequence model with attention.
    
    Combines encoder, decoder, and attention mechanism for neural machine translation.
    Supports both training (teacher forcing) and inference modes.
    
    Architecture:
    1. Encoder processes source sequence
    2. Decoder generates target sequence step-by-step with attention
    3. Attention mechanism allows decoder to focus on relevant source positions
    """
    
    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim=256, hidden_dim=512, 
                 attention_dim=256, num_layers=1, dropout=0.1):
        super(Seq2SeqModel, self).__init__()
        
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size
        self.hidden_dim = hidden_dim
        
        # Encoder
        self.encoder = Encoder(
            vocab_size=src_vocab_size,
            embed_dim=embed_dim,
            hidden_dim=hidden_dim,
            num_layers=num_layers,
            dropout=dropout
        )
        
        # Decoder
        self.decoder = Decoder(
            vocab_size=tgt_vocab_size,
            embed_dim=embed_dim,
            hidden_dim=hidden_dim,
            encoder_hidden_dim=hidden_dim * 2,  # Bidirectional encoder
            attention_dim=attention_dim,
            num_layers=num_layers,
            dropout=dropout
        )
    
    def create_mask(self, src_seq, src_lengths):
        """
        Create mask for padding tokens in source sequence.
        """
        batch_size, max_len = src_seq.size()
        mask = torch.zeros(batch_size, max_len, dtype=torch.bool, device=src_seq.device)
        
        for i, length in enumerate(src_lengths):
            mask[i, :length] = 1
        
        return mask
    
    def forward(self, src_seq, tgt_seq, src_lengths=None, teacher_forcing_ratio=1.0):
        """
        Forward pass for training with teacher forcing.
        
        Args:
            src_seq: Source sequences (batch_size, src_len)
            tgt_seq: Target sequences (batch_size, tgt_len)
            src_lengths: Source sequence lengths
            teacher_forcing_ratio: Probability of using teacher forcing
        
        Returns:
            outputs: Decoder outputs (batch_size, tgt_len, vocab_size)
            attention_weights: Attention weights (batch_size, tgt_len, src_len)
        """
        batch_size = src_seq.size(0)
        tgt_len = tgt_seq.size(1)
        
        # Encode source sequence
        encoder_outputs, hidden, cell = self.encoder(src_seq, src_lengths)
        
        # Create source mask
        mask = self.create_mask(src_seq, src_lengths) if src_lengths is not None else None
        
        # Initialize outputs and attention weights
        outputs = torch.zeros(batch_size, tgt_len, self.tgt_vocab_size, device=src_seq.device)
        attention_weights = torch.zeros(batch_size, tgt_len, encoder_outputs.size(1), device=src_seq.device)
        
        # First decoder input is SOS token
        decoder_input = tgt_seq[:, 0:1]  # (batch_size, 1)
        
        # Decode step by step
        for t in range(tgt_len):
            # Decoder forward pass
            output, hidden, cell, attn_weights = self.decoder(
                decoder_input, hidden, cell, encoder_outputs, mask
            )
            
            # Store output and attention weights
            outputs[:, t, :] = output
            attention_weights[:, t, :] = attn_weights
            
            # Decide next input (teacher forcing or previous prediction)
            if t < tgt_len - 1:
                if random.random() < teacher_forcing_ratio:
                    # Use teacher forcing: next input is next target token
                    decoder_input = tgt_seq[:, t+1:t+2]
                else:
                    # Use previous prediction
                    decoder_input = output.argmax(dim=1).unsqueeze(1)
        
        return outputs, attention_weights
    
    def translate(self, src_seq, src_lengths=None, max_length=50, eos_token=1):
        """
        Inference mode: translate source sequence to target sequence.
        
        Args:
            src_seq: Source sequence (1, src_len) - single sequence
            src_lengths: Source sequence length
            max_length: Maximum target sequence length
            eos_token: End-of-sequence token ID
        
        Returns:
            translation: Generated target sequence
            attention_weights: Attention weights for visualization
        """
        self.eval()
        
        with torch.no_grad():
            batch_size = src_seq.size(0)
            
            # Encode source sequence
            encoder_outputs, hidden, cell = self.encoder(src_seq, src_lengths)
            
            # Create source mask
            mask = self.create_mask(src_seq, src_lengths) if src_lengths is not None else None
            
            # Initialize with SOS token
            decoder_input = torch.tensor([[0]], device=src_seq.device)  # SOS_TOKEN = 0
            
            # Store results
            translation = []
            attention_weights = []
            
            # Generate tokens one by one
            for _ in range(max_length):
                output, hidden, cell, attn_weights = self.decoder(
                    decoder_input, hidden, cell, encoder_outputs, mask
                )
                
                # Get predicted token
                predicted_token = output.argmax(dim=1)
                translation.append(predicted_token.item())
                attention_weights.append(attn_weights.squeeze(0).cpu().numpy())
                
                # Stop if EOS token is generated
                if predicted_token.item() == eos_token:
                    break
                
                # Use predicted token as next input
                decoder_input = predicted_token.unsqueeze(1)
        
        return translation, np.array(attention_weights)

# Create model instance
model = Seq2SeqModel(
    src_vocab_size=len(preprocessor.src_vocab),
    tgt_vocab_size=len(preprocessor.tgt_vocab),
    embed_dim=256,
    hidden_dim=256,  # Smaller for CPU training
    attention_dim=128,
    num_layers=1,
    dropout=0.1
)

# Move model to device
model = model.to(DEVICE)

# Count parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_params = count_parameters(model)
print(f"\n🏗️  Seq2Seq Model Architecture:")
print(f"   Source vocabulary size: {len(preprocessor.src_vocab)}")
print(f"   Target vocabulary size: {len(preprocessor.tgt_vocab)}")
print(f"   Embedding dimension: 256")
print(f"   Hidden dimension: 256")
print(f"   Attention dimension: 128")
print(f"   Total parameters: {total_params:,}")
print(f"   Device: {DEVICE}")

# Test model with sample data
sample_batch = next(iter(train_loader))
src_seq = sample_batch['source'].to(DEVICE)
tgt_input = sample_batch['target_input'].to(DEVICE)

print(f"\n🧪 Model Forward Pass Test:")
print(f"   Input shape: {src_seq.shape}")
print(f"   Target shape: {tgt_input.shape}")

# Forward pass
with torch.no_grad():
    outputs, attention = model(src_seq, tgt_input, teacher_forcing_ratio=1.0)
    print(f"   Output shape: {outputs.shape}")
    print(f"   Attention shape: {attention.shape}")
    print(f"   ✅ Model forward pass successful!")

## 🏋️ Training Loop with TensorBoard Integration

We'll implement a comprehensive training loop that includes:

- **Loss computation** with padding token masking
- **Teacher forcing scheduling** with decreasing ratio over epochs
- **TensorBoard logging** for loss, attention visualizations, and metrics
- **Gradient clipping** to prevent exploding gradients
- **Learning rate scheduling** for better convergence
- **Model checkpointing** to save best model

The training follows repository standards with device-aware implementation and comprehensive monitoring.

In [None]:
import os
import time
from datetime import datetime

def get_run_logdir(run_name="seq2seq_translation"):
    """Generate unique log directory for TensorBoard."""
    
    # Platform-specific TensorBoard log directory setup
    if IS_COLAB:
        root_logdir = "/content/tensorboard_logs"
    elif IS_KAGGLE:
        root_logdir = "./tensorboard_logs"
    else:
        root_logdir = "./tensorboard_logs"
    
    # Create root directory if it doesn't exist
    os.makedirs(root_logdir, exist_ok=True)
    
    # Generate unique run directory
    timestamp = datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
    run_logdir = os.path.join(root_logdir, f"{run_name}_{timestamp}")
    
    return run_logdir

def masked_cross_entropy_loss(outputs, targets, pad_token=2):
    """
    Compute cross-entropy loss while ignoring padding tokens.
    
    Args:
        outputs: Model predictions (batch_size, seq_len, vocab_size)
        targets: Target sequences (batch_size, seq_len)
        pad_token: Padding token ID to ignore
    
    Returns:
        loss: Masked cross-entropy loss
        num_tokens: Number of non-padding tokens
    """
    # Flatten predictions and targets
    outputs_flat = outputs.view(-1, outputs.size(-1))  # (batch_size * seq_len, vocab_size)
    targets_flat = targets.view(-1)  # (batch_size * seq_len)
    
    # Create mask for non-padding tokens
    mask = (targets_flat != pad_token)
    
    # Compute loss only for non-padding tokens
    loss = F.cross_entropy(outputs_flat, targets_flat, reduction='none')
    masked_loss = loss * mask.float()
    
    # Average over non-padding tokens
    num_tokens = mask.sum().item()
    if num_tokens > 0:
        avg_loss = masked_loss.sum() / num_tokens
    else:
        avg_loss = masked_loss.sum()  # Should be 0
    
    return avg_loss, num_tokens

def train_seq2seq_model(model, train_loader, val_loader, preprocessor, 
                       num_epochs=20, learning_rate=0.001, 
                       teacher_forcing_start=1.0, teacher_forcing_end=0.5,
                       clip_grad_norm=1.0, save_best=True):
    """
    Train the sequence-to-sequence model with comprehensive monitoring.
    
    Features:
    - Teacher forcing ratio decay
    - Gradient clipping
    - TensorBoard logging
    - Model checkpointing
    - Translation examples during training
    """
    
    # Setup optimizer and scheduler
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5)
    
    # TensorBoard setup
    log_dir = get_run_logdir("australian_translation")
    writer = SummaryWriter(log_dir)
    
    # Training state
    best_val_loss = float('inf')
    train_losses = []
    val_losses = []
    
    print(f"🚀 Starting Australian Tourism Translation Training")
    print(f"📊 Device: {DEVICE}")
    print(f"🎯 Target: English → Vietnamese translation")
    print(f"🌏 Context: Australian tourism content")
    print(f"📈 Epochs: {num_epochs}")
    print(f"🔢 Learning Rate: {learning_rate}")
    print(f"📝 TensorBoard logs: {log_dir}")
    print("=" * 70)
    
    for epoch in range(num_epochs):
        epoch_start_time = time.time()
        
        # Calculate teacher forcing ratio (decay over epochs)
        teacher_forcing_ratio = teacher_forcing_start - (teacher_forcing_start - teacher_forcing_end) * (epoch / num_epochs)
        
        # Training phase
        model.train()
        train_loss = 0.0
        train_tokens = 0
        
        for batch_idx, batch in enumerate(train_loader):
            # Move data to device
            src_seq = batch['source'].to(DEVICE)
            tgt_input = batch['target_input'].to(DEVICE)
            tgt_output = batch['target_output'].to(DEVICE)
            src_lengths = batch['src_len'].to(DEVICE)
            
            # Forward pass
            optimizer.zero_grad()
            outputs, attention_weights = model(src_seq, tgt_input, src_lengths, teacher_forcing_ratio)
            
            # Compute loss
            loss, num_tokens = masked_cross_entropy_loss(outputs, tgt_output)
            
            # Backward pass
            loss.backward()
            
            # Gradient clipping
            if clip_grad_norm > 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad_norm)
            
            optimizer.step()
            
            # Accumulate metrics
            train_loss += loss.item() * num_tokens
            train_tokens += num_tokens
            
            # Log batch metrics
            if batch_idx % 10 == 0:
                step = epoch * len(train_loader) + batch_idx
                writer.add_scalar('Loss/Train_Batch', loss.item(), step)
                writer.add_scalar('Meta/Teacher_Forcing_Ratio', teacher_forcing_ratio, step)
                writer.add_scalar('Meta/Learning_Rate', optimizer.param_groups[0]['lr'], step)
        
        # Calculate average training loss
        avg_train_loss = train_loss / train_tokens if train_tokens > 0 else 0
        train_losses.append(avg_train_loss)
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        val_tokens = 0
        
        with torch.no_grad():
            for batch in val_loader:
                src_seq = batch['source'].to(DEVICE)
                tgt_input = batch['target_input'].to(DEVICE)
                tgt_output = batch['target_output'].to(DEVICE)
                src_lengths = batch['src_len'].to(DEVICE)
                
                outputs, attention_weights = model(src_seq, tgt_input, src_lengths, 1.0)
                loss, num_tokens = masked_cross_entropy_loss(outputs, tgt_output)
                
                val_loss += loss.item() * num_tokens
                val_tokens += num_tokens
        
        avg_val_loss = val_loss / val_tokens if val_tokens > 0 else 0
        val_losses.append(avg_val_loss)
        
        # Learning rate scheduling
        scheduler.step(avg_val_loss)
        
        # Log epoch metrics
        writer.add_scalar('Loss/Train_Epoch', avg_train_loss, epoch)
        writer.add_scalar('Loss/Validation', avg_val_loss, epoch)
        writer.add_scalar('Perplexity/Train', np.exp(avg_train_loss), epoch)
        writer.add_scalar('Perplexity/Validation', np.exp(avg_val_loss), epoch)
        
        # Save best model
        if save_best and avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'train_loss': avg_train_loss,
                'val_loss': avg_val_loss
            }, 'best_seq2seq_model.pth')
        
        # Sample translation for monitoring
        if epoch % 5 == 0:
            sample_translation_logging(model, val_dataset, preprocessor, writer, epoch)
        
        # Print epoch summary
        epoch_time = time.time() - epoch_start_time
        print(f"Epoch {epoch+1:2d}/{num_epochs} | "
              f"Train Loss: {avg_train_loss:.4f} | "
              f"Val Loss: {avg_val_loss:.4f} | "
              f"TF Ratio: {teacher_forcing_ratio:.3f} | "
              f"Time: {epoch_time:.1f}s")
    
    writer.close()
    
    print(f"\n🎯 Training completed!")
    print(f"📊 Best validation loss: {best_val_loss:.4f}")
    print(f"📈 TensorBoard logs saved to: {log_dir}")
    
    return train_losses, val_losses, log_dir

def sample_translation_logging(model, dataset, preprocessor, writer, epoch):
    """
    Generate sample translations and log to TensorBoard.
    """
    model.eval()
    
    # Get a few samples
    sample_indices = [0, 5, 10]
    translations_text = []
    
    with torch.no_grad():
        for idx in sample_indices:
            if idx < len(dataset):
                sample = dataset[idx]
                src_seq = sample['source'].unsqueeze(0).to(DEVICE)
                src_len = torch.tensor([sample['src_len']], device=DEVICE)
                
                # Generate translation
                translation, attention = model.translate(src_seq, src_len, max_length=50)
                
                # Decode sequences
                src_text = preprocessor.decode_sequence(sample['source'], is_source=True)
                tgt_text = preprocessor.decode_sequence(sample['target_output'], is_source=False)
                pred_text = preprocessor.decode_sequence(translation, is_source=False)
                
                translation_example = f"Source: {src_text}\nTarget: {tgt_text}\nPrediction: {pred_text}\n"
                translations_text.append(translation_example)
    
    # Log translations as text
    writer.add_text('Translations/Sample', '\n'.join(translations_text), epoch)

print("🏋️ Training setup complete! Ready to start training.")

In [None]:
# Start training the model
print("🎬 Starting seq2seq model training...")

# Train the model
train_losses, val_losses, log_directory = train_seq2seq_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    preprocessor=preprocessor,
    num_epochs=15,  # Moderate number for demonstration
    learning_rate=0.001,
    teacher_forcing_start=1.0,
    teacher_forcing_end=0.5,
    clip_grad_norm=1.0,
    save_best=True
)

print("\n" + "=" * 60)
print("📊 TENSORBOARD VISUALIZATION")
print("=" * 60)
print(f"Log directory: {log_directory}")
print("\n🚀 To view TensorBoard:")

if IS_COLAB:
    print("   In Google Colab:")
    print("   1. Run: %load_ext tensorboard")
    print(f"   2. Run: %tensorboard --logdir {log_directory}")
    print("   3. TensorBoard will appear inline in the notebook")
elif IS_KAGGLE:
    print("   In Kaggle:")
    print(f"   1. Download logs from: {log_directory}")
    print("   2. Run locally: tensorboard --logdir ./tensorboard_logs")
    print("   3. Open http://localhost:6006 in browser")
else:
    print("   Locally:")
    print(f"   1. Run: tensorboard --logdir {log_directory}")
    print("   2. Open http://localhost:6006 in browser")

print("\n📈 Available visualizations:")
print("   • Scalars: Training and validation loss, perplexity")
print("   • Text: Sample translations during training")
print("   • Meta: Teacher forcing ratio, learning rate")
print("=" * 60)

## 📊 Training Results Visualization

Let's visualize the training progress using seaborn (following repository visualization standards) to understand how our model performed during training.

In [None]:
# Visualize training progress
def plot_training_metrics(train_losses, val_losses):
    """
    Plot training metrics with seaborn styling for Australian tourism model.
    """
    
    # Create DataFrame for seaborn
    epochs = range(1, len(train_losses) + 1)
    
    # Combine losses for plotting
    metrics_df = pd.DataFrame({
        'Epoch': list(epochs) * 2,
        'Loss': train_losses + val_losses,
        'Phase': ['Training'] * len(epochs) + ['Validation'] * len(epochs)
    })
    
    # Create subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Loss plot
    sns.lineplot(data=metrics_df, x='Epoch', y='Loss', hue='Phase', ax=ax1, marker='o')
    ax1.set_title('Australian Tourism Translation - Training Loss', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Cross-Entropy Loss')
    ax1.grid(True, alpha=0.3)
    ax1.legend(title='Phase')
    
    # Perplexity plot
    perplexity_df = metrics_df.copy()
    perplexity_df['Perplexity'] = np.exp(perplexity_df['Loss'])
    
    sns.lineplot(data=perplexity_df, x='Epoch', y='Perplexity', hue='Phase', ax=ax2, marker='s')
    ax2.set_title('Australian Tourism Translation - Perplexity', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Perplexity')
    ax2.grid(True, alpha=0.3)
    ax2.legend(title='Phase')
    
    plt.tight_layout()
    plt.show()
    
    # Print final metrics
    final_train_loss = train_losses[-1]
    final_val_loss = val_losses[-1]
    final_train_perplexity = np.exp(final_train_loss)
    final_val_perplexity = np.exp(final_val_loss)
    
    print(f"📊 Final Training Metrics:")
    print(f"   Training Loss: {final_train_loss:.4f} (Perplexity: {final_train_perplexity:.2f})")
    print(f"   Validation Loss: {final_val_loss:.4f} (Perplexity: {final_val_perplexity:.2f})")
    
    # Convergence analysis
    improvement = train_losses[0] - train_losses[-1]
    print(f"   Training improvement: {improvement:.4f}")
    
    if final_val_loss < final_train_loss + 0.1:
        print("   ✅ Model shows good generalization (low overfitting)")
    else:
        print("   ⚠️  Model may be overfitting - consider regularization")

# Plot the training results
plot_training_metrics(train_losses, val_losses)

## 🎯 Model Evaluation and BLEU Scores

Let's evaluate our trained model using standard translation metrics:

- **BLEU Score** - Measures n-gram overlap between predictions and references
- **Translation Examples** - Qualitative analysis of translation quality
- **Error Analysis** - Common translation patterns and mistakes

We'll test on both our validation set and some new Australian tourism examples.

In [None]:
def compute_bleu_score(predictions, references, max_n=4):
    """
    Compute BLEU score for translation evaluation.
    
    Simple implementation if NLTK is not available.
    For production, use NLTK or sacrebleu for more accurate scoring.
    """
    try:
        # Use NLTK if available
        if 'nltk' in globals():
            # Convert to required format for NLTK
            references_nltk = [[ref.split()] for ref in references]
            predictions_nltk = [pred.split() for pred in predictions]
            
            bleu_scores = []
            for pred, ref in zip(predictions_nltk, references_nltk):
                score = sentence_bleu(ref, pred, weights=(0.25, 0.25, 0.25, 0.25))
                bleu_scores.append(score)
            
            return np.mean(bleu_scores)
    except:
        pass
    
    # Simple BLEU implementation
    def get_ngrams(tokens, n):
        return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    
    total_score = 0
    for pred, ref in zip(predictions, references):
        pred_tokens = pred.split()
        ref_tokens = ref.split()
        
        scores = []
        for n in range(1, max_n + 1):
            pred_ngrams = get_ngrams(pred_tokens, n)
            ref_ngrams = get_ngrams(ref_tokens, n)
            
            if len(pred_ngrams) == 0:
                scores.append(0)
            else:
                matches = sum(1 for ng in pred_ngrams if ng in ref_ngrams)
                scores.append(matches / len(pred_ngrams))
        
        # Geometric mean of n-gram scores
        if all(s > 0 for s in scores):
            bleu = np.exp(np.mean(np.log(scores)))
        else:
            bleu = 0
        
        # Brevity penalty
        bp = min(1.0, len(pred_tokens) / len(ref_tokens)) if len(ref_tokens) > 0 else 0
        total_score += bleu * bp
    
    return total_score / len(predictions) if predictions else 0

def evaluate_model(model, dataset, preprocessor, num_samples=50, device=DEVICE):
    """
    Evaluate the trained model on validation data.
    
    Returns BLEU scores and example translations.
    """
    model.eval()
    
    predictions = []
    references = []
    examples = []
    
    print(f"🔍 Evaluating model on {num_samples} samples...")
    
    with torch.no_grad():
        for i in range(min(num_samples, len(dataset))):
            sample = dataset[i]
            
            # Prepare input
            src_seq = sample['source'].unsqueeze(0).to(device)
            src_len = torch.tensor([sample['src_len']], device=device)
            
            # Generate translation
            translation, attention = model.translate(src_seq, src_len, max_length=50)
            
            # Decode sequences
            src_text = preprocessor.decode_sequence(sample['source'], is_source=True)
            ref_text = preprocessor.decode_sequence(sample['target_output'], is_source=False)
            pred_text = preprocessor.decode_sequence(translation, is_source=False)
            
            predictions.append(pred_text)
            references.append(ref_text)
            
            # Store examples for display
            if len(examples) < 10:
                examples.append({
                    'source': src_text,
                    'reference': ref_text,
                    'prediction': pred_text,
                    'attention': attention
                })
    
    # Compute BLEU score
    bleu_score = compute_bleu_score(predictions, references)
    
    return bleu_score, examples, predictions, references

# Load best model if available
try:
    checkpoint = torch.load('best_seq2seq_model.pth', map_location=DEVICE)
    model.load_state_dict(checkpoint['model_state_dict'])
    print("✅ Loaded best model from checkpoint")
except FileNotFoundError:
    print("ℹ️  Using current model state (no checkpoint found)")

# Evaluate the model
bleu_score, examples, predictions, references = evaluate_model(
    model, val_dataset, preprocessor, num_samples=30
)

print(f"\n📊 Evaluation Results:")
print(f"   BLEU Score: {bleu_score:.4f}")
print(f"   Samples evaluated: {len(predictions)}")

# Interpret BLEU score
if bleu_score > 0.3:
    quality = "Excellent"
elif bleu_score > 0.2:
    quality = "Good"
elif bleu_score > 0.1:
    quality = "Fair"
else:
    quality = "Needs Improvement"

print(f"   Translation Quality: {quality}")

# Display example translations
print(f"\n🌏 Sample Translations:")
print("=" * 80)

for i, example in enumerate(examples[:5], 1):
    print(f"{i}. 🇬🇧 Source:     {example['source']}")
    print(f"   🇻🇳 Reference:  {example['reference']}")
    print(f"   🤖 Prediction: {example['prediction']}")
    print()

## 🎨 Attention Visualization

One of the key advantages of attention mechanisms is interpretability. We can visualize which parts of the source sentence the model "pays attention to" when generating each word in the target sentence.

This helps us understand:
- **Alignment quality** between source and target languages
- **Translation patterns** the model has learned
- **Potential issues** like attention collapse or misalignment

In [None]:
def visualize_attention(source_text, target_text, attention_weights, preprocessor):
    """
    Create attention heatmap using seaborn.
    
    Args:
        source_text: Source sentence
        target_text: Target sentence  
        attention_weights: Attention weights array (target_len, source_len)
        preprocessor: Text preprocessor for tokenization
    """
    # Tokenize sentences
    src_tokens = preprocessor.tokenize(source_text)
    tgt_tokens = target_text.split()
    
    # Trim attention weights to match actual tokens
    attention_trimmed = attention_weights[:len(tgt_tokens), :len(src_tokens)]
    
    # Create heatmap
    plt.figure(figsize=(12, 8))
    
    # Use seaborn for better aesthetics
    sns.heatmap(
        attention_trimmed,
        xticklabels=src_tokens,
        yticklabels=tgt_tokens,
        cmap='Blues',
        cbar_kws={'label': 'Attention Weight'},
        square=False,
        linewidths=0.5
    )
    
    plt.title('Attention Weights: English → Vietnamese Translation\n' + 
              f'Source: {source_text[:50]}...\n' +
              f'Target: {target_text[:50]}...', 
              fontsize=12, pad=20)
    plt.xlabel('Source Tokens (English)', fontsize=11)
    plt.ylabel('Target Tokens (Vietnamese)', fontsize=11)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    
    plt.tight_layout()
    plt.show()
    
    # Print attention analysis
    print(f"🔍 Attention Analysis:")
    print(f"   Source length: {len(src_tokens)} tokens")
    print(f"   Target length: {len(tgt_tokens)} tokens")
    print(f"   Attention matrix shape: {attention_trimmed.shape}")
    
    # Find strongest attention weights
    max_attention = np.max(attention_trimmed)
    max_pos = np.unravel_index(np.argmax(attention_trimmed), attention_trimmed.shape)
    print(f"   Strongest attention: {max_attention:.3f} at position {max_pos}")
    print(f"   Target '{tgt_tokens[max_pos[0]]}' ← Source '{src_tokens[max_pos[1]]}'")

def create_interactive_translation_demo(model, preprocessor, device=DEVICE):
    """
    Create an interactive translation demonstration.
    
    Shows translations for new Australian tourism sentences.
    """
    model.eval()
    
    # New test sentences (not in training data)
    test_sentences = [
        "Australia has unique wildlife like koalas and kangaroos.",
        "The weather in Sydney is perfect for outdoor activities.",
        "Melbourne's food scene is diverse and exciting.",
        "The Great Ocean Road offers stunning coastal scenery.",
        "Adelaide is known for its wine and festivals.",
        "Tasmania's wilderness areas are pristine and beautiful.",
        "Perth has some of the world's most beautiful beaches.",
        "The Australian Outback is vast and mysterious."
    ]
    
    print("🎭 Interactive Australian Tourism Translation Demo")
    print("=" * 60)
    
    translations = []
    
    with torch.no_grad():
        for i, sentence in enumerate(test_sentences, 1):
            # Encode sentence
            src_indices = preprocessor.encode_sentence(sentence, is_source=True, add_eos=True)
            src_seq = torch.tensor([src_indices], dtype=torch.long, device=device)
            src_len = torch.tensor([len(src_indices)], device=device)
            
            # Generate translation
            translation, attention = model.translate(src_seq, src_len, max_length=50)
            
            # Decode translation
            translated_text = preprocessor.decode_sequence(translation, is_source=False)
            
            # Store for visualization
            translations.append({
                'source': sentence,
                'translation': translated_text,
                'attention': attention
            })
            
            print(f"{i}. 🇬🇧 English:    {sentence}")
            print(f"   🇻🇳 Vietnamese: {translated_text}")
            print()
    
    return translations

# Run interactive demo
demo_translations = create_interactive_translation_demo(model, preprocessor)

# Visualize attention for a couple of examples
print("\n🎨 Attention Visualizations:")
print("=" * 60)

# Show attention for first two examples
for i, example in enumerate(demo_translations[:2]):
    print(f"\nExample {i+1}:")
    visualize_attention(
        example['source'], 
        example['translation'], 
        example['attention'], 
        preprocessor
    )

## 🔬 TensorFlow vs PyTorch Implementation Comparison

Let's compare key differences between implementing seq2seq models in TensorFlow and PyTorch:

### **Key Architectural Differences**

| Component | TensorFlow/Keras | PyTorch |
|-----------|------------------|----------|
| **Model Definition** | Functional/Sequential API | `nn.Module` subclassing |
| **Training Loop** | `model.fit()` | Manual loop with optimizer steps |
| **Attention** | `tf.keras.layers.Attention` | Custom implementation |
| **Data Loading** | `tf.data.Dataset` | `torch.utils.data.DataLoader` |
| **Device Management** | Mostly automatic | Explicit `.to(device)` calls |

### **Code Comparison Examples**

#### **Model Training:**
```python
# TensorFlow/Keras approach
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
history = model.fit(train_data, validation_data=val_data, epochs=20)

# PyTorch approach (what we implemented)
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(20):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs, attention = model(src_seq, tgt_seq)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
```

#### **Attention Mechanism:**
```python
# TensorFlow/Keras
attention_layer = tf.keras.layers.Attention()
context = attention_layer([query, value, key])

# PyTorch (our implementation)
class Attention(nn.Module):
    def __init__(self, encoder_hidden_dim, decoder_hidden_dim, attention_dim):
        # Custom attention implementation with linear layers
        self.attention_vector = nn.Linear(attention_dim, 1, bias=False)
```

### **Advantages of Each Framework**

**TensorFlow/Keras Advantages:**
- 🚀 **Faster prototyping** with high-level APIs
- 🔧 **Built-in training loops** with callbacks and metrics
- 📊 **Integrated visualization** with TensorBoard
- 🏭 **Production deployment** tools (TensorFlow Serving, TensorFlow Lite)

**PyTorch Advantages:**
- 🔬 **Research flexibility** with dynamic computation graphs
- 🛠️ **Fine-grained control** over training process
- 🐍 **Pythonic design** that feels more natural
- 🔍 **Easier debugging** with standard Python debugging tools
- 📚 **Better for learning** underlying ML concepts

## 🎓 Summary and Next Steps

Congratulations! You've successfully implemented a complete sequence-to-sequence neural machine translation model from scratch. Here's what we accomplished:

### **✅ What We Built**

1. **📚 Australian Tourism Translation Dataset** - English-Vietnamese pairs with cultural context
2. **🔤 Text Preprocessing Pipeline** - Unicode normalization, tokenization, vocabulary building
3. **🏗️ Seq2Seq Architecture** - Bidirectional LSTM encoder with attention mechanism
4. **🎯 Attention Mechanism** - Additive (Bahdanau) attention for better translations
5. **🏋️ Training Pipeline** - Complete training loop with TensorBoard integration
6. **📊 Evaluation Framework** - BLEU scores and qualitative analysis
7. **🎨 Attention Visualization** - Interpretable attention weight heatmaps
8. **🎭 Interactive Demo** - Real-time translation of Australian tourism content

### **🔑 Key Learning Points**

- **Encoder-Decoder Architecture**: How to encode source sequences and decode target sequences
- **Attention Mechanisms**: Why attention helps and how to implement it
- **Teacher Forcing**: Training technique for sequence generation models
- **PyTorch Training Loops**: Manual training vs TensorFlow's automated approach
- **Translation Evaluation**: BLEU scores and qualitative assessment methods
- **Attention Visualization**: Making neural networks more interpretable

### **🚀 Next Steps for Improvement**

1. **🔤 Advanced Tokenization**
   - Implement subword tokenization (BPE, SentencePiece)
   - Handle out-of-vocabulary words better
   - Add proper Vietnamese word segmentation

2. **🧠 Model Architecture**
   - Try Transformer models (self-attention)
   - Experiment with different attention mechanisms
   - Add copy mechanisms for handling proper nouns

3. **📊 Data and Evaluation**
   - Collect larger, more diverse datasets
   - Implement more evaluation metrics (METEOR, ROUGE-L)
   - Add human evaluation studies

4. **🏭 Production Deployment**
   - Model quantization for faster inference
   - Beam search decoding for better translations
   - REST API deployment with FastAPI/Flask

5. **🔬 Research Extensions**
   - Multi-language support beyond English-Vietnamese
   - Domain adaptation for different types of content
   - Zero-shot translation capabilities

### **📚 Additional Resources**

- **Papers**: "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al.)
- **Books**: "Natural Language Processing with Python" and "Deep Learning" (Goodfellow et al.)
- **Courses**: Stanford CS224N, Fast.ai NLP course
- **Datasets**: WMT translation shared tasks, OpenSubtitles corpus

### **🎯 Repository Integration**

This notebook demonstrates:
- ✅ **Australian context** in all examples and data
- ✅ **Vietnamese as secondary language** for multilingual tasks
- ✅ **PyTorch vs TensorFlow comparisons** for learning transition
- ✅ **Device-aware implementation** with CPU/GPU/MPS support
- ✅ **TensorBoard integration** following repository standards
- ✅ **Seaborn visualizations** for training metrics
- ✅ **Comprehensive documentation** with learning objectives

You now have a solid foundation in sequence-to-sequence models and attention mechanisms. The skills you've learned here transfer directly to modern Transformer architectures and can be applied to many other sequence modeling tasks beyond translation!

**Happy coding and keep exploring! 🚀🇦🇺🇻🇳**