[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.6/machine-translation.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.6/machine-translation.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.6/machine-translation.ipynb)

# 🌐 Machine Translation: Converting Text Between Languages

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- What machine translation is and its practical applications
- How to use Marian models for neural machine translation
- How to use T5 models for text-to-text translation
- Vietnamese to English translation using modern transformer models
- Comparison between different translation model architectures

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))

## 📚 What We'll Cover
1. **Introduction**: Machine translation concepts
2. **Environment Setup**: Device detection and model loading
3. **Marian Models**: Using Helsinki-NLP Marian models
4. **T5 Models**: Text-to-text translation with T5
5. **Translation Examples**: Vietnamese to English examples
6. **Model Comparison**: Performance and quality analysis
7. **Summary**: Key takeaways and next steps

## Part 1: Introduction to Machine Translation

**Machine Translation (MT)** is the task of automatically translating text from one language to another using computational methods. Modern neural machine translation uses transformer architectures to achieve high-quality translations.

### 🔍 Key Model Types:

**Marian Models:**
- Specialized encoder-decoder architecture for translation
- Trained on large parallel corpora (OPUS dataset)
- Fast inference and good quality for specific language pairs

**T5 Models:**
- Text-to-Text Transfer Transformer approach
- Treats translation as a text generation task
- More flexible but requires task-specific prefixes

### 🇻🇳 Why Vietnamese to English?
- Vietnamese is a tonal language with unique linguistic features
- Demonstrates challenges in translating between different language families
- Growing importance in Southeast Asian technology markets

## Part 2: Environment Setup

In [None]:
# Import necessary libraries
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, MarianMTModel, MarianTokenizer
import torch
import time
from typing import List, Dict, Optional

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("📦 Libraries imported successfully!")

In [None]:
# Device detection for optimal performance
def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU for better performance)")
    
    return device

# Get optimal device
device = get_device()

## Part 3: Marian Models for Translation

Marian models are optimized for machine translation tasks. Let's load a Vietnamese to English Marian model.

In [None]:
# Load Marian model for Vietnamese to English translation
print("🔄 Loading Marian model for Vietnamese → English translation...")
print("This may take a few minutes on first run (downloading model)")

try:
    # Using Helsinki-NLP OPUS Marian model
    marian_model_name = "Helsinki-NLP/opus-mt-vi-en"
    
    # Load tokenizer and model
    marian_tokenizer = MarianTokenizer.from_pretrained(marian_model_name)
    marian_model = MarianMTModel.from_pretrained(marian_model_name)
    
    # Move model to optimal device
    marian_model.to(device)
    
    # Create translation pipeline
    marian_translator = pipeline(
        "translation",
        model=marian_model,
        tokenizer=marian_tokenizer,
        device=0 if device.type == 'cuda' else -1
    )
    
    print(f"✅ Marian model loaded successfully: {marian_model_name}")
    print(f"📊 Model size: ~{marian_model.num_parameters():,} parameters")
    
except Exception as e:
    print(f"❌ Error loading Marian model: {e}")
    print("💡 Continuing with alternative approach...")
    marian_translator = None

## Part 4: T5 Models for Translation

T5 treats every NLP task as a text-to-text problem. We'll use a multilingual T5 model for translation.

In [None]:
# Load T5 model for Vietnamese to English translation
print("🔄 Loading T5 model for Vietnamese → English translation...")
print("This may take a few minutes on first run (downloading model)")

try:
    # Using multilingual T5 model
    t5_model_name = "google/mt5-small"
    
    # Load tokenizer and model
    t5_tokenizer = AutoTokenizer.from_pretrained(t5_model_name)
    t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_model_name)
    
    # Move model to optimal device
    t5_model.to(device)
    
    print(f"✅ T5 model loaded successfully: {t5_model_name}")
    print(f"📊 Model size: ~{t5_model.num_parameters():,} parameters")
    
except Exception as e:
    print(f"❌ Error loading T5 model: {e}")
    print("💡 Continuing without T5 model...")
    t5_model = None
    t5_tokenizer = None

## Part 5: Translation Examples

Let's translate some Vietnamese texts to English using our loaded models.

In [None]:
# Sample Vietnamese texts for translation
vietnamese_texts = [
    "Xin chào! Hôm nay bạn có khỏe không?",
    "Tôi đang học về trí tuệ nhân tạo và học máy tại Đại học Sydney.",
    "Thời tiết ở Sydney hôm nay rất đẹp. Bạn có muốn đi dạo quanh cảng không?",
    "Việt Nam là một đất nước xinh đẹp với văn hóa phong phú.",
    "Công nghệ máy học đang phát triển rất nhanh trong những năm gần đây."
]

print("📝 Vietnamese Texts to Translate:")
print("=" * 50)
for i, text in enumerate(vietnamese_texts, 1):
    print(f"{i}. {text}")

In [None]:
# Function to translate using Marian model
def translate_with_marian(texts: List[str]) -> List[str]:
    """
    Translate Vietnamese texts to English using Marian model.
    
    Args:
        texts: List of Vietnamese texts to translate
        
    Returns:
        List of English translations
    """
    if marian_translator is None:
        return ["Marian model not available"] * len(texts)
    
    print("🔄 Translating with Marian model...")
    start_time = time.time()
    
    translations = []
    for text in texts:
        try:
            result = marian_translator(text, max_length=512)
            translation = result[0]['translation_text'] if result else "Translation failed"
            translations.append(translation)
        except Exception as e:
            translations.append(f"Error: {str(e)[:50]}...")
    
    duration = time.time() - start_time
    print(f"⏱️ Marian translation completed in {duration:.2f} seconds")
    
    return translations

# Translate using Marian model
marian_translations = translate_with_marian(vietnamese_texts)

In [None]:
# Function to translate using T5 model (manual implementation)
def translate_with_t5(texts: List[str]) -> List[str]:
    """
    Translate Vietnamese texts to English using T5 model.
    
    Args:
        texts: List of Vietnamese texts to translate
        
    Returns:
        List of English translations
    """
    if t5_model is None or t5_tokenizer is None:
        return ["T5 model not available"] * len(texts)
    
    print("🔄 Translating with T5 model...")
    start_time = time.time()
    
    translations = []
    for text in texts:
        try:
            # T5 requires task prefix for translation
            input_text = f"translate Vietnamese to English: {text}"
            
            # Tokenize input
            inputs = t5_tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Generate translation
            with torch.no_grad():
                outputs = t5_model.generate(
                    **inputs,
                    max_length=256,
                    num_beams=4,
                    early_stopping=True
                )
            
            # Decode translation
            translation = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
            translations.append(translation)
            
        except Exception as e:
            translations.append(f"Error: {str(e)[:50]}...")
    
    duration = time.time() - start_time
    print(f"⏱️ T5 translation completed in {duration:.2f} seconds")
    
    return translations

# Translate using T5 model
t5_translations = translate_with_t5(vietnamese_texts)

## Part 6: Translation Results and Comparison

Let's compare the translations from both models.

In [None]:
# Display translation results
print("🔍 TRANSLATION RESULTS COMPARISON")
print("=" * 60)

for i, (original, marian_trans, t5_trans) in enumerate(zip(vietnamese_texts, marian_translations, t5_translations), 1):
    print(f"\n📝 Example {i}:")
    print(f"Vietnamese:  {original}")
    print(f"Marian:      {marian_trans}")
    print(f"T5:          {t5_trans}")
    print("-" * 40)

## Part 7: Model Analysis and Best Practices

Let's analyze the performance and characteristics of each model.

In [None]:
# Model comparison analysis
def analyze_models():
    """
    Provide analysis of different translation models.
    """
    print("📊 MODEL ANALYSIS")
    print("=" * 40)
    
    models_info = {
        "🎯 Marian Models (Helsinki-NLP/opus-mt-vi-en)": {
            "Strengths": [
                "• Specialized for translation tasks",
                "• Fast inference speed",
                "• Trained on high-quality OPUS parallel corpora",
                "• Good performance for specific language pairs"
            ],
            "Use Cases": [
                "• Production translation systems",
                "• Real-time translation applications",
                "• When speed is more important than flexibility"
            ]
        },
        "🔄 T5 Models (google/mt5-small)": {
            "Strengths": [
                "• Unified text-to-text framework",
                "• Multilingual capabilities",
                "• Can be fine-tuned for specific domains",
                "• More flexible for various NLP tasks"
            ],
            "Use Cases": [
                "• Research and experimentation",
                "• Multi-task applications",
                "• When fine-tuning is needed",
                "• Custom domain translation"
            ]
        }
    }
    
    for model_name, info in models_info.items():
        print(f"\n{model_name}:")
        print("\nStrengths:")
        for strength in info["Strengths"]:
            print(f"  {strength}")
        print("\nBest Use Cases:")
        for use_case in info["Use Cases"]:
            print(f"  {use_case}")
        print("-" * 30)

analyze_models()

In [None]:
# Best practices for machine translation
def display_best_practices():
    """
    Display best practices for machine translation.
    """
    print("💡 MACHINE TRANSLATION BEST PRACTICES")
    print("=" * 45)
    
    practices = [
        {
            "title": "🎯 Model Selection",
            "tips": [
                "• Use Marian for fast, production-ready translations",
                "• Use T5 for research and when fine-tuning is needed",
                "• Consider model size vs. performance trade-offs",
                "• Test multiple models for your specific use case"
            ]
        },
        {
            "title": "📝 Text Preprocessing",
            "tips": [
                "• Clean and normalize input text",
                "• Handle special characters and punctuation carefully",
                "• Consider sentence segmentation for long texts",
                "• Maintain formatting when possible"
            ]
        },
        {
            "title": "⚡ Performance Optimization",
            "tips": [
                "• Use GPU acceleration when available",
                "• Batch multiple translations together",
                "• Adjust max_length based on typical text length",
                "• Consider model quantization for deployment"
            ]
        },
        {
            "title": "🔍 Quality Assurance",
            "tips": [
                "• Test with diverse text types and domains",
                "• Compare outputs from multiple models",
                "• Consider human evaluation for critical applications",
                "• Monitor translation quality over time"
            ]
        }
    ]
    
    for practice in practices:
        print(f"\n{practice['title']}:")
        for tip in practice['tips']:
            print(f"  {tip}")

display_best_practices()

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Machine Translation Fundamentals**: Understanding how neural machine translation works
- **Marian Models**: Using specialized encoder-decoder models for fast, high-quality translation
- **T5 Models**: Leveraging text-to-text frameworks for flexible translation tasks
- **Vietnamese-English Translation**: Practical examples of translating between different language families

### 📈 Best Practices Learned
- Choose the right model architecture based on your specific needs and constraints
- Consider performance trade-offs between speed, quality, and resource requirements
- Implement proper error handling and fallback mechanisms
- Test translations with diverse text types to ensure robust performance

### 🚀 Next Steps
- **Advanced Models**: Explore larger multilingual models like NLLB-200
- **Fine-tuning**: Learn to fine-tune models on domain-specific data
- **Bidirectional Translation**: Implement both Vietnamese↔English translation
- **Evaluation Metrics**: Study BLEU, METEOR, and other translation quality metrics
- **Production Deployment**: Build scalable translation APIs and services

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*