[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/08-translation.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/08-translation.ipynb)

# üåê Machine Translation: English to Vietnamese

## üéØ Learning Objectives
By the end of this notebook, you will understand:
- What is machine translation and why it's important
- How modern neural machine translation works
- Using pre-trained translation models with Hugging Face
- Working with multilingual models and language pairs
- Best practices for translation systems

## üìã Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals

## üìö What We'll Cover
1. Introduction to Machine Translation
2. Setting up the Environment
3. Using Pre-trained Translation Models
4. English to Vietnamese Translation Examples
5. Advanced Translation Features
6. Best Practices and Tips
7. Summary and Next Steps

## Part 1: Introduction to Machine Translation

**Machine Translation (MT)** is the task of automatically translating text from one language to another using computational methods. It's one of the most challenging and useful applications in NLP.

### üîç Types of Machine Translation:

**Statistical Machine Translation (SMT):**
- Uses statistical models based on bilingual text corpora
- Learns translation probabilities from aligned text pairs
- Earlier approach, largely superseded by neural methods

**Neural Machine Translation (NMT):**
- Uses deep neural networks (typically Transformers)
- End-to-end learning from source to target language
- Current state-of-the-art approach

### üåè Why English to Vietnamese?
- Vietnamese is a tonal language with unique characteristics
- Growing importance in Southeast Asian markets
- Interesting linguistic challenges for translation models
- Good example of translating between different language families

### üåü Real-world Applications:
- Global communication and business
- Educational content localization
- Tourism and travel assistance
- Cross-cultural content sharing
- International news and media

## Part 2: Setting up the Environment

In [None]:
# Import necessary libraries
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, MarianMTModel, MarianTokenizer
import torch
import time
from typing import List, Dict, Optional, Union

# For visualization and analysis
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

print("üì¶ Libraries imported successfully!")

In [None]:
# Device detection for optimal performance
def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"üöÄ Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("üçé Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("üíª Using CPU (consider GPU for better performance)")
    
    return device

# Get optimal device
device = get_device()

## Part 3: Using Pre-trained Translation Models

Let's explore different models available for English to Vietnamese translation.

In [None]:
# Available translation models for English -> Vietnamese
translation_models = {
    "Helsinki-NLP/opus-mt-en-vi": {
        "name": "OPUS-MT English-Vietnamese",
        "description": "Specialized model trained on OPUS dataset",
        "size": "~300MB",
        "quality": "Good for general text"
    },
    "VietAI/envit5-translation": {
        "name": "EnViT5 Translation", 
        "description": "T5-based model fine-tuned for EN-VI translation",
        "size": "~500MB",
        "quality": "High quality for Vietnamese"
    },
    "facebook/nllb-200-distilled-600M": {
        "name": "NLLB-200 (Multilingual)",
        "description": "No Language Left Behind - supports 200 languages",
        "size": "~2.4GB",
        "quality": "Very high quality, multilingual"
    }
}

print("üåê Available Translation Models:")
print("=" * 40)

for model_id, info in translation_models.items():
    print(f"\nüìñ {info['name']}")
    print(f"   Model ID: {model_id}")
    print(f"   Description: {info['description']}")
    print(f"   Size: {info['size']}")
    print(f"   Quality: {info['quality']}")

In [None]:
# Initialize translation pipeline
print("üîÑ Loading translation model...")
print("This may take a few minutes on first run (downloading model)")

try:
    # Start with the OPUS-MT model (smallest and fastest)
    model_name = "Helsinki-NLP/opus-mt-en-vi"
    
    translator = pipeline(
        "translation",
        model=model_name,
        device=0 if device.type == 'cuda' else -1,  # Use GPU if available
        max_length=512  # Maximum translation length
    )
    
    print(f"‚úÖ Translation model loaded successfully: {translation_models[model_name]['name']}")
    
except Exception as e:
    print(f"‚ùå Error loading primary model: {e}")
    print("üí° This model might not be available. Translation features may be limited.")
    translator = None

## Part 4: Your First Translation

Let's try translating some English text to Vietnamese!

In [None]:
# Sample English texts for translation
sample_texts = [
    "Hello, how are you today?",
    "I love learning about artificial intelligence and machine learning at University of Sydney.",
    "The weather is beautiful in Sydney today. Would you like to walk around the Harbour?",
    "Technology is rapidly changing the way we live and work in Australia.",
    "Welcome to Sydney! I hope you enjoy your visit to our beautiful harbour city."
]

print("üåê English to Vietnamese Translation Examples")
print("=" * 50)

if translator:
    for i, text in enumerate(sample_texts, 1):
        print(f"\n{i}. üá∫üá∏ English: {text}")
        
        try:
            # Translate the text
            start_time = time.time()
            result = translator(text)
            translation_time = time.time() - start_time
            
            # Extract translated text
            vietnamese_text = result[0]['translation_text']
            
            print(f"   üáªüá≥ Vietnamese: {vietnamese_text}")
            print(f"   ‚è±Ô∏è Translation time: {translation_time:.2f}s")
            
        except Exception as e:
            print(f"   ‚ùå Translation error: {e}")
else:
    print("‚ö†Ô∏è Translation model not available. Please check model loading.")

## Part 5: Building an Advanced Translation System

Let's create a more sophisticated translation system with better error handling and analysis features.

In [None]:
class EnglishVietnameseTranslator:
    """
    A comprehensive English to Vietnamese translation system.
    
    This class provides an easy-to-use interface for translation
    with built-in error handling, preprocessing, and analysis features.
    """
    
    def __init__(self, model_name: str = "Helsinki-NLP/opus-mt-en-vi"):
        """
        Initialize the translator with a specific model.
        
        Args:
            model_name: Hugging Face model identifier for EN->VI translation
        """
        self.model_name = model_name
        self.pipeline = None
        self.tokenizer = None
        self.model = None
        self._load_model()
    
    def _load_model(self):
        """Load the translation model with error handling."""
        try:
            print(f"üîÑ Loading model: {self.model_name}")
            
            # Load model and tokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
            self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name)
            
            # Move model to optimal device
            self.model.to(device)
            
            # Create pipeline
            self.pipeline = pipeline(
                "translation",
                model=self.model,
                tokenizer=self.tokenizer,
                device=0 if device.type == 'cuda' else -1
            )
            
            print("‚úÖ Model loaded successfully")
            
        except Exception as e:
            print(f"‚ùå Error loading model: {e}")
            print("üí° Model may not be available or require different installation.")
            self.pipeline = None
    
    def preprocess_text(self, text: str) -> str:
        """
        Preprocess English text before translation.
        
        Args:
            text: Input English text to preprocess
            
        Returns:
            Preprocessed text ready for translation
        """
        if not text or not text.strip():
            raise ValueError("Input text is empty")
        
        # Basic preprocessing
        text = text.strip()
        
        # Check text length
        if self.tokenizer:
            tokens = self.tokenizer.encode(text)
            if len(tokens) > 400:  # Most models have ~512 token limit
                print(f"‚ö†Ô∏è Warning: Text is quite long ({len(tokens)} tokens). "
                      f"Consider splitting into smaller segments.")
        
        return text
    
    def translate(
        self, 
        text: str, 
        max_length: int = 512,
        num_beams: int = 5,
        early_stopping: bool = True
    ) -> Dict:
        """
        Translate English text to Vietnamese.
        
        Args:
            text: English text to translate
            max_length: Maximum length of translation
            num_beams: Number of beams for beam search
            early_stopping: Whether to stop early when all beams finish
            
        Returns:
            Dictionary containing translation and analysis
        """
        if not self.pipeline:
            return {
                'error': 'Translation model not loaded',
                'english_text': text,
                'vietnamese_text': None
            }
        
        # Preprocess input
        try:
            processed_text = self.preprocess_text(text)
        except ValueError as e:
            return {
                'error': str(e),
                'english_text': text,
                'vietnamese_text': None
            }
        
        try:
            # Generate translation
            start_time = time.time()
            
            result = self.pipeline(
                processed_text,
                max_length=max_length,
                num_beams=num_beams,
                early_stopping=early_stopping,
                do_sample=False
            )
            
            translation_time = time.time() - start_time
            vietnamese_text = result[0]['translation_text']
            
            # Calculate statistics
            english_words = len(text.split())
            vietnamese_words = len(vietnamese_text.split()) if vietnamese_text else 0
            
            return {
                'english_text': text,
                'vietnamese_text': vietnamese_text,
                'english_words': english_words,
                'vietnamese_words': vietnamese_words,
                'translation_time': translation_time,
                'model_used': self.model_name
            }
            
        except Exception as e:
            return {
                'error': str(e),
                'english_text': text,
                'vietnamese_text': None
            }
    
    def translate_batch(self, texts: List[str]) -> List[Dict]:
        """
        Translate multiple texts efficiently.
        
        Args:
            texts: List of English texts to translate
            
        Returns:
            List of translation results
        """
        results = []
        
        print(f"üîÑ Translating {len(texts)} texts...")
        
        for i, text in enumerate(texts, 1):
            print(f"  Processing {i}/{len(texts)}...", end=' ')
            result = self.translate(text)
            results.append(result)
            
            if 'error' not in result:
                print("‚úÖ")
            else:
                print(f"‚ùå {result['error']}")
        
        return results

# Initialize our advanced translator
advanced_translator = EnglishVietnameseTranslator()

## Part 6: Practical Translation Examples

Let's test our translator with different types of content.

In [None]:
# Different categories of text for testing
test_categories = {
    "Greetings & Politeness": [
        "Hello, nice to meet you!",
        "Thank you very much for your help.",
        "Have a wonderful day!",
        "Excuse me, could you please help me?"
    ],
    
    "Travel & Tourism": [
        "Where is the nearest caf√© in Sydney CBD?",
        "I would like to book a hotel room near Circular Quay.",
        "How much does a ferry trip to Manly cost?",
        "Can you recommend some Sydney Harbour attractions?"
    ],
    
    "Technology & Business": [
        "We are developing a new mobile application in Sydney.",
        "The meeting is scheduled for next Tuesday.",
        "Please send me the quarterly report.",
        "Artificial intelligence is transforming many industries across Australia."
    ],
    
    "Education & Learning": [
        "I am studying computer science at University of Sydney.",
        "This textbook explains the concepts very clearly.",
        "Students should practice regularly to improve their skills.",
        "Online learning has become increasingly popular."
    ]
}

print("üß™ Testing Translation with Different Content Categories")
print("=" * 60)

In [None]:
# Test translations for each category
category_results = {}

for category, texts in test_categories.items():
    print(f"\nüìö Category: {category}")
    print("-" * 40)
    
    # Translate all texts in this category
    results = advanced_translator.translate_batch(texts)
    category_results[category] = results
    
    # Display results
    successful_translations = 0
    total_time = 0
    
    for i, result in enumerate(results):
        if 'error' not in result:
            print(f"\n{i+1}. üá∫üá∏ {result['english_text']}")
            print(f"   üáªüá≥ {result['vietnamese_text']}")
            print(f"   üìä {result['english_words']} EN words ‚Üí {result['vietnamese_words']} VI words")
            successful_translations += 1
            total_time += result['translation_time']
        else:
            print(f"\n{i+1}. ‚ùå Error: {result['error']}")
    
    print(f"\nüìà Category Statistics:")
    print(f"   ‚úÖ Successful: {successful_translations}/{len(texts)}")
    if successful_translations > 0:
        print(f"   ‚è±Ô∏è Average time: {total_time/successful_translations:.2f}s")

print(f"\nüéâ Translation testing completed!")

## Part 7: Advanced Translation Features

Let's explore some advanced features and analyze translation quality.

In [None]:
# Test different translation parameters
parameter_tests = [
    {"name": "Fast (Few Beams)", "num_beams": 2, "max_length": 256},
    {"name": "Balanced", "num_beams": 5, "max_length": 512},
    {"name": "High Quality (More Beams)", "num_beams": 8, "max_length": 512},
]

# Test sentence for parameter comparison
test_sentence = "Machine learning and artificial intelligence are revolutionizing how we solve complex problems and make decisions in various fields including healthcare, finance, and education."

print("üéõÔ∏è Testing Different Translation Parameters")
print("=" * 50)
print(f"üìù Test sentence: {test_sentence}")
print()

for config in parameter_tests:
    print(f"\nüîß Configuration: {config['name']}")
    print(f"   Beams: {config['num_beams']}, Max length: {config['max_length']}")
    
    result = advanced_translator.translate(
        test_sentence,
        num_beams=config['num_beams'],
        max_length=config['max_length']
    )
    
    if 'error' not in result:
        print(f"üáªüá≥ Translation: {result['vietnamese_text']}")
        print(f"‚è±Ô∏è Time: {result['translation_time']:.2f}s")
        print(f"üìä {result['english_words']} ‚Üí {result['vietnamese_words']} words")
    else:
        print(f"‚ùå Error: {result['error']}")

## Part 8: Translation Quality Analysis

Let's analyze some interesting aspects of English-Vietnamese translation.

In [None]:
def analyze_translation_patterns():
    """
    Analyze patterns and characteristics of English-Vietnamese translation.
    """
    print("üîç English-Vietnamese Translation Analysis")
    print("=" * 45)
    
    # Linguistic differences to highlight
    patterns = {
        "Word Order": {
            "description": "Vietnamese generally follows Subject-Verb-Object order like English",
            "example_en": "I love learning languages",
            "example_vi": "T√¥i y√™u th√≠ch h·ªçc ng√¥n ng·ªØ",
            "note": "Both languages have similar basic word order"
        },
        "Tones": {
            "description": "Vietnamese is a tonal language with 6 tones",
            "example_en": "ma (mother/ghost/but/tomb/horse/rice plant)",
            "example_vi": "ma/m√°/m√†/m√£/m·∫£/m·∫°",
            "note": "Same spelling, different meanings based on tone marks"
        },
        "Articles": {
            "description": "Vietnamese doesn't have articles like 'a', 'an', 'the'",
            "example_en": "The cat is sleeping",
            "example_vi": "Con m√®o ƒëang ng·ªß",
            "note": "'Con' is a classifier, not an article"
        },
        "Classifiers": {
            "description": "Vietnamese uses classifiers with nouns",
            "example_en": "Two books", 
            "example_vi": "Hai quy·ªÉn s√°ch",
            "note": "'Quy·ªÉn' is a classifier for books"
        }
    }
    
    for pattern_name, details in patterns.items():
        print(f"\nüìñ {pattern_name}")
        print(f"   {details['description']}")
        print(f"   üá∫üá∏ English: {details['example_en']}")
        print(f"   üáªüá≥ Vietnamese: {details['example_vi']}")
        print(f"   üí° Note: {details['note']}")
    
    print("\nüéØ Translation Challenges:")
    challenges = [
        "Cultural context and idiomatic expressions",
        "Formal vs informal language levels",
        "Technical terminology and neologisms",
        "Proper names and transliteration",
        "Handling of English compound words"
    ]
    
    for i, challenge in enumerate(challenges, 1):
        print(f"   {i}. {challenge}")

analyze_translation_patterns()

## Part 9: Best Practices and Tips

Here are essential guidelines for effective English-Vietnamese translation:

In [None]:
def demonstrate_translation_best_practices():
    """
    Demonstrate best practices for machine translation.
    """
    print("üí° Best Practices for English-Vietnamese Translation")
    print("=" * 55)
    
    practices = [
        {
            "title": "üéØ Choose the Right Model",
            "description": "Select models based on your specific needs",
            "tips": [
                "‚Ä¢ OPUS-MT: Good for general text, fast inference",
                "‚Ä¢ VietAI models: Better understanding of Vietnamese context",
                "‚Ä¢ NLLB-200: Highest quality but resource-intensive",
                "‚Ä¢ Consider fine-tuning for domain-specific content"
            ]
        },
        {
            "title": "üìù Text Preprocessing",
            "description": "Prepare text properly for better translation",
            "tips": [
                "‚Ä¢ Break long texts into sentences for better quality",
                "‚Ä¢ Handle special characters and formatting carefully",
                "‚Ä¢ Consider context when translating technical terms",
                "‚Ä¢ Preserve proper names and acronyms appropriately"
            ]
        },
        {
            "title": "‚öôÔ∏è Parameter Optimization",
            "description": "Tune parameters for quality vs speed",
            "tips": [
                "‚Ä¢ More beams (5-8): Higher quality, slower generation",
                "‚Ä¢ Fewer beams (2-3): Faster but potentially lower quality",
                "‚Ä¢ Adjust max_length based on expected output length",
                "‚Ä¢ Use early_stopping=True for efficiency"
            ]
        },
        {
            "title": "üåè Cultural Awareness", 
            "description": "Consider cultural and linguistic nuances",
            "tips": [
                "‚Ä¢ Be aware of formal vs informal language levels",
                "‚Ä¢ Consider regional variations in Vietnamese",
                "‚Ä¢ Handle cultural references appropriately",
                "‚Ä¢ Validate translations with native speakers when possible"
            ]
        }
    ]
    
    for practice in practices:
        print(f"\n{practice['title']}")
        print(f"{practice['description']}")
        for tip in practice['tips']:
            print(f"  {tip}")
    
    print("\nüö® Common Pitfalls to Avoid:")
    pitfalls = [
        "‚Ä¢ Translating without considering context or domain",
        "‚Ä¢ Ignoring Vietnamese tone marks and diacritics",
        "‚Ä¢ Over-relying on direct word-for-word translation",
        "‚Ä¢ Not validating translations for cultural appropriateness",
        "‚Ä¢ Using overly formal language for casual content",
        "‚Ä¢ Forgetting to handle Vietnamese-specific punctuation"
    ]
    
    for pitfall in pitfalls:
        print(f"  {pitfall}")
    
    print("\n‚ú® Quality Improvement Tips:")
    tips = [
        "‚Ä¢ Use context clues to disambiguate meaning",
        "‚Ä¢ Post-process translations for consistency",
        "‚Ä¢ Implement feedback loops for continuous improvement",
        "‚Ä¢ Consider hybrid human-AI translation workflows",
        "‚Ä¢ Regular evaluation with native Vietnamese speakers"
    ]
    
    for tip in tips:
        print(f"  {tip}")

demonstrate_translation_best_practices()

## Part 10: Interactive Translation Demo

Let's create a simple interactive demo for trying different translations.

In [None]:
# Interactive translation function
def interactive_translation_demo():
    """
    Simple demo for trying custom translations.
    """
    print("üéÆ Interactive Translation Demo")
    print("=" * 35)
    print("Try translating your own English text to Vietnamese!")
    print("(Note: This is a demo - in Jupyter, you'd need actual input widgets)")
    
    # Sample user inputs for demonstration
    demo_inputs = [
        "Good morning! How can I help you today?",
        "I'm excited to learn about Vietnamese culture.",
        "Could you please show me the way to the museum?",
        "The food in Vietnam is absolutely delicious!",
        "Thank you for your patience and understanding."
    ]
    
    print("\nüéØ Demo Translations:")
    
    for i, demo_input in enumerate(demo_inputs, 1):
        print(f"\n--- Translation {i} ---")
        print(f"üìù Input: {demo_input}")
        
        result = advanced_translator.translate(demo_input)
        
        if 'error' not in result:
            print(f"üáªüá≥ Translation: {result['vietnamese_text']}")
            print(f"üìä {result['english_words']} words ‚Üí {result['vietnamese_words']} words")
            print(f"‚è±Ô∏è Time: {result['translation_time']:.2f}s")
        else:
            print(f"‚ùå Error: {result['error']}")
    
    print("\nüí° Try modifying the demo_inputs list above to test your own sentences!")

# Run the interactive demo
interactive_translation_demo()

## Summary

Congratulations! You've successfully learned the fundamentals of machine translation from English to Vietnamese using Hugging Face transformers.

### üîë Key Concepts Mastered
- **Machine Translation Basics**: Understanding neural machine translation vs statistical approaches
- **Model Usage**: Working with pre-trained translation models for English-Vietnamese
- **Pipeline API**: Using Hugging Face's translation pipeline efficiently
- **Language Pairs**: Understanding challenges in cross-language translation
- **Cultural Awareness**: Recognizing the importance of cultural context in translation

### üìà Best Practices Learned
- Choose appropriate models based on quality vs speed requirements
- Preprocess text properly for optimal translation quality
- Consider cultural and linguistic nuances in Vietnamese
- Implement proper error handling for production systems
- Validate translations with native speakers when possible

### üåè Understanding Vietnamese Translation
- Vietnamese is a tonal language with unique grammatical structures
- Classifiers and word order differences from English
- Importance of formal vs informal language levels
- Cultural context significantly impacts translation quality

### üöÄ Next Steps
- **Advanced Models**: Explore more sophisticated multilingual models like NLLB
- **Fine-tuning**: Learn to fine-tune models on domain-specific Vietnamese data
- **Evaluation**: Study translation quality metrics (BLEU, METEOR, BERTScore)
- **Bidirectional**: Implement Vietnamese to English translation
- **Production**: Build scalable translation APIs and services

### üìö Additional Resources
- [Hugging Face Translation Models](https://huggingface.co/models?pipeline_tag=translation)
- [Vietnamese Language Resources](https://en.wikipedia.org/wiki/Vietnamese_language)
- [Neural Machine Translation Research](https://arxiv.org/list/cs.CL/recent)

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- üåê **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- üíº **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- üíª **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*