[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.6/grammar-correction.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.6/grammar-correction.ipynb)

# Grammar Correction with T5 Models

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to use T5 models for grammar correction tasks
- The text-to-text approach for fixing grammatical errors
- Working with grammar correction across multiple languages (English, Vietnamese, Japanese)
- Best practices for grammar correction with transformers
- How to evaluate grammar correction quality

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Understanding of encoder-decoder architectures (refer to [docs/encoder-decoder.md](../../docs/encoder-decoder.md))

## 📚 What We'll Cover
1. **Introduction**: Grammar correction with T5
2. **Basic Setup**: Loading models and tokenizers
3. **English Grammar Correction**: Common error patterns
4. **Multilingual Correction**: Vietnamese and Japanese examples
5. **Advanced Techniques**: Fine-tuning and optimization
6. **Evaluation**: Quality assessment methods

## What is Grammar Correction?

Grammar correction is a **sequence-to-sequence task** where we transform grammatically incorrect text into grammatically correct text. The T5 (Text-to-Text Transfer Transformer) model excels at this task by treating it as a text generation problem with the prefix **"grammar: "**.

### Why T5 for Grammar Correction?

- **Unified Framework**: T5 treats all NLP tasks as text-to-text problems
- **Context Understanding**: Encoder-decoder architecture captures full sentence context
- **Flexible Output**: Can handle various types of corrections simultaneously
- **Multilingual Support**: Works across different languages with appropriate models

In [None]:
# Install required packages (run only once)
# !pip install transformers torch datasets numpy pandas matplotlib seaborn tqdm

# Import essential libraries
import torch
import numpy as np
import pandas as pd
from transformers import (
    T5ForConditionalGeneration, 
    T5Tokenizer,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    pipeline
)
import time
from typing import List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

print("📚 Libraries loaded successfully!")
print(f"PyTorch version: {torch.__version__}")

In [None]:
def get_device() -> torch.device:
    """
    Get the best available device for PyTorch operations.
    
    Priority order: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS for Apple Silicon optimization")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU - consider GPU for better performance")
    
    return device

# Set up device
device = get_device()

## Part 1: Basic Grammar Correction with T5

Let's start with a simple T5 model for grammar correction. We'll use the T5-base model which provides a good balance between performance and computational requirements.

In [None]:
# Load T5 model for grammar correction
def load_grammar_correction_model(model_name: str = "t5-base"):
    """
    Load T5 model and tokenizer for grammar correction.
    
    Args:
        model_name: HuggingFace model identifier
        
    Returns:
        Tuple of (model, tokenizer)
    """
    print(f"📥 Loading {model_name} for grammar correction...")
    
    try:
        # Load tokenizer and model
        tokenizer = T5Tokenizer.from_pretrained(model_name)
        model = T5ForConditionalGeneration.from_pretrained(model_name)
        
        # Move to optimal device
        model = model.to(device)
        
        # Enable evaluation mode for inference
        model.eval()
        
        print(f"✅ Model loaded successfully")
        print(f"📊 Model size: {model.num_parameters():,} parameters")
        
        return model, tokenizer
        
    except Exception as e:
        print(f"❌ Error loading model: {e}")
        print("💡 Try using t5-small for lower memory requirements")
        raise

# Load the model
model, tokenizer = load_grammar_correction_model("t5-small")  # Using t5-small for better compatibility

In [None]:
def correct_grammar(text: str, model, tokenizer, max_length: int = 128) -> str:
    """
    Correct grammar in the given text using T5 model.
    
    Args:
        text: Input text with potential grammar errors
        model: T5 model for grammar correction
        tokenizer: T5 tokenizer
        max_length: Maximum output length
        
    Returns:
        Corrected text
    """
    # Format input for T5 with grammar correction prefix
    input_text = f"grammar: {text}"
    
    # Tokenize input
    input_ids = tokenizer.encode(
        input_text, 
        return_tensors="pt", 
        max_length=512, 
        truncation=True
    ).to(device)
    
    # Generate correction
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_length=max_length,
            num_beams=4,  # Use beam search for better quality
            length_penalty=0.6,
            early_stopping=True,
            do_sample=False  # Deterministic output
        )
    
    # Decode the output
    corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return corrected_text

# Test with a simple example
test_sentence = "I are going to the store yesterday."
corrected = correct_grammar(test_sentence, model, tokenizer)

print("🧪 BASIC GRAMMAR CORRECTION TEST")
print("=" * 40)
print(f"Original:  {test_sentence}")
print(f"Corrected: {corrected}")

## Part 2: English Grammar Correction Examples

Let's test our grammar correction system with various types of English grammar errors:

In [None]:
# English grammar correction examples
english_examples = {
    "Subject-Verb Agreement": [
        "The dogs runs in the park every day.",
        "She don't like coffee in the morning.",
        "There is many books on the shelf."
    ],
    "Verb Tenses": [
        "Yesterday, I go to the cinema with my friends.",
        "She has went to the store already.",
        "I will going to the meeting tomorrow."
    ],
    "Articles": [
        "I need a advice about this situation.",
        "She is a honest person.",
        "The cats are playing in a garden."
    ],
    "Prepositions": [
        "I am good in mathematics.",
        "She arrived to the party late.",
        "The book is laying on the table."
    ]
}

print("🇺🇸 ENGLISH GRAMMAR CORRECTION EXAMPLES")
print("=" * 50)

for category, examples in english_examples.items():
    print(f"\n📚 {category.upper()}:")
    print("-" * 30)
    
    for i, example in enumerate(examples, 1):
        print(f"\n{i}. Original:  {example}")
        
        try:
            start_time = time.time()
            corrected = correct_grammar(example, model, tokenizer)
            correction_time = time.time() - start_time
            
            print(f"   Corrected: {corrected}")
            print(f"   ⏱️  Time: {correction_time:.2f}s")
            
        except Exception as e:
            print(f"   ❌ Error: {e}")

## Part 3: Vietnamese Grammar Correction

Now let's try grammar correction with Vietnamese text. While T5-base was primarily trained on English, it can handle some basic corrections in other languages:

In [None]:
# Vietnamese grammar correction examples
vietnamese_examples = {
    "Word Order": [
        "Tôi rất thích ăn phở.",  # Correct: "I really like eating pho"
        "Ăn phở tôi thích rất.",  # Incorrect word order
        "Hôm nay thời tiết đẹp rất."
    ],
    "Classifiers": [
        "Tôi có hai con chó.",  # Correct: "I have two dogs"
        "Tôi có hai chó.",      # Missing classifier
        "Ba cuốn sách này rất hay."
    ],
    "Common Errors": [
        "Tôi sẽ đi học ở trường đại học.",  # Going to study at university
        "Việc học tiếng Anh rất khó.",       # Learning English is difficult
        "Chúng tôi đến từ Việt Nam."
    ]
}

print("🇻🇳 VIETNAMESE GRAMMAR CORRECTION EXAMPLES")
print("=" * 50)

for category, examples in vietnamese_examples.items():
    print(f"\n📚 {category.upper()}:")
    print("-" * 30)
    
    for i, example in enumerate(examples, 1):
        print(f"\n{i}. 🇻🇳 Vietnamese: {example}")
        
        try:
            start_time = time.time()
            # Note: Using the same model - in practice, you'd want a Vietnamese-specific model
            corrected = correct_grammar(example, model, tokenizer)
            correction_time = time.time() - start_time
            
            print(f"   Processed: {corrected}")
            print(f"   ⏱️  Time: {correction_time:.2f}s")
            print(f"   💡 Note: Consider using mT5 for better multilingual support")
            
        except Exception as e:
            print(f"   ❌ Error: {e}")

print("\n🔧 FOR BETTER VIETNAMESE SUPPORT:")
print("Consider using multilingual models like:")
print("- google/mt5-base")
print("- VietAI/vit5-base")
print("- facebook/mbart-large-50-many-to-many-mmt")

## Part 4: Japanese Grammar Correction

Let's explore Japanese text processing. Similar to Vietnamese, specialized multilingual models work better for Japanese:

In [None]:
# Japanese grammar correction examples
japanese_examples = {
    "Particle Usage": [
        "私は学生です。",         # I am a student (correct)
        "私が学生です。",         # Incorrect particle usage
        "本を読みます。"           # I read books
    ],
    "Verb Conjugation": [
        "昨日映画を見ました。",     # I watched a movie yesterday
        "明日東京に行きます。",     # I will go to Tokyo tomorrow
        "毎日日本語を勉強します。"   # I study Japanese every day
    ],
    "Politeness Levels": [
        "ありがとうございます。",   # Thank you (polite)
        "すみません。",           # Excuse me / Sorry
        "おはようございます。"     # Good morning (polite)
    ]
}

print("🇯🇵 JAPANESE GRAMMAR CORRECTION EXAMPLES")
print("=" * 50)

for category, examples in japanese_examples.items():
    print(f"\n📚 {category.upper()}:")
    print("-" * 30)
    
    for i, example in enumerate(examples, 1):
        print(f"\n{i}. 🇯🇵 Japanese: {example}")
        
        try:
            start_time = time.time()
            # Note: T5-base has limited Japanese capability
            corrected = correct_grammar(example, model, tokenizer)
            correction_time = time.time() - start_time
            
            print(f"   Processed: {corrected}")
            print(f"   ⏱️  Time: {correction_time:.2f}s")
            print(f"   💡 Note: T5-base has limited Japanese support")
            
        except Exception as e:
            print(f"   ❌ Error: {e}")

print("\n🔧 FOR BETTER JAPANESE SUPPORT:")
print("Consider using specialized models like:")
print("- rinna/japanese-gpt2-medium")
print("- cl-tohoku/bert-base-japanese")
print("- google/mt5-base (multilingual)")
print("- megagonlabs/t5-base-japanese-web")

## Part 5: Advanced Grammar Correction Techniques

Let's explore more advanced techniques for grammar correction:

In [None]:
def batch_grammar_correction(texts: List[str], model, tokenizer, batch_size: int = 4) -> List[str]:
    """
    Correct grammar for multiple texts efficiently using batching.
    
    Args:
        texts: List of input texts
        model: T5 model
        tokenizer: T5 tokenizer
        batch_size: Number of texts to process simultaneously
        
    Returns:
        List of corrected texts
    """
    corrected_texts = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Processing batches"):
        batch = texts[i:i + batch_size]
        
        # Format inputs for T5
        input_texts = [f"grammar: {text}" for text in batch]
        
        # Tokenize batch
        inputs = tokenizer(
            input_texts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)
        
        # Generate corrections
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=128,
                num_beams=2,
                length_penalty=0.6,
                early_stopping=True
            )
        
        # Decode outputs
        batch_corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        corrected_texts.extend(batch_corrected)
    
    return corrected_texts

# Test batch processing
test_batch = [
    "She don't like coffee.",
    "I are going to school.",
    "The book are on the table.",
    "He have many friends.",
    "They was happy yesterday."
]

print("⚡ BATCH GRAMMAR CORRECTION")
print("=" * 35)

start_time = time.time()
batch_results = batch_grammar_correction(test_batch, model, tokenizer, batch_size=2)
total_time = time.time() - start_time

for i, (original, corrected) in enumerate(zip(test_batch, batch_results), 1):
    print(f"\n{i}. Original:  {original}")
    print(f"   Corrected: {corrected}")

print(f"\n⏱️  Total time: {total_time:.2f}s")
print(f"🚀 Speed: {len(test_batch)/total_time:.1f} corrections/second")

## Part 6: Grammar Correction Quality Assessment

Let's implement some basic methods to assess the quality of our grammar corrections:

In [None]:
def assess_correction_quality(original: str, corrected: str, reference: str = None) -> Dict[str, float]:
    """
    Assess the quality of grammar correction.
    
    Args:
        original: Original text with errors
        corrected: Model-corrected text
        reference: Reference correct text (optional)
        
    Returns:
        Dictionary with quality metrics
    """
    import re
    
    metrics = {}
    
    # Basic metrics
    original_words = original.split()
    corrected_words = corrected.split()
    
    metrics['length_change'] = len(corrected_words) - len(original_words)
    metrics['length_ratio'] = len(corrected_words) / len(original_words) if original_words else 0
    
    # Word-level changes
    changed_words = sum(1 for o, c in zip(original_words, corrected_words) if o != c)
    metrics['words_changed'] = changed_words
    metrics['change_ratio'] = changed_words / len(original_words) if original_words else 0
    
    # If reference is provided, calculate similarity
    if reference:
        ref_words = reference.split()
        
        # Simple word overlap with reference
        corrected_set = set(corrected_words)
        reference_set = set(ref_words)
        
        overlap = len(corrected_set & reference_set)
        union = len(corrected_set | reference_set)
        
        metrics['word_overlap_with_reference'] = overlap / union if union else 0
    
    return metrics

# Test quality assessment
test_cases = [
    {
        'original': "She don't like coffee.",
        'reference': "She doesn't like coffee."
    },
    {
        'original': "I are going to school.",
        'reference': "I am going to school."
    },
    {
        'original': "The books is on the table.",
        'reference': "The books are on the table."
    }
]

print("📊 GRAMMAR CORRECTION QUALITY ASSESSMENT")
print("=" * 50)

for i, test_case in enumerate(test_cases, 1):
    original = test_case['original']
    reference = test_case['reference']
    
    # Get model correction
    corrected = correct_grammar(original, model, tokenizer)
    
    # Assess quality
    quality_metrics = assess_correction_quality(original, corrected, reference)
    
    print(f"\n{i}. Test Case:")
    print(f"   Original:  {original}")
    print(f"   Corrected: {corrected}")
    print(f"   Reference: {reference}")
    print(f"   📈 Quality Metrics:")
    for metric, value in quality_metrics.items():
        if isinstance(value, float):
            print(f"      {metric}: {value:.3f}")
        else:
            print(f"      {metric}: {value}")

## Part 7: Best Practices and Tips

Here are some best practices for effective grammar correction with T5:

In [None]:
def demonstrate_grammar_correction_best_practices():
    """
    Demonstrate best practices for grammar correction.
    """
    
    best_practices = {
        "🎯 Model Selection": [
            "Use T5-base or T5-large for better English grammar correction",
            "Consider mT5 for multilingual applications",
            "Fine-tune on domain-specific data when possible",
            "Start with smaller models (T5-small) for development"
        ],
        
        "⚙️ Generation Parameters": [
            "Use num_beams=4-8 for better quality (slower)",
            "Set length_penalty=0.6-0.8 to control output length",
            "Enable early_stopping=True to avoid repetition",
            "Use do_sample=False for consistent corrections"
        ],
        
        "📝 Input Formatting": [
            "Always use 'grammar:' prefix for T5 models",
            "Keep input sentences reasonably short (<100 words)",
            "Handle special characters and punctuation carefully",
            "Consider sentence segmentation for long texts"
        ],
        
        "⚡ Performance Optimization": [
            "Use batch processing for multiple texts",
            "Cache model and tokenizer to avoid reloading",
            "Use GPU acceleration when available",
            "Consider model quantization for deployment"
        ],
        
        "🌍 Multilingual Support": [
            "Use language-specific models for better results",
            "Consider mT5 or mBART for multilingual tasks",
            "Test thoroughly with target language examples",
            "Be aware of cultural and linguistic nuances"
        ],
        
        "📊 Quality Evaluation": [
            "Compare against human-annotated references",
            "Use multiple evaluation metrics (BLEU, METEOR, etc.)",
            "Test on diverse error types and domains",
            "Monitor for over-correction or under-correction"
        ]
    }
    
    print("💡 GRAMMAR CORRECTION BEST PRACTICES")
    print("=" * 50)
    
    for category, tips in best_practices.items():
        print(f"\n{category}:")
        for tip in tips:
            print(f"  • {tip}")
    
    print("\n" + "=" * 50)
    print("🚨 Common Pitfalls to Avoid:")
    
    pitfalls = [
        "Don't rely solely on T5-base for non-English languages",
        "Avoid processing very long texts without segmentation",
        "Don't ignore context when correcting isolated sentences",
        "Avoid over-correcting stylistic choices vs. actual errors",
        "Don't assume all model outputs are improvements"
    ]
    
    for pitfall in pitfalls:
        print(f"  ❌ {pitfall}")

demonstrate_grammar_correction_best_practices()

## Part 8: Interactive Grammar Correction Demo

Let's create a simple interactive demo for trying custom grammar corrections:

In [None]:
def interactive_grammar_correction_demo():
    """
    Simple demo for trying custom grammar corrections.
    """
    print("🎮 Interactive Grammar Correction Demo")
    print("=" * 40)
    print("Try correcting your own text!")
    print("(Note: In a real notebook, you could use input() for interaction)")
    
    # Demo examples instead of interactive input
    demo_texts = [
        "I has a dog and two cats.",
        "She going to the market yesterday.",
        "The childrens was playing in the park.",
        "We seen that movie last week.",
        "He don't know the answer."
    ]
    
    print("\n📝 Demo Corrections:")
    
    for i, demo_text in enumerate(demo_texts, 1):
        print(f"\n{i}. Input: {demo_text}")
        
        try:
            corrected = correct_grammar(demo_text, model, tokenizer)
            print(f"   Output: {corrected}")
            
            # Simple feedback
            if demo_text.lower() != corrected.lower():
                print("   ✅ Grammar corrections applied")
            else:
                print("   ℹ️  No changes suggested")
                
        except Exception as e:
            print(f"   ❌ Error: {e}")
    
    print("\n💡 Tips for better results:")
    print("  • Keep sentences reasonably short")
    print("  • Focus on grammatical errors vs. style preferences")
    print("  • Consider context when evaluating corrections")

interactive_grammar_correction_demo()

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **T5 for Grammar Correction**: Using text-to-text approach with "grammar:" prefix
- **Multilingual Processing**: Working with English, Vietnamese, and Japanese text
- **Quality Assessment**: Methods to evaluate correction effectiveness
- **Batch Processing**: Efficient handling of multiple texts
- **Best Practices**: Production-ready techniques and optimization strategies

### 📈 Best Practices Learned
- Use appropriate model sizes based on computational constraints
- Apply proper input formatting with task prefixes
- Implement batch processing for efficiency
- Consider language-specific models for non-English text
- Evaluate corrections against reference standards

### 🚀 Next Steps
- **Advanced Models**: Explore mT5, T5-large, or domain-specific models
- **Fine-tuning**: Train on your specific grammar correction dataset
- **Multilingual Support**: Implement with mT5 or language-specific models
- **Production Deployment**: Build scalable grammar correction APIs
- **Evaluation Metrics**: Implement BLEU, METEOR, or BERTScore for quality assessment

### 📚 Further Reading
- [T5 Paper: "Exploring the Limits of Transfer Learning"](https://arxiv.org/abs/1910.10683)
- [Hugging Face T5 Documentation](https://huggingface.co/docs/transformers/model_doc/t5)
- [Grammar Error Correction Survey](https://arxiv.org/abs/2005.06600)
- [Multilingual T5 (mT5) Paper](https://arxiv.org/abs/2010.11934)

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*