[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/04-mask-filling.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/04-mask-filling.ipynb)

# 04 - Mask Filling with Hugging Face Transformers

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- What mask filling is and how it works
- How to use the fill-mask pipeline in Hugging Face
- Different models suitable for mask filling tasks
- How to interpret and work with prediction results
- Advanced mask filling techniques and use cases

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Understanding of transformer architectures

## 📚 What We'll Cover
1. Introduction to Mask Filling
2. Basic Fill-Mask Pipeline
3. Working with Different Models
4. Advanced Techniques
5. Practical Applications
6. Performance Considerations

## What is Mask Filling?

Mask filling (also called masked language modeling) is a task where the model predicts missing words in a sentence. The missing words are represented by special `<mask>` tokens. This technique is fundamental to how models like BERT were pre-trained and is useful for:

- **Text completion**: Suggesting words to complete sentences
- **Error correction**: Finding and correcting typos or grammatical errors
- **Creative writing**: Generating alternative word choices
- **Language understanding**: Testing model comprehension of context

In [None]:
# Install required packages (run only once)
# !pip install transformers torch datasets

# Import essential libraries
import torch
import numpy as np
import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
import warnings
warnings.filterwarnings('ignore')

# Device detection function
def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps") 
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU for better performance)")
    
    return device

# Get the optimal device
device = get_device()
print(f"Selected device: {device}")

## Part 1: Basic Fill-Mask Pipeline

The simplest way to perform mask filling is using the Hugging Face pipeline. Let's start with the basic example from the issue:

In [None]:
# Create a fill-mask pipeline
# This will automatically download and use a default model (usually BERT-based)
unmasker = pipeline("fill-mask")

# Basic mask filling example
text = "This course will teach you all about <mask> models."
result = unmasker(text, top_k=2)

print("Basic Mask Filling Example:")
print(f"Input: {text}")
print("\nPredictions:")
for i, prediction in enumerate(result, 1):
    print(f"{i}. Token: '{prediction['token_str']}' | Score: {prediction['score']:.4f}")
    print(f"   Complete sentence: {prediction['sequence']}")

### Understanding the Results

Let's break down what each field in the result means:

In [None]:
# Detailed analysis of results
def analyze_mask_results(results):
    """
    Analyze and explain mask filling results in detail.
    
    Args:
        results: Output from fill-mask pipeline
    """
    print("📊 DETAILED RESULT ANALYSIS")
    print("=" * 50)
    
    for i, result in enumerate(results, 1):
        print(f"\nPrediction #{i}:")
        print(f"  🎯 Token: '{result['token_str']}'")
        print(f"  📈 Confidence Score: {result['score']:.4f} ({result['score']*100:.2f}%)")
        print(f"  🔢 Token ID: {result['token']}")
        print(f"  📝 Complete Sentence: {result['sequence']}")
        
        # Confidence level interpretation
        if result['score'] > 0.5:
            confidence = "Very High"
        elif result['score'] > 0.2:
            confidence = "High"
        elif result['score'] > 0.1:
            confidence = "Medium"
        else:
            confidence = "Low"
        
        print(f"  💡 Confidence Level: {confidence}")

# Analyze our previous results
analyze_mask_results(result)

## Part 2: Exploring Different Contexts

Let's try mask filling with various types of sentences to see how context affects predictions:

In [None]:
# Different context examples
examples = [
    "The weather today is very <mask>.",
    "I need to buy some <mask> from the grocery store.",
    "The capital of France is <mask>.",
    "Machine learning is a subset of <mask> intelligence.",
    "The <mask> is the largest planet in our solar system.",
    "Python is a popular <mask> language."
]

print("🔍 EXPLORING DIFFERENT CONTEXTS")
print("=" * 60)

for i, example in enumerate(examples, 1):
    print(f"\n{i}. Input: {example}")
    
    try:
        predictions = unmasker(example, top_k=3)
        print("   Top predictions:")
        
        for j, pred in enumerate(predictions, 1):
            print(f"      {j}. {pred['token_str']} (score: {pred['score']:.3f})")
            
    except Exception as e:
        print(f"   Error: {e}")

## Part 3: Working with Different Models

Different models have different strengths. Let's compare a few popular models for mask filling:

In [None]:
# List of models to compare
models_to_try = [
    "bert-base-uncased",
    "distilbert-base-uncased", 
    "roberta-base",
    "albert-base-v2"
]

test_sentence = "The best way to learn <mask> is through practice."

print("🏆 MODEL COMPARISON")
print("=" * 50)
print(f"Test sentence: {test_sentence}")

model_results = {}

for model_name in models_to_try:
    print(f"\n📱 Testing {model_name}:")
    
    try:
        # Create pipeline with specific model
        model_unmasker = pipeline(
            "fill-mask", 
            model=model_name,
            device=0 if device.type == 'cuda' else -1  # Use GPU if available
        )
        
        # Get predictions
        predictions = model_unmasker(test_sentence, top_k=2)
        model_results[model_name] = predictions
        
        print("   Top predictions:")
        for i, pred in enumerate(predictions, 1):
            print(f"      {i}. '{pred['token_str']}' (score: {pred['score']:.4f})")
            
    except Exception as e:
        print(f"   ❌ Error with {model_name}: {e}")
        print("   💡 This might be due to memory constraints or model availability")

### Model Performance Analysis

Let's analyze the differences between models:

In [None]:
# Compare model predictions
if model_results:
    print("📊 MODEL PREDICTION COMPARISON")
    print("=" * 60)
    
    # Create comparison table
    comparison_data = []
    
    for model_name, predictions in model_results.items():
        if predictions:  # Check if we got results
            top_prediction = predictions[0]
            comparison_data.append({
                'Model': model_name,
                'Top Prediction': top_prediction['token_str'],
                'Confidence': f"{top_prediction['score']:.4f}",
                'Percentage': f"{top_prediction['score']*100:.2f}%"
            })
    
    if comparison_data:
        df_comparison = pd.DataFrame(comparison_data)
        print(df_comparison.to_string(index=False))
        
        print("\n🎯 Key Observations:")
        print("- Different models may predict different words")
        print("- Confidence scores vary between models")
        print("- Some models are more confident in their predictions")
        print("- Context understanding differs across architectures")
else:
    print("No model results to compare. This might be due to resource constraints.")

## Part 4: Advanced Mask Filling Techniques

### Multiple Masks

Some models can handle multiple masks in a single sentence:

In [None]:
# Multiple masks example (note: not all models support this well)
print("🎭 MULTIPLE MASKS EXAMPLE")
print("=" * 40)

# Single mask first for comparison
single_mask = "I love <mask> programming."
print(f"Single mask: {single_mask}")

try:
    single_result = unmasker(single_mask, top_k=3)
    print("Predictions:")
    for i, pred in enumerate(single_result, 1):
        print(f"  {i}. {pred['token_str']} (score: {pred['score']:.3f})")
except Exception as e:
    print(f"Error: {e}")

# Note about multiple masks
print("\n💡 Note about multiple masks:")
print("Most fill-mask models are designed for single mask prediction.")
print("For multiple masks, you typically need to:")
print("1. Process one mask at a time")
print("2. Use specialized multi-mask models")
print("3. Use text generation models instead")

### Custom Processing Function

Let's create a more sophisticated mask filling function with additional features:

In [None]:
def advanced_mask_filling(text, model_name="bert-base-uncased", top_k=5, min_score=0.01):
    """
    Advanced mask filling with filtering and analysis.
    
    Args:
        text: Input text with <mask> token
        model_name: Model to use for prediction
        top_k: Number of top predictions to return
        min_score: Minimum confidence score to include
    
    Returns:
        Filtered and analyzed predictions
    """
    try:
        # Create pipeline
        pipe = pipeline(
            "fill-mask", 
            model=model_name,
            device=0 if device.type == 'cuda' else -1
        )
        
        # Get predictions
        predictions = pipe(text, top_k=top_k)
        
        # Filter by minimum score
        filtered_predictions = [
            pred for pred in predictions 
            if pred['score'] >= min_score
        ]
        
        # Add analysis
        total_score = sum(pred['score'] for pred in filtered_predictions)
        
        results = {
            'input': text,
            'model': model_name,
            'predictions': filtered_predictions,
            'total_predictions': len(filtered_predictions),
            'confidence_sum': total_score,
            'top_prediction_confidence': filtered_predictions[0]['score'] if filtered_predictions else 0
        }
        
        return results
        
    except Exception as e:
        return {'error': str(e)}

# Test the advanced function
test_text = "Artificial <mask> is transforming many industries."
advanced_result = advanced_mask_filling(test_text, top_k=5, min_score=0.05)

print("🚀 ADVANCED MASK FILLING RESULTS")
print("=" * 50)

if 'error' not in advanced_result:
    print(f"Input: {advanced_result['input']}")
    print(f"Model: {advanced_result['model']}")
    print(f"Total valid predictions: {advanced_result['total_predictions']}")
    print(f"Sum of confidence scores: {advanced_result['confidence_sum']:.4f}")
    
    print("\nFiltered predictions (min score 0.05):")
    for i, pred in enumerate(advanced_result['predictions'], 1):
        print(f"{i:2d}. '{pred['token_str']:12}' | Score: {pred['score']:.4f} | {pred['sequence']}")
else:
    print(f"Error: {advanced_result['error']}")

## Part 5: Practical Applications

Let's explore some real-world applications of mask filling:

In [None]:
# Practical application examples
practical_examples = {
    "Text Completion": [
        "The meeting is scheduled for <mask> morning.",
        "Please send the report by <mask>.",
        "The new <mask> will improve our productivity."
    ],
    "Creative Writing": [
        "The mysterious <mask> appeared at midnight.",
        "She walked through the <mask> forest.",
        "The ancient <mask> held many secrets."
    ],
    "Educational Context": [
        "The process of photosynthesis occurs in <mask>.",
        "Newton's <mask> law states that every action has an equal and opposite reaction.",
        "The <mask> War lasted from 1914 to 1918."
    ]
}

print("🎯 PRACTICAL APPLICATIONS")
print("=" * 60)

for category, examples in practical_examples.items():
    print(f"\n📚 {category.upper()}:")
    print("-" * 30)
    
    for i, example in enumerate(examples, 1):
        print(f"\n{i}. {example}")
        
        try:
            predictions = unmasker(example, top_k=2)
            print("   Suggestions:")
            for j, pred in enumerate(predictions, 1):
                print(f"      {j}. {pred['token_str']} (confidence: {pred['score']:.3f})")
        except Exception as e:
            print(f"   Error: {e}")

### Word Sense Disambiguation

Mask filling can help understand how context affects word meaning:

In [None]:
# Word sense disambiguation examples
context_examples = [
    ("The bank of the river was muddy.", "Financial vs. Geographic"),
    ("He went to the <mask> to deposit money.", "Should predict 'bank' (financial)"),
    ("They sat on the <mask> of the river.", "Should predict 'bank' (geographic)"),
    ("The bat flew through the cave.", "Animal vs. Sports equipment"),
    ("He swung the <mask> at the cricket ball.", "Should predict 'bat' (cricket)"),
    ("The <mask> hung upside down from the tree.", "Should predict 'bat' (animal)")
]

print("🧠 WORD SENSE DISAMBIGUATION")
print("=" * 50)

for example, explanation in context_examples:
    if "<mask>" in example:
        print(f"\nExample: {example}")
        print(f"Expected: {explanation}")
        
        try:
            predictions = unmasker(example, top_k=3)
            print("Predictions:")
            for i, pred in enumerate(predictions, 1):
                print(f"  {i}. {pred['token_str']} (score: {pred['score']:.3f})")
        except Exception as e:
            print(f"Error: {e}")
    else:
        print(f"\nReference: {example} | {explanation}")

## Part 6: Performance and Optimization

### Batch Processing

For multiple texts, batch processing can be more efficient:

In [None]:
import time

# Batch processing example
batch_texts = [
    "The weather is <mask> today.",
    "I love eating <mask> for breakfast.",
    "The movie was <mask> entertaining.",
    "She works as a <mask> engineer.",
    "The book is very <mask> to read."
]

print("⚡ BATCH PROCESSING PERFORMANCE")
print("=" * 50)

# Individual processing
print("\n1. Individual Processing:")
start_time = time.time()

individual_results = []
for text in batch_texts:
    try:
        result = unmasker(text, top_k=1)
        individual_results.append((text, result[0]['token_str'], result[0]['score']))
    except Exception as e:
        individual_results.append((text, "ERROR", 0.0))

individual_time = time.time() - start_time

print(f"   Time taken: {individual_time:.3f} seconds")
print("   Results:")
for text, prediction, score in individual_results:
    print(f"     '{prediction}' for '{text}' (score: {score:.3f})")

# Note about batch processing
print("\n💡 Note about batch processing:")
print("- The basic pipeline processes one text at a time")
print("- For true batch processing, you'd need to use the model directly")
print("- Batch processing is more memory efficient for large datasets")
print(f"- Current processing speed: {len(batch_texts)/individual_time:.1f} texts/second")

### Memory and Performance Tips

In [None]:
# Performance tips and memory management
print("🔧 PERFORMANCE OPTIMIZATION TIPS")
print("=" * 50)

print("\n1. 🚀 Model Selection:")
print("   - DistilBERT: Faster, smaller (66M params)")
print("   - BERT-base: Better accuracy (110M params)")
print("   - RoBERTa: Often better performance (125M params)")
print("   - ALBERT: More efficient architecture (12M params)")

print("\n2. 💾 Memory Management:")
print("   - Use device=0 for GPU acceleration")
print("   - Use torch_dtype=torch.float16 for half precision")
print("   - Clear GPU cache: torch.cuda.empty_cache()")

print("\n3. ⚡ Speed Optimization:")
print("   - Reduce top_k for faster inference")
print("   - Use smaller models for real-time applications")
print("   - Cache models to avoid reloading")

# Memory usage check
if torch.cuda.is_available():
    print("\n📊 Current GPU Memory Usage:")
    allocated = torch.cuda.memory_allocated() / 1024**3
    cached = torch.cuda.memory_reserved() / 1024**3
    print(f"   Allocated: {allocated:.2f}GB")
    print(f"   Cached: {cached:.2f}GB")
    
    # Clear cache if needed
    if cached > 1.0:  # If using more than 1GB
        torch.cuda.empty_cache()
        print("   🧹 Cleared GPU cache")
else:
    print("\n📊 Using CPU - consider upgrading to GPU for better performance")

## Summary

### 🔑 Key Concepts Mastered

- **Mask Filling Fundamentals**: Understanding how masked language models predict missing words using context
- **Pipeline Usage**: Using the `fill-mask` pipeline for quick and easy mask filling
- **Model Comparison**: Different models (BERT, DistilBERT, RoBERTa, ALBERT) have different strengths and performance characteristics
- **Result Interpretation**: Understanding confidence scores and how to filter and analyze predictions
- **Context Sensitivity**: How surrounding words influence predictions and word sense disambiguation

### 📈 Best Practices Learned

- **Model Selection**: Choose models based on your speed vs. accuracy requirements
- **Device Optimization**: Use GPU acceleration when available for better performance
- **Result Filtering**: Apply minimum confidence thresholds to get more reliable predictions
- **Error Handling**: Always implement proper error handling for robust applications
- **Memory Management**: Monitor and optimize memory usage, especially with large models

### 🚀 Next Steps

- **Advanced Applications**: Explore using mask filling for text correction and completion systems
- **Custom Training**: Learn how to fine-tune models for domain-specific mask filling
- **Integration**: Combine mask filling with other NLP tasks for comprehensive applications
- **Production Deployment**: Scale mask filling systems for real-world applications

### 💡 Key Takeaways

- Mask filling is a powerful technique for text completion, error correction, and understanding language context
- Different models excel in different scenarios - choose based on your specific needs
- Context is crucial - the same mask can have very different predictions based on surrounding words
- Performance optimization involves balancing model size, accuracy, and computational resources
- Always consider confidence scores when using predictions in real applications

Mask filling is a fundamental NLP technique that demonstrates the power of transformer models to understand and generate human language. The concepts learned here form the foundation for more advanced NLP applications!

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*