[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic2.5/multiple-sequences.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic2.5/multiple-sequences.ipynb)

# Handling Multiple Sequences: Padding, Attention Masks, and Long Context

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to handle sequences of different lengths in batch processing
- The importance and mechanics of padding and attention masks
- Strategies for dealing with longer sequences and truncation
- Advanced techniques with Longformer for extended context understanding
- Visual analysis of attention patterns across different sequence lengths

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Understanding of tokenization concepts (refer to `02_tokenizers.ipynb`)

## 📚 What We'll Cover
1. Section 1: Understanding the Multiple Sequences Problem
2. Section 2: Padding and Truncation Strategies
3. Section 3: Attention Masks Deep Dive
4. Section 4: Handling Longer Sequences
5. Section 5: Longformer for Extended Context
6. Section 6: Visualizing Attention Patterns
7. Section 7: Best Practices and Performance Optimization
8. Section 8: Summary and Next Steps

In [None]:
# Import essential libraries for this comprehensive tutorial
import torch
import torch.nn.functional as F
from transformers import (
    AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    LongformerTokenizer, LongformerModel, LongformerForSequenceClassification,
    AutoConfig, BertTokenizer, BertModel, RobertaTokenizer, RobertaModel
)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from typing import List, Dict, Optional, Union, Tuple
import warnings
warnings.filterwarnings('ignore')

# Load environment variables from .env.local for local development
import os
try:
    from dotenv import load_dotenv
    load_dotenv('.env.local', override=True)
    print("Environment variables loaded from .env.local")
except ImportError:
    print("python-dotenv not installed, skipping .env.local loading")

# For Google Colab compatibility
try:
    from google.colab import userdata
    COLAB_AVAILABLE = True
except ImportError:
    COLAB_AVAILABLE = False

def get_api_key(key_name: str, required: bool = False) -> Optional[str]:
    """
    Load API key from environment or Google Colab secrets.
    
    Args:
        key_name: Environment variable name
        required: Whether to raise error if not found
        
    Returns:
        API key string or None
    """
    # Try Colab secrets first
    if COLAB_AVAILABLE:
        try:
            return userdata.get(key_name)
        except:
            pass
    
    # Try environment variable
    api_key = os.getenv(key_name)
    
    if required and not api_key:
        raise ValueError(
            f"{key_name} not found. Set it in:\n"
            f"- Local: .env.local file\n"
            f"- Colab: Secrets manager"
        )
    
    return api_key

def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps") 
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU for better performance)")
    
    return device

# Setup authentication and device
hf_token = get_api_key('HF_TOKEN', required=False)
if hf_token:
    os.environ['HF_TOKEN'] = hf_token
    print("✅ Hugging Face token configured")

device = get_device()

# Set up plotting style for educational visualizations
plt.style.use('default')  # Use default style for better compatibility
sns.set_palette("husl")

print(f"\n=== Setup Information ===")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")
print(f"Ready for multiple sequence processing! 🎯")

## Section 1: Understanding the Multiple Sequences Problem

When working with real-world NLP tasks, we rarely process single sequences. Instead, we need to handle batches of text sequences that vary significantly in length. This creates several challenges:

### Key Challenges:
- **Variable Length**: Text sequences have different numbers of tokens
- **Batch Processing**: Neural networks require fixed-size inputs for efficient computation
- **Memory Efficiency**: Longer sequences consume more computational resources
- **Attention Mechanics**: Models need to know which parts of input to focus on

Let's start by demonstrating this problem with real examples.

In [None]:
def demonstrate_sequence_length_problem():
    """
    Demonstrate the variable sequence length problem with real text examples.
    """
    print("🔍 DEMONSTRATING THE MULTIPLE SEQUENCES PROBLEM")
    print("=" * 55)
    
    # Example texts with varying lengths - using hate speech detection focus
    example_texts = [
        "Great post!",  # Short: 3 words
        "I really enjoyed reading this article about machine learning.",  # Medium: 10 words
        "This comprehensive tutorial on natural language processing and transformer models provides detailed insights into modern NLP techniques and their practical applications in industry.",  # Long: 25 words
        "AI",  # Very short: 1 word
        "The development of large language models has revolutionized how we approach various NLP tasks, from text classification to question answering, enabling more sophisticated and nuanced understanding of human language patterns."  # Very long: 32 words
    ]
    
    # Load a BERT tokenizer for demonstration
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    print("📊 Text Length Analysis:")
    print("-" * 40)
    
    tokenized_lengths = []
    for i, text in enumerate(example_texts):
        # Tokenize without any padding or truncation
        tokens = tokenizer.encode(text, add_special_tokens=True)
        word_count = len(text.split())
        token_count = len(tokens)
        
        tokenized_lengths.append(token_count)
        
        print(f"{i+1}. Words: {word_count:2d} | Tokens: {token_count:2d} | Text: '{text[:50]}{'...' if len(text) > 50 else ''}'")
    
    print(f"\n📈 Token Length Statistics:")
    print(f"   Min length: {min(tokenized_lengths)} tokens")
    print(f"   Max length: {max(tokenized_lengths)} tokens")
    print(f"   Range: {max(tokenized_lengths) - min(tokenized_lengths)} tokens")
    print(f"   Average: {np.mean(tokenized_lengths):.1f} tokens")
    
    # Visualize the length distribution
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    bars = plt.bar(range(1, len(tokenized_lengths) + 1), tokenized_lengths, color='skyblue', alpha=0.8)
    plt.title('Token Count per Sequence')
    plt.xlabel('Sequence Number')
    plt.ylabel('Number of Tokens')
    plt.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar, length in zip(bars, tokenized_lengths):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
                str(length), ha='center', va='bottom')
    
    plt.subplot(1, 2, 2)
    # Create a visual representation of sequence lengths
    sequences_visual = np.zeros((len(tokenized_lengths), max(tokenized_lengths)))
    for i, length in enumerate(tokenized_lengths):
        sequences_visual[i, :length] = 1
    
    plt.imshow(sequences_visual, cmap='RdYlBu_r', aspect='auto')
    plt.title('Sequence Length Visualization\n(Blue = tokens, White = padding needed)')
    plt.xlabel('Token Position')
    plt.ylabel('Sequence Number')
    plt.yticks(range(len(tokenized_lengths)), [f'Seq {i+1}' for i in range(len(tokenized_lengths))])
    
    plt.tight_layout()
    plt.show()
    
    print("\n🚨 THE PROBLEM:")
    print("   • Neural networks need fixed-size inputs for batch processing")
    print("   • Variable sequence lengths prevent efficient batching")
    print("   • Need padding to make all sequences the same length")
    print("   • Need attention masks to ignore padded positions")
    
    return example_texts, tokenized_lengths

# Run the demonstration
example_texts, tokenized_lengths = demonstrate_sequence_length_problem()

## Section 2: Padding and Truncation Strategies

To solve the variable sequence length problem, we use **padding** and **truncation**:

### Padding Strategies:
- **Right Padding**: Add padding tokens to the end of sequences (most common)
- **Left Padding**: Add padding tokens to the beginning (used for some generative models)
- **Dynamic Padding**: Pad only to the length of the longest sequence in the batch
- **Fixed Padding**: Pad all sequences to a predetermined maximum length

### Truncation Strategies:
- **Right Truncation**: Remove tokens from the end
- **Left Truncation**: Remove tokens from the beginning  
- **Middle Truncation**: Remove tokens from the middle (less common)

Let's implement and compare these strategies:

In [None]:
def demonstrate_padding_strategies():
    """
    Demonstrate different padding and truncation strategies.
    """
    print("🛠️ PADDING AND TRUNCATION STRATEGIES")
    print("=" * 45)
    
    # Load preferred hate speech detection model tokenizer
    tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-hate-latest")
    
    # Use example texts from hate speech detection domain
    texts = [
        "This message promotes understanding and respect.",
        "Great work!",
        "I completely disagree with this perspective, but I respect your right to express it and welcome civil discourse on the topic.",
        "Thanks"
    ]
    
    print("📝 Original texts:")
    for i, text in enumerate(texts):
        print(f"{i+1}. '{text}'")
    
    print("\n" + "=" * 60)
    
    # Strategy 1: No padding/truncation (shows the problem)
    print("\n1️⃣ NO PADDING/TRUNCATION (The Problem):")
    try:
        unpadded = tokenizer(texts, return_tensors="pt")
        print("   ✅ Successful batching")
    except Exception as e:
        print(f"   ❌ Error: {str(e)[:100]}...")
        print("   This is why we need padding!")
    
    # Strategy 2: Right padding (most common)
    print("\n2️⃣ RIGHT PADDING:")
    right_padded = tokenizer(texts, padding=True, return_tensors="pt")
    print(f"   Shape: {right_padded['input_ids'].shape}")
    print(f"   Attention mask shape: {right_padded['attention_mask'].shape}")
    
    # Strategy 3: Fixed max length padding  
    print("\n3️⃣ FIXED LENGTH PADDING (max_length=20):")
    fixed_padded = tokenizer(texts, padding='max_length', max_length=20, return_tensors="pt")
    print(f"   Shape: {fixed_padded['input_ids'].shape}")
    
    # Strategy 4: Truncation with padding
    print("\n4️⃣ TRUNCATION + PADDING (max_length=15):")
    truncated_padded = tokenizer(texts, padding=True, truncation=True, max_length=15, return_tensors="pt")
    print(f"   Shape: {truncated_padded['input_ids'].shape}")
    
    # Visualize the different strategies
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Padding and Truncation Strategies Visualization', fontsize=16)
    
    strategies = [
        ("Right Padding", right_padded),
        ("Fixed Length (20)", fixed_padded),
        ("Truncated + Padded (15)", truncated_padded),
        ("Attention Masks", right_padded)  # Show attention masks
    ]
    
    for idx, (title, encoded) in enumerate(strategies):
        row, col = idx // 2, idx % 2
        ax = axes[row, col]
        
        if title == "Attention Masks":
            # Show attention masks
            data = encoded['attention_mask'].numpy()
            cmap = 'RdBu_r'
        else:
            # Show input IDs (replace pad tokens with -1 for visualization)
            data = encoded['input_ids'].numpy()
            pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
            data = np.where(data == pad_token_id, -1, 1)  # 1 for real tokens, -1 for padding
            cmap = 'RdYlBu_r'
        
        im = ax.imshow(data, cmap=cmap, aspect='auto')
        ax.set_title(title)
        ax.set_xlabel('Token Position')
        ax.set_ylabel('Sequence Number')
        ax.set_yticks(range(len(texts)))
        ax.set_yticklabels([f'Text {i+1}' for i in range(len(texts))])
        
        # Add colorbar
        plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
    
    plt.tight_layout()
    plt.show()
    
    # Show detailed token analysis for first text
    print("\n🔍 DETAILED ANALYSIS - First Text:")
    print(f"   Original: '{texts[0]}'")
    
    # Show tokens for different strategies
    strategies_detail = [
        ("Right Padded", right_padded['input_ids'][0]),
        ("Fixed Length", fixed_padded['input_ids'][0]),  
        ("Truncated", truncated_padded['input_ids'][0])
    ]
    
    for name, token_ids in strategies_detail:
        tokens = tokenizer.convert_ids_to_tokens(token_ids)
        print(f"   {name:15}: {tokens}")
        print(f"   {'Length':15}: {len([t for t in tokens if t != tokenizer.pad_token])} real tokens, {len(tokens)} total")
    
    return {
        'right_padded': right_padded,
        'fixed_padded': fixed_padded,
        'truncated_padded': truncated_padded
    }

# Run the demonstration
padding_results = demonstrate_padding_strategies()

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Multiple Sequence Challenge**: Understanding why variable-length sequences create processing difficulties
- **Padding and Truncation**: Strategic approaches to normalize sequence lengths for batch processing
- **Attention Masks**: Critical mechanism for distinguishing real content from padding tokens
- **Long Sequence Handling**: Techniques for processing sequences beyond standard model limits
- **Longformer Architecture**: Specialized model design for extended context understanding
- **Attention Visualization**: Methods to understand model behavior across different sequence types
- **Performance Optimization**: Best practices for efficient multiple sequence processing

### 📈 Best Practices Learned
- Use dynamic padding instead of fixed max_length for memory efficiency
- Always implement proper attention masking to prevent padding interference
- Group sequences by similar lengths for optimal batching efficiency
- Choose appropriate models (Longformer) for consistently long sequences
- Profile your data's length distribution to inform processing strategies
- Monitor memory usage and adjust batch sizes accordingly
- Visualize attention patterns to understand model behavior and debug issues

### 🚀 Next Steps
- **Advanced Training**: Explore gradient accumulation and mixed precision training
- **Custom Models**: Implement custom attention mechanisms for specific use cases
- **Production Deployment**: Learn about model serving and inference optimization
- **Evaluation Metrics**: Understanding how sequence length affects model performance
- **Data Preprocessing**: Advanced techniques for handling diverse text sources

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*