[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/07-summarization.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/07-summarization.ipynb)

# 📝 Text Summarization: Creating Concise Summaries

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- What is text summarization and why it's important
- The difference between extractive and abstractive summarization
- How to use pre-trained summarization models with Hugging Face
- Practical applications of text summarization
- Best practices for implementing summarization systems

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals

## 📚 What We'll Cover
1. Introduction to Text Summarization
2. Setting up the Environment
3. Using Pre-trained Summarization Models
4. Practical Examples and Applications
5. Best Practices and Tips
6. Summary and Next Steps

## Part 1: Introduction to Text Summarization

**Text summarization** is the task of creating a shorter version of a text while preserving the most important information. It's one of the most useful NLP applications.

### 🔍 Types of Summarization:

**Extractive Summarization:**
- Selects and combines existing sentences from the original text
- Maintains original phrasing and vocabulary
- Example: Highlighting key sentences in a document

**Abstractive Summarization:**
- Generates new text that captures the essence of the original
- Can rephrase and restructure information
- More human-like but computationally complex

### 🌟 Real-world Applications:
- News article summaries
- Research paper abstracts
- Meeting notes condensation
- Email and message summaries
- Document analysis for legal/medical fields

## Part 2: Setting up the Environment

In [None]:
# Import necessary libraries
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import time
from typing import List, Dict, Optional

# For visualization and analysis
import matplotlib.pyplot as plt
import pandas as pd

print("📦 Libraries imported successfully!")

In [None]:
# Device detection for optimal performance
def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU for better performance)")
    
    return device

# Get optimal device
device = get_device()

## Part 3: Using Pre-trained Summarization Models

Let's start with the easiest approach - using Hugging Face's pipeline API for summarization.

In [None]:
# Initialize summarization pipeline
print("🔄 Loading summarization model...")
print("This may take a few minutes on first run (downloading model)")

try:
    # Use BART model fine-tuned on CNN/DailyMail dataset
    # This is one of the best models for news summarization
    summarizer = pipeline(
        "summarization",
        model="facebook/bart-large-cnn",
        device=0 if device.type == 'cuda' else -1  # Use GPU if available
    )
    print("✅ Summarization model loaded successfully!")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 Trying alternative model...")
    
    # Fallback to a smaller model
    summarizer = pipeline(
        "summarization",
        model="sshleifer/distilbart-cnn-12-6",
        device=0 if device.type == 'cuda' else -1
    )
    print("✅ Alternative summarization model loaded!")

## Part 4: Your First Summarization

Let's try summarizing a sample article to see how it works!

In [None]:
# Sample article for demonstration
sample_article = """
Artificial Intelligence (AI) has become one of the most transformative technologies of the 21st century, 
revolutionizing industries from healthcare to transportation. Machine learning algorithms, particularly 
deep neural networks, have achieved remarkable breakthroughs in tasks that were once considered 
impossible for computers to perform.

In healthcare, AI systems can now diagnose diseases from medical images with accuracy matching or 
exceeding human specialists. For instance, Google's AI can detect diabetic retinopathy from eye scans, 
potentially preventing blindness in millions of people worldwide. Similarly, AI-powered drug discovery 
platforms are accelerating the development of new medications, reducing the time and cost required 
to bring life-saving treatments to market.

The transportation industry has been equally transformed by autonomous vehicle technology. Companies 
like Tesla, Waymo, and Uber have invested billions in developing self-driving cars that promise to 
reduce traffic accidents, improve fuel efficiency, and provide mobility solutions for elderly and 
disabled populations.

However, the rapid advancement of AI also raises important concerns about job displacement, privacy, 
and the need for proper regulation. As AI systems become more capable, society must carefully 
consider how to harness their benefits while mitigating potential risks.
""".strip()

print("📄 Original Article:")
print("-" * 50)
print(sample_article)
print(f"\n📊 Article length: {len(sample_article.split())} words")

In [None]:
# Generate summary
print("🔄 Generating summary...")
start_time = time.time()

try:
    # Generate summary with specific parameters
    summary_result = summarizer(
        sample_article,
        max_length=100,      # Maximum length of summary
        min_length=30,       # Minimum length of summary
        do_sample=False,     # Use deterministic generation
        truncation=True      # Truncate if input is too long
    )
    
    # Extract summary text
    summary = summary_result[0]['summary_text']
    
    generation_time = time.time() - start_time
    
    print("✅ Summary generated successfully!")
    print(f"⏱️ Generation time: {generation_time:.2f} seconds")
    
except Exception as e:
    print(f"❌ Error generating summary: {e}")
    summary = "Error occurred during summarization"

In [None]:
# Display and analyze the summary
print("📝 Generated Summary:")
print("-" * 50)
print(summary)

# Calculate summary statistics
original_words = len(sample_article.split())
summary_words = len(summary.split())
compression_ratio = original_words / summary_words if summary_words > 0 else 0

print(f"\n📊 Summary Statistics:")
print(f"  Original length: {original_words} words")
print(f"  Summary length: {summary_words} words")
print(f"  Compression ratio: {compression_ratio:.1f}x")
print(f"  Reduction: {((original_words - summary_words) / original_words * 100):.1f}%")

## Part 5: Building a More Advanced Summarization System

Let's create a more sophisticated summarization system with better error handling and customization options.

In [None]:
class TextSummarizer:
    """
    A comprehensive text summarization system with educational focus.
    
    This class provides an easy-to-use interface for text summarization
    with built-in error handling, preprocessing, and analysis features.
    """
    
    def __init__(self, model_name: str = "facebook/bart-large-cnn"):
        """
        Initialize the summarizer with a specific model.
        
        Args:
            model_name: Hugging Face model identifier
        """
        self.model_name = model_name
        self.pipeline = None
        self._load_model()
    
    def _load_model(self):
        """Load the summarization model with error handling."""
        try:
            print(f"🔄 Loading model: {self.model_name}")
            self.pipeline = pipeline(
                "summarization",
                model=self.model_name,
                device=0 if device.type == 'cuda' else -1
            )
            print("✅ Model loaded successfully")
            
        except Exception as e:
            print(f"❌ Error loading model: {e}")
            print("💡 Trying fallback model...")
            
            # Fallback to smaller model
            self.model_name = "sshleifer/distilbart-cnn-12-6"
            self.pipeline = pipeline(
                "summarization",
                model=self.model_name,
                device=0 if device.type == 'cuda' else -1
            )
            print("✅ Fallback model loaded successfully")
    
    def preprocess_text(self, text: str) -> str:
        """
        Preprocess text before summarization.
        
        Args:
            text: Input text to preprocess
            
        Returns:
            Preprocessed text ready for summarization
        """
        if not text or not text.strip():
            raise ValueError("Input text is empty")
        
        # Basic preprocessing
        text = text.strip()
        
        # Check minimum length (most models need at least 30-50 words)
        words = text.split()
        if len(words) < 30:
            print(f"⚠️ Warning: Text is quite short ({len(words)} words). "
                  f"Summarization works best with longer texts.")
        
        return text
    
    def summarize(
        self, 
        text: str, 
        max_length: int = 130, 
        min_length: int = 30,
        num_beams: int = 4
    ) -> Dict:
        """
        Generate a summary of the input text.
        
        Args:
            text: Input text to summarize
            max_length: Maximum length of the summary
            min_length: Minimum length of the summary
            num_beams: Number of beams for beam search (higher = more thorough)
            
        Returns:
            Dictionary containing summary and analysis
        """
        # Preprocess input
        processed_text = self.preprocess_text(text)
        
        try:
            # Generate summary
            start_time = time.time()
            
            result = self.pipeline(
                processed_text,
                max_length=max_length,
                min_length=min_length,
                num_beams=num_beams,
                do_sample=False,
                truncation=True
            )
            
            generation_time = time.time() - start_time
            summary = result[0]['summary_text']
            
            # Calculate statistics
            original_words = len(text.split())
            summary_words = len(summary.split())
            compression_ratio = original_words / summary_words if summary_words > 0 else 0
            
            return {
                'original_text': text,
                'summary': summary,
                'original_length': original_words,
                'summary_length': summary_words,
                'compression_ratio': compression_ratio,
                'generation_time': generation_time,
                'model_used': self.model_name
            }
            
        except Exception as e:
            print(f"❌ Error during summarization: {e}")
            return {
                'error': str(e),
                'original_text': text,
                'summary': None
            }

# Initialize our advanced summarizer
advanced_summarizer = TextSummarizer()

## Part 6: Practical Examples with Different Text Types

Let's test our summarizer with different types of content to see how it performs.

In [None]:
# Different types of content for testing
test_texts = {
    "news": """
    Scientists at MIT have developed a revolutionary new battery technology that could dramatically 
    extend the range of electric vehicles while reducing charging time to just five minutes. The 
    breakthrough involves using lithium metal anodes combined with a solid ceramic electrolyte, 
    which eliminates the safety concerns and performance limitations of current lithium-ion batteries.
    
    The new battery design can store up to three times more energy than conventional batteries while 
    maintaining stability over thousands of charge cycles. Initial testing shows that electric vehicles 
    equipped with these batteries could travel over 1,000 miles on a single charge, compared to the 
    300-400 mile range of current EVs.
    
    The research team, led by Dr. Sarah Chen from University of Sydney, expects the technology to be commercially available 
    within five years. Major automakers including Tesla, Ford, and Toyota have already expressed 
    interest in licensing the technology. The development could accelerate the global transition 
    to electric vehicles and significantly reduce carbon emissions from transportation.
    """,
    
    "scientific": """
    Recent advances in quantum computing have brought us closer to achieving quantum supremacy in 
    practical applications. Researchers have successfully demonstrated quantum error correction using 
    a 70-qubit processor, marking a significant milestone in fault-tolerant quantum computing.
    
    The experiment involved creating logical qubits that are protected from environmental noise 
    through sophisticated error correction codes. By encoding quantum information across multiple 
    physical qubits, the system can detect and correct errors without destroying the quantum state.
    
    This breakthrough addresses one of the fundamental challenges in quantum computing: maintaining 
    coherent quantum states long enough to perform complex calculations. The implications for 
    cryptography, drug discovery, and materials science are profound, as quantum computers could 
    solve certain problems exponentially faster than classical computers.
    """,
    
    "business": """
    Global e-commerce giant Amazon announced record-breaking quarterly profits, driven by strong 
    growth in cloud computing services and advertising revenue. The company reported net income of 
    $15.3 billion for the third quarter, exceeding analyst expectations by 22%.
    
    Amazon Web Services (AWS), the company's cloud computing division, saw revenue increase by 35% 
    year-over-year as more businesses accelerate their digital transformation efforts. The advertising 
    business also performed exceptionally well, with revenue growing 58% as retailers increased their 
    marketing spend during the peak shopping season.
    
    CEO Andy Jassy attributed the success to continued innovation in artificial intelligence and 
    machine learning capabilities, which have improved operational efficiency and customer experience. 
    The company plans to invest heavily in expanding its logistics network and developing new AI-powered 
    services for business customers.
    """
}

print("🧪 Testing summarization on different content types...")
print("=" * 60)

In [None]:
# Test summarization on different content types
results = {}

for content_type, text in test_texts.items():
    print(f"\n📝 {content_type.upper()} ARTICLE:")
    print("-" * 40)
    
    # Generate summary
    result = advanced_summarizer.summarize(
        text.strip(),
        max_length=80,  # Shorter summaries for comparison
        min_length=25
    )
    
    if 'error' not in result:
        print(f"📄 Original ({result['original_length']} words):")
        print(text.strip()[:200] + "..." if len(text.strip()) > 200 else text.strip())
        
        print(f"\n📝 Summary ({result['summary_length']} words):")
        print(result['summary'])
        
        print(f"\n📊 Statistics:")
        print(f"  • Compression: {result['compression_ratio']:.1f}x")
        print(f"  • Generation time: {result['generation_time']:.2f}s")
        
        results[content_type] = result
    else:
        print(f"❌ Error: {result['error']}")

print(f"\n✅ Successfully summarized {len(results)} different content types!")

## Part 7: Understanding Summarization Parameters

Let's explore how different parameters affect the quality and style of summaries.

In [None]:
# Test different parameter configurations
parameter_tests = [
    {"name": "Short & Concise", "max_length": 50, "min_length": 20, "num_beams": 4},
    {"name": "Balanced", "max_length": 100, "min_length": 40, "num_beams": 4},
    {"name": "Detailed", "max_length": 150, "min_length": 60, "num_beams": 6},
    {"name": "High Quality (More Beams)", "max_length": 100, "min_length": 40, "num_beams": 8}
]

# Use the first test text for comparison
test_text = test_texts["news"]

print("🎛️ Testing Different Summarization Parameters")
print("=" * 50)
print(f"📄 Source text: {len(test_text.split())} words")
print()

for config in parameter_tests:
    print(f"\n🔧 Configuration: {config['name']}")
    print(f"   Max length: {config['max_length']}, Min length: {config['min_length']}, Beams: {config['num_beams']}")
    
    result = advanced_summarizer.summarize(
        test_text.strip(),
        max_length=config['max_length'],
        min_length=config['min_length'],
        num_beams=config['num_beams']
    )
    
    if 'error' not in result:
        print(f"📝 Summary ({result['summary_length']} words): {result['summary']}")
        print(f"⏱️ Time: {result['generation_time']:.2f}s, Ratio: {result['compression_ratio']:.1f}x")
    else:
        print(f"❌ Error: {result['error']}")

## Part 8: Best Practices and Tips

Here are some important guidelines for effective text summarization:

In [None]:
def demonstrate_best_practices():
    """
    Demonstrate best practices for text summarization.
    """
    print("💡 Best Practices for Text Summarization")
    print("=" * 40)
    
    practices = [
        {
            "title": "🎯 Choose the Right Model",
            "description": "Different models work better for different content types",
            "examples": [
                "• facebook/bart-large-cnn: Best for news articles",
                "• google/pegasus-xsum: Good for very short summaries",
                "• t5-small: Lighter model for resource-constrained environments"
            ]
        },
        {
            "title": "📏 Optimize Length Parameters",
            "description": "Set appropriate length based on your use case",
            "examples": [
                "• Headlines: max_length=30-50",
                "• Social media: max_length=50-100",
                "• Executive summaries: max_length=100-200"
            ]
        },
        {
            "title": "🔧 Tune Generation Parameters",
            "description": "Adjust parameters for quality vs speed tradeoffs",
            "examples": [
                "• More beams (4-8): Higher quality, slower generation",
                "• Fewer beams (1-3): Faster generation, potentially lower quality",
                "• Length penalty: Controls summary length adherence"
            ]
        },
        {
            "title": "⚠️ Handle Edge Cases",
            "description": "Always validate inputs and handle errors gracefully",
            "examples": [
                "• Check minimum text length (30+ words recommended)",
                "• Handle empty or very short inputs",
                "• Implement fallback models for reliability"
            ]
        }
    ]
    
    for practice in practices:
        print(f"\n{practice['title']}")
        print(f"{practice['description']}")
        for example in practice['examples']:
            print(f"  {example}")
    
    print("\n🚨 Common Pitfalls to Avoid:")
    pitfalls = [
        "• Using summarization on very short texts (< 30 words)",
        "• Setting min_length too close to max_length",
        "• Ignoring model token limits (typically 1024 tokens)",
        "• Not preprocessing text (removing unnecessary formatting)",
        "• Expecting perfect summaries without domain-specific fine-tuning"
    ]
    
    for pitfall in pitfalls:
        print(f"  {pitfall}")

demonstrate_best_practices()

## Summary

Congratulations! You've successfully learned the fundamentals of text summarization with Hugging Face transformers.

### 🔑 Key Concepts Mastered
- **Text Summarization Basics**: Understanding extractive vs abstractive approaches
- **Model Usage**: Working with pre-trained summarization models
- **Pipeline API**: Using Hugging Face's simple pipeline interface
- **Parameter Tuning**: Optimizing length, quality, and performance
- **Best Practices**: Handling different content types and edge cases

### 📈 Best Practices Learned
- Choose models appropriate for your content type and use case
- Optimize length parameters based on desired output format
- Implement proper error handling for production systems
- Consider computational resources when selecting models and parameters

### 🚀 Next Steps
- **Advanced Fine-tuning**: Learn to fine-tune models on domain-specific data
- **Evaluation Metrics**: Explore ROUGE scores and other summarization metrics  
- **Multi-document Summarization**: Summarize multiple documents together
- **Real-time Applications**: Build web services and APIs for summarization

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*