# Text Summarization for Long Documents using Hugging Face

## Problem Statement

Traditional summarization models like BART or T5 have token limitations that prevent them from processing long documents directly. This notebook implements a sliding window approach to break long texts into manageable chunks, summarize each chunk, and combine the results into a coherent final summary.

## Setup and Installation

In [None]:
# Install required libraries
!pip install transformers datasets torch

## Import Required Modules

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer, pipeline
import torch
import numpy as np
import textwrap

## Load Pre-trained BART Model

We'll use Facebook's BART model fine-tuned on CNN/Daily Mail dataset for summarization.

In [None]:
# Load pre-trained BART model and tokenizer
model_name = 'facebook/bart-large-cnn'
print(f"Loading model: {model_name}")

model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Create summarization pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
print("Model loaded successfully!")

## Implement Sliding Window Summarization

This function handles long documents by splitting them into overlapping chunks.

In [None]:
def sliding_window_summarize(text, max_input_length=1024, max_output_length=200, stride=512):
    """
    Summarizes a long document using a sliding window approach.
    
    Parameters:
    - text: The input long text that needs to be summarized.
    - max_input_length: The max token length for the model's input (default: 1024).
    - max_output_length: The max token length for the model's output (default: 200).
    - stride: The sliding window stride to overlap chunks (default: 512).
    
    Returns:
    - final_summary: The combined summary from all chunks.
    """
    # Tokenize to get accurate length
    tokens = tokenizer.encode(text, truncation=False, add_special_tokens=True)
    
    # If text is short enough, summarize directly
    if len(tokens) <= max_input_length:
        summary = summarizer(text, max_length=max_output_length, min_length=50, do_sample=False)
        return summary[0]['summary_text']
    
    # Split the text into overlapping chunks
    chunks = []
    chunk_texts = []
    
    for i in range(0, len(tokens), max_input_length - stride):
        chunk_tokens = tokens[i:i + max_input_length]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        chunk_texts.append(chunk_text)
    
    print(f"Document split into {len(chunk_texts)} chunks")
    
    # Summarize each chunk
    summaries = []
    for i, chunk in enumerate(chunk_texts):
        print(f"Summarizing chunk {i+1}/{len(chunk_texts)}...")
        summary = summarizer(chunk, max_length=max_output_length, min_length=50, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    
    # Join summaries from all chunks
    intermediate_summary = ' '.join(summaries)
    
    # If the combined summary is still too long, summarize it again
    intermediate_tokens = tokenizer.encode(intermediate_summary, truncation=False, add_special_tokens=True)
    if len(intermediate_tokens) > max_input_length:
        print("Combined summary is long, performing final summarization...")
        final_summary = summarizer(intermediate_summary, max_length=max_output_length*2, min_length=100, do_sample=False)
        return final_summary[0]['summary_text']
    
    return intermediate_summary

## Helper Function to Display Results

In [None]:
def display_summary(original_text, summary):
    """Display the original text and summary in a formatted way."""
    print("=" * 80)
    print("ORIGINAL TEXT LENGTH:", len(original_text), "characters")
    print("=" * 80)
    print(textwrap.fill(original_text[:500] + "...", width=80))
    print("\n" + "=" * 80)
    print("SUMMARY LENGTH:", len(summary), "characters")
    print("=" * 80)
    print(textwrap.fill(summary, width=80))
    print("=" * 80)

## Test with Sample Long Document

In [None]:
# Sample long text - Replace with your actual document
long_text = """
The field of artificial intelligence has undergone remarkable transformations since its inception in the 1950s. 
What began as theoretical discussions among mathematicians and computer scientists has evolved into one of the 
most influential technologies of the 21st century. The journey from simple rule-based systems to today's 
sophisticated deep learning models represents decades of innovation, setbacks, and breakthroughs.

In the early days, AI researchers were optimistic about creating machines that could think like humans within 
a few decades. The Dartmouth Conference of 1956, often considered the birth of AI as a field, brought together 
pioneers like John McCarthy, Marvin Minsky, and Claude Shannon. They believed that "every aspect of learning 
or any other feature of intelligence can in principle be so precisely described that a machine can be made to 
simulate it."

The 1960s and 1970s saw the development of expert systems and symbolic AI. These systems used predefined rules 
and logic to solve specific problems. ELIZA, created by Joseph Weizenbaum in 1964, was one of the first 
chatbots that could engage in seemingly intelligent conversation. However, it quickly became apparent that 
these rule-based approaches had significant limitations.

The first "AI winter" occurred in the 1970s when funding dried up due to unmet expectations. Researchers had 
promised too much too soon, and the technology couldn't deliver on those promises. This period taught the AI 
community valuable lessons about managing expectations and the complexity of human intelligence.

The 1980s brought a resurgence with the popularity of expert systems in business applications. Companies 
invested heavily in AI systems for tasks like financial analysis and medical diagnosis. Japan's Fifth 
Generation Computer Project aimed to create intelligent computers by the 1990s. However, these systems were 
brittle and couldn't handle situations outside their programmed expertise.

The second AI winter in the late 1980s and early 1990s again saw reduced funding and interest. However, this 
period also saw important theoretical developments. Researchers began exploring neural networks more seriously, 
inspired by the structure of the human brain. Geoffrey Hinton, Yann LeCun, and others laid the groundwork for 
what would become deep learning.

The late 1990s and 2000s marked a turning point. The internet provided vast amounts of data, computational 
power increased exponentially following Moore's Law, and new algorithms made training neural networks more 
practical. IBM's Deep Blue defeating world chess champion Garry Kasparov in 1997 demonstrated AI's potential 
in specialized domains.

The real breakthrough came in the 2010s with deep learning. In 2012, AlexNet dramatically improved image 
recognition performance in the ImageNet competition. This success sparked renewed interest and investment in AI. 
Companies like Google, Facebook, and Microsoft established AI research labs and began integrating AI into their 
products.

The transformer architecture, introduced in 2017, revolutionized natural language processing. Models like BERT 
and GPT demonstrated unprecedented language understanding capabilities. These models could generate coherent 
text, answer questions, and even write code. The scale of these models grew rapidly, from millions to billions 
and now trillions of parameters.

Today, AI is ubiquitous. It powers recommendation systems, virtual assistants, autonomous vehicles, and 
medical diagnostic tools. The COVID-19 pandemic accelerated AI adoption in healthcare, with AI systems helping 
to develop vaccines and predict disease spread. However, this widespread deployment has also raised important 
ethical questions about bias, privacy, and the societal impact of automation.

Looking forward, the field faces both exciting opportunities and significant challenges. Researchers are 
working on making AI more explainable, robust, and aligned with human values. The goal of artificial general 
intelligence (AGI) remains elusive but continues to drive research. As we stand at this inflection point, it's 
clear that AI will play an increasingly central role in shaping our future, making it crucial that we develop 
and deploy these technologies responsibly.
"""

# Generate summary
print("Starting summarization process...")
final_summary = sliding_window_summarize(long_text, max_input_length=1024, max_output_length=150, stride=512)

# Display results
display_summary(long_text, final_summary)

## Advanced Usage: Customizing Parameters

In [None]:
# Example with different parameters for more detailed summaries
detailed_summary = sliding_window_summarize(
    long_text, 
    max_input_length=1024,    # Maximum tokens per chunk
    max_output_length=250,    # Longer summaries per chunk
    stride=256               # More overlap between chunks
)

print("DETAILED SUMMARY:")
print("=" * 80)
print(textwrap.fill(detailed_summary, width=80))

## Process Multiple Documents

In [None]:
def batch_summarize_documents(documents, **kwargs):
    """Summarize multiple documents and return results."""
    summaries = {}
    
    for i, (doc_name, doc_text) in enumerate(documents.items()):
        print(f"\nProcessing document {i+1}/{len(documents)}: {doc_name}")
        summaries[doc_name] = sliding_window_summarize(doc_text, **kwargs)
    
    return summaries

# Example usage
documents = {
    "Document 1": long_text,
    # Add more documents here
}

all_summaries = batch_summarize_documents(documents, max_output_length=200)

# Display all summaries
for doc_name, summary in all_summaries.items():
    print(f"\n{doc_name} Summary:")
    print("-" * 40)
    print(textwrap.fill(summary, width=80))

## Saving Results

In [None]:
import json
from datetime import datetime

# Save summaries to file
def save_summaries(summaries, filename=None):
    """Save summaries to a JSON file."""
    if filename is None:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"summaries_{timestamp}.json"
    
    output = {
        "timestamp": datetime.now().isoformat(),
        "model": model_name,
        "summaries": summaries
    }
    
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(output, f, indent=2, ensure_ascii=False)
    
    print(f"Summaries saved to {filename}")

# Example
save_summaries({"example_document": final_summary})

## Performance Optimization Tips

In [None]:
# For GPU acceleration (if available)
if torch.cuda.is_available():
    model = model.to('cuda')
    print("Using GPU for acceleration")
else:
    print("Using CPU")

# Batch processing for multiple chunks (more efficient)
def optimized_sliding_window_summarize(text, batch_size=2, **kwargs):
    """Process multiple chunks in batches for better performance."""
    # Implementation would process multiple chunks simultaneously
    # This is a placeholder for the concept
    pass

## Conclusion and Next Steps

This notebook demonstrates how to:
1. Handle long documents that exceed model token limits
2. Use sliding window approach with overlapping chunks
3. Combine chunk summaries into coherent final summaries
4. Customize parameters for different use cases

### Potential Improvements:
- **Fine-tuning**: Train the model on domain-specific data for better summaries
- **Preprocessing**: Clean text (remove URLs, formatting, etc.) before summarization
- **Post-processing**: Apply additional NLP techniques to improve coherence
- **Alternative Models**: Try T5, Pegasus, or other summarization models
- **Evaluation**: Implement ROUGE scores to measure summary quality

### Resources:
- [Hugging Face Documentation](https://huggingface.co/docs/transformers)
- [BART Paper](https://arxiv.org/abs/1910.13461)
- [Summarization Task Guide](https://huggingface.co/tasks/summarization)