# **Demo: Text Summarizer with Error Handling**

This notebook demonstrates how to use an open-source Hugging Face model to summarize the content of a PDF document while handling long texts and errors gracefully.

In [2]:
!pip install transformers PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [3]:
# Import necessary libraries
import os
from PyPDF2 import PdfReader
from transformers import AutoTokenizer, pipeline

# Step 1: Read the PDF and Extract Text
pdf_path = 'arxiv_impact_of_GENAI.pdf'  # Path to your PDF file
reader = PdfReader(pdf_path)

# Extract text from all pages
pdf_text = ''
for page in reader.pages:
    pdf_text += page.extract_text()

print(f'Extracted {len(pdf_text.split())} words from the PDF.')

# Step 2: Summarize the Text using Hugging Face Model with Tokenizer-Based Chunking
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

# Function to chunk text based on tokenization
def tokenize_and_chunk(text, tokenizer, max_tokens=500):
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk = tokenizer.decode(tokens[i:i + max_tokens], skip_special_tokens=True)
        chunks.append(chunk)
    return chunks

# Tokenize and chunk the PDF text
chunks = tokenize_and_chunk(pdf_text, tokenizer, max_tokens=500)

# Summarize each chunk with error handling
summaries = []
for i, chunk in enumerate(chunks):
    try:
        if chunk.strip():  # Skip empty chunks
            summary = summarizer(chunk, max_length=130, min_length=30, do_sample=False)
            summaries.append(summary[0]['summary_text'])
    except Exception as e:
        print(f'Error summarizing chunk {i}: {e}')

# Combine summaries
final_summary = ' '.join(summaries)

print("Final Summary:")
print(final_summary)



Extracted 6516 words from the PDF.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Your max_length is set to 130, but your input_length is only 63. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=31)


Final Summary:
The rise of generative artificial intelligence (AI) has sparked concerns about its potential influence on unemployment and market depression. This study addresses this concern by ex-amining the impact of Generative AI on product markets. While generativeAI lowers average prices, it substantially boosts order volume and overall revenue. In response to generative AI, artists have staged massprotests with the slogan ”NO to Ai generated images,” a sentiment depicted in Figure 1. Elon Musk has appealed to the AI community to pause the development of such technology. The challenge of assessing the impact of generativeAI lies in achieving causal inference. Hackers attacked Novelai and leaked the AI model and training codes used by Novelai. The post-leak landscape is intriguing. While the ”tachie’ market experiences a dip in price, it experiences a surge in volume and turnover. We collected 197,110 records from a major Chinese paint-consuming outsourcing platform from January 20

In [5]:
summary_file_path = 'Summary.txt'
with open(summary_file_path, 'w') as file:
    for sentence in final_summary.split('. '):  # Split by sentence
        file.write(sentence.strip() + '.\n')

print(f"Final Summary saved to {summary_file_path}.")


Final Summary saved to summary.txt.


### **Conclusion**

This notebook demonstrates how to read a PDF file and summarize its contents using an open-source Hugging Face model with chunking, sanitization, and error handling.