# Document Summarization with IDP Common Package

This notebook demonstrates how to use the document summarization capability of the IDP Common Package to generate concise summaries of documents. The summarization service uses LLMs to analyze document content and produce both brief and detailed summaries, along with a formatted markdown report.

**Key Benefits of Document Summarization:**
1. Quickly understand the main points of lengthy documents
2. Extract essential information without reading the entire document
3. Generate consistent summaries across different document types
4. Preserve important facts while reducing document length
5. Create shareable summary reports in markdown format

> **Note**: This notebook uses real AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

Let's install the IDP common package in development mode.

In [1]:
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[all]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

Found existing installation: idp_common 0.2.19
Uninstalling idp_common-0.2.19:
  Successfully uninstalled idp_common-0.2.19
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Version: 0.2.19
Location: /home/ec2-user/miniconda/lib/python3.12/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import datetime

# Import base libraries
from idp_common.models import Document, Status
from idp_common import ocr, summarization

# Configure logging
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)
logging.getLogger('idp_common.summarization.service').setLevel(logging.INFO)
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'
os.environ['CONFIGURATION_TABLE_NAME'] = 'mock-config-table'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name = f"idp-notebook-input-{account_id}-{region}"
output_bucket_name = f"idp-notebook-output-{account_id}-{region}"

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Example
AWS_REGION: us-west-2
Input bucket: idp-notebook-input-912625584728-us-west-2
Output bucket: idp-notebook-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
try:
    with open(SAMPLE_PDF_PATH, 'rb') as file_data:
        s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)
    print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")
except FileNotFoundError:
    print(f"Sample file not found at {SAMPLE_PDF_PATH}. Please ensure the sample file exists.")
    # Use a default sample if available
    alt_sample_path = "../samples/sample.pdf"
    try:
        with open(alt_sample_path, 'rb') as file_data:
            s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)
        print(f"Used alternative sample file: s3://{input_bucket_name}/{sample_file_key}")
    except FileNotFoundError:
        print("No sample files found. Please create a samples directory with PDF files.")

Bucket idp-notebook-input-912625584728-us-west-2 already exists
Bucket idp-notebook-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-notebook-input-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf


In [4]:
# Set up configuration for the summarization service - standard format
CONFIG = {
    "summarization": {
        "model": "us.amazon.nova-pro-v1:0",
        "temperature": 0,
        "top_k": 0.5,
        "system_prompt": "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main points, and important details from the document. Your output must be in valid JSON format.",
        "task_prompt": "Analyze the provided document and create a comprehensive summary.\n\n<document>\n{DOCUMENT_TEXT}\n</document>\n\nCreate a summary that captures the essential information from the document. Your summary should:\n1. Extract key information, main points, and important details\n2. Maintain the original document's organizational structure where appropriate\n3. Preserve important facts, figures, dates, and entities\n4. Reduce the length while retaining all critical information\n5. The detailed summary should be approximately 1000 words in length\n\nReturn your response as valid JSON with the following structure:\n```json\n{\n  \"brief_summary\": \"A 1-2 sentence overview of what this document is about\",\n  \"detailed_summary\": \"A comprehensive summary following the summarization style\"\n}\n```"
    }
}


In [5]:
# Optionally, explore or test with different models
alternative_models = [
    "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    "us.anthropic.claude-3-5-sonnet-20241022-v2:0",
    "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
]

## 4. Set Up Configuration

In [6]:
# Initialize a new Document
document = Document(
    id="doc-for-summarization",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: doc-for-summarization
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 5.32 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 5.32 seconds
Document status: OCR_COMPLETED
Number of pages processed: 10

Processed pages:
Page 1:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/1/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/1/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/1/result.json
Page 2:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/2/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/2/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/2/result.json
Page 3:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/pages/3/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-w

## 6. Summarize the Document

In [7]:
# Create summarization service with Bedrock backend
summarization_service = summarization.SummarizationService(
    config=CONFIG, 
    backend="bedrock"
)

# Summarize the document
print("\nSummarizing document...")
start_time = time.time()
document = summarization_service.process_document(document)
summarization_time = time.time() - start_time

print(f"Summarization completed in {summarization_time:.2f} seconds")


INFO:idp_common.summarization.service:Initialized summarization service with Bedrock backend using model us.amazon.nova-pro-v1:0



Summarizing document...


INFO:idp_common.summarization.service:Summarizing text with Bedrock
INFO:idp_common.summarization.service:Document summarized successfully. Summary report stored at: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/summary/summary.md


Summarization completed in 9.25 seconds


In [8]:
# Display the summary
print("\n" + "=" * 80)
print("DOCUMENT SUMMARY")
print("=" * 80)

# Access the summary content dynamically
summary = document.summarization_result.summary
print("\nAvailable fields / values in summary:")
for key in summary.keys():
    print("\n" + "=" * 40)
    print(f"{key}:")
    print("\n" + "=" * 40)
    print(summary.get(key, 'N/A'))

# Display summary report URI
print("\nSummary Report URI:")
print(document.summary_report_uri)

# Document status should already be updated by the service
print(f"\nDocument status: {document.status.value}")


DOCUMENT SUMMARY

Available fields / values in summary:

brief_summary:

The document contains various communications and reports related to the tobacco industry, including opposition to FDA regulation, market promotion strategies, lab testing procedures, and customer satisfaction surveys.

detailed_summary:

The document comprises several distinct sections related to the tobacco industry:

1. **Opposition to FDA Regulation**: A letter from Will E. Clark, General Manager of the Western Dark Fired Tobacco Growers' Association, dated October 31, 1995, expresses strong opposition to the 'Commitment to our Children' petition. The association, representing 9,000 tobacco producers, argues against FDA regulation of tobacco, citing existing age restriction laws and the potential negative impact on family farms and adult consumer rights.

2. **Market Promotion Strategies**: A memorandum from Mel Fallis dated April 3, 1987, outlines a project development timetable for the Situation Analysis and

In [9]:
# Display rendered markdown summary
from IPython.display import Markdown, display

# Get the summary report content
if document.summary_report_uri:
    bucket, key = parse_s3_uri(document.summary_report_uri)
    try:
        response = s3_client.get_object(Bucket=bucket, Key=key)
        report_content = response['Body'].read().decode('utf-8')
        # Display the markdown content
        display(Markdown(report_content))
    except Exception as e:
        print(f"Error retrieving summary report: {str(e)}")

# Document Summary: doc-for-summarization

## Brief Summary
The document contains various communications and reports related to the tobacco industry, including opposition to FDA regulation, market promotion strategies, lab testing procedures, and customer satisfaction surveys.

## Detailed Summary
The document comprises several distinct sections related to the tobacco industry:

1. **Opposition to FDA Regulation**: A letter from Will E. Clark, General Manager of the Western Dark Fired Tobacco Growers' Association, dated October 31, 1995, expresses strong opposition to the 'Commitment to our Children' petition. The association, representing 9,000 tobacco producers, argues against FDA regulation of tobacco, citing existing age restriction laws and the potential negative impact on family farms and adult consumer rights.

2. **Market Promotion Strategies**: A memorandum from Mel Fallis dated April 3, 1987, outlines a project development timetable for the Situation Analysis and Campaign Development phases for Benson & Hedges and Virginia Slims. The focus is on gathering industry, category, consumer dynamics, and promotion activities information.

3. **Lab Services Consistency Report**: A report dated February 28, 1993, details lab services consistency, including sample weights, dilution factors, and consistency percentages for various samples.

4. **Tobacco Industry Updates**: Notes from April 20, 1998, summarize recent events in the tobacco industry, including defeated smoking bans in Falmouth, MA, and tabled tobacco retailing ordinances in Waseca and Wadena Counties, MN.

5. **Mutation Assay Report**: A detailed report on a mutation assay conducted on Algoral 40 LF, describing the objectives, methods, and results of the test on Chinese Hamster Ovary (CHO) cells.

6. **Invoice from Peake Printers**: An invoice dated November 12, 1992, from Peake Printers to The Tobacco Institute for the production of 5,000 two-sided decals.

7. **FTC Cigarette Testing Announcement**: A news release from August 1, 1967, announcing the Federal Trade Commission's (FTC) commencement of formal cigarette testing under specified conditions, including butt length, random selection, and reporting of tar and nicotine content.

8. **Customer Satisfaction Survey**: A survey form assessing customer satisfaction with R. J. Reynolds' customer service, including questions on representative courtesy, knowledge, and overall satisfaction with the brand.

9. **Biographical Sketches**: Brief biographical sketches of key personnel, including Mario Stevenson, an Assistant Professor with a background in Biochemistry, and Moshe Kalina, an Associate Professor specializing in cytochemistry and the surfactant system.

Each section provides specific insights into different aspects of the tobacco industry, from regulatory opposition and market strategies to laboratory testing and customer feedback.

## Metadata
| Key | Value |
| --- | --- |


Execution time: 8.94 seconds

## 7. Test Using the Summary Report

In [10]:
# Create a simple example to demonstrate the Document model serialization
print("\n" + "=" * 80)
print("DOCUMENT SERIALIZATION EXAMPLE")
print("=" * 80)

# Convert document to dictionary
doc_dict = document.to_dict()

# Check that summary_report_uri is included
print(f"summary_report_uri in document dict: {'summary_report_uri' in doc_dict}")

# Show the summary fields in the dictionary
print("\nSummary report URI in document dictionary:")
print(f"  - summary_report_uri: {doc_dict.get('summary_report_uri')}")

# Create a new document from the dictionary
new_document = Document.from_dict(doc_dict)

# Verify that fields were preserved
print("\nVerifying fields were preserved after serialization:")
print(f"  - summary_report_uri matches: {new_document.summary_report_uri == document.summary_report_uri}")

# The summarization_result object is not included in serialization
print("\nNote: The summarization_result object is not included in the serialization")
print(f"  - Original document has summarization_result: {document.summarization_result is not None}")
print(f"  - New document from dict has summarization_result: {new_document.summarization_result is not None}")


DOCUMENT SERIALIZATION EXAMPLE
summary_report_uri in document dict: True

Summary report URI in document dictionary:
  - summary_report_uri: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/summary/summary.md

Verifying fields were preserved after serialization:
  - summary_report_uri matches: True

Note: The summarization_result object is not included in the serialization
  - Original document has summarization_result: True
  - New document from dict has summarization_result: False


In [11]:
# Test with differnt format configuration
print("\n" + "=" * 80)
print("TESTING FLEXIBLE OUTPUT STRUCTURE")
print("=" * 80)

# Alternative configuration with flexible format
FLEXIBLE_CONFIG = {
    "summarization": {
        "model": "us.amazon.nova-pro-v1:0",
        "temperature": 0,
        "top_k": 0.5,
        "system_prompt": "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a structured summary that captures the key information, main points, and important details from the document. Your output must be in valid JSON format.",
        "task_prompt": "Analyze the provided document and create a structured summary.\n\n<document>\n{DOCUMENT_TEXT}\n</document>\n\nCreate a summary that captures the essential information from the document. Your summary should be structured and comprehensive.\n\nReturn your response as valid JSON with the following structure:\n```json\n{\n  \"overview\": \"A 1-2 sentence overview of what this document is about\",\n  \"key_points\": [\"List of 3-5 key points as bullet points\"],\n  \"sections\": {\n    \"section1_name\": \"Summary of section 1\",\n    \"section2_name\": \"Summary of section 2\"\n  },\n  \"entities\": [\"List of important entities (people, organizations, dates, etc.)\"],\n  \"conclusion\": \"A brief conclusion statement\"\n}\n```"
    }
}

# Create summarization service with Bedrock backend
summarization_service = summarization.SummarizationService(
    config=FLEXIBLE_CONFIG, 
)

# Summarize the document
print("\nSummarizing document...")
start_time = time.time()
flexible_document = summarization_service.process_document(document)
summarization_time = time.time() - start_time

print(f"Summarization completed in {summarization_time:.2f} seconds")

# Show available fields in the flexible summary
summary = flexible_document.summarization_result.summary
print("\nAvailable fields in flexible summary:")
for key in summary.keys():
    print(f"  - {key}")

# Show the summary report URI
print("\nSummary Report URI:")
print(flexible_document.summary_report_uri)

# Fetch and prepare the markdown content for rendering
flexible_markdown_content = "No flexible summary report available"
if flexible_document.summary_report_uri:
    bucket, key = parse_s3_uri(flexible_document.summary_report_uri)
    try:
        response = s3_client.get_object(Bucket=bucket, Key=key)
        flexible_markdown_content = response['Body'].read().decode('utf-8')
        print("\nSuccessfully retrieved flexible summary report for rendering.")
    except Exception as e:
        print(f"Error retrieving flexible summary report: {str(e)}")

INFO:idp_common.summarization.service:Initialized summarization service with Bedrock backend using model us.amazon.nova-pro-v1:0



TESTING FLEXIBLE OUTPUT STRUCTURE

Summarizing document...


INFO:idp_common.summarization.service:Summarizing text with Bedrock
INFO:idp_common.summarization.service:Document summarized successfully. Summary report stored at: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/summary/summary.md


Summarization completed in 5.71 seconds

Available fields in flexible summary:
  - overview
  - key_points
  - sections
  - entities
  - conclusion

Summary Report URI:
s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-19_21-52-32.pdf/summary/summary.md

Successfully retrieved flexible summary report for rendering.


In [12]:
# Display rendered flexible markdown summary
from IPython.display import Markdown, display

print("\n" + "=" * 80)
print("RENDERED FLEXIBLE SUMMARY REPORT")
print("=" * 80 + "\n")

# Display the markdown content
display(Markdown(flexible_markdown_content))


RENDERED FLEXIBLE SUMMARY REPORT



# Document Summary: doc-for-summarization

## Overview
The document contains various sections related to the tobacco industry, including opposition to FDA regulation, marketing strategies, lab reports, and biographical sketches of researchers.

## Key Points
- Opposition to FDA regulation of tobacco by the Western Dark Fired Tobacco Growers' Association.
- Marketing strategies and project development timetables for tobacco brands.
- Lab reports and testing procedures for tobacco products.
- Biographical sketches of key personnel involved in tobacco-related research.

## Sections
### Opposition To Fda Regulation
A letter from the Western Dark Fired Tobacco Growers' Association expressing strong opposition to FDA regulation of tobacco, arguing it would create bureaucracy and harm family farms.

### Marketing Strategies
A memorandum outlining the proposed project development timetable for the Situation Analysis and Campaign Development phases for Benson & Hedges and Virginia Slims.

### Lab Reports
Various lab reports and testing procedures, including consistency reports, mutation assays, and cigarette testing protocols by the Federal Trade Commission.

### Biographical Sketches
Biographical sketches of key personnel, including Mario Stevenson and Moshe Kalina, detailing their education, professional experience, and publications.


## Entities
- Western Dark Fired Tobacco Growers' Association
- FDA
- Benson & Hedges
- Virginia Slims
- Federal Trade Commission
- Mario Stevenson
- Moshe Kalina

## Conclusion
The document provides a comprehensive view of various aspects of the tobacco industry, including regulatory opposition, marketing strategies, laboratory testing, and biographical information of key researchers.

## Metadata
| Key | Value |
| --- | --- |


Execution time: 5.57 seconds

In [13]:
# Conclusion and cleanup
print("\n" + "=" * 80)
print("CONCLUSION")
print("=" * 80)

print("""
This notebook has demonstrated the flexible document summarization capabilities of the IDP Common Package:

1. **Standard Summary Format** - Using the default structure with brief and detailed summaries
2. **Flexible Summary Format** - Using a more complex structure with sections and entities

Key benefits of the approach:

- Adapt to any summary structure based on document requirements
- Configure prompts to generate exactly the fields you need
- View rendered markdown reports for better readability
- Store and retrieve summaries with consistent interfaces
- Works with any LLM supported by the backend

Learn more in the documentation: /lib/idp_common_pkg/idp_common/summarization/README.md
""")



CONCLUSION

This notebook has demonstrated the flexible document summarization capabilities of the IDP Common Package:

1. **Standard Summary Format** - Using the default structure with brief and detailed summaries
2. **Flexible Summary Format** - Using a more complex structure with sections and entities

Key benefits of the approach:

- Adapt to any summary structure based on document requirements
- Configure prompts to generate exactly the fields you need
- View rendered markdown reports for better readability
- Store and retrieve summaries with consistent interfaces
- Works with any LLM supported by the backend

Learn more in the documentation: /lib/idp_common_pkg/idp_common/summarization/README.md

