# Document Summarization with IDP Common Package

This notebook demonstrates how to use the document summarization capability of the IDP Common Package to generate concise summaries of documents. The summarization service uses LLMs to analyze document content and produce both brief and detailed summaries, along with a formatted markdown report.

**Key Benefits of Document Summarization:**
1. Quickly understand the main points of lengthy documents
2. Extract essential information without reading the entire document
3. Generate consistent summaries across different document types
4. Preserve important facts while reducing document length
5. Create shareable summary reports in markdown format

> **Note**: This notebook uses real AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

Let's install the IDP common package in development mode.

In [13]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[dev, all]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

[0mNote: you may need to restart the kernel to use updated packages.
Version: 0.2.21
Location: /home/ec2-user/miniconda/lib/python3.12/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [14]:
import os
import json
import time
import boto3
import logging
import datetime

# Import base libraries
from idp_common.models import Document, Status
from idp_common import ocr, summarization

# Configure logging
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)
logging.getLogger('idp_common.summarization.service').setLevel(logging.INFO)
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.bedrock.client').setLevel(logging.DEBUG)  # show prompts

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'
os.environ['CONFIGURATION_TABLE_NAME'] = 'mock-config-table'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name =  os.getenv("IDP_INPUT_BUCKET_NAME", f"idp-notebook-input-{account_id}-{region}")
output_bucket_name = os.getenv("IDP_OUTPUT_BUCKET_NAME", f"idp-notebook-output-{account_id}-{region}")

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Example
AWS_REGION: us-west-2
Input bucket: idp-notebook-input-912625584728-us-west-2
Output bucket: idp-notebook-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [15]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
try:
    with open(SAMPLE_PDF_PATH, 'rb') as file_data:
        s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)
    print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")
except FileNotFoundError:
    print(f"Sample file not found at {SAMPLE_PDF_PATH}. Please ensure the sample file exists.")
    # Use a default sample if available
    alt_sample_path = "../samples/sample.pdf"
    try:
        with open(alt_sample_path, 'rb') as file_data:
            s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)
        print(f"Used alternative sample file: s3://{input_bucket_name}/{sample_file_key}")
    except FileNotFoundError:
        print("No sample files found. Please create a samples directory with PDF files.")

Bucket idp-notebook-input-912625584728-us-west-2 already exists
Bucket idp-notebook-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-notebook-input-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf


In [16]:
# Set up configuration for the summarization service - standard format
CONFIG = {
    "summarization": {
        "model": "us.amazon.nova-pro-v1:0",
        "temperature": 0,
        "top_k": 5,
        "system_prompt": "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main points, and important details from the document. Your output must be in valid JSON format.\n\nSummarization Style: Balanced\nCreate a balanced summary that provides a moderate level of detail. Include the main points and key supporting information, while maintaining the document's overall structure. Aim for a comprehensive yet concise summary.",
        "task_prompt": """
            Analyze the provided document and create a comprehensive summary.

            CRITICAL INSTRUCTION: You MUST return your response as valid JSON with the EXACT structure shown at the end of these instructions. Do not include any explanations, notes, or text outside of the JSON structure.

            Create a summary that captures the essential information from the document. Your summary should:
            1. Extract key information, main points, and important details
            2. Maintain the original document's organizational structure where appropriate
            3. Preserve important facts, figures, dates, and entities
            4. Reduce the length while retaining all critical information
            5. Use markdown formatting for better readability (headings, lists, emphasis, etc.)
            6. Cite all relevant facts from the source document using the format [Cite-X, Page-Y] where X is a sequential citation number and Y is the page number
            7. For each citation, include a hover-enabled reference using HTML span tags with title attributes that contain the exact text snippet from which the fact is derived
            Example: <span title=\"Original text from document: The company reported $10M in revenue\" class=\"citation\">[Cite-1, Page-3]</span>

            Output Format:
            You MUST return ONLY valid JSON with the following structure and nothing else:
            ```json
            {
            \"summary\": \"A comprehensive summary in markdown format with citations and hover functionality\"
            }
            ```

            Do not include any text, explanations, or notes outside of this JSON structure. The JSON must be properly formatted and parseable.

            <<CACHEPOINT>>

            <document>
            {DOCUMENT_TEXT}
            </document>
        """
    }
}


In [17]:
# Optionally, explore or test with different models
alternative_models = [
    "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    "us.anthropic.claude-3-5-sonnet-20241022-v2:0",
    "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
]

## 4. Set Up Configuration

In [18]:
# Initialize a new Document
document = Document(
    id="doc-for-summarization",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: doc-for-summarization
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 6.73 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 6.73 seconds
Document status: QUEUED
Number of pages processed: 10

Processed pages:
Page 1:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/1/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/1/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/1/result.json
Page 2:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/2/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/2/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/2/result.json
Page 3:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/pages/3/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/s

## 6. Summarize the Document

In [19]:
# Create summarization service with Bedrock backend
summarization_service = summarization.SummarizationService(
    config=CONFIG, 
    backend="bedrock"
)

# Summarize the document
print("\nSummarizing document...")
start_time = time.time()
document = summarization_service.process_document(document)
summarization_time = time.time() - start_time

print(f"Summarization completed in {summarization_time:.2f} seconds")


INFO:idp_common.summarization.service:Initialized summarization service with Bedrock backend using model us.amazon.nova-pro-v1:0
INFO:idp_common.summarization.service:No sections defined, summarizing entire document at once



Summarizing document...


INFO:idp_common.summarization.service:Summarizing text with Bedrock
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: 
            Analyze the provided document and cre...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 223 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 2474 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
DEBUG:idp_common.bedrock.client:  - system: [{'text': "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a s

Summarization completed in 16.76 seconds


In [20]:
# Display the summary
print("\n" + "=" * 80)
print("DOCUMENT SUMMARY")
print("=" * 80)

# Access the summary content dynamically
summary = document.summarization_result.summary
print("\nAvailable fields / values in summary:")
for key in summary.keys():
    print("\n" + "=" * 40)
    print(f"{key}:")
    print("\n" + "=" * 40)
    print(summary.get(key, 'N/A'))

# Display summary report URI
print("\nSummary Report URI:")
print(document.summary_report_uri)

# Document status should already be updated by the service
print(f"\nDocument status: {document.status.value}")


DOCUMENT SUMMARY

Available fields / values in summary:

summary:

## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:** The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510

**Summary:** 
- The Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, opposes the "Commitment to our Children" petition.
- They argue against FDA regulation of tobacco, citing existing age restriction laws and the potential negative impact on family farms and adult freedoms.
- The association urges Senator Ford to prevent FDA from becoming a federal tobacco regulator.

<span title="Original text from document: On behalf of the Western Dark Fired Tobacco Growers' Association and the 9,000 tobacco producers it represents, I an obligated to convey our strong opposition to the 'Commitment to our 

In [21]:
# Display rendered markdown summary
from IPython.display import Markdown, display

# Get the summary report content
if document.summary_report_uri:
    bucket, key = parse_s3_uri(document.summary_report_uri)
    try:
        response = s3_client.get_object(Bucket=bucket, Key=key)
        report_content = response['Body'].read().decode('utf-8')
        # Display the markdown content
        display(Markdown(report_content))
    except Exception as e:
        print(f"Error retrieving summary report: {str(e)}")

## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:** The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510

**Summary:** 
- The Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, opposes the "Commitment to our Children" petition.
- They argue against FDA regulation of tobacco, citing existing age restriction laws and the potential negative impact on family farms and adult freedoms.
- The association urges Senator Ford to prevent FDA from becoming a federal tobacco regulator.

<span title="Original text from document: On behalf of the Western Dark Fired Tobacco Growers' Association and the 9,000 tobacco producers it represents, I an obligated to convey our strong opposition to the 'Commitment to our Children' petition being circulated by several Members of Congress." class="citation">[Cite-1, Page-1]</span>

---

## MEMORANDUM

**To:** Howard Goldfrach, Kathy Leiber
**Date:** April 3, 1987
**From:** Mel Fallis
**Re:** Black Consumer Market Promotion Development

**Summary:** 
- Proposed project development timetable for Benson & Hedges and Virginia Slims.
- Focus on Situation Analysis and Campaign Development phases.
- Meetings scheduled for April 9th with Howard and Kathy.
- Request for industry, category, consumer dynamics, and promotion activities information.

<span title="Original text from document: Attached you will find the proposed project development timetable for the Situation Analysis and Campaign Development phases for Benson & Hedges and Virginia Slims." class="citation">[Cite-2, Page-2]</span>

---

## LAB SERVICES CONSISTENCY REPORT

**Date:** 2/28/93
**Technician:** CC
**Shift:** A
**Trial:** 8
**Line:** 2
**Area:** 52

**Summary:** 
- Details of a lab services consistency report including sample ID, reason for request, and data communication.

<span title="Original text from document: DATE: 2/28/93 TECHNICIAN: CC SHIFT: A Trial 8 LINE: 2 AREA: 52" class="citation">[Cite-3, Page-2]</span>

---

## MORNING TEAM NOTES

**Date:** April 18, 1998
**From:** Kelahan, Ben
**Subject:** FW: Morning Team Notes 4/20

**Summary:** 
- Updates on local smoking bans and tobacco retailing ordinances in various locations.

<span title="Original text from document: Original Message From: Byron Nelson (SMTP:bnelson@wka.com] Sent: Friday, April 17. 1998 5:25 PM" class="citation">[Cite-4, Page-3]</span>

---

## MUTATION ASSAY

**Objective:** To measure the ability of a test substance to induce mutation at the hypoxanthine guanine phosphoribosyl transferase (HGPRT) locus in Chinese Hamster Ovary (CHO) cells.

**Summary:** 
- Details of the mutation assay including materials, methods, and results.

<span title="Original text from document: Objectives To measure the ability of a test substance to induse mutation at the hypoxanthine gunnine prosphonloryl transferes (hgpat) lear in Chinese Hamster very (CHO) cells" class="citation">[Cite-5, Page-4]</span>

---

## INVOICE

**Invoice No:** 86239
**Invoice Date:** 11/12/92
**Ship Date:** 10/13/92
**Bill To:** The Tobacco Institute, Anne Cannell

**Summary:** 
- Invoice details for two-sided decal prints ordered by The Tobacco Institute.

<span title="Original text from document: Invoice No: 86239 Invoice Date: 11/12/92 Ship Date: 10/13/92" class="citation">[Cite-6, Page-5]</span>

---

## NEWS RELEASE

**Date:** August 1, 1967
**Summary:** 
- FTC to begin cigarette testing under specific conditions including smoke length, random selection, and reporting of tar and nicotine content.

<span title="Original text from document: The Federal Trade Commission, having been advised by the staff that the cigarette testing laboratory has satisfactorily completed its trial tests, has now issued directions to commence the first formal test, under the follow- ing conditions:" class="citation">[Cite-7, Page-6]</span>

---

## CUSTOMER SATISFACTION SURVEY

**Summary:** 
- Survey questions regarding satisfaction with phone call representatives and R. J. Reynolds' response to requests.

<span title="Original text from document: How satisfied were you in each of the following areas: Neither Very Somewhat Satisfied Nor Somewhat Very Satisfied Satisfied Dissatisfied Dissatisfied Dissatisfied" class="citation">[Cite-8, Page-7]</span>

---

## BIOGRAPHICAL SKETCH

**Name:** Mario Stevenson
**Position Title:** Assistant Professor
**Birthdate:** May 11, 1957

**Summary:** 
- Education and professional experience including research positions and publications.

<span title="Original text from document: NAME POSITION TITLE BIRTHDATE (Mo. Day, Yr.) Mario Stevenson Assistant Professor May 11, 1957" class="citation">[Cite-9, Page-8]</span>

---

## CURRICULUM VITAE

**Surname:** Kalina
**First Name:** Moshe
**Birthdate:** January 21, 1938

**Summary:** 
- Education background and employment history in various research areas.

<span title="Original text from document: SURNAME: Kalina BIRTHDATE January 21, 1938 FIRST NAME: Moshe" class="citation">[Cite-10, Page-9]</span>

## 7. Test Using the Summary Report

In [22]:
# Create a simple example to demonstrate the Document model serialization
print("\n" + "=" * 80)
print("DOCUMENT SERIALIZATION EXAMPLE")
print("=" * 80)

# Convert document to dictionary
doc_dict = document.to_dict()

# Check that summary_report_uri is included
print(f"summary_report_uri in document dict: {'summary_report_uri' in doc_dict}")

# Show the summary fields in the dictionary
print("\nSummary report URI in document dictionary:")
print(f"  - summary_report_uri: {doc_dict.get('summary_report_uri')}")

# Create a new document from the dictionary
new_document = Document.from_dict(doc_dict)

# Verify that fields were preserved
print("\nVerifying fields were preserved after serialization:")
print(f"  - summary_report_uri matches: {new_document.summary_report_uri == document.summary_report_uri}")

# The summarization_result object is not included in serialization
print("\nNote: The summarization_result object is not included in the serialization")
print(f"  - Original document has summarization_result: {document.summarization_result is not None}")
print(f"  - New document from dict has summarization_result: {new_document.summarization_result is not None}")


DOCUMENT SERIALIZATION EXAMPLE
summary_report_uri in document dict: True

Summary report URI in document dictionary:
  - summary_report_uri: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/summary/summary.md

Verifying fields were preserved after serialization:
  - summary_report_uri matches: True

Note: The summarization_result object is not included in the serialization
  - Original document has summarization_result: True
  - New document from dict has summarization_result: False


In [23]:
# Test with differnt format configuration
print("\n" + "=" * 80)
print("TESTING FLEXIBLE OUTPUT STRUCTURE")
print("=" * 80)

# Alternative configuration with flexible format
FLEXIBLE_CONFIG = {
    "summarization": {
        "model": "us.amazon.nova-pro-v1:0",
        "temperature": 0,
        "top_k": 5,
        "system_prompt": "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a structured summary that captures the key information, main points, and important details from the document. Your output must be in valid JSON format.",
        "task_prompt": "Analyze the provided document and create a structured summary.\n\n<document>\n{DOCUMENT_TEXT}\n</document>\n\nCreate a summary that captures the essential information from the document. Your summary should be structured and comprehensive.\n\nReturn your response as valid JSON with the following structure:\n```json\n{\n  \"overview\": \"A 1-2 sentence overview of what this document is about\",\n  \"key_points\": [\"List of 3-5 key points as bullet points\"],\n  \"sections\": {\n    \"section1_name\": \"Summary of section 1\",\n    \"section2_name\": \"Summary of section 2\"\n  },\n  \"entities\": [\"List of important entities (people, organizations, dates, etc.)\"],\n  \"conclusion\": \"A brief conclusion statement\"\n}\n```"
    }
}

# Create summarization service with Bedrock backend
summarization_service = summarization.SummarizationService(
    config=FLEXIBLE_CONFIG, 
)

# Summarize the document
print("\nSummarizing document...")
start_time = time.time()
flexible_document = summarization_service.process_document(document)
summarization_time = time.time() - start_time

print(f"Summarization completed in {summarization_time:.2f} seconds")

# Show available fields in the flexible summary
summary = flexible_document.summarization_result.summary
print("\nAvailable fields in flexible summary:")
for key in summary.keys():
    print(f"  - {key}")

# Show the summary report URI
print("\nSummary Report URI:")
print(flexible_document.summary_report_uri)

# Fetch and prepare the markdown content for rendering
flexible_markdown_content = "No flexible summary report available"
if flexible_document.summary_report_uri:
    bucket, key = parse_s3_uri(flexible_document.summary_report_uri)
    try:
        response = s3_client.get_object(Bucket=bucket, Key=key)
        flexible_markdown_content = response['Body'].read().decode('utf-8')
        print("\nSuccessfully retrieved flexible summary report for rendering.")
    except Exception as e:
        print(f"Error retrieving flexible summary report: {str(e)}")

INFO:idp_common.summarization.service:Initialized summarization service with Bedrock backend using model us.amazon.nova-pro-v1:0
INFO:idp_common.summarization.service:No sections defined, summarizing entire document at once



TESTING FLEXIBLE OUTPUT STRUCTURE

Summarizing document...


INFO:idp_common.summarization.service:Summarizing text with Bedrock
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
DEBUG:idp_common.bedrock.client:  - system: [{'text': 'You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a structured summary that captures the key information, main points, and important details from the document. Your output must be in valid JSON format.'}]
DEBUG:idp_common.bedrock.client:  - messages: [{'role': 'user', 'content': [{'text': 'Analyze the provided document and create a structured summary.\n\n<document>\n<page-number>1</page-number>\n\n\nWESTERN DARK FIRED TOBA

Summarization completed in 5.58 seconds

Available fields in flexible summary:
  - overview
  - key_points
  - sections
  - entities
  - conclusion

Summary Report URI:
s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-58-07.pdf/summary/summary.md

Successfully retrieved flexible summary report for rendering.


In [24]:
# Display rendered flexible markdown summary
from IPython.display import Markdown, display

print("\n" + "=" * 80)
print("RENDERED FLEXIBLE SUMMARY REPORT")
print("=" * 80 + "\n")

# Display the markdown content
display(Markdown(flexible_markdown_content))


RENDERED FLEXIBLE SUMMARY REPORT



## overview

This document contains various sections related to the tobacco industry, including opposition to FDA regulation, market promotion strategies, lab services reports, and biographical sketches of key personnel.

## sections

### opposition_to_fda_regulation

The Western Dark Fired Tobacco Growers' Association expresses strong opposition to FDA regulation of tobacco, arguing it would create bureaucracy and hamper farmers' rights.

### market_promotion_strategies

A memorandum outlines the proposed project development timetable for market promotion of Benson & Hedges and Virginia Slims, emphasizing the need for industry information and consumer dynamics.

### lab_services_report

A lab services consistency report details the procedures and results of a test, including sample handling, dilution factors, and consistency measurements.

### biographical_sketches

Biographical sketches provide information on the education, research experience, honors, and publications of key personnel involved in tobacco-related research.

## conclusion

The document highlights various aspects of the tobacco industry, including regulatory opposition, market strategies, lab testing, and personnel backgrounds.

# Conclusion and cleanup

This notebook has demonstrated the flexible document summarization capabilities of the IDP Common Package:

1. **Standard Summary Format** - Using the default structure with brief and detailed summaries
2. **Flexible Summary Format** - Using a more complex structure with sections and entities

Key benefits of the approach:

- Adapt to any summary structure based on document requirements
- Configure prompts to generate exactly the fields you need
- View rendered markdown reports for better readability
- Store and retrieve summaries with consistent interfaces
- Works with any LLM supported by the backend

Learn more in the documentation: /lib/idp_common_pkg/idp_common/summarization/README.md