# Holistic Packet Classification with IDP Common Package

This notebook demonstrates how to use the holistic packet classification capability of the IDP Common Package to classify multi-document packets, where each document might span multiple pages. The holistic approach examines the document as a whole to identify boundaries between different document types within the packet.

**Key Benefits of Holistic Packet Classification:**
1. Properly handles multi-page documents within a packet
2. Detects logical document boundaries
3. Identifies document types in context of the whole document
4. Handles documents where individual pages may not be clearly classifiable on their own

The notebook demonstrates how to process a document with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock using the multi-model page based method.
3. Extraction Service - Extract structured information from sections using Bedrock
4. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [1]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[dev, all]"

# Note: We can also install specific components like:
# %pip install -q -e "../lib/idp_common_pkg[ocr,classification,extraction,evaluation]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

Found existing installation: idp_common 0.2.21
Uninstalling idp_common-0.2.21:
  Successfully uninstalled idp_common-0.2.21
Note: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
Version: 0.2.21
Location: /home/ec2-user/miniconda/lib/python3.12/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import datetime
import copy

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, evaluation, summarization
from idp_common import s3
import json
from dotenv import load_dotenv
load_dotenv()

# Configure logging 
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.DEBUG)  # Enable evaluation logs
logging.getLogger('idp_common.bedrock.client').setLevel(logging.DEBUG)  # show prompts


# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name =  os.getenv("IDP_INPUT_BUCKET_NAME", f"idp-notebook-input-{account_id}-{region}")
output_bucket_name = os.getenv("IDP_OUTPUT_BUCKET_NAME", f"idp-notebook-output-{account_id}-{region}")

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Example
AWS_REGION: us-west-2
Input bucket: idp-notebook-input-912625584728-us-west-2
Output bucket: idp-notebook-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

Bucket idp-notebook-input-912625584728-us-west-2 already exists
Bucket idp-notebook-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-notebook-input-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf


## 4. Set Up Configuration

In [4]:
import yaml
with open("./config.yml", 'r') as file:
    CONFIG = yaml.safe_load(file)

## 5. Process Document with OCR

In [5]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: rvl-cdip-package
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 6.13 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 6.13 seconds
Document status: QUEUED
Number of pages processed: 10

Processed pages:
Page 1:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/1/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/1/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/1/result.json
Page 2:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/2/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/2/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/2/result.json
Page 3:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/pages/3/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/s

## 6. Classify the Document

In [6]:
# Verify that Config specifies => "classificationMethod": "textbasedHolisticClassification"
print("*****************************************************************")
print(f'CONFIG classificationMethod: {CONFIG["classification"].get("classificationMethod")}')
print("*****************************************************************")

# Create classification service with Bedrock backend
# The classification method is set in the config
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

*****************************************************************
CONFIG classificationMethod: textbasedHolisticClassification
*****************************************************************

Classifying document...


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: The <document-types> XML tags contain a markdown t...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 666 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 2475 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
DEBUG:idp_common.bedrock.client:  - system: [{'text': 'You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package from various domains. Your task is to determine the document type based on its content and structure, us

Classification completed in 3.67 seconds
Document status: QUEUED


### Show classification results

In [7]:
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")


Detected sections:
Section 1: letter
  Pages: ['1', '2']
Section 2: memo
  Pages: ['3', '4']
Section 3: invoice
  Pages: ['5']
Section 4: advertisement
  Pages: ['6']
Section 5: questionnaire
  Pages: ['7']
Section 6: resume
  Pages: ['8', '9']
Section 7: memo
  Pages: ['10']

Page-level classifications:
Page 1: letter
Page 10: memo
Page 2: letter
Page 3: memo
Page 4: memo
Page 5: invoice
Page 6: advertisement
Page 7: questionnaire
Page 8: resume
Page 9: resume


# Summarization service

## PART 1: Processing Individual Sections

In [8]:
summarization_service = summarization.SummarizationService(config=CONFIG)

print("=== PART 1: Processing Individual Sections ===")
n = 3  # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document, section_metering = summarization_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    summarization_time = time.time() - start_time
    print(f"Summarization for section {section.section_id} completed in {summarization_time:.2f} seconds")
    
    # Print the summary content if available
    if section.attributes and 'summary_uri' in section.attributes:
        summary_uri = section.attributes['summary_uri']
        summary_md_uri = section.attributes.get('summary_md_uri')
        
        print(f"\nJSON Summary URI: {summary_uri}")
        if summary_md_uri:
            print(f"Markdown Summary URI: {summary_md_uri}")
        
        # Get and display JSON summary
        try:
            # Get the JSON summary content from S3
            summary_content = s3.get_json_content(summary_uri)
            print("\nJSON Summary Content:")
            
            # Check if there's a specific summary field in the content
            if isinstance(summary_content, dict):
                if 'summary' in summary_content:
                    print("Summary field found in JSON:")
                    print(summary_content['summary'][:300] + "..." if len(summary_content['summary']) > 300 else summary_content['summary'])
                elif 'content' in summary_content:
                    print("Content field found in JSON:")
                    print(summary_content['content'][:300] + "..." if len(str(summary_content['content'])) > 300 else summary_content['content'])
                else:
                    # Print the whole content if no specific summary field
                    print("Full JSON content (truncated):")
                    print(json.dumps(summary_content, indent=2)[:300] + "..." if len(json.dumps(summary_content)) > 300 else json.dumps(summary_content, indent=2))
            else:
                print(summary_content)
        except Exception as e:
            print(f"Error retrieving JSON summary: {e}")
            
        # Get and display Markdown summary if available
        if summary_md_uri:
            try:
                # Get the markdown summary content from S3
                markdown_content = s3.get_text_content(summary_md_uri)
                print("\nMarkdown Summary Content (first 300 chars):")
                print(markdown_content[:300] + "..." if len(markdown_content) > 300 else markdown_content)
                
                # Display the rendered markdown
                from IPython.display import Markdown, display
                print("\nRendered Markdown Summary:")
                display(Markdown(markdown_content))
            except Exception as e:
                print(f"Error retrieving markdown summary: {e}")
    else:
        print("No summary available for this section")
    
print(f"\nSummarization for first {n} sections complete.")

=== PART 1: Processing Individual Sections ===

Processing section 1 (class: letter)


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: Analyze the provided document and create a compreh...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 223 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 471 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
DEBUG:idp_common.bedrock.client:  - system: [{'text': "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main poi

Summarization for section 1 completed in 5.89 seconds

JSON Summary URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/sections/1/summary.json
Markdown Summary URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/sections/1/summary.md

JSON Summary Content:
Summary field found in JSON:
## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:** The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510

**Su...

Markdown Summary Content (first 300 chars):
## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:** The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510

**Su

## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:** The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510

**Summary:**

- The Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, expresses strong opposition to the "Commitment to our Children" petition circulated by several Members of Congress [Cite-1, Page-1].
- The association argues against additional government bureaucracy, emphasizing the existing age restriction laws and the need for better enforcement rather than more regulation [Cite-2, Page-1].
- They warn against FDA regulation of tobacco, fearing it would create inefficient bureaucracy and negatively impact family farms and the freedom of choice for adults [Cite-3, Page-1].
- The association urges Senator Ford to prevent FDA from becoming a federal tobacco regulator and to address youth smoking prevention responsibly, starting with parental involvement [Cite-4, Page-1].

## LAB SERVICES CONSISTENCY REPORT

**Date:** 2/28/93
**Technician:** CC
**Shift:** A
**Trial:** 8
**Line:** 2
**Area:** 52
**Product Unit Code:** 0728
**Sample ID:** stuff box 2
**Reason for Request:** test

**Summary:**

- The report details lab services consistency for a specific trial, including sample weights, dilution factors, and consistency percentages [Cite-5, Page-2].
- Key data points include average consistency, sample weights, and specific measurements for various samples [Cite-6, Page-2].


Processing section 2 (class: memo)


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: Analyze the provided document and create a compreh...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 223 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 465 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
DEBUG:idp_common.bedrock.client:  - system: [{'text': "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main poi

Summarization for section 2 completed in 9.38 seconds

JSON Summary URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/sections/2/summary.json
Markdown Summary URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/sections/2/summary.md

JSON Summary Content:
Summary field found in JSON:
## Summary

### Email Summary
- **From:** Kelahan, Ben
- **To:** TI New York, TI Minnesota
- **CC:** Ashley Bratich (MSMAIL)
- **Subject:** FW: Morning Team Notes 4/20
- **Date:** Saturday, April 18, 1998 2:09 PM

#### Original Message
- **From:** Byron Nelson
- **Sent:** Friday, April 17, 1998 5:25...

Markdown Summary Content (first 300 chars):
## Summary

### Email Summary
- **From:** Kelahan, Ben
- **To:** TI New York, TI Minnesota
- **CC:** Ashley Bratich (MSMAIL)
- **Subject:** FW: Morning Team Notes 4/20
- **Date:** Saturday, April 18, 1998 2:09 PM

#### Original Message
- **From:** Byron Nelson
- **Sent:** Friday, April 17, 1998 5:25

## Summary

### Email Summary
- **From:** Kelahan, Ben
- **To:** TI New York, TI Minnesota
- **CC:** Ashley Bratich (MSMAIL)
- **Subject:** FW: Morning Team Notes 4/20
- **Date:** Saturday, April 18, 1998 2:09 PM

#### Original Message
- **From:** Byron Nelson
- **Sent:** Friday, April 17, 1998 5:25 PM
- **To:** Multiple recipients including Judy Albert, Carolyn, Jackie Cohen, etc.
- **Subject:** Morning Team Notes 4/20

### Key Points

#### Falmouth, MA
- On 4/15, a warrant article for a 100 percent ban on smoking in restaurants was defeated 84-77. [<span title="town meeting representatives defeated by a 84-77 vote a warrant article calling for a 100 percent ban on smoking in restaurants" class="citation">[Cite-1, Page-3]</span>]
- A motion to reconsider was rejected 104-49 on 4/16. [<span title="a motion to reconsider the vote was soundly rejected 104-49" class="citation">[Cite-2, Page-3]</span>]
- The restaurant owner's alternative was not considered due to constitutional issues. [<span title="the town counsel found the article to be unconstitutional" class="citation">[Cite-3, Page-3]</span>]

#### Waseca County, MN
- On 4/7, county commissioners tabled a new tobacco retailing ordinance. [<span title="the county commissioners once again tabled consideration of a new tobacco retailing ordinance" class="citation">[Cite-4, Page-3]</span>]
- Waseca is the 11th Minnesota community to delay the issue. [<span title="Waseca is the 11th Minnesota community to put the issue on hold" class="citation">[Cite-5, Page-3]</span>]

#### Wadena County, MN
- County commissioners tabled a new tobacco retailing ordinance in mid-March until 4/23. [<span title="the county commissioners tabled consideration of a new tobacco retailing ordinance until 4/23" class="citation">[Cite-6, Page-3]</span>]
- They will consider a model ordinance mirroring state law. [<span title="they will take up a model ordinance that mirrors the state law" class="citation">[Cite-7, Page-3]</span>]
- Bob Fackler requests calls to retailers to attend. [<span title="Bob Fackler requests calls to retailers to alert them to attend" class="citation">[Cite-8, Page-3]</span>]

### Mutation Assay Summary
- **Objective:** Measure the ability of a test substance to induce mutation at the HGPRT locus in CHO cells. [<span title="To measure the ability of a test substance to induce mutation at the hypoxanthine gunnine prosphonloryl transferes (hgpat) lear in Chinese Hamster very (CHO) cells" class="citation">[Cite-9, Page-4]</span>]
- **Methods:** Refer to Standard Operating Procedure PH314. [<span title="Refer to Standard Operating Procedure PH314" class="citation">[Cite-10, Page-4]</span>]
- **Sponsor:** American Cyanamid Company
- **Test Chemical:** Algoral 40 LF
- **Description:** Clear liquid
- **Dates:**
  - Preliminary Cytotoxicity Instituted: 6/3/82
  - CHO/HGPRT forward gene Mutation Assay Instituted: 8/26/82
- **Cell Culture:**
  - CHO-KI-BHY cells received from Oak Ridge National Laboratories on 7/1/82
  - Routine subculture every Friday (a.m.) and Monday (p.m.)
  - Subcultured into 3-75cm² flasks containing 15 ml of media F12/FCMSIO
- **Treatment:**
  - CHO-KI-BHY cells subcultured into 10-T75cm² flasks on 8/23/82
  - Further subcultured into 36 - T25 cm² flasks on 8/25/82 in preparation for treatment
- **Recorded and Edited by:** D. Good on 8/25/82


Processing section 3 (class: invoice)


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: Analyze the provided document and create a compreh...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 223 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 175 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
DEBUG:idp_common.bedrock.client:  - system: [{'text': "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main poi

Summarization for section 3 completed in 9.08 seconds

JSON Summary URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/sections/3/summary.json
Markdown Summary URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/sections/3/summary.md

JSON Summary Content:
Summary field found in JSON:
## Invoice Summary

- **Invoice Number:** 86239 [Cite-1, Page-5](<span title="Invoice No: 86239" class="citation">[Cite-1, Page-5]</span>)
- **Invoice Date:** 11/12/92 [Cite-2, Page-5](<span title="Invoice Date: 11/12/92" class="citation">[Cite-2, Page-5]</span>)
- **Ship Date:** 10/13/92 [Cite-3, P...

Markdown Summary Content (first 300 chars):
## Invoice Summary

- **Invoice Number:** 86239 [Cite-1, Page-5](<span title="Invoice No: 86239" class="citation">[Cite-1, Page-5]</span>)
- **Invoice Date:** 11/12/92 [Cite-2, Page-5](<span title="Invoice Date: 11/12/92" class="citation">[Cite-2, Page-5]</span>)
- **Ship Date:** 10/13/92 [Cite-3, P

## Invoice Summary

- **Invoice Number:** 86239 [Cite-1, Page-5](<span title="Invoice No: 86239" class="citation">[Cite-1, Page-5]</span>)
- **Invoice Date:** 11/12/92 [Cite-2, Page-5](<span title="Invoice Date: 11/12/92" class="citation">[Cite-2, Page-5]</span>)
- **Ship Date:** 10/13/92 [Cite-3, Page-5](<span title="Ship Date: 10/13/92" class="citation">[Cite-3, Page-5]</span>)
- **Bill To:** The Tobacco Institute, Attn: Anne Cannell, 1875 I Street, N.W., Washington DC 20006 [Cite-4, Page-5](<span title="BILL TO 20050 THE TOBACCO INSTITUTE ATTN: ANNE CANNELL 1875 I STREET, N.W. WASHINGTON DC 20006" class="citation">[Cite-4, Page-5]</span>)
- **Salesman:** Michael J McKillips [Cite-5, Page-5](<span title="Salesman MICHAEL J MCKILLIPS" class="citation">[Cite-5, Page-5]</span>)
- **Job Number:** 86239 [Cite-6, Page-5](<span title="Job Number: 86239" class="citation">[Cite-6, Page-5]</span>)
- **Terms:** NET 30 DAYS [Cite-7, Page-5](<span title="Terms: NET 30 DAYS" class="citation">[Cite-7, Page-5]</span>)

**Service Charge:** A service charge of 2% per month (24% per year) will be applied if payment is not received by the end of the first month after the invoice date [Cite-8, Page-5](<span title="A Service Charge of 2% per month (24% per year) will be charged if payment not received by end of first month after invoice date" class="citation">[Cite-8, Page-5]</span>).

### Items

- **Description:** TWO SIDED DECAL "IT'S THE LAW--UNDER 18" PRINTS 2/2, 5 1/2 x 7 1/2" [Cite-9, Page-5](<span title="DESCRIPTION TWO SIDED DECAL \"IT'S THE LAW--UNDER 18\" PRINTS 2/2, 5 1/2 x 7 1/2\"" class="citation">[Cite-9, Page-5]</span>)
- **Quantity:** 5000 [Cite-10, Page-5](<span title="QUANTITY 5000" class="citation">[Cite-10, Page-5]</span>)
- **Unit Price:** $5145.00 [Cite-11, Page-5](<span title="UNIT PRICE 5145.000" class="citation">[Cite-11, Page-5]</span>)
- **Amount:** $5145.00 [Cite-12, Page-5](<span title="AMOUNT 5145.00" class="citation">[Cite-12, Page-5]</span>)

### Totals

- **Sub Total:** $5145.00 [Cite-13, Page-5](<span title="SUB TOTAL 5145.00" class="citation">[Cite-13, Page-5]</span>)
- **Tax:** $308.70 [Cite-14, Page-5](<span title="TAX 308.70" class="citation">[Cite-14, Page-5]</span>)
- **Total Invoice:** $5453.70 [Cite-15, Page-5](<span title="TOTAL INVOICE 5453.70" class="citation">[Cite-15, Page-5]</span>)
- **Amount Due:** $5453.70 [Cite-16, Page-5](<span title="Invoice AMT DUE 5453.70" class="citation">[Cite-16, Page-5]</span>)
- **Less Deposit:** $(11000.00) [Cite-17, Page-5](<span title="LESS DEPOSIT (11000.00)" class="citation">[Cite-17, Page-5]</span>)
- **Credit Balance:** $5546.30 [Cite-18, Page-5](<span title="CREDIT BALANCE $5546.30" class="citation">[Cite-18, Page-5]</span>)

**Confidential Note:** Minnesota Tobacco Litigation [Cite-19, Page-5](<span title="CONFIDENTIAL: MINNESOTA TOBACCO LITIGATION" class="citation">[Cite-19, Page-5]</span>)


Summarization for first 3 sections complete.


## PART 2: Processing Document with Sections

In [9]:
document_with_sections = copy.deepcopy(document)

# Process the entire document using the section-based approach
start_time = time.time()
document_with_sections = summarization_service.process_document(
    document=document_with_sections,
    store_results=True
)
summarization_time = time.time() - start_time
print(f"Document summarization with sections completed in {summarization_time:.2f} seconds")

DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: Analyze the provided document and create a compreh...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 223 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 498 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
DEBUG:idp_common.bedrock.client:  - system: [{'text': "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main poi

Document summarization with sections completed in 112.83 seconds


In [10]:
# Print the combined summary report URI
if document_with_sections.summary_report_uri:
    print(f"\nCombined Summary Report URI: {document_with_sections.summary_report_uri}")
    
    # Try to get and display the markdown summary
    try:
        # Extract bucket and key from the s3 URI
        uri_parts = document_with_sections.summary_report_uri.replace("s3://", "").split("/", 1)
        bucket = uri_parts[0]
        key = uri_parts[1]
        
        # Use boto3 to get the object directly
        s3_client = boto3.client('s3')
        response = s3_client.get_object(Bucket=bucket, Key=key)
        markdown_content = response['Body'].read().decode('utf-8')
        
        # Display a preview of the summary
        print("\nSummary Preview (first 500 chars):")
        print(markdown_content[:500] + "..." if len(markdown_content) > 500 else markdown_content)
        
        # Display the full markdown summary in a rendered cell
        from IPython.display import Markdown, display
        print("\nFull Rendered Summary:")
        display(Markdown(markdown_content))
        
        # Also check if JSON summary exists
        json_key = key.replace("summary.md", "summary.json")
        try:
            json_response = s3_client.get_object(Bucket=bucket, Key=json_key)
            summary_json = json.loads(json_response['Body'].read().decode('utf-8'))
            # print("\nJSON Summary Structure:")
            # print(f"Keys: {list(summary_json.keys())}")
            
            # Check for section summaries
            if 'metadata' in summary_json and 'section_summaries' in summary_json['metadata']:
                print(f"\nSection Summaries: {list(summary_json['metadata']['section_summaries'].keys())}")
        except Exception as e:
            print(f"Note: JSON summary not found or couldn't be parsed: {e}")
            
    except Exception as e:
        print(f"Error retrieving summary: {e}")
else:
    print("No summary available")

# Check individual section summaries if available
# if document_with_sections.sections:
#     print("\nIndividual Section Summaries:")
#     for section in document_with_sections.sections:
#         if section.attributes and 'summary_md_uri' in section.attributes:
#             # print(f"Section {section.section_id} ({section.classification}) Summary: {section.attributes['summary_md_uri']}")
#             print(f"{section.attributes['summary_md_uri']}")



Combined Summary Report URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/summary/summary.md

Summary Preview (first 500 chars):
# Document Summary

## advertisement

## FTC to Begin Cigarette Testing

The Federal Trade Commission (FTC) has issued directions to commence the first formal test on cigarettes following the satisfactory completion of trial tests by the cigarette testing laboratory [Cite-1, Page-6].

### Test Conditions
- Smoke cigarettes to a 23 mm butt length, or to the length of the filter and overwrap plus 3 mm if in excess of 23 mm [Cite-2, Page-6].
- Base results on a test of 100 cigarettes per brand or t...

Full Rendered Summary:


# Document Summary

## advertisement

## FTC to Begin Cigarette Testing

The Federal Trade Commission (FTC) has issued directions to commence the first formal test on cigarettes following the satisfactory completion of trial tests by the cigarette testing laboratory [Cite-1, Page-6].

### Test Conditions
- Smoke cigarettes to a 23 mm butt length, or to the length of the filter and overwrap plus 3 mm if in excess of 23 mm [Cite-2, Page-6].
- Base results on a test of 100 cigarettes per brand or type [Cite-3, Page-6].
- Cigarettes will be selected on a random basis, as opposed to "weight selection" [Cite-4, Page-6].
- Determine particulate matter on a "dry" basis using the gas chromatography method by C. H. Sloan and B. J. Sublett, modified by F. J. Schultz and A. W. Spears to determine moisture content [Cite-5, Page-6].
- Determine and report the "tar" content after subtracting moisture and alkaloids (as nicotine) from particulate matter [Cite-6, Page-6].
- Report tar content to the nearest whole milligram and nicotine content to the nearest 1/10 milligrams [Cite-7, Page-6].

### Scope of Testing
The test will cover approximately 50 major brands and types, and all brands with tar or nicotine statements on labels or in advertising to verify the accuracy of such statements [Cite-8, Page-6]. Cigarettes will be purchased on the open market in 50 localities throughout the United States [Cite-9, Page-6].

### Basis for Procedures
The Commission relied on a record including written presentations and oral testimony from a public hearing on November 30, 1966 [Cite-10, Page-6]. The hearing aimed to determine what action should be taken regarding the Cambridge Filter Method and how test results should be expressed [Cite-11, Page-6]. The Commission emphasized the importance of a reasonable standardized method and presenting results in an understandable manner to the public [Cite-12, Page-6].


## invoice

## Invoice Summary

- **Invoice Number:** 86239 [Cite-1, Page-5](<span title="Invoice No: 86239" class="citation">[Cite-1, Page-5]</span>)
- **Invoice Date:** 11/12/92 [Cite-2, Page-5](<span title="Invoice Date: 11/12/92" class="citation">[Cite-2, Page-5]</span>)
- **Ship Date:** 10/13/92 [Cite-3, Page-5](<span title="Ship Date: 10/13/92" class="citation">[Cite-3, Page-5]</span>)
- **Bill To:** The Tobacco Institute, Attn: Anne Cannell, 1875 I Street, N.W., Washington DC 20006 [Cite-4, Page-5](<span title="BILL TO 20050 THE TOBACCO INSTITUTE ATTN: ANNE CANNELL 1875 I STREET, N.W. WASHINGTON DC 20006" class="citation">[Cite-4, Page-5]</span>)
- **Salesman:** Michael J McKillips [Cite-5, Page-5](<span title="Salesman MICHAEL J MCKILLIPS" class="citation">[Cite-5, Page-5]</span>)
- **Job Number:** 86239 [Cite-6, Page-5](<span title="Job Number: 86239" class="citation">[Cite-6, Page-5]</span>)
- **Terms:** NET 30 DAYS [Cite-7, Page-5](<span title="Terms: NET 30 DAYS" class="citation">[Cite-7, Page-5]</span>)

### Items

- **Description:** TWO SIDED DECAL "IT'S THE LAW--UNDER 18" PRINTS 2/2, 5 1/2 x 7 1/2" [Cite-8, Page-5](<span title="DESCRIPTION TWO SIDED DECAL \"IT'S THE LAW--UNDER 18\" PRINTS 2/2, 5 1/2 x 7 1/2\"" class="citation">[Cite-8, Page-5]</span>)
- **Quantity:** 5000 [Cite-9, Page-5](<span title="QUANTITY 5000" class="citation">[Cite-9, Page-5]</span>)
- **Unit Price:** $5145.00 [Cite-10, Page-5](<span title="UNIT PRICE 5145.000" class="citation">[Cite-10, Page-5]</span>)
- **Amount:** $5145.00 [Cite-11, Page-5](<span title="AMOUNT 5145.00" class="citation">[Cite-11, Page-5]</span>)

### Totals

- **Sub Total:** $5145.00 [Cite-12, Page-5](<span title="SUB TOTAL 5145.00" class="citation">[Cite-12, Page-5]</span>)
- **Tax:** $308.70 [Cite-13, Page-5](<span title="TAX 308.70" class="citation">[Cite-13, Page-5]</span>)
- **Total Invoice:** $5453.70 [Cite-14, Page-5](<span title="TOTAL INVOICE 5453.70" class="citation">[Cite-14, Page-5]</span>)
- **Amount Due:** $5453.70 [Cite-15, Page-5](<span title="AMT DUE 5453.70" class="citation">[Cite-15, Page-5]</span>)
- **Less Deposit:** $(11000.00) [Cite-16, Page-5](<span title="LESS DEPOSIT (11000.00)" class="citation">[Cite-16, Page-5]</span>)
- **Credit Balance:** $5546.30 [Cite-17, Page-5](<span title="CREDIT BALANCE $5546.30" class="citation">[Cite-17, Page-5]</span>)

### Notes
- A service charge of 2% per month (24% per year) will be charged if payment not received by end of first month after invoice date. [Cite-18, Page-5](<span title="A Service Charge of 2% per month (24% per year) will be charged if payment not received by end of first month after invoice date" class="citation">[Cite-18, Page-5]</span>)
- **Confidential:** Minnesota Tobacco Litigation [Cite-19, Page-5](<span title="CONFIDENTIAL: MINNESOTA TOBACCO LITIGATION" class="citation">[Cite-19, Page-5]</span>)


## memo

## Summary of Document

### Email Correspondence
- **From:** Kelahan, Ben
- **To:** TI New York, TI Minnesota
- **CC:** Ashley Bratich (MSMAIL)
- **Subject:** FW: Morning Team Notes 4/20
- **Date:** Saturday, April 18, 1998 2:09 PM

**Original Message:**
- **From:** Byron Nelson
- **Sent:** Friday, April 17, 1998 5:25 PM
- **To:** Multiple recipients including Judy Albert, Carolyn, Jackie Cohen, etc.
- **Subject:** Morning Team Notes 4/20

### Key Points from Email:

#### Falmouth, MA
- On 4/15, a warrant article for a 100 percent ban on smoking in restaurants was defeated 84-77. [<span title="town meeting representatives defeated by a 84-77 vote a warrant article calling for a 100 percent ban on smoking in restaurants" class="citation">[Cite-1, Page-3]</span>]
- A motion to reconsider was rejected 104-49 on 4/16. [<span title="a motion to reconsider the vote was soundly rejected 104-49." class="citation">[Cite-2, Page-3]</span>]
- The restaurant owner's alternative was not considered due to constitutional issues. [<span title="The restaurant owner's moderate alternative was not considered because the town counsel found the article to be unconstitutional" class="citation">[Cite-3, Page-3]</span>]

#### Waseca County, MN
- On 4/7, county commissioners tabled a new tobacco retailing ordinance. [<span title="the county commissioners once again tabled consideration of a new tobacco retailing ordinance" class="citation">[Cite-4, Page-3]</span>]
- Waseca is the 11th Minnesota community to delay the issue. [<span title="Waseca is the 11th Minnesota community to put the issue on hold." class="citation">[Cite-5, Page-3]</span>]

#### Wadena County, MN
- County commissioners tabled a new tobacco retailing ordinance until 4/23. [<span title="the county commissioners tabled consideration of a new tobacco retailing ordinance until 4/23" class="citation">[Cite-6, Page-3]</span>]
- They will consider a model ordinance mirroring state law. [<span title="they will take up a model ordinance that mirrors the state law" class="citation">[Cite-7, Page-3]</span>]
- Bob Fackler requests calls to retailers to attend. [<span title="Bob Fackler requests calls to retailers to alert them to attend" class="citation">[Cite-8, Page-3]</span>]

### Scientific Document

#### Mutation Assay
- **Objective:** Measure the ability of a test substance to induce mutation at the HGPRT locus in CHO cells. [<span title="To measure the ability of a test substance to induce mutation at the hypoxanthine gunnine prosphonloryl transferes (hgpat) lear in Chinese Hamster very (CHO) cells" class="citation">[Cite-9, Page-4]</span>]
- **Methods:** Refer to Standard Operating Procedure PH314. [<span title="Methods: Refer to Standard Operating Procedure PH314." class="citation">[Cite-10, Page-4]</span>]
- **Sponsor:** American Cyanamid Company
- **Test Chemical:** Algoral 40 LF
- **Description:** Clear liquid
- **Dates:**
  - Preliminary Cytotoxicity: Instituted 6/3/82
  - CHO/HGPRT forward gene Mutation Assay: Instituted 8/26/82
- **Cell Culture:**
  - CHO-KI-BHY cells received from Oak Ridge National Laboratories on 7/1/82
  - Routine subculture every Friday (a.m.) and Monday (p.m.)
  - Subcultured into various flasks with specified media
- **Treatment:**
  - CHO-KI-BHY cells treated on 7/23/82 and 8/23/82
  - Subcultured into T25 cm² flasks on 8/25/82 in preparation for treatment

#### Witness and Record
- **Witnessed & Understood by:** [Name Redacted]
- **Recorded & Edited by:** D. Good on 8/25/82


## questionnaire

## Customer Satisfaction Survey Summary

### Phone Call Evaluation
- **Courtesy and Politeness**: Customers were asked to rate the courtesy and politeness of the representative.
- **Knowledgeability**: Customers evaluated the representative's knowledge.
- **Request Handling**: Customers assessed whether their question or request was handled effectively.

### Overall Satisfaction with R. J. Reynolds' Response
- Customers selected one statement that best described their satisfaction with R. J. Reynolds' response to their request for assistance:
  - I was very satisfied
  - I was somewhat satisfied
  - I was neither satisfied nor dissatisfied
  - I was somewhat dissatisfied
  - I was very dissatisfied [<span title="I was very dissatisfied" class="citation">[Cite-1, Page-7]</span>]

### Future Purchase Intention
- Customers indicated their likelihood of continuing to purchase the brand of cigarettes they contacted about:
  - I Definitely Would
  - I Probably Would
  - I Might or Might Not
  - I Probably Would Not
  - I Definitely Would Not

### Recommendation Likelihood
- Customers were asked if they would recommend the brand to an adult smoker (21 years of age or older) who currently smokes a competitive brand:
  - I Definitely Would
  - I Probably Would
  - I Might or Might Not
  - I Probably Would Not
  - I Definitely Would Not

### Additional Information
- Document reference number: 52435 9399 [<span title="52435 9399" class="citation">[Cite-2, Page-7]</span>]


## resume

## Biographical Sketch and Curriculum Vitae Summary

### Mario Stevenson
- **Position Title**: Assistant Professor
- **Birthdate**: May 11, 1957 [<span title="Mario Stevenson, Assistant Professor, was born on May 11, 1957." class="citation">[Cite-1, Page-8]</span>]

#### Education
- **Glasgow College of Technology**, Glasgow, Scotland
  - **Degree**: B.Sc.
  - **Year**: 1979
  - **Field of Study**: Biochemistry [<span title="Mario Stevenson earned a B.Sc. in Biochemistry from Glasgow College of Technology in 1979." class="citation">[Cite-2, Page-8]</span>]
- **University of Strathclyde**, Glasgow, Scotland
  - **Degree**: Ph.D.
  - **Year**: 1984
  - **Field of Study**: Biochemistry [<span title="Mario Stevenson obtained a Ph.D. in Biochemistry from the University of Strathclyde in 1984." class="citation">[Cite-3, Page-8]</span>]

#### Research and Professional Experience
- **Research Associate**, Department of Pharmacology, University of Strathclyde, Glasgow, Scotland (8/80 - 5/84)
- **Research Fellow**, Department of Pathology & Microbiology, University of Nebraska Medical Center, Omaha, Nebraska (10/84 - 5/86)
- **Instructor**, Department of Pathology & Microbiology, University of Nebraska Medical Center, Omaha, Nebraska (6/86 - 6/87)
- **Assistant Professor**, Department of Pathology & Microbiology, University of Nebraska Medical Center, Omaha, Nebraska (7/87 - Present) [<span title="Mario Stevenson has been an Assistant Professor at the University of Nebraska Medical Center since July 1987." class="citation">[Cite-4, Page-8]</span>]

#### Honors
- British Pharmaceutical Association Travel Award
- Glasgow College of Technology, Bachelor of Science with Honors [<span title="Mario Stevenson received the British Pharmaceutical Association Travel Award and graduated with Honors from Glasgow College of Technology." class="citation">[Cite-5, Page-8]</span>]

#### Publications
1. Stevenson, M., Baillie, A.J., and Richards, R.M.E. Enhanced activity of Streptomycin and Chloramphenicol against intracellular E. coli in the J774 macrophage cell line mediated by liposome delivery. *Antimicrob. Agents. Chemother.* 24:742-749, 1985.
2. Stevenson, M., Baillie, A.J., and Richards, R.M.E. An in-vitro model of intracellular bacterial infection using the murine macrophage cell line J774.2. *J. Pharm. Pharmacol.* 36:90-94, 1984.
3. Stevenson, M., Baillie, A.J., and Richards, R.M.E. Quantification of liposomal uptake in J774 macrophages -- a flow cytometric study. *J. Pharm. Pharmacol.* 38:120-126, 1985.
4. Volsky, D.J., Wu. Y.T., Stevenson, M., Sinangil, F., Merino, F., Rodriguez, L., and Godoy, G. Antibodies to HTLV-III/LAV in Venezuelan patients with acute malaria infection (P. falciparum and P. vivax). *New England J. Med.*, 314, #10:647-648, 1986.
5. Shapiro. I., Stevenson, M., Sinangil, F., and Volsky, D.J. Transfection of lymphoblastoid cells: expression of co-transfected DNA and selection of transfected cell lines. *Somatic cell & Mol. Genetics*, 12:351-356, 1986. [<span title="List of publications by Mario Stevenson." class="citation">[Cite-6, Page-8]</span>]

### Moshe Kalina
- **Birthdate**: January 21, 1938 [<span title="Moshe Kalina was born on January 21, 1938." class="citation">[Cite-7, Page-9]</span>]

#### Education Background
- **Hebrew University of Jerusalem** (1958-1961): Agriculture, B.So
- **Hebrew University of Jerusalem** (1961-1964): Biochemistry, M.So
- **Univ. of London, King's College** (1964-1967): Cytochemistry, Ph.D. [<span title="Moshe Kalina's educational background includes degrees in Agriculture, Biochemistry, and Cytochemistry." class="citation">[Cite-8, Page-9]</span>]

#### Employment
- **Dept. of Histology, Tel Aviv University** (1978-present): The surfactant system, Assoc Prof.
- **Dept. of Histology & Cell Biology** (1972-1978): Cytotoxic lymphocyte, Sen Lectur.
- **Dept. of Histology & Cell Biology** (1967-1978): Histochemistry, Lecturer
- **National Jewish Hospital, Denver** (1990-1991): The surfactant system, Visit.Ass
- **Postgraduate Medical School London** (1984-1985): Immunory tochem-endocris system, Vist Ass.
- **Dept. of Anatomy, UCLA** (1976-1977): The surfactant system, Vist Ass. Prof
- **Johns Hopkins University** (1972-1973): E.M. cytochenistry, Vist.Ass. Prof [<span title="Moshe Kalina's employment history includes various positions in Histology, Cell Biology, and Cytochemistry." class="citation">[Cite-9, Page-9]</span>]

#### Major Research Interest
- The surfactant system: cell biological approach [<span title="Moshe Kalina's major research interest is the surfactant system from a cell biological approach." class="citation">[Cite-10, Page-9]</span>]


## memo

## MEMORANDUM

- **To:** Howard Goldfrach, Kathy Leiber
- **Date:** April 3, 1987
- **From:** Mel Fallis
- **Re:** Black Consumer Market Promotion Development

### Summary

- **Project Development Timetable:** Attached is the proposed project development timetable for the Situation Analysis and Campaign Development phases for Benson & Hedges and Virginia Slims. [Cite-1, Page-10](<span title="Attached you will find the proposed project development timetable for the Situation Analysis and Campaign Development phases for Benson & Hedges and Virginia Slims." class="citation">[Cite-1, Page-10]</span>)
- **Scheduled Meetings:** 
  - April 9th with Howard at 1:30 PM
  - April 9th with Kathy at 3 PM
- **Focus:** Concentration will be on subjects for review and evaluation identified for April. [Cite-2, Page-10](<span title="At the scheduled meetings on April 9th with Howard at 1:30PM and Kathy at 3PM, concentration will be placed on the subjects for review and evaluation identified for April." class="citation">[Cite-2, Page-10]</span>)
- **Request for Information:** Interested in obtaining copies of available information about the industry, category, consumer dynamics, and promotion activities. [Cite-3, Page-10](<span title="I am very interested in obtaining copies of as much information as available about the industry, category, consumer dynamics and promotion activities." class="citation">[Cite-3, Page-10]</span>)
- **Contact Information:** If there are questions prior to the meeting, please call. [Cite-4, Page-10](<span title="If there are questions prior to our meeting, please don't hesitate to call." class="citation">[Cite-4, Page-10]</span>)
- **CC:** Ellen Merlo, Terry Fraser, Emmie LaBauve
- **Address:** 475 10th AVENUE . SUITE 800 . NEW YORK, N.Y. 10018 . (212) 564-8588 [Cite-5, Page-10](<span title="475 10th AVENUE . SUITE 800 . NEW YORK, N.Y. 10018 . (212) 564-8588" class="citation">[Cite-5, Page-10]</span>)


## letter

## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:**
- 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056

**Contact:**
- Phone: (502) 753-3341
- FAX: (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:**
- The Honorable Wendell H. Ford
- United States Senate
- Washington, D.C. 20510

**Summary:**

The Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, expresses strong opposition to the "Commitment to our Children" petition circulated by several Members of Congress. The Association argues against additional government bureaucracy and FDA regulation of tobacco, emphasizing the existing age restriction laws and the need for better enforcement rather than more regulations. They warn that FDA regulation would create inefficient federal government regulations, impacting family farms and adult freedom of choice. The Association urges Senator Ford to prevent FDA from becoming a federal tobacco regulator. [Cite-1, Page-1](<span title="On behalf of the Western Dark Fired Tobacco Growers' Association and the 9,000 tobacco producers it represents, I an obligated to convey our strong opposition to the \"Commitment to our Children' petition being circulated by several Members of Congress." class="citation">)

---

## LAB SERVICES CONSISTENCY REPORT

**Date:** 2/28/93
**Technician:** CC
**Shift:** A
**Trial:** 8
**Line:** 2
**Area:** 52
**Product Unit Code:** 0728
**Sample ID:** stuff box 2
**Reason for Request:** test

**Summary:**

The report details lab services consistency for a specific trial, including sample weights, dilution factors, and consistency percentages. The average consistency is noted as 3.68, with individual sample consistencies ranging from 3.366 to 3.77. [Cite-2, Page-2](<span title="AVERAGE 3.68 CONSISTENCY SAMPLE #1 10 SAMPLE #7 // 2645 2853 84 568 79.235 3.366 3123 3.62 3.64 15 SAMPLE $ 12 2847 89.776 3.411 3.77" class="citation">)



Execution time: 112.44 seconds


Section Summaries: ['advertisement_4', 'invoice_3', 'memo_2', 'questionnaire_5', 'resume_6', 'memo_7', 'letter_1']


## PART 3: Processing Document without Sections

In [11]:
# Create a copy of the document without sections to demonstrate the fallback approach
document_without_sections = copy.deepcopy(document)
document_without_sections.sections = []  # Remove all sections

# Process the document without sections (should use the fallback approach)
start_time = time.time()
document_without_sections = summarization_service.process_document(
    document=document_without_sections,
    store_results=True
)
summarization_time = time.time() - start_time
print(f"Document summarization without sections completed in {summarization_time:.2f} seconds")

# Print the summary report URI
if document_without_sections.summary_report_uri:
    print(f"\nWhole Document Summary Report URI: {document_without_sections.summary_report_uri}")
    
    # Try to get and display the markdown summary
    try:
        # Extract bucket and key from the s3 URI
        uri_parts = document_without_sections.summary_report_uri.replace("s3://", "").split("/", 1)
        bucket = uri_parts[0]
        key = uri_parts[1]
        
        # Use boto3 to get the object directly
        s3_client = boto3.client('s3')
        response = s3_client.get_object(Bucket=bucket, Key=key)
        summary_md = response['Body'].read().decode('utf-8')
        
        print("\nWhole Document Summary (first 500 chars):")
        print(summary_md[:500] + "..." if len(summary_md) > 500 else summary_md)
    except Exception as e:
        print(f"Error retrieving whole document summary: {e}")
else:
    print("No whole document summary available")

DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: Analyze the provided document and create a compreh...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 223 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 2474 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Bedrock request attempt 1/8:
DEBUG:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
DEBUG:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
DEBUG:idp_common.bedrock.client:  - system: [{'text': "You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main po

Document summarization without sections completed in 19.97 seconds

Whole Document Summary Report URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-05-22_17-42-33.pdf/summary/summary.md

Whole Document Summary (first 500 chars):
## WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

**Date:** October 31, 1995

**Recipient:** The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510

**Summary:**
- The Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, opposes the "Commitment to our Children" petition.
- They argue against additional government bure...
