# Holistic Packet Classification with IDP Common Package

This notebook demonstrates how to use the holistic packet classification capability of the IDP Common Package to classify multi-document packets, where each document might span multiple pages. The holistic approach examines the document as a whole to identify boundaries between different document types within the packet.

**Key Benefits of Holistic Packet Classification:**
1. Properly handles multi-page documents within a packet
2. Detects logical document boundaries
3. Identifies document types in context of the whole document
4. Handles documents where individual pages may not be clearly classifiable on their own

The notebook demonstrates how to process a document with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock using the multi-model page based method.
3. Extraction Service - Extract structured information from sections using Bedrock
4. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [1]:
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[all]"

# Note: We can also install specific components like:
# %pip install -q -e "../lib/idp_common_pkg[ocr,classification,extraction,evaluation]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

Found existing installation: idp_common 0.2.19
Uninstalling idp_common-0.2.19:
  Successfully uninstalled idp_common-0.2.19
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Version: 0.2.19
Location: /Users/miislamg/miniconda3/envs/genaiic-idp-accelerator/lib/python3.13/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import datetime
import copy

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, evaluation, summarization
from idp_common import s3
import json
from dotenv import load_dotenv
load_dotenv()

# Configure logging 
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.DEBUG)  # Enable evaluation logs

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name = f"idp-notebook-input-{account_id}-{region}"
output_bucket_name = f"idp-notebook-output-{account_id}-{region}"

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Example
AWS_REGION: us-east-1
Input bucket: idp-notebook-input-195275636621-us-east-1
Output bucket: idp-notebook-output-195275636621-us-east-1
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

Bucket idp-notebook-input-195275636621-us-east-1 already exists
Bucket idp-notebook-output-195275636621-us-east-1 already exists
Uploaded sample file to: s3://idp-notebook-input-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf


## 4. Set Up Configuration

In [4]:
with open("./config.json", 'r') as file:
    CONFIG = json.load(file)

## 5. Process Document with OCR

In [5]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: rvl-cdip-package
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 7.60 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 7.60 seconds
Document status: OCR_COMPLETED
Number of pages processed: 10

Processed pages:
Page 1:
  Image URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/1/image.jpg
  Raw Text URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/1/rawText.json
  Parsed Text URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/1/result.json
Page 2:
  Image URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/2/image.jpg
  Raw Text URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/2/rawText.json
  Parsed Text URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/2/result.json
Page 3:
  Image URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/pages/3/image.jpg
  Raw Text URI: s3://idp-notebook-output-195275636621-us-e

## 6. Classify the Document

In [6]:
# Verify that Config specifies => "classificationMethod": "textbasedHolisticClassification"
print("*****************************************************************")
print(f'CONFIG classificationMethod: {CONFIG["classification"].get("classificationMethod")}')
print("*****************************************************************")

# Create classification service with Bedrock backend
# The classification method is set in the config
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

*****************************************************************
CONFIG classificationMethod: textbasedHolisticClassification
*****************************************************************

Classifying document...
Classification completed in 6.55 seconds
Document status: CLASSIFIED


### Show classification results

In [7]:
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")


Detected sections:
Section 1: letter
  Pages: ['1']
Section 2: form
  Pages: ['2']
Section 3: email
  Pages: ['3']
Section 4: scientific_publication
  Pages: ['4']
Section 5: invoice
  Pages: ['5']
Section 6: advertisement
  Pages: ['6']
Section 7: questionnaire
  Pages: ['7']
Section 8: resume
  Pages: ['8']
Section 9: resume
  Pages: ['9']
Section 10: memo
  Pages: ['10']

Page-level classifications:
Page 1: letter
Page 10: memo
Page 2: form
Page 3: email
Page 4: scientific_publication
Page 5: invoice
Page 6: advertisement
Page 7: questionnaire
Page 8: resume
Page 9: resume


# Summarization service

## PART 1: Processing Individual Sections

In [8]:
summarization_service = summarization.SummarizationService(config=CONFIG)

print("=== PART 1: Processing Individual Sections ===")
n = 3  # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document = summarization_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    summarization_time = time.time() - start_time
    print(f"Summarization for section {section.section_id} completed in {summarization_time:.2f} seconds")
    
    # Print the summary content if available
    if section.attributes and 'summary_uri' in section.attributes:
        summary_uri = section.attributes['summary_uri']
        try:
            # Get the summary content from S3
            summary_content = s3.get_json_content(summary_uri)
            print("\nSummary Content:")
            
            # Check if there's a specific summary field in the content
            if isinstance(summary_content, dict):
                if 'summary' in summary_content:
                    print(summary_content['summary'])
                elif 'content' in summary_content:
                    print(summary_content['content'])
                else:
                    # Print the whole content if no specific summary field
                    print(json.dumps(summary_content, indent=2))
            else:
                print(summary_content)
        except Exception as e:
            print(f"Error retrieving summary: {e}")
    else:
        print("No summary available for this section")
    
print(f"\nSummarization for first {n} sections complete.")

=== PART 1: Processing Individual Sections ===

Processing section 1 (class: letter)
Summarization for section 1 completed in 8.58 seconds

Summary Content:
### Western Dark Fired Tobacco Growers' Association Letter to Senator Ford

**Date:** October 31, 1995 [Cite-1, Page-1]

**From:** Will E. Clark, General Manager, Western Dark Fired Tobacco Growers' Association

**To:** The Honorable Wendell H. Ford, United States Senate

**Address:** 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056

**Contact:** (502) 753-3341 FAX (502) 753-0069/3342

---

**Summary:**

- The Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, expresses strong opposition to the "Commitment to our Children" petition circulated by several Members of Congress [Cite-2, Page-1].
- The association argues that age restriction laws are already in place across all states and emphasizes the need for better enforcement rather than additional bureaucracy [Cite-3, Page-1].
- They war

## PART 2: Processing Document with Sections

In [10]:
document_with_sections = copy.deepcopy(document)

# Process the entire document using the section-based approach
start_time = time.time()
document_with_sections = summarization_service.process_document(
    document=document_with_sections,
    store_results=True
)
summarization_time = time.time() - start_time
print(f"Document summarization with sections completed in {summarization_time:.2f} seconds")

Document summarization with sections completed in 103.50 seconds


In [12]:
# Print the combined summary report URI
if document_with_sections.summary_report_uri:
    print(f"\nCombined Summary Report URI: {document_with_sections.summary_report_uri}")
    
    # Try to get and display the markdown summary from the JSON file
    try:
        # Extract bucket and key from the s3 URI
        uri_parts = document_with_sections.summary_report_uri.replace("s3://", "").split("/", 1)
        bucket = uri_parts[0]
        key = uri_parts[1]
        
        # Use boto3 to get the object directly
        s3_client = boto3.client('s3')
        response = s3_client.get_object(Bucket=bucket, Key=key)
        summary_json = json.loads(response['Body'].read().decode('utf-8'))
        print(summary_json)
        
        # Extract the markdown summary from the JSON
        summary_md = summary_json.get('summary', '')
        
        # Display a preview of the summary
        print("\nSummary Preview (first 500 chars):")
        print(summary_md[:500] + "..." if len(summary_md) > 500 else summary_md)
        
        # Display the full markdown summary in a rendered cell
        from IPython.display import Markdown, display
        print("\nFull Rendered Summary:")
        display(Markdown(summary_md))
    except Exception as e:
        print(f"Error retrieving summary: {e}")
else:
    print("No summary available")


Combined Summary Report URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-13-33.pdf/summary/summary.json
{'letter': {'summary': '## Summary\n\n- **Organization**: Western Dark Fired Tobacco Growers\' Association\n- **Address**: 206 Maple Street, P.O. Box 1056, Murray, Kentucky 42071-1056\n- **Contact**: (502) 753-3341, FAX (502) 753-0069/3342\n- **Date**: October 31, 1995\n- **Recipient**: The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510\n\n### Main Points\n\n- The Association represents 9,000 tobacco producers and strongly opposes the "Commitment to our Children" petition circulated by several Members of Congress [Cite-1, Page-1].\n- They advocate for better enforcement of existing age restriction laws rather than creating more bureaucracy [Cite-2, Page-1].\n- The Association warns against FDA regulation of tobacco, fearing it would lead to inefficient government bureaucracy and negatively impact family farms [Cite-3, Page-1].\n- The



## PART 3: Processing Document without Sections

In [None]:
# Create a copy of the document without sections to demonstrate the fallback approach
document_without_sections = copy.deepcopy(document)
document_without_sections.sections = []  # Remove all sections

# Process the document without sections (should use the fallback approach)
start_time = time.time()
document_without_sections = summarization_service.process_document(
    document=document_without_sections,
    store_results=True
)
summarization_time = time.time() - start_time
print(f"Document summarization without sections completed in {summarization_time:.2f} seconds")

# Print the summary report URI
if document_without_sections.summary_report_uri:
    print(f"\nWhole Document Summary Report URI: {document_without_sections.summary_report_uri}")
    
    # Try to get and display the markdown summary
    try:
        # Extract bucket and key from the s3 URI
        uri_parts = document_without_sections.summary_report_uri.replace("s3://", "").split("/", 1)
        bucket = uri_parts[0]
        key = uri_parts[1]
        
        # Use boto3 to get the object directly
        s3_client = boto3.client('s3')
        response = s3_client.get_object(Bucket=bucket, Key=key)
        summary_md = response['Body'].read().decode('utf-8')
        
        print("\nWhole Document Summary (first 500 chars):")
        print(summary_md[:500] + "..." if len(summary_md) > 500 else summary_md)
    except Exception as e:
        print(f"Error retrieving whole document summary: {e}")
else:
    print("No whole document summary available")

Document summarization without sections completed in 19.65 seconds

Whole Document Summary Report URI: s3://idp-notebook-output-195275636621-us-east-1/sample-2025-04-25_04-01-21.pdf/summary/summary.json

Whole Document Summary (first 500 chars):
{"summary": "### Western Dark Fired Tobacco Growers' Association\n\n- **Address**: 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056\n- **Contact**: (502) 753-3341 FAX (502) 753-0069/3342\n- **Date**: October 31, 1995\n- **Recipient**: The Honorable Wendell H. Ford, United States Senate, Washington, D.C. 20510\n\n**Summary**:\nThe Western Dark Fired Tobacco Growers' Association, representing 9,000 tobacco producers, expresses strong opposition to the \"Commitment to our Children\" petiti...
