# Holistic Packet Classification with IDP Common Package

This notebook demonstrates how to use the holistic packet classification capability of the IDP Common Package to classify multi-document packets, where each document might span multiple pages. The holistic approach examines the document as a whole to identify boundaries between different document types within the packet.

**Key Benefits of Holistic Packet Classification:**
1. Properly handles multi-page documents within a packet
2. Detects logical document boundaries
3. Identifies document types in context of the whole document
4. Handles documents where individual pages may not be clearly classifiable on their own

> **Note**: This notebook uses real AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

Let's install the IDP common package in development mode.

In [1]:
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[all]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

Found existing installation: idp_common 0.3.0
Uninstalling idp_common-0.3.0:
  Successfully uninstalled idp_common-0.3.0
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Version: 0.3.0
Location: /home/ec2-user/miniconda/lib/python3.12/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import sys
from io import BytesIO
from pathlib import Path
import datetime
import yaml
from IPython.display import Markdown, display

# Configure logging - target only the OCR and classification service modules
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)
logging.getLogger('idp_common.classification.service').setLevel(logging.INFO)
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Holistic-Classification-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'
os.environ['CONFIGURATION_TABLE_NAME'] = 'mock-config-table'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name = f"idp-holistic-input-{account_id}-{region}"
output_bucket_name = f"idp-holistic-output-{account_id}-{region}"

# Import base libraries
from idp_common.models import Document, Status
from idp_common import ocr, classification, extraction, get_config, utils

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Holistic-Classification-Example
AWS_REGION: us-west-2
Input bucket: idp-holistic-input-912625584728-us-west-2
Output bucket: idp-holistic-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
try:
    with open(SAMPLE_PDF_PATH, 'rb') as file_data:
        s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)
    print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")
except FileNotFoundError:
    print(f"Sample file not found at {SAMPLE_PDF_PATH}. Please ensure the sample file exists.")
    # Use a default sample if available
    alt_sample_path = "../samples/sample.pdf"
    try:
        with open(alt_sample_path, 'rb') as file_data:
            s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)
        print(f"Used alternative sample file: s3://{input_bucket_name}/{sample_file_key}")
    except FileNotFoundError:
        print("No sample files found. Please create a samples directory with PDF files.")

Bucket idp-holistic-input-912625584728-us-west-2 already exists
Bucket idp-holistic-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-holistic-input-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf


## 4. Set Up Configuration

In [4]:
# Sample configuration that mimics what would be in DynamoDB
CONFIG = {
    "classes": [
        {
        "name": "letter",
        "description": "A formal written message that is typically sent from one person to another",
        "attributes": [
            {
            "name": "sender_name",
            "description": "The name of the person or entity who wrote or sent the letter. Look for text following or near terms like 'from', 'sender', 'authored by', 'written by', or at the end of the letter before a signature."
            },
            {
            "name": "sender_address",
            "description": "The physical address of the sender, typically appearing at the top of the letter. May be labeled as 'address', 'location', or 'from address'."
            }
        ]
        },
        {
        "name": "form",
        "description": "A document with blank spaces for filling in information",
        "attributes": [
            {
            "name": "form_type",
            "description": "The category or purpose of the form, such as 'application', 'registration', 'request', etc. May be identified by 'form name', 'document type', or 'form category'."
            },
            {
            "name": "form_id",
            "description": "The unique identifier for the form, typically a number or alphanumeric code. Often labeled as 'form number', 'id', or 'reference number'."
            }
        ]
        },
        {
        "name": "invoice",
        "description": "A commercial document issued by a seller to a buyer relating to a sale",
        "attributes": [
            {
            "name": "invoice_number",
            "description": "The unique identifier for the invoice. Look for 'invoice no', 'invoice #', or 'bill number', typically near the top of the document."
            },
            {
            "name": "invoice_date",
            "description": "The date when the invoice was issued. May be labeled as 'date', 'invoice date', or 'billing date'."
            }
        ]
        },
        {
        "name": "resume",
        "description": "A document summarizing a person's background, skills, and qualifications",
        "attributes": [
            {
            "name": "full_name",
            "description": "The complete name of the job applicant, typically appearing prominently at the top of the resume. May be simply labeled as 'name' or 'applicant name'."
            },
            {
            "name": "contact_info",
            "description": "The phone number, email, and address of the applicant. Look for a section with 'contact', 'phone', 'email', or 'address', usually near the top of the resume."
            }
        ]
        },
        {
        "name": "scientific_publication",
        "description": "A formally published document presenting scientific research findings",
        "attributes": [
            {
            "name": "title",
            "description": "The name of the scientific paper, typically appearing prominently at the beginning. May be labeled as 'title', 'paper title', or 'article title'."
            },
            {
            "name": "authors",
            "description": "The researchers who conducted the study and wrote the paper. Look for names after 'authors', 'contributors', or 'researchers', usually following the title."
            }
        ]
        },
        {
        "name": "memo",
        "description": "A brief written message used for internal communication within an organization",
        "attributes": [
            {
            "name": "memo_date",
            "description": "The date when the memo was written. Look for 'date' or 'memo date', typically near the top of the document."
            },
            {
            "name": "from",
            "description": "The person or department that wrote the memo. May be labeled as 'from', 'sender', or 'author'."
            }
        ]
        },
        {
        "name": "advertisement",
        "description": "A public notice promoting a product, service, or event",
        "attributes": [
            {
            "name": "product_name",
            "description": "The name of the item or service being advertised. Look for prominently displayed text that could be a 'product', 'item', or 'service' name."
            },
            {
            "name": "brand",
            "description": "The company or manufacturer of the product. May be indicated by a logo or text labeled as 'brand', 'company', or 'manufacturer'."
            }
        ]
        },
        {
        "name": "email",
        "description": "An electronic message sent from one person to another over a computer network",
        "attributes": [
            {
            "name": "from_address",
            "description": "The email address of the sender. Look for text following 'from', 'sender', or 'sent by', typically at the beginning of the email header."
            },
            {
            "name": "to_address",
            "description": "The email address of the primary recipient. May be labeled as 'to', 'recipient', or 'sent to'."
            }
        ]
        },
        {
        "name": "questionnaire",
        "description": "A set of written questions designed to collect information from respondents",
        "attributes": [
            {
            "name": "form_title",
            "description": "The name or title of the questionnaire. Look for prominently displayed text at the beginning that could be a 'title', 'survey name', or 'questionnaire name'."
            },
            {
            "name": "respondent_info",
            "description": "Information about the person completing the questionnaire. May include fields labeled 'respondent', 'participant', or 'name'."
            }
        ]
        },
        {
        "name": "specification",
        "description": "A detailed description of technical requirements or characteristics",
        "attributes": [
            {
            "name": "product_name",
            "description": "The name of the item being specified. Look for text labeled as 'product', 'item', or 'model', typically appearing prominently at the beginning."
            },
            {
            "name": "version",
            "description": "The iteration or release number. May be indicated by 'version', 'revision', or 'release', often followed by a number or code."
            }
        ]
        },
        {
        "name": "generic",
        "description": "A general document type that doesn't fit into other specific categories",
        "attributes": [
            {
            "name": "document_type",
            "description": "The classification or category of the document. Look for terms like 'type', 'category', or 'class' that indicate what kind of document this is."
            },
            {
            "name": "document_date",
            "description": "The date when the document was created. May be labeled as 'date', 'created on', or 'issued on'."
            }
        ]
        }
    ],
  "classification": {
    "temperature": "0",
    "model": "us.anthropic.claude-3-haiku-20240307-v1:0",
    "classificationMethod": "textbasedHolisticClassification",  # Use holistic packet classification
    "system_prompt": "You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package from various domains. Your task is to determine the document type based on its content and structure, using the provided document type definitions. Your output must be valid JSON according to the requested format.",
    "top_k": "200",
    "task_prompt": """The <document-text> XML tags contains the text separated into pages from the document package. Each page will begin with a <page-number> XML tag indicating the one based page ordinal of the page text to follow.
<document-text>
{DOCUMENT_TEXT}
</document-text>

The <document-types> XML tags contain a markdown table of known doc types for detection.
<document-types>
{CLASS_NAMES_AND_DESCRIPTIONS}
</document-types>

<guidance>
Guidance for terminology found in the instructions.
    * ordinal_start_page: The one based beginning page of a document segment within the document package.
    * ordinal_end_page: The one based ending page of a document segment within the document package.
    * document_type: The document type code detected for a document segment.
    * Distinct documents of the same type may be adjacent to each other in the packet. Be sure to separate them into different document segments and don't combine them.
</guidance>

Follow these steps when classifying documents within the document package:
1. Examine the document package as a whole, and identify page ranges that are likely to belong to one of the <document-types>.
2. Match each page range with an identified document type.
3. Identify documents of the same type, that are not the same document but are adjacent to each other in the packet.
4. Separate unique documents of the same type adjacent to each other in the packet into distinct document segments. Important: Do not combine distinct documents of the same type into a single document segment.
5. For each identified document type, note the ordinal_start_page and ordinal_end_page.
6. Compile the classified documents into a list with their respective ordinal_start_page and ordinal_end_page.

Return your response as valid JSON according to this format:
```json
{
    "segments": [
                      {
                        "ordinal_start_page": 1,
                        "ordinal_end_page": 2,
                        "type": "the first type of document detected"
                      },
                      {
                        "ordinal_start_page": 3,
                        "ordinal_end_page": 4,
                        "type": "the second type of document detected"
                      }
                    ]
}
```"""
  },
  "extraction": {
    "temperature": "0",
    "model": "us.anthropic.claude-3-haiku-20240307-v1:0",
    "system_prompt": "You are a document assistant. Respond only with JSON. Never make up data, only provide data found in the document being provided.\n",
    "top_k": "200",
    "task_prompt": "<background>\nYou are an expert in business document analysis and information extraction. \nYou can understand and extract key information from business documents classified as type \n{DOCUMENT_CLASS}.\n</background>\n<document_ocr_data>\n{DOCUMENT_TEXT}\n</document_ocr_data>\n<task>\nYour task is to take the unstructured text provided and convert it into a well-organized table format using JSON. Identify the main entities, attributes, or categories mentioned in the attributes list below and use them as keys in the JSON object. \nThen, extract the relevant information from the text and populate the corresponding values in the JSON object. \nGuidelines:\nEnsure that the data is accurately represented and properly formatted within the JSON structure\nInclude double quotes around all keys and values\nDo not make up data - only extract information explicitly found in the document\nDo not use /n for new lines, use a space instead\nIf a field is not found or if unsure, return null\nAll dates should be in MM/DD/YYYY format\nDo not perform calculations or summations unless totals are explicitly given\nIf an alias is not found in the document, return null\nHere are the attributes you should extract:\n<attributes>\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n</attributes>\n</task>\n"
  }
}

print("Test configuration created for IDP services with textbased holistic classification method enabled")

Test configuration created for IDP services with textbased holistic classification method enabled


## 5. Process Document with OCR

In [5]:
# Initialize a new Document
document = Document(
    id="doc-insurance-package",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: doc-insurance-package
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 6.00 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 6.00 seconds
Document status: OCR_COMPLETED
Number of pages processed: 10

Processed pages:
Page 1:
  Image URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/1/image.jpg
  Raw Text URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/1/rawText.json
  Parsed Text URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/1/result.json
Page 2:
  Image URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/2/image.jpg
  Raw Text URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/2/rawText.json
  Parsed Text URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/2/result.json
Page 3:
  Image URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/pages/3/image.jpg
  Raw Text URI: s3://idp-holistic-output-912625584728-us-w

## 6. Classify the Document

In [None]:
# Verify that Config specifies => "classificationMethod": "textbasedHolisticClassification"
print("*****************************************************************")
print(f'CONFIG classificationMethod: {CONFIG["classification"].get("classificationMethod")}')
print("*****************************************************************")

# Create classification service with Bedrock backend
# The classification method is set in the config
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

# Show classification results
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")

INFO:idp_common.classification.service:Initialized classification service with Bedrock backend using model us.anthropic.claude-3-haiku-20240307-v1:0
INFO:idp_common.classification.service:Using textbased holistic packet classification method
INFO:idp_common.classification.service:Classifying document with 10 pages using holistic packet method
INFO:idp_common.classification.service:Classifying document with 10 pages using holistic packet method


CONFIG classificationMethod: textbasedHolisticClassification

Classifying document...


INFO:idp_common.classification.service:Invoking Bedrock for holistic packet classification
INFO:idp_common.classification.service:Time taken for holistic classification: 5.33 seconds
INFO:idp_common.classification.service:Document classified with 8 sections using holistic method


Classification completed in 5.34 seconds
Document status: CLASSIFIED

Detected sections:
Section 1: letter
  Pages: ['1']
Section 2: lab_report
  Pages: ['2']
Section 3: email
  Pages: ['3']
Section 4: invoice
  Pages: ['4']
Section 5: invoice
  Pages: ['5']
Section 6: scientific_publication
  Pages: ['6']
Section 7: questionnaire
  Pages: ['7']
Section 8: memo
  Pages: ['8', '9', '10']

Page-level classifications:
Page 1: letter
Page 10: memo
Page 2: lab_report
Page 3: email
Page 4: invoice
Page 5: invoice
Page 6: scientific_publication
Page 7: questionnaire
Page 8: memo
Page 9: memo


## 7. Extract Information from Document Sections

In [7]:
# Create extraction service with Bedrock
extraction_service = extraction.ExtractionService(config=CONFIG)

print("\nExtracting information from document sections...")
extracted_results = {}

# Create individual document for each section
n=0
for section in document.sections:
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Create a section-specific document
    section_document = Document(
        id=document.id,
        input_bucket=document.input_bucket,
        input_key=document.input_key,
        output_bucket=document.output_bucket,
        status=document.status,
        sections=[section]
    )
    
    # Add only pages needed for this section
    needed_pages = {}
    for page_id in section.page_ids:
        if page_id in document.pages:
            needed_pages[page_id] = document.pages[page_id]
    section_document.pages = needed_pages
    
    # Process section
    start_time = time.time()
    section_document = extraction_service.process_document_section(
        document=section_document,
        section_id=section.section_id
    )
    extraction_time = time.time() - start_time
    print(f"Extraction completed in {extraction_time:.2f} seconds")
    
    # Get the updated section
    updated_section = section_document.sections[0]
    print(f"Extraction result URI: {updated_section.extraction_result_uri}")
    
    # Store results for later use
    extracted_results[section.section_id] = {
        "section": updated_section,
        "result_uri": updated_section.extraction_result_uri
    }
    n += 1
    print(f"Processed {n} sections")
    if n >= 3:  # Limit to first 3 sections to save time/costs
        print("\nExtraction for first 3 sections complete.")
        break

print("\nExtraction complete.")


Extracting information from document sections...

Processing section 1 (class: letter)
Extraction completed in 2.61 seconds
Extraction result URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/1/result.json
Processed 1 sections

Processing section 2 (class: lab_report)
Extraction completed in 6.03 seconds
Extraction result URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/2/result.json
Processed 2 sections

Processing section 3 (class: email)
Extraction completed in 2.45 seconds
Extraction result URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/3/result.json
Processed 3 sections

Extraction for first 3 sections complete.

Extraction complete.


## 8. Inspect Extraction Results

In [8]:
print("Loading extraction results from S3...\n")

for section_id, data in extracted_results.items():
    # Load the extraction results from S3
    uri = data['result_uri']
    try:
        result_data = load_json_from_s3(uri)
        
        # Extract the inference results
        if "inference_result" in result_data:
            extraction_results = result_data["inference_result"]
        else:
            extraction_results = result_data
            
        # Print out section and extraction results
        print(f"Results for section {section_id} (class: {data['section'].classification})")
        print(f"S3 URI: {uri}")
        print("Extracted attributes:")
        for key, value in extraction_results.items():
            print(f"  {key}: {value}")
        print()
    except Exception as e:
        print(f"Error loading results from {uri}: {str(e)}")
        print()

Loading extraction results from S3...

Results for section 1 (class: letter)
S3 URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/1/result.json
Extracted attributes:
  sender_name: Will E. Clark
  sender_address: 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056

Results for section 2 (class: lab_report)
S3 URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/2/result.json
Extracted attributes:
  DATE: 2/28/93
  TECHNICIAN: CC
  SHIFT: A
  LINE: 2
  AREA: 52
  PRODUCT UNIT CODE: 0728
  SAMPLE ID: stuff box 2
  REASON FOR REQUEST: test
  REQUESTED DELIVERY TIME: None
  TIME SAMPLE RECEIVED: None
  TIME ANALYSIS COMPLETED: None
  DATA COMMUNICATED TO: Gone
  DATA COMMUNICATED AT: 11:05
  DRYING TIME: {'IN': None, 'OUT': None}
  SAMPLE & CONTAINER WEIGHT IN GRAMS: 1159.3
  CONTAINER WEIGHT IN GRAMS: 6
  SAMPLE WEIGHT IN GRAMS: 2
  DILUTION FACTOR: 6955.8
  DILUTED SAMPLE WEIGHT IN GRAMS: None
  FILT

## 9. Final Document Status Summary

In [9]:
# Update document status to PROCESSED
document.status = Status.PROCESSED

# Update document sections with extraction results
for section_id, data in extracted_results.items():
    # Find section in document
    for i, section in enumerate(document.sections):
        if section.section_id == section_id:
            document.sections[i] = data['section']

# Display final document state
print("Final Document State:")
print(f"Document ID: {document.id}")
print(f"Status: {document.status.value}")
print(f"Number of pages: {document.num_pages}")
print(f"Number of sections: {len(document.sections)}")

print("\nSection summary:")
for section in document.sections:
    print(f"  Section {section.section_id}: {section.classification}")
    print(f"    Pages: {section.page_ids}")
    print(f"    Extraction result URI: {section.extraction_result_uri}")

# Demonstrate that a document can be serialized to JSON
print("\nDocument can be serialized to JSON:")
document_dict = document.to_dict()
document_json = json.dumps(document_dict, indent=2)[:500]  # Truncate for display
print(f"{document_json}...")
print("(truncated for display)")

Final Document State:
Document ID: doc-insurance-package
Status: PROCESSED
Number of pages: 10
Number of sections: 9

Section summary:
  Section 1: letter
    Pages: ['1']
    Extraction result URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/1/result.json
  Section 2: lab_report
    Pages: ['2']
    Extraction result URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/2/result.json
  Section 3: email
    Pages: ['3']
    Extraction result URI: s3://idp-holistic-output-912625584728-us-west-2/sample-2025-04-14_15-27-10.pdf/sections/3/result.json
  Section 4: memo
    Pages: ['4']
    Extraction result URI: None
  Section 5: invoice
    Pages: ['5']
    Extraction result URI: None
  Section 6: scientific_publication
    Pages: ['6']
    Extraction result URI: None
  Section 7: news_release
    Pages: ['7', '8']
    Extraction result URI: None
  Section 8: questionnaire
    Pages: ['9']
    Extraction res

## 10. Clean Up (Optional)

In [10]:
# Function to delete objects in a bucket
def delete_bucket_objects(bucket_name):
    try:
        # List all objects in the bucket
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if 'Contents' in response:
            delete_keys = {'Objects': [{'Key': obj['Key']} for obj in response['Contents']]}
            s3_client.delete_objects(Bucket=bucket_name, Delete=delete_keys)
            print(f"Deleted all objects in bucket {bucket_name}")
        else:
            print(f"Bucket {bucket_name} is already empty")
            
        # Delete bucket
        s3_client.delete_bucket(Bucket=bucket_name)
        print(f"Deleted bucket {bucket_name}")
    except Exception as e:
        print(f"Error cleaning up bucket {bucket_name}: {str(e)}")

# Uncomment the following lines to delete the buckets
# print("Cleaning up resources...")
# delete_bucket_objects(input_bucket_name)
# delete_bucket_objects(output_bucket_name)
# print("Cleanup complete")

## Conclusion

This notebook demonstrates how to use the holistic packet classification capability in the IDP Common Package, which offers several advantages over page-by-page classification:

### Benefits of Holistic Packet Classification:

1. **Contextual Understanding**: Analyzes the entire document as a whole, rather than page-by-page, providing better context for classification decisions
2. **Document Boundary Detection**: Precisely detects where one document ends and another begins within a multi-document packet
3. **Improved Accuracy**: More accurately classifies pages that might be ambiguous when viewed individually
4. **Type Consistency**: Maintains document type consistency for multi-page documents
5. **Flexible Configuration**: Supports customizable prompts and document type definitions

Compared to page-by-page classification (multimodalPageLevelClassification), the textbased holistic classification (textbasedHolisticClassification) approach gives more accurate results for complex document packets, especially when individual pages might not provide enough context for accurate classification on their own.

The implementation leverages the existing Document model structure, where:
- Document = DocumentPacket
- Section = DocumentSegment

This approach ensures consistency with the rest of the IDP Common Package while adding the powerful capability of holistic document packet classification. The classification method can be configured through the "classificationMethod" parameter in the configuration.