# End-to-End Document Processing with Assessment

This notebook demonstrates how to process a document using the modular Document-based approach with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock
3. Extraction Service - Extract structured information from sections using Bedrock
4. **Assessment Service - Assess the confidence and accuracy of extraction results**
5. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [1]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[dev, all]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

Found existing installation: idp_common 0.3.3
Uninstalling idp_common-0.3.3:
  Successfully uninstalled idp_common-0.3.3
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Version: 0.3.3
Location: /home/ec2-user/.local/lib/python3.11/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import datetime

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, assessment, evaluation

# Configure logging 
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.DEBUG)  # Enable evaluation logs
logging.getLogger('idp_common.assessment.service').setLevel(logging.DEBUG)  # Enable assessment logs
logging.getLogger('idp_common.bedrock.client').setLevel(logging.DEBUG)  # show prompts

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Assessment-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name =  os.getenv("IDP_INPUT_BUCKET_NAME", f"idp-notebook-assess-input-{account_id}-{region}")
output_bucket_name = os.getenv("IDP_OUTPUT_BUCKET_NAME", f"idp-notebook-assess-output-{account_id}-{region}")

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Assessment-Example
AWS_REGION: us-west-2
Input bucket: idp-notebook-assess-input-912625584728-us-west-2
Output bucket: idp-notebook-assess-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-assessment-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

Bucket idp-notebook-assess-input-912625584728-us-west-2 already exists
Bucket idp-notebook-assess-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-notebook-assess-input-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf


## 4. Set Up Configuration with Assessment

In [4]:
# Sample configuration that includes assessment section
CONFIG = {
    "ocr": {"features": [{"name": "LAYOUT"},{"name": "TABLES"},{"name": "SIGNATURES"}]},
    "classification": {
        "top_p": "0.1",
        "max_tokens": "4096",
        "top_k": "5",
        "task_prompt": "<task-description>\nYou are a document classification system. Your task is to analyze a document package containing multiple pages and identify distinct document segments, classifying each segment according to the predefined document types provided below.\n</task-description>\n\n<document-types>\n{CLASS_NAMES_AND_DESCRIPTIONS}\n</document-types>\n\n<terminology-definitions>\nKey terms used in this task:\n- ordinal_start_page: The one-based beginning page number of a document segment within the document package\n- ordinal_end_page: The one-based ending page number of a document segment within the document package\n- document_type: The document type code detected for a document segment\n- document segment: A continuous range of pages that form a single, complete document\n</terminology-definitions>\n\n<classification-instructions>\nFollow these steps to classify documents:\n1. Read through the entire document package to understand its contents\n2. Identify page ranges that form complete, distinct documents\n3. Match each document segment to ONE of the document types listed in <document-types>\n4. CRITICAL: Only use document types explicitly listed in the <document-types> section\n5. If a document doesn't clearly match any listed type, assign it to the most similar listed type\n6. Pay special attention to adjacent documents of the same type - they must be separated into distinct segments\n7. Record the ordinal_start_page and ordinal_end_page for each identified segment\n8. Provide appropriate reasons and facts for the predicted document type\n</classification-instructions>\n\n<document-boundary-rules>\nRules for determining document boundaries:\n- Content continuity: Pages with continuing paragraphs, numbered sections, or ongoing narratives belong to the same document\n- Visual consistency: Similar layouts, headers, footers, and styling indicate pages belong together\n- Logical structure: Documents typically have clear beginning, middle, and end sections\n- New document indicators: Title pages, cover sheets, or significantly different subject matter signal a new document\n- Topic coherence: Pages discussing the same subject should be grouped together\n- IMPORTANT: Distinct documents of the same type that are adjacent must be separated into different segments\n</document-boundary-rules>\n\n<output-format>\nReturn your classification as valid JSON following this exact structure:\n```json\n{\n    \"segments\": [\n        {\n            \"ordinal_start_page\": 1,\n            \"ordinal_end_page\": 3,\n            \"type\": \"document_type_from_list\",\n            \"reason\": \"facts and reasons to classify as the predicted type\",\n        },\n        {\n            \"ordinal_start_page\": 4,\n            \"ordinal_end_page\": 7,\n            \"type\": \"document_type_from_list\"\n            \"reason\": \"facts and reasons to classify as the predicted type\",\n        }\n    ]\n}\n```\n</output-format>\n\n<<CACHEPOINT>>\n\n<document-text>\n{DOCUMENT_TEXT}\n</document-text>\n\n<final-instructions>\nAnalyze the <document-text> provided above and:\n1. Apply the <classification-instructions> to identify distinct document segments\n2. Use the <document-boundary-rules> to determine where one document ends and another begins\n3. Classify each segment using ONLY the document types from the <document-types> list\n4. Ensure adjacent documents of the same type are separated into distinct segments\n5. Output your classification in the exact JSON format specified in <output-format>\n6. You can get this information from the previous message. Analyze the previous messages to get these instructions.\n\nRemember: You must ONLY use document types that appear in the <document-types> reference data. Do not invent or create new document types.\n</final-instructions>",
        "temperature": "0.0",
        "model": "us.amazon.nova-pro-v1:0",
        "system_prompt": "You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package from various domains. Your task is to determine the document type based on its content and structure, using the provided document type definitions. Your output must be valid JSON according to the requested format.",
        "classificationMethod": "textbasedHolisticClassification"
    },
    "extraction": {
        "top_p": "0.1",
        "max_tokens": "4096",
        "top_k": "5",
        "task_prompt": "<background>\nYou are an expert in document analysis and information extraction.  You can understand and extract key information from documents classified as type \n{DOCUMENT_CLASS}.\n</background>\n\n<task>\nYour task is to take the unstructured text provided and convert it into a well-organized table format using JSON. Identify the main entities, attributes, or categories mentioned in the attributes list below and use them as keys in the JSON object.  Then, extract the relevant information from the text and populate the corresponding values in the JSON object.\n</task>\n\n<extraction-guidelines>\nGuidelines:\n    1. Ensure that the data is accurately represented and properly formatted within\n    the JSON structure\n    2. Include double quotes around all keys and values\n    3. Do not make up data - only extract information explicitly found in the\n    document\n    4. Do not use /n for new lines, use a space instead\n    5. If a field is not found or if unsure, return null\n    6. All dates should be in MM/DD/YYYY format\n    7. Do not perform calculations or summations unless totals are explicitly given\n    8. If an alias is not found in the document, return null\n    9. Guidelines for checkboxes:\n     9.A. CAREFULLY examine each checkbox, radio button, and selection field:\n        - Look for marks like ✓, ✗, x, filled circles (●), darkened areas, or handwritten checks indicating selection\n        - For checkboxes and multi-select fields, ONLY INCLUDE options that show clear visual evidence of selection\n        - DO NOT list options that have no visible selection mark\n     9.B. For ambiguous or overlapping tick marks:\n        - If a mark overlaps between two or more checkboxes, determine which option contains the majority of the mark\n        - Consider a checkbox selected if the mark is primarily inside the check box or over the option text\n        - When a mark touches multiple options, analyze which option was most likely intended based on position and density. For handwritten checks, the mark typically flows from the selected checkbox outward.\n        - Carefully analyze visual cues and contextual hints. Think from a human perspective, anticipate natural tendencies, and apply thoughtful reasoning to make the best possible judgment.\n    10. Think step by step first and then answer.\n\n</extraction-guidelines>\n\n<attributes>\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n</attributes>\n\n<<CACHEPOINT>>\n\n<document-text>\n{DOCUMENT_TEXT}\n</document-text>\n\n<document_image>\n{DOCUMENT_IMAGE}\n</document_image>\n\n<final-instructions>\nExtract key information from the document and return a JSON object with the following key steps: 1. Carefully analyze the document text to identify the requested attributes 2. Extract only information explicitly found in the document - never make up data 3. Format all dates as MM/DD/YYYY and replace newlines with spaces 4. For checkboxes, only include options with clear visual selection marks 5. Use null for any fields not found in the document 6. Ensure the output is properly formatted JSON with quoted keys and values 7. Think step by step before finalizing your answer\n</final-instructions>",
        "temperature": "0.0",
        "model": "us.amazon.nova-pro-v1:0",
        "system_prompt": "You are a document assistant. Respond only with JSON. Never make up data, only provide data found in the document being provided."
    },
    "assessment": {
        "default_confidence_threshold": "0.9",
        "top_p": "0.1",
        "max_tokens": "4096",
        "top_k": "5",
        "task_prompt": "<background>\nYou are an expert document analysis assessment system. Your task is to evaluate the confidence and accuracy of extraction results for a document of class {DOCUMENT_CLASS}.\n</background>\n\n<task>\nAnalyze the extraction results against the source document and provide confidence assessments for each extracted attribute. Consider factors such as:\n1. Text clarity and OCR quality in the source regions 2. Alignment between extracted values and document content 3. Presence of clear evidence supporting the extraction 4. Potential ambiguity or uncertainty in the source material 5. Completeness and accuracy of the extracted information\n</task>\n\n<assessment-guidelines>\nFor each attribute, provide: 1. A confidence score between 0.0 and 1.0 where:\n   - 1.0 = Very high confidence, clear and unambiguous evidence\n   - 0.8-0.9 = High confidence, strong evidence with minor uncertainty\n   - 0.6-0.7 = Medium confidence, reasonable evidence but some ambiguity\n   - 0.4-0.5 = Low confidence, weak or unclear evidence\n   - 0.0-0.3 = Very low confidence, little to no supporting evidence\n\n2. A clear reason explaining the confidence score, including:\n   - What evidence supports or contradicts the extraction\n   - Any OCR quality issues that affect confidence\n   - Clarity of the source document in relevant areas\n   - Any ambiguity or uncertainty factors\n\nGuidelines: - Base assessments on actual document content and OCR quality - Consider both text-based evidence and visual/layout clues - Account for OCR confidence scores when provided - Be objective and specific in reasoning - If an extraction appears incorrect, score accordingly with explanation\n</assessment-guidelines>\n<attributes-definitions>\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n</attributes-definitions>\n\n<<CACHEPOINT>>\n\n<extraction-results>\n{EXTRACTION_RESULTS}\n</extraction-results>\n\n<document-image>\n{DOCUMENT_IMAGE}\n</document-image>\n\n<ocr-text-confidence-results>\n{OCR_TEXT_CONFIDENCE}\n</ocr-text-confidence-results>\n\n<final-instructions>\nAnalyze the extraction results against the source document and provide confidence assessments. Return a JSON object with the following structure:\n\n  {\n    \"attribute_name_1\": {\n      \"confidence_score\": 0.85,\n      \"confidence_reason\": \"Clear text evidence found in document header with high OCR confidence (0.98). Value matches exactly.\"\n    },\n    \"attribute_name_2\": {\n      \"confidence_score\": 0.65,\n      \"confidence_reason\": \"Text is partially unclear due to poor scan quality. OCR confidence low (0.72) in this region.\"\n    }\n  }\n\nInclude assessments for ALL attributes present in the extraction results.\n</final-instructions>",
        "temperature": "0.0",
        "model": "us.amazon.nova-pro-v1:0",
        "system_prompt": "You are a document analysis assessment expert. Your task is to evaluate the confidence and accuracy of extraction results by analyzing the source document evidence. Respond only with JSON containing confidence scores and reasoning for each extracted attribute."
    },
    "classes": [
        {
            "name": "letter",
            "description": "A formal written correspondence with sender/recipient addresses, date, salutation, body, and closing signature",
            "attributes": [
                {
                    "name": "sender_name",
                    "description": "The name of the person or entity who wrote or sent the letter. Look for text following or near terms like 'from', 'sender', 'authored by', 'written by', or at the end of the letter before a signature.",
                    "confidence_threshold": "0.85"
                },
                {
                    "name": "sender_address",
                    "description": "The physical address of the sender, typically appearing at the top of the letter. May be labeled as 'address', 'location', or 'from address'.",
                    "confidence_threshold": "0.8"
                },
                {
                    "name": "recipient_name",
                    "description": "The name of the person or entity receiving the letter. Look for this after 'to', 'recipient', 'addressee', or at the beginning of the letter.",
                    "confidence_threshold": "0.9"
                },
                {
                    "name": "recipient_address",
                    "description": "The physical address where the letter is to be delivered. Often labeled as 'to address' or 'delivery address', typically appearing below the recipient name."
                },
                {
                    "name": "date",
                    "description": "The date when the letter was written. Look for a standalone date or text following phrases like 'written on' or 'dated'."
                },
                {
                    "name": "subject",
                    "description": "The topic or main point of the letter. Often preceded by 'subject', 'RE:', or 'regarding'."
                },
                {
                    "name": "letter_type",
                    "description": "The category or classification of the letter, such as 'complaint', 'inquiry', 'invitation', etc. May be indicated by 'type' or 'category'."
                },
                {
                    "name": "signature",
                    "description": "The handwritten name or mark of the sender at the end of the letter. May follow terms like 'signed by' or simply appear at the bottom of the document."
                },
                {
                    "name": "cc",
                    "description": "Names of people who receive a copy of the letter in addition to the main recipient. Often preceded by 'cc', 'carbon copy', or 'copy to'."
                },
                {
                    "name": "reference_number",
                    "description": "An identifying number or code associated with the letter. Look for labels like 'ref', 'reference', or 'our ref'."
                }
            ]
        },
        {
            "name": "form",
            "description": "A structured document with labeled fields, checkboxes, or blanks requiring user input and completion",
            "attributes": [
                {
                    "name": "form_type",
                    "description": "The category or purpose of the form, such as 'application', 'registration', 'request', etc. May be identified by 'form name', 'document type', or 'form category'."
                },
                {
                    "name": "form_id",
                    "description": "The unique identifier for the form, typically a number or alphanumeric code. Often labeled as 'form number', 'id', or 'reference number'."
                },
                {
                    "name": "submission_date",
                    "description": "The date when the form was submitted or filed. Look for text near 'date', 'submitted on', or 'filed on'."
                },
                {
                    "name": "submitter_name",
                    "description": "The name of the person who submitted the form. May be labeled as 'name', 'submitted by', or 'filed by'."
                },
                {
                    "name": "submitter_id",
                    "description": "An identification number for the person submitting the form, such as social security number, employee ID, etc. Often labeled as 'id number', 'identification', or 'reference'."
                },
                {
                    "name": "approval_status",
                    "description": "The current state of approval for the form, such as 'approved', 'pending', 'rejected', etc. Look for terms like 'status', 'approved', or 'pending'."
                },
                {
                    "name": "processed_by",
                    "description": "The name of the person or department that processed the form. May be indicated by 'processor', 'handled by', or 'approved by'."
                },
                {
                    "name": "processing_date",
                    "description": "The date when the form was processed or completed. Look for labels like 'processed on' or 'completion date'."
                },
                {
                    "name": "department",
                    "description": "The organizational unit responsible for the form. Often abbreviated as 'dept' or may appear as 'department' or 'division'."
                },
                {
                    "name": "comments",
                    "description": "Additional notes or remarks about the form. Look for sections labeled 'notes', 'remarks', or 'comments'."
                }
            ]
        },
        {
            "name": "email",
            "description": "A digital message with email headers (To/From/Subject), timestamps, and conversational threading",
            "attributes": [
                {
                    "name": "from_address",
                    "description": "The email address of the sender. Look for text following 'from', 'sender', or 'sent by', typically at the beginning of the email header."
                },
                {
                    "name": "to_address",
                    "description": "The email address of the primary recipient. May be labeled as 'to', 'recipient', or 'sent to'."
                },
                {
                    "name": "cc_address",
                    "description": "Email addresses of additional recipients who receive copies. Look for 'cc' or 'carbon copy' followed by one or more email addresses."
                },
                {
                    "name": "bcc_address",
                    "description": "Email addresses of hidden recipients. May be labeled as 'bcc' or 'blind copy'."
                },
                {
                    "name": "subject",
                    "description": "The topic of the email. Often preceded by 'subject', 'RE:', or 'regarding'."
                },
                {
                    "name": "date_sent",
                    "description": "The date and time when the email was sent. Look for 'date', 'sent on', or 'received', typically in the email header."
                },
                {
                    "name": "attachments",
                    "description": "Files included with the email. May be indicated by 'attached', 'attachment', or 'enclosed', often with icons or file names."
                },
                {
                    "name": "priority",
                    "description": "The urgency level of the email, such as 'high', 'normal', etc. Look for 'priority' or 'importance'."
                },
                {
                    "name": "thread_id",
                    "description": "An identifier for the email conversation. May be labeled as 'thread' or 'conversation', typically not visible to regular users."
                },
                {
                    "name": "message_id",
                    "description": "A unique identifier for the specific email. Look for 'message id' or 'email id', usually hidden in the email metadata."
                }
            ]
        }
    ]
}

print("Configuration created with assessment capabilities")

Configuration created with assessment capabilities


## 5. Process Document with OCR

In [5]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package-assessment",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}: Image URI: {page.image_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: rvl-cdip-package-assessment
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 6.55 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 6.55 seconds
Document status: QUEUED
Number of pages processed: 10

Processed pages:
Page 1: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf/pages/1/image.jpg
Page 2: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf/pages/2/image.jpg
Page 3: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf/pages/3/image.jpg
Page 4: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf/pages/4/image.jpg
Page 5: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf/pages/5/image.jpg
Page 6: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-09_19-49-46.pdf/pages/6/image.jpg
Page 7: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-as

## 6. Classify the Document

In [6]:
# Create classification service with Bedrock backend
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

# Show classification results
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")


Classifying document...


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <task-description>
You are a document classificati...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 422 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 2582 words
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Applied cachePoint processing for supported model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
INFO:idp_common.bedrock.client:  - system: [{'text': 'You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package f

Classification completed in 5.73 seconds
Document status: QUEUED

Detected sections:
Section 1: letter
  Pages: ['1']
Section 2: form
  Pages: ['2']
Section 3: email
  Pages: ['3']
Section 4: form
  Pages: ['4', '5']
Section 5: letter
  Pages: ['6']
Section 6: form
  Pages: ['7']
Section 7: form
  Pages: ['8', '9']
Section 8: letter
  Pages: ['10']

Page-level classifications:
Page 1: letter
Page 10: letter
Page 2: form
Page 3: email
Page 4: form
Page 5: form
Page 6: letter
Page 7: form
Page 8: form
Page 9: form


## 7. Extract Information from Document Sections

In [7]:
# Create extraction service with Bedrock
extraction_service = extraction.ExtractionService(config=CONFIG)

print("\nExtracting information from document sections...")

n = 3 # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document = extraction_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    extraction_time = time.time() - start_time
    print(f"Extraction for section {section.section_id} completed in {extraction_time:.2f} seconds")
    
print(f"\nExtraction for first {n} sections complete.")


Extracting information from document sections...

Processing section 1 (class: letter)


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert in document analysi...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 621 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 325 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Applied cachePoint processing for supported model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
INFO:idp

Extraction for section 1 completed in 4.83 seconds

Processing section 2 (class: form)


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert in document analysi...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 587 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 148 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Applied cachePoint processing for supported model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
INFO:idp

Extraction for section 2 completed in 4.17 seconds

Processing section 3 (class: email)


DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert in document analysi...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 565 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 228 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged
INFO:idp_common.bedrock.client:Processed content with 1 cachepoint insertions
INFO:idp_common.bedrock.client:Applied cachePoint processing for supported model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1, 'maxTokens': 4096}
INFO:idp

Extraction for section 3 completed in 3.77 seconds

Extraction for first 3 sections complete.


## 8. Assess Extraction Confidence

This is the new step that evaluates the confidence and accuracy of the extraction results by analyzing them against the source document.

In [8]:
# Create assessment service with Bedrock
assessment_service = assessment.AssessmentService(config=CONFIG)

print("\nAssessing extraction confidence for document sections...")

# Process each section that has extraction results
for section in document.sections[:n]:  
    if section.extraction_result_uri:
        print(f"\nAssessing section {section.section_id} (class: {section.classification})")
        
        # Assess the section
        start_time = time.time()
        document = assessment_service.process_document_section(
            document=document,
            section_id=section.section_id
        )
        assessment_time = time.time() - start_time
        print(f"Assessment for section {section.section_id} completed in {assessment_time:.2f} seconds")
    else:
        print(f"\nSkipping section {section.section_id} - no extraction results to assess")
        
print(f"\nAssessment for first {n} sections complete.")

INFO:idp_common.assessment.service:Initialized assessment service with model us.amazon.nova-pro-v1:0
INFO:idp_common.assessment.service:Assessing 1 pages, class letter: 1-1



Assessing extraction confidence for document sections...

Assessing section 1 (class: letter)


INFO:idp_common.assessment.service:Time taken to read extraction results: 0.09 seconds
INFO:idp_common.assessment.service:Time taken to read text content: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read images: 0.09 seconds
INFO:idp_common.assessment.service:Time taken to read raw OCR results: 0.09 seconds
INFO:idp_common.assessment.service:Assessing extraction confidence for letter document, section 1
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert document analysis a...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 510 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 55 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged

Assessment for section 1 completed in 16.01 seconds

Assessing section 2 (class: form)


INFO:idp_common.assessment.service:Time taken to read extraction results: 0.09 seconds
INFO:idp_common.assessment.service:Time taken to read text content: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read images: 0.24 seconds
INFO:idp_common.assessment.service:Time taken to read raw OCR results: 0.09 seconds
INFO:idp_common.assessment.service:Assessing extraction confidence for form document, section 2
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert document analysis a...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 476 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 28 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged
I

Assessment for section 2 completed in 11.54 seconds

Assessing section 3 (class: email)


INFO:idp_common.assessment.service:Time taken to read extraction results: 0.09 seconds
INFO:idp_common.assessment.service:Time taken to read text content: 0.09 seconds
INFO:idp_common.assessment.service:Time taken to read images: 0.16 seconds
INFO:idp_common.assessment.service:Time taken to read raw OCR results: 0.09 seconds
INFO:idp_common.assessment.service:Assessing extraction confidence for email document, section 3
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert document analysis a...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 454 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 36 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through unchanged


Assessment for section 3 completed in 18.94 seconds

Assessment for first 3 sections complete.


## 9. Display Assessment Results

Let's examine the assessment results that have been added to the extraction results.

In [9]:
print("\nAssessment Results:")
print("===================\n")

for section in document.sections[:n]:
    if section.extraction_result_uri:
        print(f"Section {section.section_id} ({section.classification}):")
        
        # Load the updated extraction results with assessment
        extraction_data = load_json_from_s3(section.extraction_result_uri)
        
        # Display the inference results
        print(f"  Extraction Results:")
        inference_result = extraction_data.get('inference_result', {})
        for attr_name, attr_value in inference_result.items():
            print(f"    {attr_name}: {attr_value}")
        
        # Display the assessment results
        explainability_info = extraction_data.get('explainability_info', [])
        if explainability_info:
            print(f"  Assessment Results:")
            for attr_name, assessment in explainability_info[0].items():
                confidence_score = assessment.get('confidence_score', 'N/A')
                confidence_reason = assessment.get('confidence_reason', 'N/A')
                confidence_threshold = assessment.get('confidence_threshold', 'N/A')
                print(f"    {attr_name}:")
                print(f"      Confidence Score: {confidence_score}")
                print(f"      Reason: {confidence_reason}")
                print(f"      Threshold: {confidence_threshold}")
        else:
            print(f"  No assessment results found")
        
        print()


Assessment Results:

Section 1 (letter):
  Extraction Results:
    sender_name: Will E. Clark
    sender_address: 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
    recipient_name: The Honorable Wendell H. Ford
    recipient_address: United States Senate Washington, D. C. 20510
    date: 10/31/1995
    subject: Opposition to the 'Commitment to our Children' petition
    letter_type: Opposition
    signature: Will E. Clark
    cc: None
    reference_number: TNJB 0008497
  Assessment Results:
    sender_name:
      Confidence Score: 0.95
      Reason: The sender name 'Will E. Clark' is clearly stated in the document with high OCR confidence (91.54730987548828). The signature at the end of the letter also matches this name, providing strong evidence.
      Threshold: 0.85
    sender_address:
      Confidence Score: 0.9
      Reason: The sender address '206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056' is clearly stated at the top of the document. The OCR confidence fo

## 10. Final Document Status Summary

In [11]:
# Update document status to COMPLETED
document.status = Status.COMPLETED

# Display final document state
print("Final Document State:")
print(f"Document ID: {document.id}")
print(f"Status: {document.status.value}")
print(f"Number of pages: {document.num_pages}")
print(f"Number of sections: {len(document.sections)}")

print("\n=== Assessment Feature Summary ===")
print("✅ OCR Processing - Convert PDF to text and images")
print("✅ Document Classification - Identify document types")
print("✅ Information Extraction - Extract structured data")
print("✅ Assessment - Evaluate extraction confidence")
print("\nThe assessment feature provides:")
print("- Confidence scores (0.0 to 1.0) for each extracted attribute")
print("- Detailed reasoning explaining the confidence level")
print("- Analysis of OCR quality and document clarity")
print("- Identification of ambiguous or uncertain extractions")
print("- Integration with existing extraction results")

Final Document State:
Document ID: rvl-cdip-package-assessment
Status: COMPLETED
Number of pages: 10
Number of sections: 8

=== Assessment Feature Summary ===
✅ OCR Processing - Convert PDF to text and images
✅ Document Classification - Identify document types
✅ Information Extraction - Extract structured data
✅ Assessment - Evaluate extraction confidence

The assessment feature provides:
- Confidence scores (0.0 to 1.0) for each extracted attribute
- Detailed reasoning explaining the confidence level
- Analysis of OCR quality and document clarity
- Identification of ambiguous or uncertain extractions
- Integration with existing extraction results


## Conclusion

This notebook demonstrates the enhanced end-to-end processing flow with the new **Assessment Service**:

1. **Document Creation** - Initialize a Document object with input/output locations
2. **OCR Processing** - Convert PDF to text using AWS Textract via OcrService
3. **Classification** - Identify document types and sections using Bedrock via ClassificationService
4. **Extraction** - Extract structured information using Bedrock via ExtractionService
5. **Assessment** - Evaluate extraction confidence using Bedrock via AssessmentService ✨ **NEW**
6. **Document Model** - Document object is consistently used between all services
7. **Result Storage** - Assessment results are stored alongside extraction results in S3

### Key Benefits of the Assessment Service:

1. **Explainability** - Provides confidence scores and reasoning for each extracted attribute
2. **Quality Control** - Identifies extractions that may need human review
3. **OCR Analysis** - Considers OCR quality and document clarity in confidence scoring
4. **Integration** - Seamlessly integrates with existing extraction workflows
5. **Flexibility** - Configurable prompts and models for different assessment strategies
6. **Multimodal** - Uses both text and image content for comprehensive assessment

### Assessment Output Structure:

The assessment service appends `explainability_info` to existing extraction results:

```json
{
  "document_class": {"type": "letter"},
  "inference_result": {
    "sender_name": "John Doe",
    "sender_address": "123 Main St"
  },
  "explainability_info": {
    "sender_name": {
      "confidence_score": 0.95,
      "confidence_reason": "Clear text found in document header with high OCR confidence"
    },
    "sender_address": {
      "confidence_score": 0.75,
      "confidence_reason": "Address partially visible but some characters unclear"
    }
  },
  "metadata": {
    "extraction_time_seconds": 2.3,
    "assessment_time_seconds": 1.8
  }
}
```

This assessment capability enables more robust document processing workflows with built-in quality control and explainability features.