# IDP Service Validation Suite

This notebook performs an in-depth validation of the intelligent document processing (IDP) pipeline using the in-repo LLM proxy for Azure Document Intelligence. It routes a diverse set of documents, enforces overrides, runs the parsing workflow with summarisation and enrichment, and verifies idempotent storage behaviour.


## Notebook goals
- Build deterministic fixtures covering PDFs, spreadsheets, CSV extracts, and HTML emails.
- Execute the router with static configuration, filename pattern overrides, and request-level overrides.
- Run the document workflow to generate canonical outputs, summaries, and optional enrichments.
- Assert that confidence-bearing canonical records are produced and idempotency is preserved across replays.


## 1. Repository context
Ensure the repository root is on the import path so the notebook can import the local `idp_service` package and helper modules without installation.


In [None]:
import sys
from pathlib import Path

REPO_ROOT = Path.cwd().resolve()
if not (REPO_ROOT / "idp_service").exists():
    REPO_ROOT = REPO_ROOT.parent
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))
print(f"Using repository root: {REPO_ROOT}")


## 2. Prepare rich sample documents
Reconstruct the embedded PDF/XLSX fixtures, synthesise a CSV extract, and craft an HTML-heavy email message so the router sees multiple content patterns.


In [None]:
import csv
from email.message import EmailMessage
from pathlib import Path
import textwrap

from idp_service.sample_documents_embedded import write_embedded

SAMPLES_DIR = REPO_ROOT / "docs" / "sample_documents"
SAMPLES_DIR.mkdir(parents=True, exist_ok=True)

EMBEDDED_FILES = [
    "financial_report.pdf",
    "operating_budget.xlsx",
]
for name in EMBEDDED_FILES:
    write_embedded(name, SAMPLES_DIR / name)

catalog_path = SAMPLES_DIR / "product_catalog.csv"
catalog_path.write_text(
    "item,description,price
"
    "SKU-001,Cloud storage subscription,49.99
"
    "SKU-002,AI analytics toolkit,129.00
"
    "SKU-003,Premium support retainer,299.00
",
    encoding="utf-8",
)

invoice_path = SAMPLES_DIR / "invoice_snapshot.csv"
invoice_path.write_text(
    "invoice_id,amount,status
"
    "INV-2024-001,1500.00,Paid
"
    "INV-2024-002,890.50,Pending
",
    encoding="utf-8",
)

msg = EmailMessage()
msg["Subject"] = "Client intake checklist"
msg["From"] = "operations@example.com"
msg["To"] = "idp-demo@example.com"
msg.set_content(
    "Please review the onboarding checklist below and confirm selections."
)
msg.add_alternative(
    textwrap.dedent(
        """        <html>
            <body>
                <h2>Client onboarding checklist</h2>
                <p>Select the required compliance artefacts:</p>
                <ul>
                    <li><input type="checkbox" checked /> Identity verification</li>
                    <li><input type="checkbox" /> Financial statements</li>
                    <li><input type="radio" name="tier" /> Standard support</li>
                    <li><input type="radio" name="tier" checked /> Premium support</li>
                </ul>
                <table>
                    <tr><th>Artefact</th><th>Status</th></tr>
                    <tr><td>AML screening</td><td>Complete</td></tr>
                    <tr><td>KYC package</td><td>Pending</td></tr>
                </table>
            </body>
        </html>
        """
    ),
    subtype="html",
)
email_path = SAMPLES_DIR / "client_intake.eml"
email_path.write_bytes(msg.as_bytes())

print("Created sample documents:")
for path in [
    SAMPLES_DIR / name for name in EMBEDDED_FILES
] + [catalog_path, invoice_path, email_path]:
    print(f" - {path.relative_to(REPO_ROOT)}")


## 3. Configure the router and overrides
Build a hybrid router with explicit strategy mappings per category and demonstrate a filename-driven override for invoices.


In [None]:
import re

from idp_service.routing import (
    DocumentCategory,
    DocumentRouter,
    HeuristicLayoutAnalyser,
    OverrideSet,
    PatternOverride,
    RouterConfig,
    StrategyConfig,
)

router_config = RouterConfig(
    category_thresholds={
        "short_form_threshold": 5,
        "long_form_threshold": 15,
        "short_form_max_pages": 4,
        "long_form_max_pages": 120,
        "table_heavy_max_pages": 10,
        "form_max_pages": 6,
    },
    default_strategy_map={
        DocumentCategory.SHORT_FORM.value: {"name": "azure_general", "model": "prebuilt-document"},
        DocumentCategory.LONG_FORM.value: {"name": "azure_longform", "model": "prebuilt-layout"},
        DocumentCategory.TABLE_HEAVY.value: {"name": "azure_tables", "model": "prebuilt-layout"},
        DocumentCategory.FORM_HEAVY.value: {"name": "azure_forms", "model": "prebuilt-layout"},
        DocumentCategory.SCANNED.value: {"name": "ocr_enhanced", "model": "ocr-2024"},
    },
    fallback_strategy={"name": "fallback_proxy"},
)

pattern_override = PatternOverride(
    pattern=re.compile(r"invoice", re.IGNORECASE),
    strategy=StrategyConfig(name="invoice_override", model="custom-invoice-v1"),
)
overrides = OverrideSet(pattern_overrides=[pattern_override])

router = DocumentRouter(
    config=router_config,
    layout_analyser=HeuristicLayoutAnalyser(),
)
print("Router ready with thresholds:", router_config.category_thresholds)


## 4. Deterministic enrichment provider
Use a lightweight keyword extractor that satisfies the enrichment interface so we can validate downstream integration without external services.


In [None]:
import string
from dataclasses import dataclass

from idp_service.enrichment import EnrichmentResponse

@dataclass
class KeywordExtractionProvider:
    name: str = "keyword_extraction"
    max_batch_size: int = 8
    timeout_seconds: int | None = None

    def enrich(self, requests):
        responses: list[EnrichmentResponse] = []
        for request in requests:
            text = " ".join(span.content for span in request.document.text_spans)
            tokens = [
                token.strip(string.punctuation).lower()
                for token in text.split()
                if len(token.strip(string.punctuation)) >= 6
            ]
            keywords = list(dict.fromkeys(tokens))[:5]
            responses.append(
                EnrichmentResponse(
                    document_id=request.document_id,
                    enrichments=[
                        {
                            "enrichment_type": "keyword_extraction",
                            "content": {"keywords": keywords},
                            "confidence": 0.6,
                        }
                    ],
                    metadata={"method": "heuristic_ngram"},
                )
            )
        return responses

keyword_provider = KeywordExtractionProvider()
print("Configured enrichment provider:", keyword_provider)


## 5. Workflow helpers and scenario definitions
Create reusable helpers that assemble routing payloads, execute the workflow, and assert expectations for each scenario. Each scenario covers a unique routing path or override behaviour.


In [None]:
import base64
from dataclasses import dataclass
from typing import Dict, List, Optional

from idp_service.document_intelligence_storage import InMemoryDocumentResultStore
from idp_service.document_intelligence_workflow import (
    DocumentIntelligenceWorkflow,
    WorkflowConfig,
)
from idp_service.llm_document_intelligence_proxy import (
    LLMAzureDocumentIntelligenceClient,
)

llm_client = LLMAzureDocumentIntelligenceClient()
result_store = InMemoryDocumentResultStore()

def _page_layout(repeats: int, *, text: float, image: float, table: float, **extras) -> List[Dict[str, float]]:
    pages: List[Dict[str, float]] = []
    for index in range(repeats):
        payload = {
            "index": index,
            "textDensity": text,
            "imageDensity": image,
            "tableDensity": table,
        }
        payload.update(extras)
        pages.append(payload)
    return pages

@dataclass(frozen=True)
class Scenario:
    name: str
    object_key: str
    path: Path
    content_type: str
    metadata: Dict[str, object]
    expected_category: str
    expected_strategy: str
    expected_reason: Optional[str] = None
    expected_override_fragment: Optional[str] = None
    enrich_with: Optional[List[str]] = None
    request_override: Optional[str] = None


def build_routing_body(scenario: Scenario) -> Dict[str, object]:
    body = {
        "documentMetadata": scenario.metadata,
        "documentBytes": base64.b64encode(scenario.path.read_bytes()).decode("ascii"),
    }
    if scenario.request_override:
        body[router_config.request_override_flag] = scenario.request_override
    return body


def run_document_flow(scenario: Scenario, *, force: bool = False) -> Dict[str, object]:
    body = build_routing_body(scenario)
    analysis = router.route(body=body, object_key=scenario.object_key, overrides=overrides)

    assert analysis.category.value == scenario.expected_category, (
        f"Expected category {scenario.expected_category} for {scenario.name}, "
        f"got {analysis.category.value}"
    )
    assert analysis.strategy.name == scenario.expected_strategy, (
        f"Expected strategy {scenario.expected_strategy}, got {analysis.strategy.name}"
    )
    if scenario.expected_reason:
        assert analysis.strategy.reason == scenario.expected_reason, (
            f"Expected reason {scenario.expected_reason}, got {analysis.strategy.reason}"
        )
    if scenario.expected_override_fragment:
        assert any(
            scenario.expected_override_fragment in entry
            for entry in analysis.overrides_applied
        ), f"Override fragment {scenario.expected_override_fragment} not applied"

    workflow = DocumentIntelligenceWorkflow(
        client=llm_client,
        store=result_store,
        config=WorkflowConfig(
            model_id=analysis.strategy.model or analysis.strategy.name or "prebuilt-document",
            enrichment_providers=[keyword_provider],
        ),
    )

    metadata_record = analysis.to_metadata_record(
        {
            "ingestion_batch_id": "validation-suite",
            "object_key": scenario.object_key,
        }
    )
    payload_bytes = scenario.path.read_bytes()
    result = workflow.process(
        document_id=scenario.name,
        document_bytes=payload_bytes,
        source_uri=f"file://{scenario.path}",
        metadata=metadata_record,
        content_type=scenario.content_type,
        force=force,
        enrich_with=scenario.enrich_with,
    )

    if result.skipped:
        return {
            "name": scenario.name,
            "skipped": True,
            "reason": "idempotent_skip",
        }

    document = result.document
    assert document is not None, "Workflow should return a canonical document"
    assert document.text_spans, "Canonical document should contain text spans"
    assert all(span.confidence > 0 for span in document.text_spans), "Confidence scores must be present"

    if scenario.enrich_with:
        assert document.enrichments, "Expected enrichment entries when enrich_with is supplied"

    summary_title = document.summaries[0].title if document.summaries else None
    summary_text = document.summaries[0].summary if document.summaries else None

    return {
        "name": scenario.name,
        "category": analysis.category.value,
        "strategy": analysis.strategy.name,
        "reason": analysis.strategy.reason,
        "text_spans": len(document.text_spans),
        "tables": len(document.tables),
        "summary_title": summary_title,
        "summary": summary_text[:200] if summary_text else None,
        "enrichments": [enrichment.enrichment_type for enrichment in document.enrichments],
        "overrides": analysis.overrides_applied,
    }


## 6. Scenario matrix
Define scenarios that exercise long-form routing, table-heavy spreadsheets, standard short-form text, filename pattern overrides, and explicit request overrides.


In [None]:
SCENARIOS: List[Scenario] = [
    Scenario(
        name="long_form_report",
        object_key="reports/financial_report.pdf",
        path=SAMPLES_DIR / "financial_report.pdf",
        content_type="application/pdf",
        metadata={
            "contentType": "application/pdf",
            "pageCount": 18,
            "layout": {"pages": _page_layout(18, text=0.72, image=0.12, table=0.08)},
        },
        expected_category=DocumentCategory.LONG_FORM.value,
        expected_strategy="azure_longform",
        expected_reason="category_default",
        enrich_with=[keyword_provider.name],
    ),
    Scenario(
        name="table_heavy_budget",
        object_key="finance/operating_budget.xlsx",
        path=SAMPLES_DIR / "operating_budget.xlsx",
        content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        metadata={
            "contentType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "pageCount": 1,
            "layout": {
                "pages": _page_layout(1, text=0.25, image=0.1, table=0.95, tableCount=4)
            },
        },
        expected_category=DocumentCategory.TABLE_HEAVY.value,
        expected_strategy="azure_tables",
        expected_reason="category_default",
    ),
    Scenario(
        name="short_form_catalog",
        object_key="exports/product_catalog.csv",
        path=SAMPLES_DIR / "product_catalog.csv",
        content_type="text/csv",
        metadata={
            "contentType": "text/csv",
            "pageCount": 1,
            "layout": {"pages": _page_layout(1, text=0.9, image=0.05, table=0.1)},
        },
        expected_category=DocumentCategory.SHORT_FORM.value,
        expected_strategy="azure_general",
        expected_reason="category_default",
    ),
    Scenario(
        name="invoice_pattern_override",
        object_key="incoming/invoice_snapshot.csv",
        path=SAMPLES_DIR / "invoice_snapshot.csv",
        content_type="text/csv",
        metadata={
            "contentType": "text/csv",
            "pageCount": 1,
            "layout": {"pages": _page_layout(1, text=0.85, image=0.05, table=0.15)},
        },
        expected_category=DocumentCategory.SHORT_FORM.value,
        expected_strategy="invoice_override",
        expected_reason="config_pattern_override",
        expected_override_fragment="pattern",
    ),
    Scenario(
        name="form_heavy_email",
        object_key="mailbox/client_intake.eml",
        path=SAMPLES_DIR / "client_intake.eml",
        content_type="message/rfc822",
        metadata={
            "contentType": "message/rfc822",
            "pageCount": 2,
            "layout": {
                "pages": _page_layout(
                    2,
                    text=0.55,
                    image=0.15,
                    table=0.35,
                    checkboxCount=1,
                    radioButtonCount=1,
                )
            },
        },
        expected_category=DocumentCategory.FORM_HEAVY.value,
        expected_strategy="azure_forms",
        expected_reason="category_default",
        enrich_with=[keyword_provider.name],
    ),
    Scenario(
        name="forced_route_email",
        object_key="mailbox/manual_override.eml",
        path=SAMPLES_DIR / "client_intake.eml",
        content_type="message/rfc822",
        metadata={
            "contentType": "message/rfc822",
            "pageCount": 2,
            "layout": {
                "pages": _page_layout(
                    2,
                    text=0.45,
                    image=0.1,
                    table=0.05,
                    checkboxCount=0,
                    radioButtonCount=0,
                )
            },
        },
        expected_category=DocumentCategory.UNKNOWN.value,
        expected_strategy="force_manual_route",
        expected_reason="request_override",
        expected_override_fragment="request",
        request_override="force_manual_route",
    ),
]

print(f"Prepared {len(SCENARIOS)} scenarios")
for scenario in SCENARIOS:
    print(f" - {scenario.name}: {scenario.object_key}")


## 7. Execute scenarios
Run each scenario through the router and workflow, capturing the canonical output statistics. Assertions in `run_document_flow` guarantee the expected behaviour.


In [None]:
from pprint import pprint

results = [run_document_flow(scenario) for scenario in SCENARIOS]
print("Captured workflow results:")
pprint(results)


## 8. Idempotency verification
Reprocess a document without the `force` flag to confirm that the workflow skips work when the checksum is unchanged, then force a replay to regenerate the artefacts.


In [None]:
first_scenario = SCENARIOS[0]
second_run = run_document_flow(first_scenario)
print("Second run skipped:", second_run)

forced_run = run_document_flow(first_scenario, force=True)
print("Forced reprocess summary:")
pprint(forced_run)


## 9. Summary
The validation suite confirms that routing, overrides, canonical parsing, summarisation, enrichment, and idempotent storage behave as designed across heterogeneous document types.
