-
Notifications
You must be signed in to change notification settings - Fork 1
How PDF to ESX Works
This is the short version of the pipeline.
estimate PDF -> text/OCR extraction -> structured parsing -> canonical estimate -> ESX/XML export
The app accepts one or more insurance estimate PDFs. Some are text-based and easy to read. Others are scan-heavy, image-based, or mixed packets with guide pages and summaries.
The pipeline first tries to use normal PDF text extraction.
If a page looks text-poor or scan-heavy, the app can apply local OCR. This improves results on messy real-world documents, but OCR quality still depends on scan quality and layout clarity.
The parser tries to identify:
- carrier and claim metadata
- insured/property details
- dates
- estimator information
- totals and subtotals
- line items
- quantities, units, prices, taxes, depreciation, and related fields
The parser is heuristic. Different carriers and estimate layouts can produce different reliability levels.
The extracted information is normalized into a canonical estimate model before export.
That matters because it keeps the app modular:
- PDF-specific quirks stay in ingestion/parsing
- export logic stays independent from raw PDF layout logic
- future parser improvements do not require rewriting the exporter
- multi-PDF merge behavior has a single internal contract to work against
The export layer turns the canonical estimate into:
- a zip-based
.esxpackage - a readable
.esx.xmlpayload - a
.canonical.jsonsnapshot for inspection and debugging
The package is deterministic and validated before success is reported.
- It does not claim universal support for every estimate layout.
- It does not claim native proprietary
XACTDOC.ZIPXMLauthoring. - It does not claim perfect OCR recovery on poor scans.
Main repository | Deep docs in docs_repo | Issues
docs_repo/ in the main repository is the authoritative deep technical source of truth.