License The curated annotations and the public portion of the StructBill-CN dataset are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). It is strictly restricted to academic research purposes only. Commercial use is prohibited.
Data Availability & Split (Strict Compliance Policy) To strictly comply with the original data distribution agreements of third-party datasets, we provide the dataset in a decoupled manner:
- What We Provide (Available Now): Our curated, unified Hierarchical JSON annotations for all subsets, PLUS the images for the fully de-identified out-of-distribution (OOD) test set of the Internal-Wild data.
- Download Annotations & Internal-Wild Test Images: [https://huggingface.co/datasets/VANVAN6992/StructBill-CN]
- Third-Party Source Images: For the CHIP-2022 and SIBR-med subsets, researchers must download the original raw images directly from their official sources to pair with our provided annotations.
- CHIP-2022 Images: Available via the Aliyun Tianchi Platform - CHIP 2022 Shared Task
- SIBR-med Images: Available via the Official SIBR-med Repository (Please insert the actual github link for SIBR-med here)
- Internal-Wild Training Set (Pending Release): We have completed the rigorous de-identification process for the training set images and obtained the necessary permissions for public release. To strictly maintain the double-blind review policy, this subset is temporarily withheld and will be uploaded to our official, non-anonymous repository immediately upon the publication of the paper.
Ethical & Privacy Statement All real-world data (Internal-Wild) included in the public test set has undergone rigorous de-identification and anonymization. All Protected Health Information (PHI)—including patient names, personal IDs, specific medical institutions, and exact dates—has been thoroughly redacted, masked, or replaced with synthetic placeholders. No sensitive personal data is exposed.
StructBill-CN comprises 3,596 high-resolution images covering 8 distinct business schemas, ranging from standardized fixed-amount invoices to unstructured medical notifications and complex itemized billing lists. The dataset integrates both public academic data and real-world private data, establishing a unified evaluation platform that spans varying layout styles, print qualities, and business logic complexities. This diversity ensures that the dataset rigorously tests the generalization of models across different distributions, moving beyond simple pattern matching to deep semantic understanding. The dataset is constructed from three primary sources: CHIP-2022, SIBR-med, and Internal-Wild datasets. The composition is detailed in Table 1 below.
Table 1: Statistics and Characteristics of the StructBill-CN Dataset
The dataset is constructed from two public datasets and an internal business dataset.
| Subset | Document Type | Count | Table Format |
|---|---|---|---|
| CHIP-2022 | Inpatient Invoice | 680 | Wired (Grid) |
| Outpatient Invoice | 340 | Wired (Grid) | |
| Pharmacy Invoice | 340 | Wired (Grid) | |
| Discharge Record | 340 | Text-heavy | |
| SIBR-Med | Fee List | 400 | Wireless |
| Notification Note | 200 | None | |
| Internal-Wild | Settlement Sheet | 1269 | Dense KV |
| Ultra-Long Fee List | 27 | Wireless | |
| Total | 8 Distinct Schemas | 3,596 | Mixed |
StructBill-CN underscores the limitations of existing MLLMs and traditional TSR approaches in complex document understanding through three distinct challenges:
- Absence of explicit visual cues: The prevalence of borderless tables—characterized by the lack of vertical separators—results in the visual merging of densely packed numerical columns, frequently causing misalignment and failure in traditional segmentation-based methods.
- Structural ambiguity and hallucination risks: Unstructured text blocks often induce models to generate spurious table rows, while alignment difficulties inherent to sparse columns containing null values frequently result in shift errors.
- Extreme density and visual noise: In real-world business scenarios, ultra-long sequences challenge long-range attention mechanisms, while physical degradations and semantically similar fields rigorously test the model's fine-grained discrimination and robustness.
StructBill-CN departs from conventional transcript-and-bounding-box annotations by adopting a Hierarchical JSON standard designed for immediate downstream integration. This Ingestion-Ready format structures ground truth into global key-value attributes and nested line-item lists, mimicking real-world database schemas to minimize post-processing.
Furthermore, the annotation protocol strictly prioritizes semantic attribution over physical location; in the presence of printing offsets or wireless table layouts, labels are assigned based on logical business context rather than geometric coordinates. This approach compels models to develop deep semantic alignment strategies, inferring structure from content logic rather than relying solely on superficial visual positioning.
BibTeX:
@article{structbillcn2026,
title={StructBill-CN: Benchmarking and Improving Logical Consistency in Visual Document Understanding with Schema-Reinforced Policy Optimization},
author={Anonymous Authors},
journal={Under Review at IJCAI},
year={2026}
}
(Note: The citation will be updated with author names and official publication details upon acceptance.)