Skip to content

vanvan6992/StructBill-CN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Dataset License and Access

License The curated annotations and the public portion of the StructBill-CN dataset are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). It is strictly restricted to academic research purposes only. Commercial use is prohibited.

Data Availability & Split (Strict Compliance Policy) To strictly comply with the original data distribution agreements of third-party datasets, we provide the dataset in a decoupled manner:

  • What We Provide (Available Now): Our curated, unified Hierarchical JSON annotations for all subsets, PLUS the images for the fully de-identified out-of-distribution (OOD) test set of the Internal-Wild data.
  • Third-Party Source Images: For the CHIP-2022 and SIBR-med subsets, researchers must download the original raw images directly from their official sources to pair with our provided annotations.
  • Internal-Wild Training Set (Pending Release): We have completed the rigorous de-identification process for the training set images and obtained the necessary permissions for public release. To strictly maintain the double-blind review policy, this subset is temporarily withheld and will be uploaded to our official, non-anonymous repository immediately upon the publication of the paper.

Ethical & Privacy Statement All real-world data (Internal-Wild) included in the public test set has undergone rigorous de-identification and anonymization. All Protected Health Information (PHI)—including patient names, personal IDs, specific medical institutions, and exact dates—has been thoroughly redacted, masked, or replaced with synthetic placeholders. No sensitive personal data is exposed.


The StructBill-CN Dataset

Description

StructBill-CN comprises 3,596 high-resolution images covering 8 distinct business schemas, ranging from standardized fixed-amount invoices to unstructured medical notifications and complex itemized billing lists. The dataset integrates both public academic data and real-world private data, establishing a unified evaluation platform that spans varying layout styles, print qualities, and business logic complexities. This diversity ensures that the dataset rigorously tests the generalization of models across different distributions, moving beyond simple pattern matching to deep semantic understanding. The dataset is constructed from three primary sources: CHIP-2022, SIBR-med, and Internal-Wild datasets. The composition is detailed in Table 1 below.

Table 1: Statistics and Characteristics of the StructBill-CN Dataset

The dataset is constructed from two public datasets and an internal business dataset.

Subset Document Type Count Table Format
CHIP-2022 Inpatient Invoice 680 Wired (Grid)
Outpatient Invoice 340 Wired (Grid)
Pharmacy Invoice 340 Wired (Grid)
Discharge Record 340 Text-heavy
SIBR-Med Fee List 400 Wireless
Notification Note 200 None
Internal-Wild Settlement Sheet 1269 Dense KV
Ultra-Long Fee List 27 Wireless
Total 8 Distinct Schemas 3,596 Mixed

Dataset Challenges

StructBill-CN underscores the limitations of existing MLLMs and traditional TSR approaches in complex document understanding through three distinct challenges:

  1. Absence of explicit visual cues: The prevalence of borderless tables—characterized by the lack of vertical separators—results in the visual merging of densely packed numerical columns, frequently causing misalignment and failure in traditional segmentation-based methods.
  2. Structural ambiguity and hallucination risks: Unstructured text blocks often induce models to generate spurious table rows, while alignment difficulties inherent to sparse columns containing null values frequently result in shift errors.
  3. Extreme density and visual noise: In real-world business scenarios, ultra-long sequences challenge long-range attention mechanisms, while physical degradations and semantically similar fields rigorously test the model's fine-grained discrimination and robustness.

Annotation Details

StructBill-CN departs from conventional transcript-and-bounding-box annotations by adopting a Hierarchical JSON standard designed for immediate downstream integration. This Ingestion-Ready format structures ground truth into global key-value attributes and nested line-item lists, mimicking real-world database schemas to minimize post-processing.

Furthermore, the annotation protocol strictly prioritizes semantic attribution over physical location; in the presence of printing offsets or wireless table layouts, labels are assigned based on logical business context rather than geometric coordinates. This approach compels models to develop deep semantic alignment strategies, inferring structure from content logic rather than relying solely on superficial visual positioning.

Citation

BibTeX:

@article{structbillcn2026,
  title={StructBill-CN: Benchmarking and Improving Logical Consistency in Visual Document Understanding with Schema-Reinforced Policy Optimization},
  author={Anonymous Authors},
  journal={Under Review at IJCAI},
  year={2026}
}

(Note: The citation will be updated with author names and official publication details upon acceptance.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors