The StructBill-CN Dataset

Dataset License and Access

License The curated annotations and the public portion of the StructBill-CN dataset are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). It is strictly restricted to academic research purposes only. Commercial use is prohibited.

Data Availability & Split (Strict Compliance Policy) To strictly comply with the original data distribution agreements of third-party datasets, we provide the dataset in a decoupled manner:

What We Provide (Available Now): Our curated, unified Hierarchical JSON annotations for all subsets, PLUS the images for the fully de-identified out-of-distribution (OOD) test set of the Internal-Wild data.
- Download Annotations & Internal-Wild Test Images: [https://huggingface.co/datasets/VANVAN6992/StructBill-CN]
Third-Party Source Images: For the CHIP-2022 and SIBR-med subsets, researchers must download the original raw images directly from their official sources to pair with our provided annotations.
- CHIP-2022 Images: Available via the Aliyun Tianchi Platform - CHIP 2022 Shared Task
- SIBR-med Images: Available via the Official SIBR-med Repository (Please insert the actual github link for SIBR-med here)
Internal-Wild Training Set (Pending Release): We have completed the rigorous de-identification process for the training set images and obtained the necessary permissions for public release. To strictly maintain the double-blind review policy, this subset is temporarily withheld and will be uploaded to our official, non-anonymous repository immediately upon the publication of the paper.

Ethical & Privacy Statement All real-world data (Internal-Wild) included in the public test set has undergone rigorous de-identification and anonymization. All Protected Health Information (PHI)—including patient names, personal IDs, specific medical institutions, and exact dates—has been thoroughly redacted, masked, or replaced with synthetic placeholders. No sensitive personal data is exposed.

The StructBill-CN Dataset

Description

StructBill-CN comprises 3,596 high-resolution images covering 8 distinct business schemas, ranging from standardized fixed-amount invoices to unstructured medical notifications and complex itemized billing lists. The dataset integrates both public academic data and real-world private data, establishing a unified evaluation platform that spans varying layout styles, print qualities, and business logic complexities. This diversity ensures that the dataset rigorously tests the generalization of models across different distributions, moving beyond simple pattern matching to deep semantic understanding. The dataset is constructed from three primary sources: CHIP-2022, SIBR-med, and Internal-Wild datasets. The composition is detailed in Table 1 below.

Table 1: Statistics and Characteristics of the StructBill-CN Dataset

The dataset is constructed from two public datasets and an internal business dataset.

Subset	Document Type	Count	Table Format
CHIP-2022	Inpatient Invoice	680	Wired (Grid)
	Outpatient Invoice	340	Wired (Grid)
	Pharmacy Invoice	340	Wired (Grid)
	Discharge Record	340	Text-heavy
SIBR-Med	Fee List	400	Wireless
	Notification Note	200	None
Internal-Wild	Settlement Sheet	1269	Dense KV
	Ultra-Long Fee List	27	Wireless
Total	8 Distinct Schemas	3,596	Mixed

Dataset Challenges

StructBill-CN underscores the limitations of existing MLLMs and traditional TSR approaches in complex document understanding through three distinct challenges:

Absence of explicit visual cues: The prevalence of borderless tables—characterized by the lack of vertical separators—results in the visual merging of densely packed numerical columns, frequently causing misalignment and failure in traditional segmentation-based methods.
Structural ambiguity and hallucination risks: Unstructured text blocks often induce models to generate spurious table rows, while alignment difficulties inherent to sparse columns containing null values frequently result in shift errors.
Extreme density and visual noise: In real-world business scenarios, ultra-long sequences challenge long-range attention mechanisms, while physical degradations and semantically similar fields rigorously test the model's fine-grained discrimination and robustness.

Annotation Details

StructBill-CN departs from conventional transcript-and-bounding-box annotations by adopting a Hierarchical JSON standard designed for immediate downstream integration. This Ingestion-Ready format structures ground truth into global key-value attributes and nested line-item lists, mimicking real-world database schemas to minimize post-processing.

Furthermore, the annotation protocol strictly prioritizes semantic attribution over physical location; in the presence of printing offsets or wireless table layouts, labels are assigned based on logical business context rather than geometric coordinates. This approach compels models to develop deep semantic alignment strategies, inferring structure from content logic rather than relying solely on superficial visual positioning.

Citation

BibTeX:

@article{structbillcn2026,
  title={StructBill-CN: Benchmarking and Improving Logical Consistency in Visual Document Understanding with Schema-Reinforced Policy Optimization},
  author={Anonymous Authors},
  journal={Under Review at IJCAI},
  year={2026}
}

(Note: The citation will be updated with author names and official publication details upon acceptance.)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset License and Access

The StructBill-CN Dataset

Description

Dataset Challenges

Annotation Details

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Dataset License and Access

The StructBill-CN Dataset

Description

Dataset Challenges

Annotation Details

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages