Name	Name	Last commit message	Last commit date
parent directory ..
deploy/python	deploy/python
README.md	README.md
README_ch.md	README_ch.md
data_collator.py	data_collator.py
export_model.py	export_model.py
finetune_args.py	finetune_args.py
layout_trainer.py	layout_trainer.py
requirements.txt	requirements.txt
run_cls.py	run_cls.py
run_mrc.py	run_mrc.py
run_ner.py	run_ner.py
utils.py	utils.py

ERNIE-Layout

content

ERNIE-Layout

1. Model Instruction

Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.

The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout through PaddleNLP.

2. Out-of-Box

HuggingFace web demo

🧾 HuggingFace web demo is available here

Demo show

Invoice VQA

Poster VQA

WebPage VQA

Table VQA

Exam Paper VQA

English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt

Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, FR) prompt

Demo images are available here

Taskflow

Input Format

[
  {"doc": "./book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]},
  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
]

Default to use PaddleOCR, you can also use your own OCR result via word_boxes, the data format is List[str, List[float, float, float, float]]。

[
  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
]

Support single and batch input

Image from http link

>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> docprompt = Taskflow("document_intelligence", lang="en")
>>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}])
[{'prompt': "What is the name of the author of 'The Adventure Zone: The "
            'Crystal Kingdom’?',
  'result': [{'end': 39,
              'prob': 0.99,
              'start': 22,
              'value': 'Clint McElroy. Carey Pietsch, Griffn McElroy, Travis '
                      'McElroy'}]},
{'prompt': 'What type of book cover does The Adventure Zone: The Crystal '
            'Kingdom have?',
  'result': [{'end': 51, 'prob': 1.0, 'start': 51, 'value': 'Paperback'}]},
{'prompt': 'For Rage, who is the author listed as?',
  'result': [{'end': 93, 'prob': 1.0, 'start': 91, 'value': 'Bob Woodward'}]}]

Image from local path

>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> docprompt = Taskflow("document_intelligence")
>>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
[{'prompt': '五百丁本次想要担任的是什么职位?',
  'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
{'prompt': '五百丁是在哪里上的大学?',
  'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
{'prompt': '大学学的是什么专业?',
  'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]

Parameter Description
- batch_size: number of input of each batch, default to 1.
- lang: PaddleOCR language, en is better to English images, default to ch.
- topn: return the top n results with highest probability, default to 1.

3. Model Performance

Dataset

Dataset	Task	Language	Note
FUNSD	Key Information Extraction	English	-
XFUND-ZH	Key Information Extraction	Chinese	-
DocVQA-ZH	Document Question Answering	Chinese	The submission of the competition of DocVQA-ZH is now closed so we split original dataset into three parts for model evluation. There are 4,187 training images, 500 validation images, and 500 test images.
RVL-CDIP (sampled)	Document Image Classification	English	The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Because of the original dataset is large and slow for training, so we downsampling from it. The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images.

Results

Model FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH

LayoutXLM-Base 86.72 90.88 86.24 66.01

ERNIE-LayoutX-Base 89.31 90.29 88.58 69.57

Model	FUNSD	RVL-CDIP (sampled)	XFUND-ZH	DocVQA-ZH
LayoutXLM-Base	86.72	90.88	86.24	66.01
ERNIE-LayoutX-Base	89.31	90.29	88.58	69.57

Evaluation Methods

All the above tasks do the Hyper Parameter searching based on Grid Search method. The evaluation step interval of FUNSD and XFUND-ZH are both 100, metric is Accuracy. The evaluation step interval of RVL-CDIP is 2000, metric is F1-Score. The evaluation step interval of DocVQA-ZH is 10000, metric is ANLS,

Hyper Parameters search ranges

Hyper Parameters	FUNSD	RVL-CDIP (sampled)	XFUND-ZH	DocVQA-ZH
learning_rate	5e-6, 1e-5, 2e-5, 5e-5	5e-6, 1e-5, 2e-5, 5e-5	5e-6, 1e-5, 2e-5, 5e-5	5e-6, 1e-5, 2e-5, 5e-5
batch_size	1, 2, 4	8, 16, 24	1, 2, 4	8, 16, 24
warmup_ratio	-	0, 0.05, 0.1	-	0, 0.05, 0.1

The strategy of lr_scheduler_type for FUNSD and XFUND is constant, so warmup_ratio is excluded.

max_steps is applied for the fine-tuning on both FUNSD and XFUND-ZH, 10000 steps and 20000 steps respectively; num_train_epochs is set to 6 and 20 for DocVQA-ZH and RVL-CDIP respectively.

Best Hyper Parameter

Model FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH

LayoutXLM-Base 1e-5, 2, _ 1e-5, 8, 0.1 1e-5, 2, _ 2e-5. 8, 0.1

ERNIE-LayoutX-Base 2e-5, 4, _ 1e-5, 8, 0. 1e-5, 4, _ 2e-5. 8, 0.05

Model	FUNSD	RVL-CDIP (sampled)	XFUND-ZH	DocVQA-ZH
LayoutXLM-Base	1e-5, 2, _	1e-5, 8, 0.1	1e-5, 2, _	2e-5. 8, 0.1
ERNIE-LayoutX-Base	2e-5, 4, _	1e-5, 8, 0.	1e-5, 4, _	2e-5. 8, 0.05

4. Fine-tuning Examples

Installation

pip install -r requirements.txt

4.1 Key Information Extraction

FUNSD Train

python -u run_ner.py \
  --model_name_or_path ernie-layoutx-base-uncased \
  --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
  --dataset_name funsd \
  --do_train \
  --do_eval \
  --max_steps 10000 \
  --eval_steps 100 \
  --save_steps 100 \
  --save_total_limit 1 \
  --load_best_model_at_end \
  --pattern ner-bio \
  --preprocessing_num_workers 4 \
  --overwrite_cache false \
  --use_segment_box \
  --doc_stride 128 \
  --target_size 1000 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --learning_rate 2e-5 \
  --lr_scheduler_type constant \
  --gradient_accumulation_steps 1 \
  --seed 1000 \
  --metric_for_best_model eval_f1 \
  --greater_is_better true \
  --overwrite_output_dir

XFUND-ZH Train

python -u run_ner.py \
  --model_name_or_path ernie-layoutx-base-uncased \
  --output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \
  --dataset_name xfund_zh \
  --do_train \
  --do_eval \
  --lang "ch" \
  --max_steps 20000 \
  --eval_steps 100 \
  --save_steps 100 \
  --save_total_limit 1 \
  --load_best_model_at_end \
  --pattern ner-bio \
  --preprocessing_num_workers 4 \
  --overwrite_cache false \
  --use_segment_box \
  --doc_stride 128 \
  --target_size 1000 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --learning_rate 1e-5 \
  --lr_scheduler_type constant \
  --gradient_accumulation_steps 1 \
  --seed 1000 \
  --metric_for_best_model eval_f1 \
  --greater_is_better true \
  --overwrite_output_dir

4.2 Document Question Answering

DocVQA-ZH Train

python3 -u run_mrc.py \
  --model_name_or_path ernie-layoutx-base-uncased \
  --output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \
  --dataset_name docvqa_zh \
  --do_train \
  --do_eval \
  --lang "ch" \
  --num_train_epochs 6 \
  --lr_scheduler_type linear \
  --warmup_ratio 0.05 \
  --weight_decay 0 \
  --eval_steps 10000 \
  --save_steps 10000 \
  --save_total_limit 1 \
  --load_best_model_at_end \
  --pattern "mrc" \
  --use_segment_box false \
  --return_entity_level_metrics false \
  --overwrite_cache false \
  --doc_stride 128 \
  --target_size 1000 \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 8 \
  --learning_rate 2e-5 \
  --preprocessing_num_workers 32 \
  --save_total_limit 1 \
  --train_nshard 16 \
  --seed 1000 \
  --metric_for_best_model anls \
  --greater_is_better true \
  --overwrite_output_dir

4.3 Document Image Classification

RVL-CDIP Train

python3 -u run_cls.py \
    --model_name_or_path ernie-layoutx-base-uncased \
    --output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \
    --dataset_name rvl_cdip_sampled \
    --do_train \
    --do_eval \
    --num_train_epochs 20 \
    --lr_scheduler_type linear \
    --max_seq_length 512 \
    --warmup_ratio 0.05 \
    --weight_decay 0 \
    --eval_steps 2000 \
    --save_steps 2000 \
    --save_total_limit 1 \
    --load_best_model_at_end \
    --pattern "cls" \
    --use_segment_box \
    --return_entity_level_metrics false \
    --overwrite_cache false \
    --doc_stride 128 \
    --target_size 1000 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-5 \
    --preprocessing_num_workers 32 \
    --train_nshard 16 \
    --seed 1000 \
    --metric_for_best_model acc \
    --greater_is_better true \
    --overwrite_output_dir

5. Deploy

5.1 Inference Model Export

After fine-tuning, you can also export the inference model via Model Export Script, the inference model will be saved in the output_path you specified.

Export the model fine-tuned on FUNSD

python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export

Export the model fine-tuned on DocVQA-ZH

python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export

Export the model fine-tuned on RVL-CDIP(sampled)

python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export

Parameter Description
- model_path：the save directory of dygraph model parameters, default to "./checkpoint/"。
- output_path：the save directory of static graph model parameters, default to "./export"。

5.2 Python Deploy

We provide the deploy example on Key Information Extraction, Document Question Answering and Document Image Classification, please follow the ERNIE-Layout Python Deploy Guide

Files

ernie-layout

Directory actions

More options