DocILE - Document Information Localization and Extraction

This is the implementation of our team (UIT@AICLUB_TAB) on the KILE subtask. In this competition, we got the 3rd Prize. The paper can found here

Introduction

DocILE is a large-scale research benchmark for cross-evaluation of machine learning methods for Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from semi-structured business documents such as invoices, orders, etc. Such large-scale benchmark was previously missing (Skalický et al., 2022), hindering comparative evaluation.

Folder Structure

├── experiments             # contains checkpoint when training
├── predictions             # contains predictions when inference
├── run_inference.sh
├── run_training.sh
├── train.py
├── inference.py
├── config.py               # make configuration here
└── utils                   # source code to create pseudo data or visualize data

Dependencies

Make sure that you are using Python 3.8+ to run scripts in this. Run this command to install requirements:

apt install poppler-utils
pip install -r requirements.txt

Config

You can view and edit the configuration here. In this file, you can config whether the model will use Post-Processing, Fast Gradient Method, and how to Ensemble outputs, you can also customize any Optimizer and Scheduler you want to use to train the model.

Trained Models

You can download the weights of our three models here. Don't forget to config the MODEL_PATHS in the config.yml and then inference with methods we used (Ensemble, Post-processing, ...).

Train

Before training, you should config hyperparameters in file run_training.sh. Don't forget to change the output directory, data path, checkpoint path, and GPU devices, too.

./run_training.sh

To resume from the specific checkpoint, you need to change this TIMESTAMP=$(date +"%Y%m%d_%H%M_%S") into the folder name containing checkpoint and add parameter --resume into train_params.

Inference

Before inference, you also need to change the checkpoint path, data path, ... in file run_inference.sh. After that, run this command to inference model on validation data (or any data):

./run_inference.sh val

Create Pseudo data for pretraining

You can view this tutorial.

Links

https://github.com/rossumai/docile

https://docile.rossum.ai/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocILE - Document Information Localization and Extraction

Introduction

Folder Structure

Dependencies

Config

Trained Models

Train

Inference

Create Pseudo data for pretraining

Links

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docile		docile
experiments		experiments
model		model
predictions		predictions
utils		utils
.gitignore		.gitignore
DocILE2023_Paper.pdf		DocILE2023_Paper.pdf
README.md		README.md
config.py		config.py
fgm.py		fgm.py
helpers.py		helpers.py
inference.py		inference.py
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh
run_training.sh		run_training.sh
train.py		train.py
trainer.py		trainer.py

xbaotg/DocILE

Folders and files

Latest commit

History

Repository files navigation

DocILE - Document Information Localization and Extraction

Introduction

Folder Structure

Dependencies

Config

Trained Models

Train

Inference

Create Pseudo data for pretraining

Links

About

Resources

Stars

Watchers

Forks

Languages