Skip to content

salesforce/fieldExtractor

Field Extraction from Forms

Introduction

This repository contains code of Field Extraction from Forms with Unlabeled Data.

Environment

CUDA="11.0"
CUDNN="8"
UBUNTU="18.04"

Install

bash install.sh
# under our project root folder
python setup.py develop 

Data Preparation

*We have pre-processed INV-CDIP test set under datasets/.

Reproduce Our Results

*Download our model pre-trained using INV-CDIP unlabeled train set.

python main.py \
--model_name_or_path pretrained_model_acl2022 \
--output_dir $OUTPUT_PATH

Visualization

*Download images of INV-CDIP test set and put under datasets/imgs.

python vis_results.py --pred_path $OUTPUT_PATH/prediction_pairs.pkl

Citation

If you find this codebase useful, please cite our paper:

@article{gao2021field,
  title={Field Extraction from Forms with Unlabeled Data},
  author={Gao, Mingfei and Chen, Zeyuan and Naik, Nikhil and Hashimoto, Kazuma and Xiong, Caiming and Xu, Ran},
  journal={ACL Spa-NLP Workshop},
  year={2022}
}

License

Our code is released under BSD 3-Clause.

Our pre-trained model is released under CC BY-NC 4.0.

The INV-CDIP dataset is released under CC BY-NC 4.0. The underlying documents to which the dataset refers are from the Tobacco Collections of Industry Documents Library. Please see Copyright and Fair Use for more information.

Contact

Please send an email to mingfei.gao@salesforce.com if you have questions.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published