GitHub - wzk1015/CNMT: code for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps (AAAI 2021)

Introduction

Code for our AAAI 2021 paper Confidence-aware Non-repetitive Multimodal Transformers for TextCaps [PDF].

Installation

Our implementation is based on Pythia framework (now called mmf), and built upon M4C-Captioner. Please refer to Pythia's document for more details on installation requirements.

# install pythia based on requirements.txt
python setup.py build develop

Data Preparation

The following is open-source data of TextCaps dataset from M4C-Captioner's Github repository. Please download them from the links below and and extract them under data directory.

Our imdb files include new OCR tokens and recognition confidence extracted with pretrained OCR systems ( CRAFT, ABCNet and four-stage STR). The three imdb files should be downloaded from the links below and put under data/imdb/.

file name	download link
imdb_train.npy	Google Drive Baidu Netdisk(password: sxbk)
imdb_val_filtered_by_image_id.npy	Google Drive Baidu Netdisk(password: i6pf)
imdb_test_filtered_by_image_id.npy	Google Drive Baidu Netdisk(password: uxew)

Finally, your data directory structure should look like this:

data
|-detectron							
|---...
|-m4c_textvqa_ocr_en_frcn_features
|---...
|-open_images						
|---...
|-vocab_textcap_threshold_10.txt 	#already provided
|-imdb								
|---imdb_train.npy					
|---imdb_val_filtered_by_image_id.npy	
|---imdb_test_filtered_by_image_id.npy

Pretrained Model

download link	description	val set CIDEr	test set CIDEr
Google Drive Baidu Netdisk(password: c4be)	CNMT best	101.6	93.0

Training

We provide an example script for training on TextCaps dataset for 12000 iterations and evaluating every 500 iterations.

./train.sh

This may take approximately 13 hours, depending on GPU devices. Please refer to our paper for implementation details.

First-time training will download fasttext model . You may also download it manually and put it under pythia/.vector_cache/.

During training, log file can be found under save/cnmt/m4c_textcaps_cnmt/logs/. You may also run training in background and check log file for training status.

Evaluation

Assume that checkpoint of the trained model is saved at save/cnmt/m4c_textcaps_cnmt/best.ckpt (otherwise modify the resume_file parameter in the shell script).

Run the following script to generate prediction json file:

#evaluate on validation set
./eval_val.sh 
#evaluate on test set
./eval_test.sh

The prediction json file will be saved under save/eval/m4c_textcaps_cnmt/reports/. You can submit the json file to the TextCaps EvalAI server for result.

Citation

@article{wang2020confidenceaware,
  title={Confidence-aware Non-repetitive Multimodal Transformers for TextCaps}, 
  author={Wang, Zhaokai and Bao, Renda and Wu, Qi and Liu, Si},
  year={2020},
  journal={arXiv preprint arXiv:2012.03662},
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
data		data
pythia		pythia
tests		tests
tools		tools
LICENSE		LICENSE
README.md		README.md
eval_test.sh		eval_test.sh
eval_val.sh		eval_val.sh
kill.sh		kill.sh
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Data Preparation

Pretrained Model

Training

Evaluation

Citation

About

Languages

License

wzk1015/CNMT

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Data Preparation

Pretrained Model

Training

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages