Code for the CVPR 2019 Paper Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing
- Python 2.7
- Pytorch 0.3.0
- CUDA 8.0
- Clone the CM-Erase repository
git clone --recursive https://github.com/xh-liu/CM-Erase
- Prepare the submodules and associated data
-
Mask R-CNN: Follow the instructions of my mask-faster-rcnn repo, preparing everything needed for
pyutils/mask-faster-rcnn
. -
REFER API and data: Use the download links of REFER and go to the foloder running
make
. Followdata/README.md
to prepare images and refcoco/refcoco+/refcocog annotations. -
refer-parser2: Follow the instructions of refer-parser2 to extract the parsed expressions using Vicente's R1-R7 attributes. Note this sub-module is only used if you want to train the models by yourself.
- Prepare the training and evaluation data by running
tools/prepro.py
:
python tools/prepro.py --dataset refcoco --splitBy unc
-
Download the Glove pretrained word embeddings at Google Drive.
-
Extract features using Mask R-CNN, where the
head_feats
are used in subject module training andann_feats
is used in relationship module training.
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_head_feats.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_ann_feats.py --dataset refcoco --splitBy unc
- Detect objects/masks and extract features (only needed if you want to evaluate the automatic comprehension). We empirically set the confidence threshold of Mask R-CNN as 0.65.
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect.py --dataset refcoco --splitBy unc --conf_thresh 0.65
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect_to_mask.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_det_feats.py --dataset refcoco --splitBy unc
- Pretrain the network (CM-Att) with ground-truth annotation:
./experiments/scripts/train_mattnet.sh GPU_ID
- Train the network with cross-modal erasing (CM-Att-Erase):
./experiments/scripts/train_erase.sh GPU_ID
Evaluate the network with ground-truth annotation:
./experiments/scripts/eval_easy.sh GPU_ID
Evaluate the network with Mask R-CNN detection results:
./experiments/scripts/eval_dets.sh GPU_ID
We provide the pre-trained models for RefCOCO, RefCOCO+ and RefCOCOg. Download them from Google Drive and put them under ./output
folder.
If you find our code useful for your research, please consider citing:
@inproceedings{liu2019improving,
title={Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing},
author={Liu, Xihui and Wang, Zihao and Shao, Jing and Wang, Xiaogang and Li, Hongsheng},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={1950--1959},
year={2019}
}
@inproceedings{yu2018mattnet,
title={Mattnet: Modular attention network for referring expression comprehension},
author={Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={1307--1315},
year={2018}
}
This project is built on Pytorch implementation of MAttNet: Modular Attention Network for Referring Expression Comprehension in CVPR 2018.