Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Code for the CVPR 2019 Paper Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Prerequisites

Python 2.7
Pytorch 0.3.0
CUDA 8.0

Installation

Clone the CM-Erase repository

git clone --recursive https://github.com/xh-liu/CM-Erase

Prepare the submodules and associated data

Mask R-CNN: Follow the instructions of my mask-faster-rcnn repo, preparing everything needed for pyutils/mask-faster-rcnn.
REFER API and data: Use the download links of REFER and go to the foloder running make. Follow data/README.md to prepare images and refcoco/refcoco+/refcocog annotations.
refer-parser2: Follow the instructions of refer-parser2 to extract the parsed expressions using Vicente's R1-R7 attributes. Note this sub-module is only used if you want to train the models by yourself.

Training

Prepare the training and evaluation data by running tools/prepro.py:

python tools/prepro.py --dataset refcoco --splitBy unc

Download the Glove pretrained word embeddings at Google Drive.
Extract features using Mask R-CNN, where the head_feats are used in subject module training and ann_feats is used in relationship module training.

CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_head_feats.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_ann_feats.py --dataset refcoco --splitBy unc

Detect objects/masks and extract features (only needed if you want to evaluate the automatic comprehension). We empirically set the confidence threshold of Mask R-CNN as 0.65.

CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect.py --dataset refcoco --splitBy unc --conf_thresh 0.65
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect_to_mask.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_det_feats.py --dataset refcoco --splitBy unc

Pretrain the network (CM-Att) with ground-truth annotation:

./experiments/scripts/train_mattnet.sh GPU_ID

Train the network with cross-modal erasing (CM-Att-Erase):

./experiments/scripts/train_erase.sh GPU_ID

Evaluation

Evaluate the network with ground-truth annotation:

./experiments/scripts/eval_easy.sh GPU_ID

Evaluate the network with Mask R-CNN detection results:

./experiments/scripts/eval_dets.sh GPU_ID

Pre-trained Models

We provide the pre-trained models for RefCOCO, RefCOCO+ and RefCOCOg. Download them from Google Drive and put them under ./output folder.

Citation

If you find our code useful for your research, please consider citing:

@inproceedings{liu2019improving,
  title={Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing},
  author={Liu, Xihui and Wang, Zihao and Shao, Jing and Wang, Xiaogang and Li, Hongsheng},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={1950--1959},
  year={2019}
}

@inproceedings{yu2018mattnet,
  title={Mattnet: Modular attention network for referring expression comprehension},
  author={Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={1307--1315},
  year={2018}
}

Acknowledgement

This project is built on Pytorch implementation of MAttNet: Modular Attention Network for Referring Expression Comprehension in CVPR 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
experiments		experiments
lib		lib
pyutils		pyutils
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Prerequisites

Installation

Training

Evaluation

Pre-trained Models

Citation

Acknowledgement

About

Releases

Packages

Languages

yao11970/CM-Erase-REG

Folders and files

Latest commit

History

Repository files navigation

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Prerequisites

Installation

Training

Evaluation

Pre-trained Models

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages