Code for "Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner" in ICCV 2017
Switch branches/tags
Nothing to show
Clone or download
Paul Chen
Latest commit 6daf82e Oct 11, 2017
Failed to load latest commit information.
data-prepro Update Oct 11, 2017
images add readme Aug 14, 2017
show-adapt-tell add readme Aug 14, 2017
.gitmodules add eval Aug 13, 2017
LICENSE Initial commit Jul 28, 2017 Update Aug 18, 2017


This is the official code for the paper

Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner
Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang, Wan-Ting Hsu, Jianlong Fu, Min Sun
To appear in ICCV 2017

In this repository we provide:

If you find this code useful for your research, please cite

  title={Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner},
  author={Chen, Tseng-Hung and Liao, Yuan-Hong and Chuang, Ching-Yao and Hsu, Wan-Ting and Fu, Jianlong and Sun, Min},
  journal={arXiv preprint arXiv:1705.00930},


P.S. Please clone the repository with the --recursive flag:

# Make sure to clone with --recursive
git clone --recursive

Data Preprocessing

MSCOCO Captioning dataset

Feature Extraction

  1. Download the pretrained ResNet-101 model and place it under data-prepro/MSCOCO_preprocess/resnet_model/.
  2. Please modify the caffe path in data-prepro/MSCOCO_preprocess/
  3. Go to data-prepro/MSCOCO_preprocess and run the following script: ./ for downloading images and extracting features.

Captions Tokenization

  1. Clone the NeuralTalk2 repository and head over to the coco/ folder and run the IPython notebook to generate a json file for Karpathy split: coco_raw.json.
  2. Run the following script: ./ for downloading and tokenizing captions.
  3. Run python to generate annotation json file for testing.

CUB-200-2011 with Descriptions

Feature Extraction

  1. Run the script ./ to download the images in CUB-200-2011.
  2. Please modify the input/output path in data-prepro/MSCOCO_preprocess/ to extract and pack features in CUB-200-2011.

Captions Tokenization

  1. Download the description data.
  2. Run python to generate dataset split following the ECCV16 paper "Generating Visual Explanations".
  3. Run python to generate annotation json file for testing.
  4. Run python for tokenization.

Models from the paper

Pretrained Models

Download all pretrained and adaption models:

Example Results

Here are some example results where the captions are generated from these models:

MSCOCO: A large air plane on a run way.
CUB-200-2011: A large white and black airplane with a large beak.
TGIF: A plane is flying over a field.
Flickr30k: A large airplane is sitting on a runway.

MSCOCO: A traffic light is seen in front of a large building.
CUB-200-2011: A yellow traffic light with a yellow light.
TGIF: A traffic light is hanging on a pole.
Flickr30k: A street sign is lit up in the dark

MSCOCO: A black dog sitting on the ground next to a window.
CUB-200-2011: A black and white dog with a black head.
TGIF: A dog is looking at something in the mirror.
Flickr30k: A black dog is looking out of the window.

MSCOCO: A man riding a skateboard up the side of a ramp.
CUB-200-2011: A man riding a skateboard on a white ramp.
TGIF: A man is doing a trick on a skateboard.
Flickr30k: A man in a blue shirt is doing a trick on a skateboard.


The training codes are under the show-adapt-tell/ folder.

Simply run python for two steps of training:

Training the source model with paired image-caption data

Please set the Boolean value of "G_is_pretrain" to True in to start pretraining the generator.

Training the cross-domain captioner with unpaired data

After pretraining, set "G_is_pretrain" to False to start training the cross-domain model.


Free for personal or research use; for commercial use please contact me.