Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Learning to Evaluate Image Captioning

TensorFlow implementation for the paper:

Learning to Evaluate Image Captioning
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, Serge Belongie
CVPR 2018

This repository contains a discriminator that could be trained to evaluate image captioning systems. The discriminator is trained to distinguish between machine generated captions and human written ones. During testing, the trained discriminator take the cadidate caption, the reference caption, and optionally the image to be captioned as input. Its output probability of how likely the candidate caption is human written can be used to evaluate the candidate caption. Please refer to our paper [link] for more detail.


  • Python (2.7)
  • Tensorflow (>1.4)
  • PyTorch (for extracting ResNet image features.)
  • ProgressBar
  • NLTK


  1. Clone the dataset with recursive (include the bilinear pooling)
git clone --recursive
  1. Install dependencies. Please refer to TensorFlow, PyTorch and NLTK's official websites for installation guide. For other dependencies, please use the following:
pip install -r requirements.txt
  1. Download data. This script will download needed data. The detailed description of the data can be found in "./".
  1. Generate vocabulrary.
python scripts/preparation/
  1. Extract image features. Following script will download COCO dataset and ResNet checkpoint, then extract image features from COCO dataset using ResNet. This might take few hours.
cd scripts/features/
python --data-dir ../../data/ --coco-img-dir ../../data

Alternatively, we provide a [link] to download features extracted from ResNet152. Please put all *.npy files under "./data/resnet152/".


To evaluate the results of an image captioning method, first put the output captions of the model on COCO dataset into the following JSON format:

    "<file-name-1>" : "<caption-1>",
    "<file-name-2>" : "<caption-2>",
    "<file-name-n>" : "<caption-n>",

Note that <caption-i> are caption represented in text, and the file name is the name for the file in the image. The caption should be all lower-cased and have no \n at the end. Examples of such files by running open sourced NeuralTalk, Show and Tell and Show, Attend and Tell can be found in the examples folder: examples/neuraltalk_all_captions.json, examples/showandtell_all_captions.json, examples/showattendandtell_all_captions.json, and examples/human_all_captions.json.

Make sure you have NLTK Punkt sentence tokenizer installed in Python:

import nltk'punkt')

Following command prepared the data so that it could be used for training:

python scripts/preparation/ --submission examples/neuraltalk_all_captions.json  --name neuraltalk

Note that we assume you've followed through the steps in the Preparation section before running this command. This script will create a folder data/neuraltalk and three .npy files that contain data needed for training the metric. Please use the following command to train the metric:

python --name neuraltalk

The results will be logged in model/neuraltalk_scoring directory. If you use the default model architecture, the results will be in model/neuraltalk_scoring/mlp_1_img_1_512_0.txt.

Followings are the scores for three submissions (calculated as the averaged score among last 10 epochs). Notice that scores might be slightly different due to randomization in training.

Architecture Epochs NeuralTalk Show and Tell Show, Attend and Tell
mlp_1_img_1_512_0 30 0.038 0.056 0.077


If you find our work helpful in your research, please cite it as:

  title = {Learning to Evaluate Image Captioning},
  author = {Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie},


No releases published


No packages published