Implementation of the image-sentence embedding method described in "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models"
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 14 commits behind ryankiros:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Code for the image-sentence ranking methods from "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models" (Kiros, Salakhutdinov, Zemel. 2014).

Images and sentences are mapped into a common vector space, where the sentence representation is computed using LSTM. This project contains training code and pre-trained models for Flickr8K, Flickr30K and MS COCO.

If you're interested in generating image captions instead, see our follow up project arctic-captions.


Below is a table of results obtained using the code from this repository, comparing the numbers reported in our paper. aR@K is the Recall@K for image annotation (higher is better), while sR@K is the Recall@K for image search (higher is better). Medr is the median rank of the closest ground truth (lower is better).


Method aR@1 aR@5 aR@10 aMedr sR@1 sR@5 sR@10 sMedr
reported 18.0 40.9 55.0 8 12.5 37.0 51.5 10
this project 22.3 48.7 59.8 6 14.9 38.3 51.6 10


Method aR@1 aR@5 aR@10 aMedr sR@1 sR@5 sR@10 sMedr
reported 23.0 50.7 62.9 5 16.8 42.0 56.5 8
this project 29.8 58.4 70.5 4 22.0 47.9 59.3 6


Method aR@1 aR@5 aR@10 aMedr sR@1 sR@5 sR@10 sMedr
this project 43.4 75.7 85.8 2 31.0 66.7 79.9 3

For a complete list of results on these tasks, see this paper by Lin Ma et al (ICCV 2015) which contains the the most up-to-date tables (as of September 2015).


This code is written in python. To use it you will need:

  • Python 2.7
  • Theano 0.7
  • A recent version of NumPy and SciPy

Getting started

You will first need to download the dataset files and pre-trained models. These can be obtained by running


Each of the dataset files contains the captions as well as VGG features from the 19-layer model. Flickr8K comes with a pre-defined train/dev/test split, while for Flickr30K and MS COCO we use the splits produced by Andrej Karpathy. Note that the original images are not included with the dataset. The full contents of each of the datasets can be obtained here, here and here.

Once the datasets are downloaded, open and set the directory to where the datasets are.

NOTE to Toronto users: the unzipped files are available in my gobi3 directory under uvsdata and uvsmodels. Just link there instead of downloading.

Evaluating pre-trained models

Lets use Flickr8K as an example. To reproduce the numbers in the table above, open and specify the path to the downloaded Flickr8K model. Then in IPython run the following:

import tools, evaluation
model = tools.load_model()
evaluation.evalrank(model, data='f8k', split='test')

This will evaluate the loaded model on the Flickr8K test set. You can also replace 'test' with 'dev' to evaluate on the development set. Alternatively, evaluate the Flickr30K and MS COCO models instead.

Computing image and sentence vectors

Suppose you have a list of strings that you would like to embed into the learned vector space. To embed them, run the following:

sentence_vectors = tools.encode_sentences(model, X, verbose=True)

Where 'X' is the list of strings. Note that the strings should already be pre-tokenized, so that split() returns the tokens.

As the vectors are being computed, it will print some numbers. The code works by extracting vectors in batches of sentences that have the same length - so the number corresponds to the current length being processed. If you want to turn this off, set verbose=False when calling encode.

To encode images, run the following instead:

image_vectors = tools.encode_images(model, IM)

Where 'IM' is a NumPy array of VGG features. Note that the VGG features were scaled to unit norm prior to training the models.

Training new models

Open and specify the hyperparameters that you would like. Below we describe each of them in detail:

  • data: The dataset to train on (f8k, f30k or coco).
  • margin: The margin used for computing the pairwise ranking loss. Should be between 0 and 1.
  • dim: The dimensionality of the learned embedding space (also the size of the RNN state).
  • dim_image: The dimensionality of the image features. This will be 4096 for VGG.
  • dim_word: The dimensionality of the learned word embeddings.
  • ncon: The number of contrastive (negative) examples for computing the loss.
  • encoder: The type of RNN to use. Only supports gru at the moment.
  • max_epochs: The number of epochs used for training.
  • dispFreq: How often to display training progress.
  • decay_c: The weight decay hyperparameter.
  • grad_clip: When to clip the gradient.
  • maxlen_w: Sentences longer then this value will be ignored.
  • optimizer: The optimization method to use. Only supports 'adam' at the moment.
  • batch_size: The size of a minibatch.
  • saveto: The location to save the model.
  • validFreq: How often to evaluate on the development set.
  • reload_: Whether to reload a previously trained model.

Once you are happy, just run the following:

import train

As the model trains, it will periodically evaluate on the development set (validFreq) and re-save the model each time performance on the development set increases. Generally you shouldn't need more than 15-20 epochs of training on any of the datasets. Once the models are saved, you can load and evaluate them in the same way as the pre-trained models.

Using different datasets and features

If you want to use a different dataset, or use different image features, you will have to edit the paths in Each of (training/dev/test) contains 2 files: a .txt file of captions (one per line) and a .npy file containing a NumPy array of image features, where each row is the image features for the corresponding caption. If you put your dataset in the same format, then it can be used for training new models.


If you found this code useful, please cite the following paper:

Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel. "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models." arXiv preprint arXiv:1411.2539 (2014).

  title={Unifying visual-semantic embeddings with multimodal neural language models},
  author={Kiros, Ryan and Salakhutdinov, Ruslan and Zemel, Richard S},
  journal={arXiv preprint arXiv:1411.2539},


Apache License 2.0