Skip to content
Implementation for "Large-scale Pretraining for Visual Dialog"
Python Shell
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
config init commit Dec 5, 2019
dataloader init commit Dec 5, 2019
images init commit Dec 5, 2019
models language only baseline code Jan 1, 2020
preprocessing init commit Dec 5, 2019
scripts minor changes to download scripts Dec 5, 2019
utils init commit Dec 5, 2019
.gitignore init commit Dec 5, 2019
LICENSE Updates readme; adds license Dec 5, 2019 adding numbers for the provided checkpoints Jan 13, 2020 init commit Dec 5, 2019
env.yml anaconda environment fixes Dec 17, 2019 init commit Dec 5, 2019 delete pretrained_dict and model_dic before the backword pass Dec 11, 2019 trivial saving checkpt error Jan 8, 2020


PyTorch implementation for the paper:

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das

Prior work in visual dialog has focused on training deep neural models on the VisDial dataset in isolation, which has led to great progress, but is limiting and wasteful. In this work, following recent trends in representation learning for language, we introduce an approach to leverage pretraining on related large-scale vision-language datasets before transferring to visual dialog. Specifically, we adapt the recently proposed ViLBERT model for multi-turn visually-grounded conversation sequences. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial with a VisDial-specific input representation and the masked language modeling and next sentence prediction objectives (as in BERT). Our best single model achieves state-of-the-art on Visual Dialog, outperforming prior published work (including model ensembles) by more than 1% absolute on NDCG and MRR.


This repository contains code for reproducing results with and without finetuning on dense annotations. All results are on v1.0 of the Visual Dialog dataset. We provide pretrained model weights and associated configs to run inference or train these models from scratch.

If you find this work useful in your research, please cite:

  title={Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline},
  author={Vishvak Murahari and Dhruv Batra and Devi Parikh and Abhishek Das},
  journal={arXiv preprint arXiv:1912.02379},

Table of Contents

Setup and Dependencies

Our code is implemented in PyTorch (v1.0). To setup, do the following:

  1. Install Python 3.6
  2. Get the source:
git clone visdial-bert
  1. Install requirements into the visdial-bert virtual environment, using Anaconda:
conda env create -f env.yml


Make both the scripts in scripts/ executable

chmod +x scripts/
chmod +x scripts/

Download preprocessed data

Download preprocessed dataset and extracted features:

sh scripts/

To get these files from scratch:

python preprocessing/ 

However, we recommend downloading these files directly.

Pre-trained checkpoints

Download pre-trained checkpoints:

sh scripts/


After running the above scripts, all the pre-processed data is downloaded to data/visdial and the major pre-trained model checkpoints used in the paper are downloaded to checkpoints-release

Here we list the training arguments to train the important variants in the paper.

To train the base model (no finetuning on dense annotations):

python -batch_size 80  -batch_multiply 1 -lr 2e-5 -image_lr 2e-5 -mask_prob 0.1 -sequences_per_image 2 -start_path checkpoints-release/vqa_pretrained_weights

To finetune the base model with dense annotations:

python -batch_size 80 -batch_multiply 10  -lr 1e-4 -image_lr 1e-4 -nsp_loss_coeff 0 -mask_prob 0.1 -sequences_per_image 2 -start_path checkpoints-release/basemodel

To finetune the base model with dense annotations and the next sentence prediction (NSP) loss:

python -batch_size 80 -batch_multiply 10  -lr 1e-4 -image_lr 1e-4 -nsp_loss_coeff 1 -mask_prob 0.1 -sequences_per_image 2 -start_path checkpoints-release/basemodel

NOTE: Dense annotation finetuning is currently only supported for 8-GPU training. This is primarily due to memory issues. To calculate the cross entropy loss over the 100 options at a dialog round, we need to have all the 100 dialog sequences in memory. However, we can only fit 80 sequences on 8 GPUs with ~12 GB RAM and we only select 80 options. Performance gets worse with fewer GPUs as we need to further cut down on the number of answer options.


The below code snippet generates a prediction file which can be submitted to the test server to get results on the test split.

python -n_gpus 8 -start_path <path to model> -save_name <name of model>

The metrics for the pretrained checkpoints should match with the numbers mentioned in the paper. However, we mention them below too. These results are on v1.0 test-std.

Checkpoint Mean Rank MRR R1 R5 R10 NDCG
basemodel 3.32 67.50 53.85 84.68 93.25 63.87
basemodel + dense 6.28 50.74 37.95 64.13 80.00 74.47
basemodel + dense + nsp 4.28 63.92 50.78 79.53 89.60 68.08


We use Visdom for all logging. Specify visdom_server, visdom_port and enable_visdom arguments in to use this feature.

Visualizing Results

Coming soon


Builds on Jiasen Lu's ViLBERT implementation.



You can’t perform that action at this time.