Skip to content


Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time

Dense Video Captioning with Bi-modal Transformer

Project PageArXivBMVC PagePresentation (Can't watch YouTube? I gotchu! 🤗) •

This is a PyTorch implementation for our paper: A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer (BMVC 2020).


Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting the visual information alone, while completely neglecting the audio track.

To this end, we present Bi-modal Transformer with Proposal Generator (BMT), which efficiently utilizes audio and visual input sequences to select events in a video and, then, use these clips to generate a textual description.

Bi-Modal Transformer with Proposal Generator

Audio and visual features are encoded with VGGish and I3D while caption tokens with GloVe. First, VGGish and I3D features are passed through the stack of N bi-modal encoder layers where audio and visual sequences are encoded to, what we call, audio-attended visual and video-attended audio features. These features are passed to the bi-modal multi-headed proposal generator, which generates a set of proposals using information from both modalities.

Then, the input features are trimmed according to the proposed segments and encoded in the bi-modal encoder again. The stack of N bi-modal decoder layers inputs both: a) GloVe embeddings of the previously generated caption sequence, b) the internal representation from the last layer of the encoder for both modalities. The decoder produces its internal representation which is, then, used in the generator model the distribution over the vocabulary for the caption next word.

Getting Started

The code is tested on Ubuntu 16.04/18.04 with one NVIDIA GPU 1080Ti/2080Ti. If you are planning to use it with other software/hardware, you might need to adapt conda environment files or even the code.

Clone the repository. Mind the --recursive flag to make sure submodules are also cloned (evaluation scripts for Python 3 and scripts for feature extraction).

git clone --recursive

Download features (I3D and VGGish) and word embeddings (GloVe). The script will download them (~10 GB) and unpack into ./data and ./.vector_cache folders. Make sure to run it while being in BMT folder

bash ./

Set up a conda environment

conda env create -f ./conda_env.yml
conda activate bmt
# install spacy language model. Make sure you activated the conda environment
python -m spacy download en


We train our model in two staged: training of the captioning module on ground truth proposals and training of the proposal generator using the pre-trained encoder from the captioning module.

  • Train the captioning module. You may also download the pre-trained model (md5 hash 7b4d48cd77ec49a027a4a1abc6867ee7).
python \
    --procedure train_cap \
    --B 32
  • Train proposal generation module. You may also download the pre-trained model (md5 hash 5f8b20826b09eadd41b7a5be662c198b)
python \
    --procedure train_prop \
    --pretrained_cap_model_path /your_exp_path/ \
    --B 16


Since a part of videos in ActivityNet Captions became unavailable over the time, we could only obtain ~91 % of videos in the dataset (see ./data/available_mp4.txt for ids). To this end, we evaluate the performance of our model against ~91 % of the validation videos. We provide the validation sets without such videos in ./data/val_*_no_missings.json. Please see Experiments and Supplementary Material sections for details and performance of other models on the same validation sets.

  • Ground truth proposals. The performance of the captioning module on ground truth segments might be obtained from the file with pre-trained captioning module. You may also want to use the official evaluation script with ./data/val_*_no_missings.json as references (-r argument).
import torch
cap_model_cpt = torch.load('./path_to_pre_trained_model/', map_location='cpu')
# To obtain the final results, average values in both dicts
  • Learned proposals. Create a file with captions for every proposal provided in --prop_pred_path using the captioning model specified in --pretrained_cap_model_path. The script will automatically evaluate it againts both ground truth validation sets. Alternatively, use the predictions prop_results_val_1_e17_maxprop100.json in ./results and official evaluation script with ./data/val_*_no_missings.json as references (-r argument).
python \
    --procedure evaluate \
    --pretrained_cap_model_path / \
    --prop_pred_path /path_to_generated_json_file \
    --device_ids 0

Details on Feature Extraction

Check out our script for extraction of I3D and VGGish features from a set of videos: video_features on GitHub (make sure to checkout to 4fa02bd5c5b8c34081dcfb609e2bcd5a973eaab2 commit). Also see #7 for more details on configuration.

Reproducibility Note

We would like to note that, despite a fixed random seed, some randomness occurs in our experimentation. Therefore, during the training of the captioning module, one might achieve slightly different results. Specifically, the numbers in your case might differ (higher or lower) from ours or the model will saturate in a different number of epochs. At the same time, we observed quite consistent results when training the proposal generation module with the pre-trained captioning module.

We relate this problem to padding and how it is implemented in PyTorch. (see PyTorch Reproducibility for details). Also, any suggestions on how to address this issue are greatly appreciated.

Comparison with MDVC

Comparison between MDVC and Bi-modal Transformer (BMT) on ActivityNet Captions validation set captioning ground truth proposals. BMT performs on par while having three times fewer parameters and using only two modalities.

Model Params (Mill) BLEU@3 BLEU@4 METEOR
MDVC 149 4.52 1.98 11.07
BMT 51 4.63 1.99 10.90

Single Video Prediction

Open In Colab

The experience with Colab is not particularly smooth. Thus, we recommend setting up the environment locally.

Start by extracting audio and visual features from your video using video_features repository. This repo is also included in ./submodules/video_features (commit 4fa02bd5c5b8c34081dcfb609e2bcd5a973eaab2).

Extract I3D features

# run this from the video_features folder:
cd ./submodules/video_features
conda deactivate
conda activate i3d
python \
    --feature_type i3d \
    --on_extraction save_numpy \
    --device_ids 0 \
    --extraction_fps 25 \
    --video_paths ../../sample/women_long_jump.mp4 \
    --output_path ../../sample/

Extract VGGish features (if ValueError, download the vggish model first--see in ./submodules/video_features)

conda deactivate
conda activate vggish
python \
    --feature_type vggish \
    --on_extraction save_numpy \
    --device_ids 0 \
    --video_paths ../../sample/women_long_jump.mp4 \
    --output_path ../../sample/

Run the inference

# run this from the BMT main folder:
cd ../../
conda deactivate
conda activate bmt
python ./sample/ \
    --prop_generator_model_path ./sample/ \
    --pretrained_cap_model_path ./sample/ \
    --vggish_features_path ./sample/women_long_jump_vggish.npy \
    --rgb_features_path ./sample/women_long_jump_rgb.npy \
    --flow_features_path ./sample/women_long_jump_flow.npy \
    --duration_in_secs 35.155 \
    --device_id 0 \
    --max_prop_per_vid 100 \
    --nms_tiou_thresh 0.4

Expected output

  {'start': 0.1, 'end': 4.9, 'sentence': 'We see a title screen'},
  {'start': 5.0, 'end': 7.9, 'sentence': 'A large group of people are seen standing around a building'},
  {'start': 0.7, 'end': 11.9, 'sentence': 'A man is seen standing in front of a large crowd'},
  {'start': 19.6, 'end': 33.3, 'sentence': 'The woman runs down a track and jumps into a sand pit'},
  {'start': 7.5, 'end': 10.0, 'sentence': 'A large group of people are seen standing around a building'},
  {'start': 0.6, 'end': 35.1, 'sentence': 'A large group of people are seen running down a track while others watch on the sides'},
  {'start': 8.2, 'end': 13.7, 'sentence': 'A man runs down a track'},
  {'start': 0.1, 'end': 2.0, 'sentence': 'We see a title screen'}

Note that in our research we avoided non-maximum suppression for computational efficiency and to allow the event prediction to be dense. Feel free to play with --nms_tiou_thresh parameter: for example, try to make it 0.4 as in the provided example.

The sample video credits: Women's long jump historical World record in 1978

If you are having an error

RuntimeError: Vector for token b'<something>' has <some-number> dimensions, but previously read vectors
have 300 dimensions.

try to remove *.txt and * from the hidden folder ./.vector_cache/ and check if you are not running out of disk space (unpacking of requires extra ~8.5G). Then run again.


Our paper was accepted at BMVC 2020. Please, use this bibtex if you would like to cite our work

  title={A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer},
  author={Iashin, Vladimir and Rahtu, Esa},
  booktitle={British Machine Vision Conference (BMVC)},
  title = {Multi-Modal Dense Video Captioning},
  author = {Iashin, Vladimir and Rahtu, Esa},
  booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  year = {2020}


Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

Media Coverage