
<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/bmt/bi_modal_transformer.svg" alt="Bi-Modal Transformer with Proposal Generator" width="900">


# Dense Video Captioning with Bi-modal Transformer
[Project Page](https://v-iashin.github.io/bmt) • [ArXiv](https://arxiv.org/abs/2005.08271) • [BMVC Page](https://www.bmvc2020-conference.com/conference/papers/paper_0111.html) • [Presentation](https://www.youtube.com/watch?v=C4zYVIqGDVQ)

This notebook accompanies the [source code](https://github.com/v-iashin/BMT) of the paper: 
_A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer_ (BMVC 2020).

It is designed to run the custom video prediction demo on [Google Colab](https://colab.research.google.com) with GPU. 
Running the notebook on the basic Google Colab version from scratch will take around 30 minutes including downloading model checkpoints and installing environments.


## Cloning the Repository

In [1]:
!git clone --recursive https://github.com/v-iashin/BMT.git
%cd BMT/

Cloning into 'BMT'...
remote: Enumerating objects: 149, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 149 (delta 9), reused 10 (delta 3), pack-reused 123[K
Receiving objects: 100% (149/149), 13.34 MiB | 12.84 MiB/s, done.
Resolving deltas: 100% (56/56), done.
Submodule 'submodules/pycocoevalcap' (https://github.com/salaniz/pycocoevalcap.git) registered for path 'submodules/pycocoevalcap'
Submodule 'submodules/video_features' (https://github.com/v-iashin/video_features.git) registered for path 'submodules/video_features'
Cloning into '/content/BMT/submodules/pycocoevalcap'...
remote: Enumerating objects: 808, done.        
remote: Counting objects: 100% (11/11), done.        
remote: Compressing objects: 100% (10/10), done.        
remote: Total 808 (delta 1), reused 6 (delta 1), pack-reused 797        
Receiving objects: 100% (808/808), 130.05 MiB | 30.33 MiB/s, done.
Resolving deltas: 100% (420/420), done.


## Installing Environments
Since feature extraction and the captioning model rely on different Python environments, we use consider `conda` as the easiest solution here. 
Unfortunately, Google Colab does not support `conda` natively.
To this end, we need to do a few tricks.

First, let's install Miniconda and tweak Python import context

In [2]:
!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh -q --show-progress
!bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | / done
Solving environment: \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447c_0
    - conda==4.8.2=py37_0
    - cryptography==2.8=py37h1ba5d50_0
    - idna==2.8=py37_0
    - ld_impl_linux-64==2.33.1=h53a641e_7
    - libedit==3.1.20181209=hc058e9b_0
    - libffi==3.2.1=hd88cf55_4
    - libgcc-ng==9.1.0=hdf63c60_0
    - libstdcxx-ng==9.1.0=hdf63c60_0
    - ncurses==6.2=he6710b0_0
    - openssl==1.1.1d=h7b6447c_4
    - pip==20.0.2=py37_1
    - pycosat==0.6.3=py37h7b6447c_0
    - pycparser==2.19=py37_0
    - pyopenssl==19.1.0=py37_0
    - pysocks==1.7.1=py37_0
    - python==3.7.6=h0371630_2
    - readli

In [3]:
import os
from pathlib import Path
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
from sample.single_video_prediction import get_video_duration

In [4]:
# feature extraction
!conda env create -f ./submodules/video_features/conda_env_i3d.yml
!conda env create -f ./submodules/video_features/conda_env_vggish.yml
# captioning model
!conda env create -f ./conda_env.yml
# spacy language model
!/usr/local/envs/bmt/bin/python -m spacy download en

Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

## Downloading Checkpoints

Downloading the GloVe embeddings. `torchtext` could do it for us but it uses a slow server. 
Therefore, we store it on our premises.

In [5]:
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/glove.840B.300d.zip -q --show-progress
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_cap_model.pt -q --show-progress
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_prop_model.pt -q --show-progress
!wget https://storage.googleapis.com/audioset/vggish_model.ckpt -q --show-progress

!mkdir .vector_cache
!mv glove.840B.300d.zip ./.vector_cache/
!mv best_cap_model.pt ./sample/
!mv best_prop_model.pt ./sample/
!mv vggish_model.ckpt ./submodules/video_features/models/vggish/checkpoints/



## Running Inference on a Custom Video

Upload your video to Google Colab and select it in the `MY_VIDEO_PATH` variable. 
The script will prepare all necessary file paths and calculate the video duration for you

In [6]:
# upload a video
MY_VIDEO_PATH = '/content/BMT/sample/women_long_jump.mp4'

# Preparing the paths
VIDEO_DURATION = get_video_duration(MY_VIDEO_PATH)

FEATURES_CACHE_PATH = '/content/BMT/tmp/'
FEATURES_PATH_STUB = os.path.join(FEATURES_CACHE_PATH, Path(MY_VIDEO_PATH).stem)
FEATURE_PATH_VGGISH = f'{FEATURES_PATH_STUB}_vggish.npy'
FEATURE_PATH_RGB = f'{FEATURES_PATH_STUB}_rgb.npy'
FEATURE_PATH_FLOW = f'{FEATURES_PATH_STUB}_flow.npy'

PROPOSAL_CKPT = '/content/BMT/sample/best_prop_model.pt'
CAPTIONING_CKPT = '/content/BMT/sample/best_cap_model.pt'

Video Duration: 35.155


The script will extract audio and visual features from the video

In [7]:
# Extract I3D features (visual)
!cd ./submodules/video_features && /usr/local/envs/i3d/bin/python main.py \
    --feature_type i3d \
    --on_extraction save_numpy \
    --device_ids 0 \
    --extraction_fps 25 \
    --video_paths $MY_VIDEO_PATH \
    --output_path $FEATURES_CACHE_PATH

# Extract VGGish features (audio)
!cd ./submodules/video_features && /usr/local/envs/vggish/bin/python main.py \
    --feature_type vggish \
    --on_extraction save_numpy \
    --device_ids 0 \
    --video_paths $MY_VIDEO_PATH \
    --output_path $FEATURES_CACHE_PATH

Saving features to /content/BMT/tmp/
100% 1/1 [01:24<00:00, 84.85s/it]
Saving features to /content/BMT/tmp/

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

100% 1/1 [00:13<00:00, 13.70s/it]


Run dense video captioning

In [8]:
# captioning parameters
MAX_PROP_PER_VIDEO = 100
NMS_TIOU_THRESHOLD = 0.4

# Running single video prediction
!/usr/local/envs/bmt/bin/python ./sample/single_video_prediction.py \
    --prop_generator_model_path $PROPOSAL_CKPT \
    --pretrained_cap_model_path $CAPTIONING_CKPT \
    --vggish_features_path $FEATURE_PATH_VGGISH \
    --rgb_features_path $FEATURE_PATH_RGB \
    --flow_features_path $FEATURE_PATH_FLOW \
    --duration_in_secs $VIDEO_DURATION \
    --device_id 0 \
    --max_prop_per_vid $MAX_PROP_PER_VIDEO \
    --nms_tiou_thresh $NMS_TIOU_THRESHOLD

Contructing caption_iterator for "train" phase
tcmalloc: large alloc 2635227136 bytes == 0x5636aecea000 @  0x7fc3b863eb6b 0x7fc3b865e379 0x7fc3787826d5 0x7fc37878469a 0x7fc37bc7b163 0x7fc37be30e78 0x7fc37bc7b9f3 0x7fc37bc81e09 0x7fc37bfd48c3 0x7fc3a98908a4 0x56366e5e9ab4 0x56366e5e9bd1 0x56366e65057b 0x56366e595389 0x56366e5e92b7 0x56366e64cb84 0x56366e595389 0x56366e5966ef 0x56366e5b5a73 0x56366e5a7fde 0x56366e64d54d 0x56366e59566a 0x56366e5966ef 0x56366e5b5a73 0x56366e5fd1ba 0x56366e5fdd17 0x56366e5a7fde 0x56366e6a5297 0x56366e5a7fde 0x56366e64d54d 0x56366e595389
100% 2196016/2196017 [04:30<00:00, 8104.37it/s]
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path: 
 /content/BMT/sample/best_cap_model.pt
[{'start': 0.1, 'end': 4.9, 'sentence': 'We see a title screen'}, {'start': 5.0, 'end': 7.9, 'sentence': 'A large group of people are seen standing around a building'}, {'start': 0.7, 'end': 11.9, 'sentence': 'A man is seen s