
<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/bmt/bi_modal_transformer.svg" alt="Bi-Modal Transformer with Proposal Generator" width="900">


# Dense Video Captioning with Bi-modal Transformer
[Project Page](https://v-iashin.github.io/bmt) • [ArXiv](https://arxiv.org/abs/2005.08271) • [BMVC Page](https://www.bmvc2020-conference.com/conference/papers/paper_0111.html) • [Presentation](https://www.youtube.com/watch?v=C4zYVIqGDVQ)

This notebook accompanies the [source code](https://github.com/v-iashin/BMT) of the paper:
_A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer_ (BMVC 2020).

It is designed to run the custom video prediction demo on [Google Colab](https://colab.research.google.com) with GPU.
Running the notebook on the basic Google Colab version from scratch will take around 30 minutes including downloading model checkpoints and installing environments.


## Cloning the Repository

In [1]:
!git clone --recursive https://github.com/v-iashin/BMT.git
%cd BMT/

Cloning into 'BMT'...
remote: Enumerating objects: 182, done.[K
remote: Counting objects: 100% (59/59), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 182 (delta 42), reused 40 (delta 37), pack-reused 123[K
Receiving objects: 100% (182/182), 12.86 MiB | 14.33 MiB/s, done.
Resolving deltas: 100% (77/77), done.
Submodule 'submodules/pycocoevalcap' (https://github.com/salaniz/pycocoevalcap.git) registered for path 'submodules/pycocoevalcap'
Submodule 'submodules/video_features' (https://github.com/v-iashin/video_features.git) registered for path 'submodules/video_features'
Cloning into '/content/BMT/submodules/pycocoevalcap'...
remote: Enumerating objects: 821, done.        
remote: Counting objects: 100% (24/24), done.        
remote: Compressing objects: 100% (20/20), done.        
remote: Total 821 (delta 5), reused 19 (delta 4), pack-reused 797        
Receiving objects: 100% (821/821), 130.06 MiB | 24.53 MiB/s, done.
Resolving deltas: 100% (424/424), don

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Installing Environments
Since feature extraction and the captioning model rely on different Python environments, we use consider `conda` as the easiest solution here.
Unfortunately, Google Colab does not support `conda` natively.
To this end, we need to do a few tricks.

First, let's install Miniconda and tweak Python import context

In [3]:
!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh -q --show-progress
!bash ./Miniconda3-py37_23.1.0-1-Linux-x86_64.sh -b -f -p /usr/local

PREFIX=/usr/local
Unpacking payload ...
                                                                                   
Installing base environment...


Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: - \ | / done
Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /usr/local


In [4]:
#@title
# fixing nasty colab environment
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install cryptography==38.0.4
!pip install pyopenssl --upgrade

Collecting pyopenssl
  Downloading pyOpenSSL-23.2.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyopenssl
  Attempting uninstall: pyopenssl
    Found existing installation: pyOpenSSL 22.0.0
    Uninstalling pyOpenSSL-22.0.0:
      Successfully uninstalled pyOpenSSL-22.0.0
Successfully installed pyopenssl-23.2.0
[0m

In [5]:
import os
from pathlib import Path
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
from sample.single_video_prediction import get_video_duration

In [6]:
# feature extraction
!export PIP_DEFAULT_TIMEOUT=100
!conda env create -f ./submodules/video_features/conda_env_i3d.yml
!conda env create -f ./submodules/video_features/conda_env_vggish.yml
# captioning model
!conda env create -f ./conda_env.yml
# spacy language model
!/usr/local/envs/bmt/bin/python -m spacy download en

[1;30;43mStreaming output truncated to the last 5000 lines.[0m





cudnn-7.6.5          | 226.4 MB  | :  79% 0.7854397903899988/1 [00:13<00:04, 22.13s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A













cudatoolkit-10.0.130 | 380.0 MB  | :  48% 0.4803030164288255/1 [00:13<00:16, 32.40s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A















cudnn-7.6.5          | 226.4 MB  | :  79% 0.7919954730544734/1 [00:13<00:04, 20.24s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A















cudnn-7.6.5          | 226.4 MB  | :  80% 0.7982061197892388/1 [00:14<00:03, 19.19s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A













cudatoolkit-10.0.130 | 380.0 MB  | :  48% 0.4843322661814385/1 [00:14<00:15, 31.00s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A















cudnn-7.6.5          | 226.4 MB  | :  81% 0.8051068383834226/1 [00:14<00:03, 17.82s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A













cudatoolkit-10.0.130 | 380.0 MB  | :  49% 

## Downloading Checkpoints

Downloading the GloVe embeddings. `torchtext` could do it for us but it uses a slow server.
Therefore, we store it on our premises.

In [7]:
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/glove.840B.300d.zip -q --show-progress
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_cap_model.pt -q --show-progress
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_prop_model.pt -q --show-progress
!wget https://storage.googleapis.com/audioset/vggish_model.ckpt -q --show-progress

!mkdir .vector_cache
!mv glove.840B.300d.zip ./.vector_cache/
!mv best_cap_model.pt ./sample/
!mv best_prop_model.pt ./sample/
!mv vggish_model.ckpt ./submodules/video_features/models/vggish/checkpoints/



## Running Inference on a Custom Video

Upload your video to Google Colab and select it in the `MY_VIDEO_PATH` variable.
The script will prepare all necessary file paths and calculate the video duration for you

In [8]:
# upload a video
MY_VIDEO_PATH = '/content/BMT/sample/women_long_jump.mp4'

# Preparing the paths
VIDEO_DURATION = get_video_duration(MY_VIDEO_PATH)

FEATURES_CACHE_PATH = '/content/BMT/tmp/'
FEATURES_PATH_STUB = os.path.join(FEATURES_CACHE_PATH, Path(MY_VIDEO_PATH).stem)
FEATURE_PATH_VGGISH = f'{FEATURES_PATH_STUB}_vggish.npy'
FEATURE_PATH_RGB = f'{FEATURES_PATH_STUB}_rgb.npy'
FEATURE_PATH_FLOW = f'{FEATURES_PATH_STUB}_flow.npy'

PROPOSAL_CKPT = '/content/BMT/sample/best_prop_model.pt'
CAPTIONING_CKPT = '/content/BMT/sample/best_cap_model.pt'

Video Duration: 35.155


The script will extract audio and visual features from the video

In [9]:
# Extract I3D features (visual)
!cd ./submodules/video_features && /usr/local/envs/i3d/bin/python main.py \
    --feature_type i3d \
    --on_extraction save_numpy \
    --device_ids 0 \
    --extraction_fps 25 \
    --video_paths $MY_VIDEO_PATH \
    --output_path $FEATURES_CACHE_PATH

# Extract VGGish features (audio)
!cd ./submodules/video_features && /usr/local/envs/vggish/bin/python main.py \
    --feature_type vggish \
    --on_extraction save_numpy \
    --device_ids 0 \
    --video_paths $MY_VIDEO_PATH \
    --output_path $FEATURES_CACHE_PATH

Saving features to /content/BMT/tmp/
100% 1/1 [00:46<00:00, 46.98s/it]
Saving features to /content/BMT/tmp/

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

100% 1/1 [00:13<00:00, 13.66s/it]


Run dense video captioning

In [None]:
# captioning parameters
MAX_PROP_PER_VIDEO = 100
NMS_TIOU_THRESHOLD = 0.4

# Running single video prediction
!/usr/local/envs/bmt/bin/python ./sample/single_video_prediction.py \
    --prop_generator_model_path $PROPOSAL_CKPT \
    --pretrained_cap_model_path $CAPTIONING_CKPT \
    --vggish_features_path $FEATURE_PATH_VGGISH \
    --rgb_features_path $FEATURE_PATH_RGB \
    --flow_features_path $FEATURE_PATH_FLOW \
    --duration_in_secs $VIDEO_DURATION \
    --device_id 0 \
    --max_prop_per_vid $MAX_PROP_PER_VIDEO \
    --nms_tiou_thresh $NMS_TIOU_THRESHOLD

Contructing caption_iterator for "train" phase
