COST - Collaborative Three-Stream Transformers for Video Captioning

2023-08-04 (v0.1) This repository is the official implementation of COST (Collaborative Three-Stream Transformers for Video Captioning), which was recently accepted by CVIU.

Collaborative Three-Stream Transformers for Video Captioning
Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo
CVIU

Get Started

Requirements

Clone this repository and install dependencies. We haved tested our code on python=3.8.5, torch=1.12.1 and cuda=11.3.1. A suitable conda environment named cost could be created and activated by following commands.

git clone https://github.com/wanghao14/COST.git
cd COST
conda create -n cost python=3.8.5
conda activate cost
pip install -r requirements.txt

Note: The METEOR metric requires java. You can install it with conda by conda install openjdk. Make sure your locale is set correct i.e. echo $LANG outputs en_US.UTF-8

Prepare video features

Download the appearance and motion features of YouCookII from Google Drive: rt_yc2_feat.tar.gz (12GB), which are repacked from features provided by densecap, and the extracted detection features provided by us: yc2_detect_feat.tar.gz (34.7GB). Extract the former one such that they can be found in data/mart_video_feature/youcook2/*.npy under this repository, and the latter one to data/yc2_detect_feature/training_aggre/*.npz and data/yc2_detect_feature/training_aggre/*.npz respectively. Otherwise you can specify the path for reading video features in dataset.py.

Train and validate

All hyper-parameters for our experiments could be modified in the used config file, and configs/yc2_non_recurrent.yaml is used in default in current version.

# Train COST on YouCookII 
CUDA_VISIBLE_DEVICES=0, 1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py

When you validate the provided checkpoints, just modify two values in configs/yc2_non_recurrent.yaml:

validate: true
exp:
  load_model: "${PATH_TO_CHECKPOINT}"

and run the same command as starting training. You can download our pretrained model from Google Drive.

Model performance

Quantitative results:

# Output:
# B@4, M, C and R@4 indicate BLEU@4, METEOR, CIDEr-D and Repetition@4, repectivaly. And the results in the first five rows are evaluated in the paragraph-level mode while the last one are in the micro-level mode.
# experiment       |          B@4|            M|            C|          R@4|  
# -----------------|-------------|-------------|-------------|-------------|
# yc2(val)_TSN     |         9.47|        17.67|        45.54|         4.04|
# yc2(val)_COOT    |        11.56|        19.67|        60.78|         6.63|
# anet(val)_TSN    |        11.22|        16.58|        25.70|         7.09|
# anet(test)_TSN   |        11.14|        15.91|        24.77|         5.86|
# anet(test)_COOT  |        11.88|        15.70|        29.64|         6.11|
# msvd(test)       |         56.8|         37.2|         99.2|         74.3|

Qualitative results:

TODO

I have been a little busy recently, and the following works will be pushed forward in my free time.

Release initial version which supports multi-gpu training and inference on YouCookII
Release pre-trained models and support training with COOT features as input
Release detection features and pre-trained models, and support training for ActivityNet-Captions
Provide instruction on Internet videos evaluation

Acknowledgment

We would like to thank the authors of MART and COOT for sharing their codes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
annotations		annotations
cache_caption		cache_caption
configs		configs
docs		docs
mart		mart
nntrainer		nntrainer
README.md		README.md
dataset.py		dataset.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotations

annotations

cache_caption

cache_caption

configs

configs

docs

docs

mart

mart

nntrainer

nntrainer

README.md

README.md

dataset.py

dataset.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

COST - Collaborative Three-Stream Transformers for Video Captioning

Get Started

Requirements

Prepare video features

Train and validate

Model performance

TODO

Acknowledgment

About

Releases

Packages

Languages

wanghao14/COST

Folders and files

Latest commit

History

Repository files navigation

COST - Collaborative Three-Stream Transformers for Video Captioning

Get Started

Requirements

Prepare video features

Train and validate

Model performance

TODO

Acknowledgment

About

Resources

Stars

Watchers

Forks

Languages