Skip to content

wanghao14/COST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COST - Collaborative Three-Stream Transformers for Video Captioning

2023-08-04 (v0.1) This repository is the official implementation of COST (Collaborative Three-Stream Transformers for Video Captioning), which was recently accepted by CVIU.

Collaborative Three-Stream Transformers for Video Captioning
Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo
CVIU

Editor

Get Started

Requirements

Clone this repository and install dependencies. We haved tested our code on python=3.8.5, torch=1.12.1 and cuda=11.3.1. A suitable conda environment named cost could be created and activated by following commands.

git clone https://github.com/wanghao14/COST.git
cd COST
conda create -n cost python=3.8.5
conda activate cost
pip install -r requirements.txt

Note: The METEOR metric requires java. You can install it with conda by conda install openjdk. Make sure your locale is set correct i.e. echo $LANG outputs en_US.UTF-8

Prepare video features

Download the appearance and motion features of YouCookII from Google Drive: rt_yc2_feat.tar.gz (12GB), which are repacked from features provided by densecap, and the extracted detection features provided by us: yc2_detect_feat.tar.gz (34.7GB). Extract the former one such that they can be found in data/mart_video_feature/youcook2/*.npy under this repository, and the latter one to data/yc2_detect_feature/training_aggre/*.npz and data/yc2_detect_feature/training_aggre/*.npz respectively. Otherwise you can specify the path for reading video features in dataset.py.

Train and validate

All hyper-parameters for our experiments could be modified in the used config file, and configs/yc2_non_recurrent.yaml is used in default in current version.

# Train COST on YouCookII 
CUDA_VISIBLE_DEVICES=0, 1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py

When you validate the provided checkpoints, just modify two values in configs/yc2_non_recurrent.yaml:

validate: true
exp:
  load_model: "${PATH_TO_CHECKPOINT}"

and run the same command as starting training. You can download our pretrained model from Google Drive.

Model performance

  • Quantitative results:
# Output:
# B@4, M, C and R@4 indicate BLEU@4, METEOR, CIDEr-D and Repetition@4, repectivaly. And the results in the first five rows are evaluated in the paragraph-level mode while the last one are in the micro-level mode.
# experiment       |          B@4|            M|            C|          R@4|  
# -----------------|-------------|-------------|-------------|-------------|
# yc2(val)_TSN     |         9.47|        17.67|        45.54|         4.04|
# yc2(val)_COOT    |        11.56|        19.67|        60.78|         6.63|
# anet(val)_TSN    |        11.22|        16.58|        25.70|         7.09|
# anet(test)_TSN   |        11.14|        15.91|        24.77|         5.86|
# anet(test)_COOT  |        11.88|        15.70|        29.64|         6.11|
# msvd(test)       |         56.8|         37.2|         99.2|         74.3|
  • Qualitative results:
Editor

TODO

I have been a little busy recently, and the following works will be pushed forward in my free time.

  • Release initial version which supports multi-gpu training and inference on YouCookII
  • Release pre-trained models and support training with COOT features as input
  • Release detection features and pre-trained models, and support training for ActivityNet-Captions
  • Provide instruction on Internet videos evaluation

Acknowledgment

We would like to thank the authors of MART and COOT for sharing their codes.

About

COST: Collaborative Three-Stream Transformers for Video Captioning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages