PyTroch implementation of our ACL'23 paper:
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, and Yuexian Zou
[2023-10-22] Update links for downloading raw images and videos in README_DATA.md.
[2023-08-28] We release the code and data.
We run the code based on Python
3.8.8, torch
1.13.1, and cuda
11.7. Please change the version of torch and cuda according to your hardwares.
git clone https://github.com/yangbang18/MultiCapCLIP.git
cd MultiCapCLIP
conda create -n zerovc python==3.8.8
conda activate zerovc
# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 -f https://download.pytorch.org/whl/cu117/torch_stable.html
pip install -r requirement.txt
Please prepare the data following README_DATA.md.
All experiments can be conducted with scripts/pipe.sh.
train_dataset="msrvtt"
method="baseline"
tasks="adapt adapt_zeroshot"
val_datasets="coco msrvtt"
bash scripts/pipe.sh $train_dataset $method $tasks $val_datasets
In the above command, we make the baseline model auto-encode on MSR-VTT's training captions first (adapt
), and then evaluate on MS-COCO (out-of-domain) and MSR-VTT (in-donmain) (adapt_zeroshot
).
Key Arguments:
train_dataset
supports one of [coco
,msrvtt
,vatex
,flickr30k
].method
is defined in configs/methods.yaml.tasks
can be combinations of:finetune
: fully-supervised training, where a model will be trained on 100% training captions.finetune_fewshot
: fully-supervised training, where a model will be trained on 0.01% (if applicable), 0.1%, 1%, 10% training captions.adapt
: text-only autoencoding. This task is only responsible for training.adapt_zeroshot
: evaluate the model that has doneadapt
on a specific dataset.adapt_fewshot
: use the model that has doneadapt
as a starting point, train the model on 0.01% (if applicable), 0.1%, 1%, 10% training captions. This task is equivalent to semi-supervised learning in the paper.
val_datasets
can be combinations of [coco
,msrvtt
,vatex
,flickr30k
].
For the msrvtt
and vatex
datasets, X
denotes Chinese (zh
). For the flickr30k
dataset, X
can be German (de
) and French (fr
).
train_dataset="flickr30k"
method="baseline#de-de"
tasks="adapt adapt_zeroshot"
val_datasets="flickr30k"
bash scripts/pipe.sh $train_dataset $method $tasks $val_datasets
Different from the monolingual command, we append a postfix #A-B
to the method:
#
activates the multilingual mode, where we usebert-base-multilingual-cased
's vocab rather than that ofbert-base-uncased
to embed tokens. See configs for more details.A
denotes which language(s) to be generated during training. For example, when we train model on English-German pairs, we can setA
tode
(the model only uses German texts as targets) oren,de
(the model uses both English and German texts as targets).B
denotes which language to be generated during evaluation. For example,#A-de
means we require generating German texts during evaluation.
You can run the following command to gather results, where mean metric scores with their standard deviation across a number of runs are reported.
python show.py --root output/finetune --csv_path results/ --csv_fn finetune.csv
python show.py --root output/adapt --csv_path results/ --csv_fn adapt.csv
Main
bash scripts/pipe.sh coco baseline "finetune"
bash scripts/pipe.sh msrvtt baseline "finetune"
bash scripts/pipe.sh vatex baseline#zh-zh "finetune"
bash scripts/pipe.sh coco baseline "adapt adapt_zeroshot" "coco msrvtt"
bash scripts/pipe.sh msrvtt baseline "adapt adapt_zeroshot" "coco msrvtt"
bash scripts/pipe.sh msrvtt baseline#zh-zh "adapt adapt_zeroshot" "vatex"
bash scripts/pipe.sh vatex baseline#zh-zh "adapt adapt_zeroshot" "vatex"
bash scripts/pipe.sh coco MultiCapCLIP_001 "adapt adapt_zeroshot" "msrvtt"
bash scripts/pipe.sh coco MultiCapCLIP_01 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001 "adapt adapt_zeroshot" "coco msrvtt"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001#zh-zh "adapt adapt_zeroshot" "vatex"
bash scripts/pipe.sh vatex MultiCapCLIP_001#zh-zh "adapt adapt_zeroshot" "vatex"
Semi-Supervised Training
bash scripts/pipe.sh coco baseline "finetune_fewshot"
bash scripts/pipe.sh msrvtt baseline "finetune_fewshot"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001 "adapt adapt_zeroshot adapt_fewshot" "coco msrvtt"
bash scripts/pipe.sh coco MultiCapCLIP_01 "adapt adapt_zeroshot adapt_fewshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_001 "adapt adapt_zeroshot adapt_fewshot" "msrvtt"
Ablation Study
bash scripts/pipe.sh msrvtt baseline "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt base_CP "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt base_IA "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt base_FA_001 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt base_IA_FA_001 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001_K4 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001_K8 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001_K32 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001_V "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh msrvtt MultiCapCLIP_001_NV "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco baseline "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco base_CP "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco base_IA "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco base_FA_01 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco base_IA_FA_01 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_01_K4 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_01_K8 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_01 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_01_K32 "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_01_V "adapt adapt_zeroshot" "coco"
bash scripts/pipe.sh coco MultiCapCLIP_01_NV "adapt adapt_zeroshot" "coco"
Extentions to German and French Languages
bash scripts/pipe.sh flickr30k baseline#de-de "finetune"
bash scripts/pipe.sh flickr30k baseline#de-de "adapt adapt_zeroshot"
bash scripts/pipe.sh flickr30k MultiCapCLIP_001#de-de "adapt adapt_zeroshot"
bash scripts/pipe.sh flickr30k baseline#fr-fr "finetune"
bash scripts/pipe.sh flickr30k baseline#fr-fr "adapt adapt_zeroshot"
bash scripts/pipe.sh flickr30k MultiCapCLIP_001#fr-fr "adapt adapt_zeroshot"
T-SNE Visualization: See notebooks/zero_shot_tsne.ipynb for an example.
Please [★star] this repo and [cite] the following papers if you feel our code or data useful to your research:
@inproceedings{yang-etal-2023-multicapclip,
title = "{M}ulti{C}ap{CLIP}: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning",
author = "Yang, Bang and Liu, Fenglin and Wu, Xian and Wang, Yaowei and Sun, Xu and Zou, Yuexian",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2023",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.664",
doi = "10.18653/v1/2023.acl-long.664",
pages = "11908--11922",
}
@article{Yang2023ZeroNLG,
title={{Z}ero{NLG}: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
journal={arXiv preprint arXiv:2303.06458}
year={2023}
}