Skip to content


Repository files navigation


PyTroch implementation of our ACL'23 paper:

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, and Yuexian Zou

ACL Anthology, arXiv


Update Notes

[2023-10-22] Update links for downloading raw images and videos in

[2023-08-28] We release the code and data.


We run the code based on Python 3.8.8, torch 1.13.1, and cuda 11.7. Please change the version of torch and cuda according to your hardwares.

git clone
cd MultiCapCLIP

conda create -n zerovc python==3.8.8
conda activate zerovc

# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117  -f

pip install -r requirement.txt


Please prepare the data following

Quick Start

All experiments can be conducted with scripts/

Monolingual Scenario (i.e., English -> English)

tasks="adapt adapt_zeroshot"
val_datasets="coco msrvtt"

bash scripts/ $train_dataset $method $tasks $val_datasets

In the above command, we make the baseline model auto-encode on MSR-VTT's training captions first (adapt), and then evaluate on MS-COCO (out-of-domain) and MSR-VTT (in-donmain) (adapt_zeroshot).

Key Arguments:

  • train_dataset supports one of [coco, msrvtt, vatex, flickr30k].
  • method is defined in configs/methods.yaml.
  • tasks can be combinations of:
    • finetune: fully-supervised training, where a model will be trained on 100% training captions.
    • finetune_fewshot: fully-supervised training, where a model will be trained on 0.01% (if applicable), 0.1%, 1%, 10% training captions.
    • adapt: text-only autoencoding. This task is only responsible for training.
    • adapt_zeroshot: evaluate the model that has done adapt on a specific dataset.
    • adapt_fewshot: use the model that has done adapt as a starting point, train the model on 0.01% (if applicable), 0.1%, 1%, 10% training captions. This task is equivalent to semi-supervised learning in the paper.
  • val_datasets can be combinations of [coco, msrvtt, vatex, flickr30k].

Multilingual Scenario (i.e., English -> X)

For the msrvtt and vatex datasets, X denotes Chinese (zh). For the flickr30k dataset, X can be German (de) and French (fr).

tasks="adapt adapt_zeroshot"

bash scripts/ $train_dataset $method $tasks $val_datasets

Different from the monolingual command, we append a postfix #A-B to the method:

  • # activates the multilingual mode, where we use bert-base-multilingual-cased's vocab rather than that of bert-base-uncased to embed tokens. See configs for more details.
  • A denotes which language(s) to be generated during training. For example, when we train model on English-German pairs, we can set A to de (the model only uses German texts as targets) or en,de (the model uses both English and German texts as targets).
  • B denotes which language to be generated during evaluation. For example, #A-de means we require generating German texts during evaluation.

Show Results

You can run the following command to gather results, where mean metric scores with their standard deviation across a number of runs are reported.

python --root output/finetune --csv_path results/ --csv_fn finetune.csv
python --root output/adapt --csv_path results/ --csv_fn adapt.csv


bash scripts/ coco baseline "finetune"
bash scripts/ msrvtt baseline "finetune"
bash scripts/ vatex baseline#zh-zh "finetune"

bash scripts/ coco baseline "adapt adapt_zeroshot" "coco msrvtt"
bash scripts/ msrvtt baseline "adapt adapt_zeroshot" "coco msrvtt"
bash scripts/ msrvtt baseline#zh-zh "adapt adapt_zeroshot" "vatex"
bash scripts/ vatex baseline#zh-zh "adapt adapt_zeroshot" "vatex"

bash scripts/ coco MultiCapCLIP_001 "adapt adapt_zeroshot" "msrvtt"
bash scripts/ coco MultiCapCLIP_01 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001 "adapt adapt_zeroshot" "coco msrvtt"
bash scripts/ msrvtt MultiCapCLIP_001#zh-zh "adapt adapt_zeroshot" "vatex"
bash scripts/ vatex MultiCapCLIP_001#zh-zh "adapt adapt_zeroshot" "vatex"
Semi-Supervised Training
bash scripts/ coco baseline "finetune_fewshot"
bash scripts/ msrvtt baseline "finetune_fewshot"
bash scripts/ msrvtt MultiCapCLIP_001 "adapt adapt_zeroshot adapt_fewshot" "coco msrvtt"
bash scripts/ coco MultiCapCLIP_01 "adapt adapt_zeroshot adapt_fewshot" "coco"
bash scripts/ coco MultiCapCLIP_001 "adapt adapt_zeroshot adapt_fewshot" "msrvtt"
Ablation Study
bash scripts/ msrvtt baseline "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt base_CP "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt base_IA "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt base_FA_001 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt base_IA_FA_001 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001_K4 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001_K8 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001_K32 "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001_V "adapt adapt_zeroshot" "coco"
bash scripts/ msrvtt MultiCapCLIP_001_NV "adapt adapt_zeroshot" "coco"

bash scripts/ coco baseline "adapt adapt_zeroshot" "coco"
bash scripts/ coco base_CP "adapt adapt_zeroshot" "coco"
bash scripts/ coco base_IA "adapt adapt_zeroshot" "coco"
bash scripts/ coco base_FA_01 "adapt adapt_zeroshot" "coco"
bash scripts/ coco base_IA_FA_01 "adapt adapt_zeroshot" "coco"
bash scripts/ coco MultiCapCLIP_01_K4 "adapt adapt_zeroshot" "coco"
bash scripts/ coco MultiCapCLIP_01_K8 "adapt adapt_zeroshot" "coco"
bash scripts/ coco MultiCapCLIP_01 "adapt adapt_zeroshot" "coco"
bash scripts/ coco MultiCapCLIP_01_K32 "adapt adapt_zeroshot" "coco"
bash scripts/ coco MultiCapCLIP_01_V "adapt adapt_zeroshot" "coco"
bash scripts/ coco MultiCapCLIP_01_NV "adapt adapt_zeroshot" "coco"
Extentions to German and French Languages
bash scripts/ flickr30k baseline#de-de "finetune"
bash scripts/ flickr30k baseline#de-de "adapt adapt_zeroshot"
bash scripts/ flickr30k MultiCapCLIP_001#de-de "adapt adapt_zeroshot"

bash scripts/ flickr30k baseline#fr-fr "finetune"
bash scripts/ flickr30k baseline#fr-fr "adapt adapt_zeroshot"
bash scripts/ flickr30k MultiCapCLIP_001#fr-fr "adapt adapt_zeroshot"

T-SNE Visualization: See notebooks/zero_shot_tsne.ipynb for an example.


Please [★star] this repo and [cite] the following papers if you feel our code or data useful to your research:

    title = "{M}ulti{C}ap{CLIP}: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning",
    author = "Yang, Bang and Liu, Fenglin and Wu, Xian and Wang, Yaowei and Sun, Xu and Zou, Yuexian",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "10.18653/v1/2023.acl-long.664",
    pages = "11908--11922",

   title={{Z}ero{NLG}: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
   author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
   journal={arXiv preprint arXiv:2303.06458}



(ACL'2023) MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning








No releases published


No packages published