Multimodal Mixture of Prompt

This repository implements Multimodal Adversarial Prompt Tuning, a technique for improving the adversarial robustness of pre-trained Vision-Language models.

Environment Setup

To set up the required environment, please follow the installation instructions provided in the CoOp repository.

Data Preparation

Before training or evaluating the models, you'll need to prepare the necessary datasets. Detailed instructions on downloading, preprocessing, and organizing the data can be found in DATASETS.md.

Training and Evaluation

This project provides scripts for training and evaluating various prompt designs. You can find all scripts in the ./scripts directory.

Example Usage

Here are examples of how to train and evaluate different Multimodal Adversarial Prompt Tuning using a ViT-B/16 backbone in a zero-shot setting:

AdvIVLP (Adversarial V-L Independent Prompt):
```
./scripts/AdvIVLP/zs_vit16_train_AdvIVLP.sh
```

AdvMaple (Adversarial V-L Joint Prompt):

./scripts/AdvMaple/zs_vit16_train_AdvMaple.sh

AdvVP (Adversarial Visual Prompt):

./scripts/AdvVPT/zs_vit16_train_AdvVPT.sh

AdvCoOp (Adversarial Textual Prompt):

./scripts/AdvCoOp/zs_vit16_train_AdvCoOp.sh

MoE Variants

The MoE variants extend each prompt design with a Mixture-of-Experts router. MoEAdvIVLP is the canonical scheme and uses alignment-aware soft routing on top of the V-L independent prompts.

MoEAdvIVLP (MoE V-L Independent Prompt):

./scripts/MoEAdvIVLP/zs_vit16_train_AdvIVLP.sh

MoEAdvMaPLe (MoE V-L Joint Prompt):

./scripts/MoEAdvMaple/zs_vit16_train_AdvMaple.sh

MoEAdvVPT (MoE Visual Prompt):

./scripts/MoEAdvVPT/zs_vit16_train_AdvVPT.sh

MoEAdvTP (MoE Textual Prompt):

./scripts/MoEAdvTP/zs_vit16_train_AdvIVLP.sh

Acknowledgement

This repository is built upon MaPLe and CoOp. Thanks for those well-organized codebases.

Citation

@inproceedings{wang2025tapt,
  title={TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models},
  author={Wang, Xin and Chen, Kai and Zhang, Jiaming and Chen, Jingjing and Ma, Xingjun},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={19910--19920},
  year={2025}
}

@article{wang2026tame,
  title={TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models},
  author={Wang, Xin and Wang, Yixu and Zhang, Jiaming and Wang, Ruofan and Yu, Jiaqi and Chen, Kai and Chen, Jingjing and Ma, Xingjun and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2605.17577},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
clip		clip
configs		configs
datasets		datasets
docs		docs
interpret_prompts		interpret_prompts
lpclip		lpclip
scripts		scripts
trainers		trainers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
trades.py		trades.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Mixture of Prompt

Environment Setup

Data Preparation

Training and Evaluation

Example Usage

MoE Variants

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Mixture of Prompt

Environment Setup

Data Preparation

Training and Evaluation

Example Usage

MoE Variants

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages