This repository implements Multimodal Adversarial Prompt Tuning, a technique for improving the adversarial robustness of pre-trained Vision-Language models.
To set up the required environment, please follow the installation instructions provided in the CoOp repository.
Before training or evaluating the models, you'll need to prepare the necessary datasets. Detailed instructions on downloading, preprocessing, and organizing the data can be found in DATASETS.md.
This project provides scripts for training and evaluating various prompt designs. You can find all scripts in the ./scripts directory.
Here are examples of how to train and evaluate different Multimodal Adversarial Prompt Tuning using a ViT-B/16 backbone in a zero-shot setting:
-
AdvIVLP (Adversarial V-L Independent Prompt):
./scripts/AdvIVLP/zs_vit16_train_AdvIVLP.sh
-
AdvMaple (Adversarial V-L Joint Prompt):
./scripts/AdvMaple/zs_vit16_train_AdvMaple.sh
-
AdvVP (Adversarial Visual Prompt):
./scripts/AdvVPT/zs_vit16_train_AdvVPT.sh
-
AdvCoOp (Adversarial Textual Prompt):
./scripts/AdvCoOp/zs_vit16_train_AdvCoOp.sh
The MoE variants extend each prompt design with a Mixture-of-Experts router. MoEAdvIVLP is the canonical scheme and uses alignment-aware soft routing on top of the V-L independent prompts.
-
MoEAdvIVLP (MoE V-L Independent Prompt):
./scripts/MoEAdvIVLP/zs_vit16_train_AdvIVLP.sh
-
MoEAdvMaPLe (MoE V-L Joint Prompt):
./scripts/MoEAdvMaple/zs_vit16_train_AdvMaple.sh
-
MoEAdvVPT (MoE Visual Prompt):
./scripts/MoEAdvVPT/zs_vit16_train_AdvVPT.sh
-
MoEAdvTP (MoE Textual Prompt):
./scripts/MoEAdvTP/zs_vit16_train_AdvIVLP.sh
This repository is built upon MaPLe and CoOp. Thanks for those well-organized codebases.
@inproceedings{wang2025tapt,
title={TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models},
author={Wang, Xin and Chen, Kai and Zhang, Jiaming and Chen, Jingjing and Ma, Xingjun},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={19910--19920},
year={2025}
}
@article{wang2026tame,
title={TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models},
author={Wang, Xin and Wang, Yixu and Zhang, Jiaming and Wang, Ruofan and Yu, Jiaqi and Chen, Kai and Chen, Jingjing and Ma, Xingjun and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2605.17577},
year={2026}
}
