Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao
Abstract: Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability.
- Semantic Consistency with LLM-Enhanced Prompt Ensembles: Enhance context vector learning using prompt ensembles derived from GPT-4, combined with a knowledge distillation strategy to enforce semantic consistency.
- Outlier Pruning for Robust Generalization: Employ a statistics-based pruning strategy to filter outlier prompts from LLMs, mitigating over-specialization and preserving essential biomedical patterns.
- First Adoption of BiomedCLIP for Prompt Learning: Leverage BiomedCLIP for prompt learning for the first time, demonstrating superior performance over general knowledge CLIP in clinical tasks.
- Extensive Multi-Modal Evaluation: Evaluate across 11 biomedical image classification datasets, 9 modalities, and 10 organs, showcasing BiomedCoOp's superior generalizability and robustness in few-shot and base-to-novel benchmarks.
Method | Paper | Configs | Training Scripts | Trainers |
---|---|---|---|---|
BiomedCoOp | CVPR 2025 | link | link | link |
CLIP | ICML 2021 | link | link | link |
CoOp | IJCV 2022 | link | link | link |
CoCoOp | CVPR 2022 | link | link | link |
KgCoOp | CVPR 2023 | link | link | link |
ProGrad | ICCV 2023 | link | link | link |
CLIP-Adapter | IJCV 2024 | link | link | link |
Tip-Adapter | ECCV 2022 | link | link | link |
LP | ICML 2021 | link | link | link |
LP++ | CVPR 2024 | link | link | link |
Results reported below show accuracy for few-shot scenarios as well as base and novel classes across 11 biomedical recognition datasets averaged over 3 seeds.
Method | |||||
---|---|---|---|---|---|
CLIP-Adapter | 44.66 | 43.91 | 44.36 | 45.42 | 46.69 |
Tip-Adapter | 49.19 | 52.36 | 57.33 | 61.98 | 67.15 |
Tip-Adapter-F | 51.17 | 52.74 | 61.23 | 65.91 | 70.91 |
Standard LP | 47.25 | 54.21 | 61.00 | 65.85 | 69.40 |
LP++ | 47.24 | 53.18 | 59.02 | 63.69 | 68.35 |
CoOp | 50.16 | 54.18 | 59.75 | 65.84 | 69.62 |
CoCoOp | 48.49 | 51.28 | 54.69 | 61.08 | 65.09 |
KgCoOp | 50.85 | 53.18 | 57.82 | 62.08 | 62.84 |
ProGrad | 51.88 | 54.71 | 60.42 | 65.61 | 67.13 |
BiomedCoOp | 57.03 | 59.13 | 63.95 | 68.32 | 72.42 |
Name | Base Acc. | Novel Acc. | HM |
---|---|---|---|
BiomedCLIP | 47.84 | 65.42 | 53.81 |
CoOp | 73.85 | 64.75 | 67.23 |
CoCoOp | 72.26 | 67.03 | 67.22 |
KgCoOp | 68.36 | 64.08 | 64.61 |
ProGrad | 71.67 | 66.93 | 67.43 |
BiomedCoOp (ours) | 76.26 | 73.92 | 75.07 |
Name | Few-Shot | Base-to-Novel |
---|---|---|
BiomedCoOp | link | link |
For installation and other package requirements, please follow the instructions detailed in INSTALL.md.
Please follow the instructions at DATASETS.md to prepare all datasets.
Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results using our pre-trained models.
If you use our work, please consider citing:
@inproceedings{koleilat2025biomedcoop,
title={Biomedcoop: Learning to prompt for biomedical vision-language models},
author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14766--14776},
year={2025}
}
Our code builds upon the CoOp, MaPLe, and LP++ repositories. We are grateful to the authors for making their code publicly available. If you use our model or code, we kindly request that you also consider citing these foundational works.