Rabiul Awal, Le Zhang and Aishwarya Agrawal
We explore fine-tuning-free prompting techniques applied to vision-language models, specifically state-of-the-art BLIP2, Kosmos2, OpenFlamino and multimodal instruction-tuned LLaVa. We mainly focus on the following prompting approaches:
- Zero- and few-shot prompt templates (commonly explored in NLP).
- Chain-of-thought reasoning for VQA.
- A two-step VQA method, proposing captioning as additional visual context for VLMs.
Existing vision-language models (VLMs) already show good zero-shot VQA performance. Our prompting techniques (especially captioning in few-shot vqa)lead to a substantial performance increase across benchmarks. However, though instruction-tuned models are claimed to show strong reasoning abilities, our tests found these reasoning abilities, particularly the chain-of-thought, to be deficient in diverse benchmarks. We hope our work will inspire future research in this direction.
We supports the following VQA formats:
Format | Description | Example |
---|---|---|
Standard VQA | Standard VQA task format. | Question: "What is the primary activity of the people in the scene?" Answer: "Dancing" |
Caption VQA | Begins with a model-generated caption, then standard VQA format. | Context: A group of people in traditional attire are dancing around a bonfire. Question: "What is the primary activity of the people in the scene?" Answer: "Dancing" |
Chain-of-thought VQA | Implements the chain-of-thought format. | Question: "What is the primary activity of the people in the scene? Let's think step-by-step." Answer: "First, considering there's a bonfire, this often signifies a gathering or festivity. Next, seeing people in traditional attire implies a cultural event. Merging these observations, the primary activity is dancing around the bonfire." |
We have a list of prompt templates that can be used with different VQA formats. Please check the prompts/templates/{dataset_name}
.
Download and unzip the files into the dataset/
folder for the VQA datasets. For Winoground, use the Hugging Face datasets
library.
OK-VQA | AOK-VQA | GQA | Winoground | VQAv2 | |
---|---|---|---|---|---|
Source | allenai | allenai | Stanford | Hugging Face | VQA |
To run the Standard VQA, use the following command:
python3 main.py --dataset_name okvqa \
--model_name blip2_t5_flant5xxl \
--vqa_format standard_vqa \
--prompt_name prefix_your_task_knowledge_qa_short_answer
To run the Caption VQA, use the following command:
python3 main.py --dataset_name okvqa \
--model_name blip2_t5_flant5xxl \
--vqa_format caption_vqa \
--prompt_name prefix_your_task_knowledge_qa_short_answer,prefix_promptcap
To run the Chain-of-thought VQA, use the following command:
python3 main.py --dataset_name okvqa \
--model_name blip2_t5_flant5xxl \
--vqa_format cot_vqa \
--prompt_name prefix_think_step_by_step_rationale
Please prepare examplar dataset dataset_zoo/nearest_neighbor.py
and run the following command:
python3 main.py \
--dataset_name okvqa \
--model_name blip2_t5_flant5xxl \
--vqa_format standard_vqa \
--prompt_name prefix_your_task_knowledge_qa_short_answer \
--vicuna_ans_parser --few_shot
Considering the constraints of VQA accuracy metrics in the context of open-ended answer generation, we offer utility scripts in evals/vicuna_llm_evals.py
. Using Vicuna LLM, these scripts process generated answers to align with reference responses and subsequently evaluate them based on the conventional VQA metric.
python3 main.py \
--dataset_name okvqa \
--model_name blip2_t5_flant5xxl \
--vqa_format standard_vqa \
--prompt_name prefix_your_task_knowledge_qa_short_answer \
--vicuna_ans_parser
We report the baseline and best setting results. Please check the paper for more results.
BLIP2 Flan-T5 | BLIP2 OPT | Kosmos2 | OpenFlamingo | LLaVA | |
---|---|---|---|---|---|
Baseline | 50.13 | 42.7 | 40.33 | 18.29 | 44.84 |
Best | 50.55 | 46.29 | 43.09 | 42.48 | 46.86 |
BLIP2 Flan-T5 | BLIP2 OPT | Kosmos2 | OpenFlamingo | LLaVA | |
---|---|---|---|---|---|
Baseline | 51.20 | 45.57 | 40.85 | 17.27 | 52.69 |
Best | 54.98 | 49.39 | 43.60 | 44.13 | 52.32 |
BLIP2 Flan-T5 | BLIP2 OPT | Kosmos2 | OpenFlamingo | LLaVA | |
---|---|---|---|---|---|
Baseline | 44.46 | 38.46 | 37.33 | 26.37 | 38.40 |
Best | 47.01 | 41.99 | 40.13 | 41.00 | 42.65 |
BLIP2 Flan-T5 | BLIP2 OPT | Kosmos2 | OpenFlamingo | LLaVA | |
---|---|---|---|---|---|
Baseline | 66.66 | 54.53 | 53.52 | 35.41 | 56.2 |
Best | 71.37 | 62.81 | 57.33 | 58.0 | 65.32 |
Please email rabiul.awal [at] mila [dot] quebec
for any questions. You can also open an issue or pull request to add more prompting techniques or new multi-modal vision-language models.
If you find this code useful, please cite our paper:
@article{awal2023investigating,
title={Investigating Prompting Techniques for Zero-and Few-Shot Visual Question Answering},
author={Awal, Rabiul and Zhang, Le and Agrawal, Aishwarya},
journal={arXiv preprint arXiv:2306.09996},
year={2023}
}
The codebase is build on top of transformers, lavis, llava and fastchat repositories. We thank the authors for their amazing work.