Skip to content

visual question answering prompting recipes for large vision-language models

Notifications You must be signed in to change notification settings

rabiulcste/vqazero

Repository files navigation

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Rabiul Awal, Le Zhang and Aishwarya Agrawal

VQA Prompt Teaser

Table of Contents

Approach

We explore fine-tuning-free prompting techniques applied to vision-language models, specifically state-of-the-art BLIP2, Kosmos2, OpenFlamino and multimodal instruction-tuned LLaVa. We mainly focus on the following prompting approaches:

  • Zero- and few-shot prompt templates (commonly explored in NLP).
  • Chain-of-thought reasoning for VQA.
  • A two-step VQA method, proposing captioning as additional visual context for VLMs.

Existing vision-language models (VLMs) already show good zero-shot VQA performance. Our prompting techniques (especially captioning in few-shot vqa)lead to a substantial performance increase across benchmarks. However, though instruction-tuned models are claimed to show strong reasoning abilities, our tests found these reasoning abilities, particularly the chain-of-thought, to be deficient in diverse benchmarks. We hope our work will inspire future research in this direction.

VQA Formats

We supports the following VQA formats:

Format Description Example
Standard VQA Standard VQA task format. Question: "What is the primary activity of the people in the scene?"
Answer: "Dancing"
Caption VQA Begins with a model-generated caption, then standard VQA format. Context: A group of people in traditional attire are dancing around a bonfire.
Question: "What is the primary activity of the people in the scene?"
Answer: "Dancing"
Chain-of-thought VQA Implements the chain-of-thought format. Question: "What is the primary activity of the people in the scene? Let's think step-by-step."
Answer: "First, considering there's a bonfire, this often signifies a gathering or festivity. Next, seeing people in traditional attire implies a cultural event. Merging these observations, the primary activity is dancing around the bonfire."

Prompt Templates

We have a list of prompt templates that can be used with different VQA formats. Please check the prompts/templates/{dataset_name}.

VQA Prompt Templates

Datasets

Download and unzip the files into the dataset/ folder for the VQA datasets. For Winoground, use the Hugging Face datasets library.

OK-VQA AOK-VQA GQA Winoground VQAv2
Source allenai allenai Stanford Hugging Face VQA

Usage

Running the inference code

To run the Standard VQA, use the following command:

python3 main.py --dataset_name okvqa \
  --model_name blip2_t5_flant5xxl \
  --vqa_format standard_vqa \
  --prompt_name prefix_your_task_knowledge_qa_short_answer

To run the Caption VQA, use the following command:

python3 main.py --dataset_name okvqa \
  --model_name blip2_t5_flant5xxl \
  --vqa_format caption_vqa \
  --prompt_name prefix_your_task_knowledge_qa_short_answer,prefix_promptcap

To run the Chain-of-thought VQA, use the following command:

python3 main.py --dataset_name okvqa \
  --model_name blip2_t5_flant5xxl \
  --vqa_format cot_vqa \
  --prompt_name prefix_think_step_by_step_rationale

Running few-shot inference

Please prepare examplar dataset dataset_zoo/nearest_neighbor.py and run the following command:

python3 main.py \
  --dataset_name okvqa \
  --model_name blip2_t5_flant5xxl \
  --vqa_format standard_vqa \
  --prompt_name prefix_your_task_knowledge_qa_short_answer \
  --vicuna_ans_parser --few_shot

Running Vicuna answer extraction

Considering the constraints of VQA accuracy metrics in the context of open-ended answer generation, we offer utility scripts in evals/vicuna_llm_evals.py. Using Vicuna LLM, these scripts process generated answers to align with reference responses and subsequently evaluate them based on the conventional VQA metric.

python3 main.py \
  --dataset_name okvqa \
  --model_name blip2_t5_flant5xxl \
  --vqa_format standard_vqa \
  --prompt_name prefix_your_task_knowledge_qa_short_answer \
  --vicuna_ans_parser

Results

We report the baseline and best setting results. Please check the paper for more results.

OKVQA

BLIP2 Flan-T5 BLIP2 OPT Kosmos2 OpenFlamingo LLaVA
Baseline 50.13 42.7 40.33 18.29 44.84
Best 50.55 46.29 43.09 42.48 46.86

AOKVQA

BLIP2 Flan-T5 BLIP2 OPT Kosmos2 OpenFlamingo LLaVA
Baseline 51.20 45.57 40.85 17.27 52.69
Best 54.98 49.39 43.60 44.13 52.32

GQA

BLIP2 Flan-T5 BLIP2 OPT Kosmos2 OpenFlamingo LLaVA
Baseline 44.46 38.46 37.33 26.37 38.40
Best 47.01 41.99 40.13 41.00 42.65

VQAv2

BLIP2 Flan-T5 BLIP2 OPT Kosmos2 OpenFlamingo LLaVA
Baseline 66.66 54.53 53.52 35.41 56.2
Best 71.37 62.81 57.33 58.0 65.32

Citing

Please email rabiul.awal [at] mila [dot] quebec for any questions. You can also open an issue or pull request to add more prompting techniques or new multi-modal vision-language models.

If you find this code useful, please cite our paper:

@article{awal2023investigating,
  title={Investigating Prompting Techniques for Zero-and Few-Shot Visual Question Answering},
  author={Awal, Rabiul and Zhang, Le and Agrawal, Aishwarya},
  journal={arXiv preprint arXiv:2306.09996},
  year={2023}
}

Acknowledgments

The codebase is build on top of transformers, lavis, llava and fastchat repositories. We thank the authors for their amazing work.

About

visual question answering prompting recipes for large vision-language models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published