Official repository for the EMNLP 2024 paper "From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis"
git clone https://github.com/steven-ccq/VisualReasoner.git
cd VisualReasoner# Python 3.8
pip install -r requirements.txtcd tools
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO/
pip install -e .
mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..Download the adapter.
Merge it with llava-1.5-7b-hf to obtain the Planner model.
Rename the Planner model as planner and move it into models/.
First, download the corresponding test sets as guided in the data/ directory.
To facilitate usage, we have provided scripts for each test task:
# TextVQA
bash textvqa.sh
# TallyQA
bash tallyqa.sh
# ST-VQA
bash stvqa.sh
# GQA
bash gqa.shThe parameters used in the scripts are described in the table below:
| Argument | Description |
|---|---|
input |
Path to the input file |
output |
Path to the output file |
vlm_module |
Path to the Answer model |
src |
Path to the image folder |
model |
Path to the Planner model |
grounding_basedir |
Path to the Grounding DINO |
# TextVQA
python eval/eval_textvqa.py --input=textvqa.json
# TallyQA
python eval/eval_tallyqa.py --input=tallyqa.json
# ST-VQA
https://rrc.cvc.uab.es/?ch=11
# GQA
python eval/eval_gqa.py --input=gqa.jsonWe also provide a 1M dataset synthesized using the least-to-most method. You can access this dataset through 🤗VisualReasoner-1M. We also release a variant of this dataset, which contains 30k end-to-end reasoning processes. You can access this dataset through 🤗VisualReasoner-30k.
