ViCrop: Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal Large Language Models

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski

[Paper]

Introduction

In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. In particular, BLIP2's zero-shot accuracy on the smaller text seubset of TextVQA is $45.91$% lower than its accuracy on the larger text subset. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. To scale up the usefulness of human cropping, we propose ViCrop, a general framework that utilizes automatic visual cropping to enhance zero-shot VQA of MLLMs. We construct five variants of ViCrop leveraging either external localization models or the decision process of the given MLLM itself. Our results show that ViCrop improves MLLMs' zero-shot accuracy across different VQA datasets, for example, enhances BLIP2-T5's performance by $32.23$% on the TextVQA test set.

Three quatitative observations with BLIP2-FlanT5-XL are shown below(the picture is clear, you may need to zoom in, like what we did in ViCrop😆). The model gradually corrects itself on object existence, category detection, and text reading when zooming in.

The proposed ViCrop framework is illstrated below. We propose five variants of ViCrop leveraging either external localization models or native cropping where we utilize the MLLM’s inference time dynamics, i.e., gradients and attention.

Illstration of attention-based cropping:

Next, we provide notebook examples for ViCrop. We encourage to test the pipeline on different datasets and models!

Testing Cropping Methods and See How Cropping helps BLIP2 Answer Question Better

(optional) Create a conda environment and activate it.

conda create -n vicrop python=3.8
conda activate vicrop

Clone the repoisitory

git clone https://github.com/saccharomycetes/vicrop.git
cd vicrop
unzip LAVIS.zip

Since we have made a modification to the original LAVIS library, please use the following command to install the modified LAVIS library.

cd LAVIS
pip install -e .

Then install the rest of the dependencies.

cd ..
pip install -r requirements.txt

Download the model checkpoints

SAM model checkpoint here

YOLO model checkpoint here

Or you can download them using the following command

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8x.pt

Now you will be ready to run the external_crop.ipynb and native_crop.ipynb to see how cropping helps BLIP2 answer question better.

You can also adapt the code to your own dataset by following the notebook.

Citation

If you find our research to be useful or insightful, please consider citing the following paper:

@article{zhang2023visual,
  title={Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models},
  author={Zhang, Jiarui and Khayatkhoei, Mahyar and Chhikara, Prateek and Ilievski, Filip},
  journal={arXiv preprint arXiv:2310.16033},
  year={2023}
}

Contact

jzhang37@usc.edu

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
figures		figures
src		src
.gitignore		.gitignore
LAVIS.zip		LAVIS.zip
README.md		README.md
bird.png		bird.png
enternal_crop.ipynb		enternal_crop.ipynb
native_crop.ipynb		native_crop.ipynb
push.sh		push.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figures

figures

src

src

.gitignore

.gitignore

LAVIS.zip

LAVIS.zip

README.md

README.md

bird.png

bird.png

enternal_crop.ipynb

enternal_crop.ipynb

native_crop.ipynb

native_crop.ipynb

push.sh

push.sh

requirements.txt

requirements.txt

Repository files navigation

ViCrop: Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal Large Language Models

Introduction

Testing Cropping Methods and See How Cropping helps BLIP2 Answer Question Better

Citation

Contact

About

Releases

Packages

Languages

saccharomycetes/vicrop

Folders and files

Latest commit

History

Repository files navigation

ViCrop: Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal Large Language Models

Introduction

Testing Cropping Methods and See How Cropping helps BLIP2 Answer Question Better

Citation

Contact

About

Resources

Stars

Watchers

Forks

Languages