ViCrop: Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal Large Language Models
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. In particular, BLIP2's zero-shot accuracy on the smaller text seubset of TextVQA is
Three quatitative observations with BLIP2-FlanT5-XL are shown below(the picture is clear, you may need to zoom in, like what we did in ViCrop😆). The model gradually corrects itself on object existence, category detection, and text reading when zooming in.
The proposed ViCrop framework is illstrated below. We propose five variants of ViCrop leveraging either external localization models or native cropping where we utilize the MLLM’s inference time dynamics, i.e., gradients and attention.
Illstration of attention-based cropping:
Next, we provide notebook examples for ViCrop. We encourage to test the pipeline on different datasets and models!
(optional) Create a conda environment and activate it.
conda create -n vicrop python=3.8
conda activate vicrop
Clone the repoisitory
git clone https://github.com/saccharomycetes/vicrop.git
cd vicrop
unzip LAVIS.zip
Since we have made a modification to the original LAVIS library, please use the following command to install the modified LAVIS library.
cd LAVIS
pip install -e .
Then install the rest of the dependencies.
cd ..
pip install -r requirements.txt
Download the model checkpoints
SAM model checkpoint here
YOLO model checkpoint here
Or you can download them using the following command
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8x.pt
Now you will be ready to run the external_crop.ipynb
and native_crop.ipynb
to see how cropping helps BLIP2 answer question better.
You can also adapt the code to your own dataset by following the notebook.
If you find our research to be useful or insightful, please consider citing the following paper:
@article{zhang2023visual,
title={Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models},
author={Zhang, Jiarui and Khayatkhoei, Mahyar and Chhikara, Prateek and Ilievski, Filip},
journal={arXiv preprint arXiv:2310.16033},
year={2023}
}
jzhang37@usc.edu