Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to
(optional) Create a conda environment and activate it.
conda create -n visual_crop_zsvqa python=3.8
conda activate visual_crop_zsvqa
Clone the repoisitory
git clone https://github.com/saccharomycetes/visual_crop_zsvqa.git
cd visual_crop_zsvqa
Since we have made a modification to the original LAVIS library, please use the following command to install the modified LAVIS library.
cd LAVIS
pip install -e .
Then install the rest of the dependencies.
cd ..
pip install -r requirements.txt
Download the model checkpoints
SAM model checkpoint here
YOLO model checkpoint here
Or you can download them using the following command
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8x.pt
Now you will be ready to run the crop.ipynb
to see how cropping helps BLIP2 answer question better.
If you find our research to be useful or insightful, please consider citing the following paper:
@article{zhang2023visual,
title={Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models},
author={Zhang, Jiarui and Khayatkhoei, Mahyar and Chhikara, Prateek and Ilievski, Filip},
journal={arXiv preprint arXiv:2310.16033},
year={2023}
}
jrzhang [AT] isi.edu