- [10/26]🔥SliME-8B achieves better high-resolution understanding performance on MME-RealWorld compared to Mini-Gemini and LLaVA-Next.
- [09/26]🔥SliME is supported by VLMEvalKit. Feel free to use it without hesitation!
- [07/16]🔥The SliME strategy demonstrates exceptional versatility, extending seamlessly to video analysis (See Slime_video.md). Remarkably, even though the model has never been specifically trained on video data, it is capable of processing up to 8 frames. In the Video-MME benchmark, the model surpasses numerous 7B/8B baselines that have undergone training on video datasets.
- [06/11]🔥SliME is coming! We release the paper, code, models, and data for SliME!
- [06/11]🔥SliME-70B will be released soon.
Please follow the instructions below to install the required packages.
- Clone this repository
git clone https://github.com/yfzhang114/SliME.git
- Install Package
conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation
We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:
Model | Base LLM | Vision Encoder | Finetuning Data | Finetuning schedule | Download |
---|---|---|---|---|---|
SliME-7B | Vicuna-7B-v1.5 | CLIP-L | SharedGPT+SMR | full_ft | ckpt |
SliME-8B | Llama-3-8B-Instruct | CLIP-L | SharedGPT+SMR | full_ft | ckpt |
SliME-13B | Vicuna-13B-v1.5 | CLIP-L | SharedGPT+SMR | full_ft | ckpt |
SliME-70B | Llama-3-70B-Instruct | CLIP-L | SharedGPT+SMR | Lora | ckpt |
Here are the pretrained weights on Stage 1/2 data only:
Model | Base LLM | Vision Encoder | Pretrain Data | Finetuning schedule | Download |
---|---|---|---|---|---|
SliME-7B | Vicuna-7B-v1.5 | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
SliME-8B | Llama-3-8B-Instruct | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
SliME-13B | Vicuna-13B-v1.5 | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
SliME-70B | Llama-3-70B-Instruct | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.
data
├── arxivqa
│ └── images
├── DVQA
│ └── images
├── Geometry3K
│ └── 0-2400 dirs
├── ChartQA
│ └── train_images
└── GeoQA3
│ ├── image
│ └── json
├── mathvision
├── scienceqa
├── tabmwp
└── GeoQA3
│ ├── train
│ └── test
│ └── val
└── ai2d
│ ├── abc_images
│ └── images
└── geoqa+
│ └── images
You can find the pre-processing code at this URL. If you have any questions about file names or image paths, please refer to the pre-processing code.
- Arxiv QA Download images using this download url
python playground/data/process_arxivqa.py
- DVQA
Download images using this url.
- ChartQA
Clone this repo
extract all the training images in ChartQA_Dataset/train/png
into ChartQA
- Geometry3K
Download images using this url.
The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')
- GeoQA3
Download images using this url
extract all the training images in GeoQA3/image
- MathVision
Download images using this url
Our data will not include the images from test-mini split automatically
- ScienceQA
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/train.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/val.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip
unzip -q train.zip
unzip -q val.zip
unzip -q test.zip
rm train.zip
rm val.zip
rm test.zip
- Tabmwp
Download images using this url
- TextbookQA
Download images using this url
- AI2D:
Download images using this url
- GeoQA+
Download images using this url
SliME training consists of three stages: (1) training the global projector and attention adapter specifically; (2) training the local compression layer; and (3) training the full model.
SliME is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Please make sure you download and organize the data following Preparation before training.
If you want to train and finetune SliME, please run the following command for SliME-7B with image size 336:
bash scripts/vicuna/vicuna_7b_pt.sh
bash scripts/vicuna/vicuna_7b_sft.sh
or for SliME-8B with image size 336:
bash scripts/llama/llama3_8b_pt.sh
bash scripts/llama/llama3_8b_sft.sh
Because we reuse the pre-trained projecter weights from the SliME-7B, you can directly use the sft commands stage-3 instruction tuning by changing the PROJECTOR_DIR
:
bash scripts/llama/llama3_8b_sft.sh
Please find more training scripts of in scripts/
.
We perform evaluation on several image-based benchmarks. Please see Evaluation for the detailes.
If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval
.
For example, run the following command for TextVQA evaluation with SliME-7B:
bash scripts/llama/eval/textvqa.sh
Please find more evaluation scripts in scripts/MODEL_PATH
.
The evaluation code and needed files can be found here.
We provide some examples in this section. More examples can be found in our project page.
If you find this repo useful for your research, please consider citing the paper
@article{zhang2024beyond,
title={Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models},
author={Zhang, Yi-Fan and Wen, Qingsong and Fu, Chaoyou and Wang, Xue and Zhang, Zhang and Wang, Liang and Jin, Rong},
journal={arXiv preprint arXiv:2406.08487},
year={2024}
}
We would like to thank the following repos for their great work:
The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.