🌋 LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

[Project Page] [Arxiv] [Demo] [Model Zoo]

🔥 News

[2024/1/14] Our training code is released.

[2023/12/6] Our paper is available in arxiv.

CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg path_to_vision_cfg --path_inter_cfg path_to_inter_cfg --model_path path_to_ckpt_dir

# for example, after downloading weights into checkpoints/llava_grounding
CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg configs/openseed/openseed_swint_lang_joint_2st_v2_data_end_with_interaction.yaml --path_inter_cfg configs/semsam/idino_swint_1_part_data_llm_ref_feat_all_16_det_pretrainv1.yaml --model_path checkpoints/llava_grounding

Please refer to our Online Demo for the more detailed user's guidence.

Training data

data
├── flickr30k_entities
│   ├── train/
│   ├── val/
│   ├── annotations
│          ├──final_flickr_separateGT_train.json
│          ├──final_flickr_separateGT_val.json
├── coco
│   ├── train2014/
│   ├── train2017/
│   ├── panoptic_train2017/
│   ├── panoptic_semseg_train2017/
│   ├── annotations
│   │      ├──instances_train2017.json
│   │      ├──instances_train2017_gvc.json
│   │      ├──grounded_visual_chat_data.json
│   │      ├──instances_train2014_filter.json
│   │      ├──panoptic_train2017_filter.json
│   │      ├──grounding_train2017.json
├── llava
│   ├── annotations
│          ├── cap600k_brackets_all.json
│          ├── llava_instruct_150k.json
│          ├── llava_instruct_150k_visual_prompt.json

Flickr30k

Please refer to MDETR's pre-processed flickr30k data.

COCO

Please download coco train2014 and train2017 images and panoptic segmentation and semantic segmentation data. Other annoations can be downloaded here.

LLaVA

The processed annotations can be downloaded here.

Training

Stage 1

bash scripts/pretrain_joint.py

Stage 2

bash scripts/finetune.py

Stage 3

bash scripts/finetune_visual_prompt.py

Citation

If you find LLaVA-Grounding useful for your research and applications, please cite using this BibTeX:

@misc{zhang2023llavagrounding,
      title={LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models},
      author={Hao Zhang and Hongyang Li and Feng Li and Tianhe Ren and Xueyan Zou and Shilong Liu and Shijia Huang and Jianfeng Gao and Lei Zhang and Chunyuan Li and Jianwei Yang},
      year={2023},
      booktitle={arXiv}
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={arXiv:2304.08485},
      year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
configs		configs
datasets_os		datasets_os
docs		docs
gradio_demo		gradio_demo
images		images
llava		llava
playground/data		playground/data
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

UX-Decoder/LLaVA-Grounding

Folders and files

Latest commit

History

Repository files navigation

🌋 LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

🔥 News

Contents

Install

LLaVA-Grounding Weights

Demo

Training data

Flickr30k

COCO

LLaVA

Training

Citation

About

Resources

License

Stars

Watchers

Forks

Languages