Skip to content
View inst-it's full-sized avatar

Block or report inst-it

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
inst-it/README.md

Inst-It: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Website arXiv HF Dataset: Inst-It-Bench HF Dataset: Inst-It-Dataset HF Model: Inst-It Leaderboard
Wujian Peng1,2*, Lingchen Meng1*, Yitong Chen1,2, Yiweng Xie1, Yang Liu1, Tao Gui1, Hang Xu3, Xipeng Qiu1,2, Zuxuan Wu1,2†, Yu-Gang Jiang1
1School of Computer Science, Fudan University  2Shanghai Innovation Institute  3Huawei Noah’s Ark Lab 
* Equal contributions  Corresponding author 

🔥 News

  • Feb. 19, 2025 Inst-It Bench Evaluation toolkit is released, you can evluate your model now!
  • Dec. 11, 2024 Inst-It Dataset is available at here. Welcome to use our dataset!
  • Dec. 5, 2024 our checkpoints are available at huggingface.

🏆 Inst-It Bench

Inst-It Bench is a fine-grained multimodal benchmark for evaluating LMMs at the instance-level.

  • Size: ~1,000 image QAs and ~1,000 video QAs
  • Splits: Image split and Video split
  • Evaluation Formats: Open-Ended and Multiple-Choice

See this Evaluate.md to learn how to perform evaluation on Inst-It-Bench.

🏆 Inst-It Dataset

Inst-It Dataset can be downloaded here. To our knowledge, this is the first dataset that provides fine-grained annotations centric on specific instances. In total, Inst-it Dataset includes :

  • 21k videos
  • 51k images
  • 21k video-level descriptions
  • 207k frame-level descriptions (51k images, 156k video frames) (each frame-level description includes captions of 1)individual instances, 2)the entire image, and 3)the temporal changes.)
  • 335k open-ended QA pairs

We visualize the data structure in the figure below, and you can view a more detailed data sample here.


click here to see the annotation format of Inst-It-Bench
[
    {
        "video_id": int,
        "frame_level_caption": (annotation for each frame within this video)
          [
              {
                  "timestamp": int, (indicate the timestamp of this frame in the video, e.g. <1>)
                  "frame_name": string, (the image filename of this frame)
                  "instance_level": (caption for each instance within this frame)
                    {
                        "1": "caption for instance 1",
                        (more instance level captions ...)
                    },
                  "image_level": string, (caption for the entire frame)
                  "temporal_change": string (caption for the temporal changes relative to the previous frame)
              },
              (more frame level captions ...)
          ],
        "question_answer_pairs": (open ended question answer pairs)
          [
             {
                "question": "the question",
                "answer": "the corresponding answer"
              },
             (more question answer pairs ...)
          ],
        "video_level_caption": string, (a dense caption for the entire video, encompassing all frames)
        "video_path": string (the path to where this video is stored)
    },
    (more annotations for other videos ...)
]
[
    {
        "image_id": int,
        "instance_level_caption": (caption for each instance within this image)
          {
              "1": "caption for instance 1",
              (more instance level captions ...)
          },
        "image_level_caption": string, (caption for the entire image)
        "image_path": string (the path to where this image is stored)
    },
    (more annotations for other images ...)
]

🌐 Model weights

We trained two models based on LLaVA-Next using our Inst-It-Dataset, which not only achieve outstanding performance on Inst-It-Bench but also demonstrate significant improvements on other generic image and video understanding benchmarks. We provide the checkpoints here:

Model Checkpoints
LLaVA-Next-Inst-It-Vicuna-7B weights and docs
LLaVA-Next-Inst-It-Qwen2-7B weights and docs

📝 Todo

  • Release the Inst-It Bench data and evaluation code.
  • Release the Inst-It Dataset.
  • Release the checkpoint of our fine-tuned models.
  • Release the meta-annotation of Inst-It Dataset, such as instance segmentation masks, bounding boxes, and more ...
  • Release the annotation file of Inst-It Dataset, which follows the format in the LLaVA codebase.
  • Release the training code.

📧 Contact Us

Feel free to contact us if you have any questions or suggestions

📎 Citation

If you find our work helpful, please consider citing our paper 📎 and starring our repo 🌟 :

@article{peng2024inst,
  title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
  author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2412.03565},
  year={2024}
}

Popular repositories Loading

  1. inst-it inst-it Public

    Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"

    Python 27

  2. inst-it.github.io inst-it.github.io Public

    Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

    JavaScript

  3. LLaVA-NeXT LLaVA-NeXT Public

    Forked from LLaVA-VL/LLaVA-NeXT

    Fork for prepare environment for Inst-IT inference.

    Python