InstructVideo: A reasoning-centric video object segmentation dataset with QA annotations for Multi-modal Large Language Models.
**π Important Notice: This repository is currently under preparation. The dataset and all associated code will be publicly released no later than October 2026.
InstructVideo is a reasoning-centric video object segmentation dataset designed to evaluate and facilitate research on multi-modal large language models (MLLMs) for complex video understanding tasks.
| Statistic | Value |
|---|---|
| Videos | 1,788 |
| QA Pairs | 6,112 |
| Objects | 3,603 |
| Average instances per multiple-object sample | 3.77 |
| Max instances in a single sample | 16 |
- Reasoning-centric queries requiring world knowledge and temporal understanding
- Both single-object and multiple-object segmentation tasks
- Logical textual responses beyond simple mask prediction
- High-quality mask annotations for referred targets
InstructVideo/
βββ train/
β βββ videos/ # Training video clips
β βββ masks/ # Segmentation mask annotations
β βββ annotations/ # QA pairs and textual responses
βββ test/
β βββ videos/ # Test video clips
β βββ masks/ # Segmentation mask annotations
β βββ annotations/ # QA pairs and textual responses
βββ README.md