InstructVideo

InstructVideo: A reasoning-centric video object segmentation dataset with QA annotations for Multi-modal Large Language Models.

**📌 Important Notice: This repository is currently under preparation. The dataset and all associated code will be publicly released no later than October 2026.

Dataset Overview

InstructVideo is a reasoning-centric video object segmentation dataset designed to evaluate and facilitate research on multi-modal large language models (MLLMs) for complex video understanding tasks.

Key Statistics

Statistic	Value
Videos	1,788
QA Pairs	6,112
Objects	3,603
Average instances per multiple-object sample	3.77
Max instances in a single sample	16

Key Features

Reasoning-centric queries requiring world knowledge and temporal understanding
Both single-object and multiple-object segmentation tasks
Logical textual responses beyond simple mask prediction
High-quality mask annotations for referred targets

Dataset Structure

InstructVideo/
├── train/
│ ├── videos/ # Training video clips
│ ├── masks/ # Segmentation mask annotations
│ └── annotations/ # QA pairs and textual responses
├── test/
│ ├── videos/ # Test video clips
│ ├── masks/ # Segmentation mask annotations
│ └── annotations/ # QA pairs and textual responses
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructVideo

Dataset Overview

Key Statistics

Key Features

Dataset Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

InstructVideo

Dataset Overview

Key Statistics

Key Features

Dataset Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages