Artemis

Artemis: Towards Referential Understanding in Complex Videos

Abstract

Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established VideoRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that Artemis can be integrated with video grounding and text summarization tools to understand more complex scenarios.

Overview

The architecture detail of the Artemis model.

Install

Clone this repository and navigate to Artemis folder

git clone https://github.com/NeurIPS24Artemis/Artemis.git
cd Artemis

Install Packages

conda create -n artemis python=3.11 
conda activate artemis
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python
pip install torch-kmeans
pip install pycocoevalcap
# cuda 11.7
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .

Model

The model base and lora model can be downloaded from Baidu Disk

Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

Acknowledgment

This project is based on Video LLaVA (paper, code), LLaVA (paper, code), GPT4RoI (paper, code), Video-ChatGPT(paper, code), thanks for their excellent works.

Citation

If you find Artemis useful for your your research and applications, please cite using this BibTeX:

@misc{qiu2024artemis,
      title={Artemis: Towards Referential Understanding in Complex Videos}, 
      author={Jihao Qiu and Yuan Zhang and Xi Tang and Lingxi Xie and Tianren Ma and Pengyu Yan and David Doermann and Qixiang Ye and Yunjie Tian},
      year={2024},
      eprint={2406.00258},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
artemis		artemis
assets		assets
mmcv-1.4.7		mmcv-1.4.7
mmdet		mmdet
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
TRAIN_AND_VALIDATE.md		TRAIN_AND_VALIDATE.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artemis

Artemis: Towards Referential Understanding in Complex Videos

Abstract

Overview

Install

Model

Training & Validating

Acknowledgment

Citation

About

Releases

Packages

Languages

qiujihao19/Artemis

Folders and files

Latest commit

History

Repository files navigation

Artemis

Artemis: Towards Referential Understanding in Complex Videos

Abstract

Overview

Install

Model

Training & Validating

Acknowledgment

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages