GitHub - yuhangzang/ContextDET: Contextual Object Detection with Multimodal Large Language Models

Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang Wei Li Jun Han Kaiyang Zhou
Chen Change Loy

S-Lab, Nanyang Technological University

Currently, we only offer the Hugging Face demo code. The CODE dataset and training scripts will be made available once this paper is accepted.

🌟 Contextual Object Detection

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection--understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.

Comparison with Related Works

Task	Language Input	Output(s)	Remark
Object Detection	✗	box, class label	pre-defined class labels
Open-Vocabulary Object Detection	(optional) class names for CLIP	box, class label	pre-defined class labels
Referring Expression Comprehension	complete referring expression	box that expression refers to	/
Contextual Cloze Test (ours)	incomplete expression, object names are masked	{box, name} to complete the mask	name could be most valid English word
Image Captioning	✗	language caption	/
Contextual Captioning (ours)	✗	language caption, box	/
Visual Question Answering	language question	language answer	/
Contextual QA (ours)	language question	language question, box	/

😎 Method

We present ContextDET, a novel generate-then-detect framework, specialized for contextual object detection. ContextDET is end-to-end and consists of three key architectural components:

a visual encoder that extracts high-level image representations and computes visual tokens,
a pre-trained LLM that decodes multimodal contextual tokens with a task-related multimodal prefix, and
a visual decoder that predicts matching scores and bounding boxes for conditional queries linked to contextual object words.

The new generate-then-detect framework enables us to detect object words within human vocabulary.

🥰 Qualitative Examples

💻 Try Demo

🤗 You can try our demo on HuggingFace spaces. To avoid waiting in the queue and speed up your inference, consider duplicating the space and use GPU resources.

🤗 If you want to try the demo on your own computer with GPU, follow these steps

Install the required python packages:

pip install -r requirements.txt

Download the checkpoint file from the following URL and save it in your local directory.
Now, you're ready to run the demo. Execute the following command:

python app.py

You are expected to see the following web page:

📝 Citation

We would be grateful if you consider citing our work if you find it useful:

@article{zang2023contextual,
  author = {Zang, Yuhang and Li, Wei and Han, Jun and Zhou, Kaiyang and Loy, Chen Change},
  title = {Contextual Object Detection with Multimodal Large Language Models},
  journal = {arXiv preprint arXiv:2305.18279},
  year = {2023}
}

📋 Liscense

This project is licensed under S-Lab License 1.0. Redistribution and use for non-commercial purposes should follow this license.

😃 Acknowledgement

We acknowledge the use of the following public code in this project: ¹DETR, ²Deformable DETR, ³DETA, ⁴OV DETR, ⁵BLIP2.

📧 Contact

If you have any questions, please feel free to contact Yuhang Zang (zang0012 AT ntu.edu.sg).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asset

asset

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Contextual Object Detection with Multimodal Large Language Models

🌟 Contextual Object Detection

Comparison with Related Works

😎 Method

🥰 Qualitative Examples

💻 Try Demo

📝 Citation

📋 Liscense

😃 Acknowledgement

📧 Contact

About

Releases

Packages

License

yuhangzang/ContextDET

Folders and files

Latest commit

History

Repository files navigation

Contextual Object Detection with Multimodal Large Language Models

🌟 Contextual Object Detection

Comparison with Related Works

😎 Method

🥰 Qualitative Examples

💻 Try Demo

📝 Citation

📋 Liscense

😃 Acknowledgement

📧 Contact

About

Topics

Resources

License

Stars

Watchers

Forks