🐬 Dolphins: Multimodal Language Model for Driving

dolphins.mp4

📖 Overview

The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like driving abilities. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are delineated into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free rapid learning and adaptation via in-context learning, reflection and error recovery, and interoperability.

In this repo, we provide Dolphins code. This codebase is under MIT License.

🔥 News

[TBD] We will release the training code and our benchmark within the next month!
[2024.7.11] We release the inference code and checkpoint.
[2024.7.1] Our paper is accepted by ECCV2024.
[2023.12.3] We release the paper and the webpage of our project.

🧰: Requirements and Installation

Environment

git clone https://github.com/vlm-driver/Dolphins.git
cd Dolphins
conda create -n dolphin python==3.8
pip install -r requirements.txt

Inference

python inference.py
# If you want to input your own supplied video and the instruction, modify inference.py.

📹 Start Demo

Launch a controller

python -m serve.controller --host 0.0.0.0 --port 10000

Launch a model worker

CUDA_VISIBLE_DEVICES=0 python -m serve.model_worker --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model_name dolphins --use_lora --num_gpus 1 --limit_model_concurrency 200

Launch a gradio web server

python -m serve.gradio_web_server_video --controller http://localhost:10000 --port 7862 --share

📑 Paper and Citation

If you find our work useful, please consider citing us!

@misc{ma2023dolphins,
      title={Dolphins: Multimodal Language Model for Driving}, 
      author={Yingzi Ma and Yulong Cao and Jiachen Sun and Marco Pavone and Chaowei Xiao},
      year={2023},
      eprint={2312.00438},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

💝 Acknowledgements

We thank the OpenFlamingo team and Otter team for their great contribution to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
configs		configs
dataset_utils		dataset_utils
mllm		mllm
pipeline		pipeline
playground/videos		playground/videos
serve		serve
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
conversation.py		conversation.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐬 Dolphins: Multimodal Language Model for Driving

📖 Overview

🔥 News

🧰: Requirements and Installation

Environment

Inference

📹 Start Demo

Launch a controller

Launch a model worker

Launch a gradio web server

📑 Paper and Citation

💝 Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

SaFoLab-WISC/Dolphins

Folders and files

Latest commit

History

Repository files navigation

🐬 Dolphins: Multimodal Language Model for Driving

📖 Overview

🔥 News

🧰: Requirements and Installation

Environment

Inference

📹 Start Demo

Launch a controller

Launch a model worker

Launch a gradio web server

📑 Paper and Citation

💝 Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages