Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Abstract: Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MMVUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques, but also to highlight their advanced capabilities in diverse learning paradigms. These paradigms include openworld understanding, efficient transfer for road scenes, continual learning, interactive and generative capability. Moreover, we provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.

Authors: Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, Ling Shao, Yi Yang, Bojun Gao, Qun Li and Guobin Wu

😋Overview

This is an overview of our survey as below where we delve into MM-VUFMs from required prerequisites, currently common practices, advanced foundation models from diverse learning paradigms, key challenges and future trends.

We also systematicaly review currently common practices of visual understanding on road scenes from task-specific models, unified multi-task models, unified multi-modal models and prompting foundation models, respectively.

Moreover, advanced capabilities on diverse learning paradigms are highlighted as below, involving open-world understanding, efficient transfer for road scenes, continual learning, learn to interact and generative foundation models, respectively.

💥News

[2024.02.05] Our survey is available at hear.

🗺️Roadmap

📚Paper Collection

💗Acknowledgement & Citation

This work was supported by DiDi GAIA Research Cooperation Initiative. If you find this work useful, please consider cite:

@article{luo2024delving,
  title={Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives},
  author={Luo, Sheng and Chen, Wei and Tian, Wanxin and Liu, Rui and Hou, Luanxuan and Zhang, Xiubao and Shen, Haifeng and Wu, Ruiqi and Geng, Shuyi and Zhou, Yi and others},
  journal={arXiv preprint arXiv:2402.02968},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
README.md		README.md
papers.md		papers.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

README.md

README.md

papers.md

papers.md

Repository files navigation

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

📖Table of Contents

😋Overview

💥News

🗺️Roadmap

📚Paper Collection

💗Acknowledgement & Citation

About

Contributors 2

rolsheng/MM-VUFM4DS

Folders and files

Latest commit

History

assets

assets

README.md

README.md

papers.md

papers.md

Repository files navigation

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

📖Table of Contents

😋Overview

💥News

🗺️Roadmap

📚Paper Collection

💗Acknowledgement & Citation

About

Resources

Stars

Watchers

Forks

Contributors 2