Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives
Abstract: Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MMVUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques, but also to highlight their advanced capabilities in diverse learning paradigms. These paradigms include openworld understanding, efficient transfer for road scenes, continual learning, interactive and generative capability. Moreover, we provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
Authors: Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, Ling Shao, Yi Yang, Bojun Gao, Qun Li and Guobin Wu
This is an overview of our survey as below where we delve into MM-VUFMs from required prerequisites, currently common practices, advanced foundation models from diverse learning paradigms, key challenges and future trends.
We also systematicaly review currently common practices of visual understanding on road scenes from task-specific models, unified multi-task models, unified multi-modal models and prompting foundation models, respectively.
Moreover, advanced capabilities on diverse learning paradigms are highlighted as below, involving open-world understanding, efficient transfer for road scenes, continual learning, learn to interact and generative foundation models, respectively.
- [2024.02.05] Our survey is available at hear.
- Related Surveys
- Task-specific Models
- Unified Multi-task Models
- Unified Multi-modal Models
- Prompting Foundation Models
- Related Datasets
- Towards Open-world Understanding
- Efficient Transfer for Road Scenes
- Continual Learning
- Learn to Interact
- Generative Foundation Models
- Closed-loop Driving Systems
- Interpretability
- Low-resource Condition
- Embodied Driving Agent
- World Model
This work was supported by DiDi GAIA Research Cooperation Initiative. If you find this work useful, please consider cite:
@article{luo2024delving,
title={Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives},
author={Luo, Sheng and Chen, Wei and Tian, Wanxin and Liu, Rui and Hou, Luanxuan and Zhang, Xiubao and Shen, Haifeng and Wu, Ruiqi and Geng, Shuyi and Zhou, Yi and others},
journal={arXiv preprint arXiv:2402.02968},
year={2024}
}