[ICCV 2023] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

Abstract

We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed.

[paper link] [Chinese summary (自动驾驶之心)]

Updates

[2023-8-21] Much better training GPU memory efficiency (45GB -> 29GB) with no hurt to the performance and speed!

[2023-7-13] 🔥SparseFusion has been accepted to ICCV 2023!🔥

[2023-3-21] We release the first version code of SparseFusion.

Overview

Compared to existing fusion algorithms, SparseFusion achieves state-of-the-art performance as well as the fastest inference speed on nuScenes test set. †: Official repository of AutoAlignV2 uses flip as test-time augmentation. ‡: We use BEVFusion-base results in the official repository of BEVFusion to match the input resolutions of other methods. $\S:$ Swin-T is adopted as image backbone.

NuScene Performance

We do not use any test-time augmentations or model ensembles to get these results. We have released the configure files and pretrained checkpoints to reproduce our results.

Validation Set

Image Backbone	Point Cloud Backbone	mAP	NDS	Link
ResNet50	VoxelNet	70.5	72.8	config/ckpt
Swin-T	VoxelNet	71.0	73.1	config/ckpt

Test Set

Image Backbone	Point Cloud Backbone	mAP	NDS
ResNet50	VoxelNet	72.0	73.8

Usage

Installation

We test our code on an environment with CUDA 11.5, python 3.7, PyTorch 1.7.1, TorchVision 0.8.2, NumPy 1.20.0, and numba 0.48.0.
We use mmdet==2.10.0, mmcv==1.2.7 for our code. Please refer to their official instructions for installation.
You can install mmdet3d==0.11.0 directly from our repo by
```
cd SparseFusion
pip install -e .
```
We use spconv==2.3.3. Please follow the official instruction to install it based on your CUDA version.
```
pip install spconv-cuxxx 
# e.g. pip install spconv-cu114	
```
You also need to install the deformable attention module with the following command.
```
pip install ./mmdet3d/models/utils/ops
```

Data Preparation

Download nuScenes full dataset from the official website. You should have a folder structure like this:

SparseFusion
├── mmdet3d
├── tools
├── configs
├── data
│   ├── nuscenes
│   │   ├── maps
│   │   ├── samples
│   │   ├── sweeps
│   │   ├── v1.0-test
|   |   ├── v1.0-trainval

Then, you can select either of the two ways to preprocess the data.

Run the following two commands sequentially.

python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes
python tools/combine_view_info.py

Alternatively, you may directly download our preprocessed data from Google Drive, and put these files in data/nuscenes.

Initial Weights

Please download the initial weights for model training, and put them in checkpoints/.

Train & Test

In our default setting, we train the model with 4 GPUs.

# training
bash tools/dist_train.sh configs/sparsefusion_nusc_voxel_LC_r50.py 4 --work-dir work_dirs/sparsefusion_nusc_voxel_LC_r50

# test
bash tools/dist_test.sh configs/sparsefusion_nusc_voxel_LC_r50.py ${CHECKPOINT_FILE} 4 --eval=bbox

Note: We use A6000 GPUs (48GB per-GPU memory) for model training. The training of SparseFusion model (ResNet50 backbone) requires ~29 GB per-GPU memory.

Contact

If you have any questions, feel free to open an issue or contact us at yichen_xie@berkeley.edu.

Acknowledgments

We sincerely thank the authors of mmdetection3d, TransFusion, BEVFusion, MSMDFusion, and DeepInteraction for providing their codes or pretrained weights.

Reference

If you find our work useful, please consider citing the following paper:

@article{xie2023sparsefusion,
  title={SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection},
  author={Xie, Yichen and Xu, Chenfeng and Rakotosaona, Marie-Julie and Rim, Patrick and Tombari, Federico and Keutzer, Kurt and Tomizuka, Masayoshi and Zhan, Wei},
  journal={arXiv preprint arXiv:2304.14340},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
demo		demo
docker		docker
mmdet3d		mmdet3d
requirements		requirements
resources		resources
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh-CN.md		README_zh-CN.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
teaser.png		teaser.png
train.sh		train.sh
video.gif		video.gif

License

yichen928/SparseFusion

Folders and files

Latest commit

History

Repository files navigation