- (2024/07/10) Paper accepted to ITSC 2024!
The image processing pipeline first transforms the camera image into Bird’s-Eye View (BEV). Subsequently, the resultant BEV undergoes conversion into Polar representation, directly mapping to the Range-Azimuth (RA) image. Object detection is performed on RA image features fused with radar features from the radar decoder. The predictions obtained in the RA view are shown in the camera images with ground-truth bounding boxes in green and predictions in blue.
Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, at first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy, but also on computational complexity.
The camera only and radar only encoder contains four ResNet-50-like blocks with a pre-encoder block. The features from each of those blocks are named x0, x1, x2, x3, and x4. The thick blue curved arrow takes the encoder’s output to the decoder’s input in order to expand the input feature maps to higher resolutions. The dotted lines represent the skip connections used to preserve spatial information. The features from the camera only decoder and radar only decoder are then fused before passing them to the detection head. The head finally predicts the objects in Bird’s Eye RA Polar View.
The models are trained and tested on the RADIal dataset. The dataset can be downloaded
here. Under resources/gen_polarimage
, we provide functions that help in converting front-facing camera images to BEV Cartesian and subsequently to BEV Polar.
Vehicle detection performances on the RADIal dataset test split. RD, ADC, RPC, RT, C correspond to Range-Doppler, Analog-To-Digital Converter signal, Radar Point Cloud, Range-Time signal, and Camera data respectively. Best values are in bold and second best are underlined. †: reimplemented with only detection head as they are multi-tasking models. The missing values are indicated by a ”-”, either due to code unavailability or unreported in the respective works.
Model | AP | AR | F1 | RE | AE | # | FPS | σ | Size | GPU cost |
---|---|---|---|---|---|---|---|---|---|---|
FFTRadNet† | 93.45 | 83.35 | 88.11 | 0.12 | 0.15 | 3.23 | 68.46 | 2.19 | 39.2 | 2.01 |
TFFTRadNet† | 90.80 | 88.31 | 89.54 | 0.15 | 0.13 | 9.08 | 54.37 | 4.28 | 109.5 | 2.04 |
ADCNet | 95 | 89 | 91.9 | 0.13 | 0.11 | - | - | - | - | - |
CMS | 96.9 | 83.49 | 89.69 | 0.45 | - | 7.7 | - | - | - | - |
ROFusion | 91.13 | 95.29 | 93.16 | 0.13 | 0.21 | 3.33 | 56.11 | 1.55 | 87.2 | 2.87 |
EchoFusion | 96.95 | 93.43 | 95.16 | 0.12 | 0.18 | 25.61 | - | - | 102.5 | - |
Ours | 95.75 | 91.35 | 93.49 | 0.11 | 0.09 | 6.58 | 58.91 | 1.28 | 79.8 | 2.06 |
- AP: Average Precision (%)
- AR: Average Recall (%)
- F1: F1 Score (%)
- RE: Range Error (meters)
- AE: Angle Error (degrees)
- #: Number of Parameters (in Millions)
- FPS: Frames Per Second: for a given model, a FPS value is computed for each frame in the test set and averaged.
- σ: Standard Deviation computed from FPS values.
- Size: Size of the model in MB.
- GPU cost: GPU memory consumption while inference in GB.
All the codes are tested in the following environment:
- Linux (tested on Ubuntu 22.04)
- Python 3.9
- Clone the repo and set up the conda environment:
$ git clone "this repo"
$ conda create --prefix "your_path" python=3.9 -y
$ conda update -n base -c defaults conda
$ source activate "your_path"
- The following are the packages used:
$ conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
$ pip install -U pip
$ pip3 install pkbar
$ pip3 install tensorboard
$ pip3 install pandas
$ pip3 install shapely
$ pip3 install opencv-python
$ pip3 install einops
$ pip3 install timm
$ pip3 install scipy
$ pip3 install scikit-learn
$ pip3 install polarTransform
$ pip3 install matplotlib
$ pip3 install numpy==1.23
Please run the following to train the model. Camera-Radar fusion in BEV for object detection will be chosen by default.
$ python 1-Train.py
To evaluate the model performance, please load the trained model and run:
$ python 2-Evaluation.py
To obtain qualitative results, please load the trained model and run:
$ python 3-Test.py
To compute Frames Per Second (FPS), please load the trained model and run:
$ python 4-FPS.py
- Even though, we focus only on object detection, there is a scope to include segmentation head in our model, which we leave open to the community or may pursue ourselves.
- We have also proposed an early fusion architecture that intakes camera images and the point cloud data from imaging radar and lidar sensor in perspective view. Our code can be extended for further analysis. All parameters can be found in our configuration file, which is available here:
config/config_allmodality.json
. - We plan to further accelerate this research using k-radar dataset.
- Thanks to Elektrobit Automotive GmbH and Mobile Perception Systems Lab from Eindhoven University of Technology for their continous support.
- This project is inspired from the codebase from RADIal.
The repo is released under the BSD 3-Clause License.