# COMP9444 23T2 Group Project
### Leveraging Sensor Fusion for Enhanced Nighttime Object Detection in Autonomous Driving: A Comparative Study of Datasets and Deep Learning Models


# 1. Introduction, Motivation, and/or Problem Statement

### Introduction and Motivation
Since 2010, Advanced Driver-Assistance Systems (ADAS) and automated driving have rapidly emerged as significant trends in the automotive industry. These developments underline the importance of achieving high-performance on-road object detection under diverse operating conditions, including nighttime and extreme weather scenarios (Waldschmidt, C., Hasch, J. and Menzel, W., 2021). Such environments present unique challenges for traditional vision-based systems, necessitating novel approaches to ensure reliable and robust object detection.
![trends](images/motivation/trend.png)

Modern day computer vision neural networks often fail to perform well in nighttime object detection (inaccurate detection of objects in low luminosity environments). Nighttime environment factors like shadow, limited luminosity, and visibility makes it challenging for the network to classify objects. With this problem, it can hinder the effectiveness and safety of pre-existing computer vision applications like surveillance, which requires all day monitoring.

Solving day/night object detection will definitely bring significant enhancements in the real world, and some key areas of improvements are autonomous driving, surveillance and security systems. This is not only an exciting technical challenge for researchers, but also has the potential to open up new possibilities for neural network computer vision advancements.

In recent years, sensor fusion technology has gained traction in addressing these challenges. The integration of various sensors such as camera, radar, and lidar, provides richer and more comprehensive data, which has been shown to enhance object detection capabilities (Nobis, F., Geisslinger, M., Weber, M., Betz, J. and Lienkamp, M., 2019). With the increasing prevalence of commercial vehicles equipped with these multiple sensor systems, sensor fusion methodologies have become an area of active research, particularly in the context of deep learning-based object detection.



### Our task
  This work aims to conduct a comparative study of state-of-the-art models based on both camera-only and sensor fusion methodologies, to understand the potential benefits of sensor fusion in improving nighttime object detection performance. Moreover, we seek to identify and evaluate the public datasets that can be utilized for future research in this domain. By elucidating the performance differences between various model architectures and data inputs, we hope to provide valuable insights for ongoing advancements in ADAS and automated driving technologies. Furthermore, by assessing the suitability of available datasets for such research, we aim to facilitate further exploration and development in this critical field.

### Related works
Researchers have made advancements in enhancing accuracy for low-light detection. An example is the REDI low-light enhancement algorithm, which effectively filters noise in low-light conditions and performs detection on the resulting image.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/lowlight.png" />
</div>

Here (a) through to (d) are stages of REDI algorithm filtering. However, there are many downsides to this algorithm like loss of details, over-correction, and high computational cost. This would pose a challenge as it would add extra complexity and computational stress on existing models.

From our literature review, there are two kinds of approaches for the object detection based on the input sensor: single sensor-based and multiple sensor-based.

## Single Sensor-Based Approaches
Single sensor-based approaches have gained significant traction in object detection tasks due to their simplicity and cost-effectiveness. One of the representative models in this category is YOLOv3 proposed by Redmon and Farhadi (2018). This model is an incremental improvement over previous YOLO models, showcasing remarkable performance in real-time object detection tasks.

Moreover, the performance of YOLOv8 in pedestrian detection was explored by Sjöberg (2023). This study provided crucial insights into the application of single sensor-based approaches in pedestrian detection tasks, one of the most challenging scenarios in autonomous driving.

Also noteworthy is the work of Yan et al. (2022), which introduced 2DPASS, a system that utilizes 2D priors for semantic segmentation on lidar point clouds. This method effectively leverages lidar data for object detection, further advancing the capabilities of single sensor-based models.

## Multiple Sensor-Based Approaches
The fusion of data from multiple sensors has emerged as a promising avenue to improve object detection performance. Broedermann et al. (2023) proposed HRFuser, a multi-resolution sensor fusion architecture for 2D object detection. By leveraging different sensor data at varying resolutions, HRFuser demonstrated a remarkable ability to detect objects under diverse conditions.

These studies collectively indicate the substantial potential of both single sensor and multiple sensor-based approaches in tackling the complexities of object detection tasks. As the evolution of sensor technologies continues, we expect further improvements in object detection performance, particularly under challenging conditions such as nighttime and extreme weather scenarios. This underscores the relevance of our current study, which aims to evaluate and compare these approaches across various conditions.


### Problem Statements
Key challenges that requires to be address by our models are:
1. The model requires to handle varying levels of brightness within the image.
2. Removing noise from nighttime image, as image taken at night might have more noise.

# 2. Exploration Analysis or Data or RL Tasks

## I. Dataset discovery
| Dataset | mmWave Radar Amount | Radar Features | Radar Sensor Amount | Radar Position | Time | Weather | Labelling Box Type | Sensor-LiDAR | Sensor-Camera |
| ------- | ------------------- | -------------- | ------------------- | -------------- | ---- | ------- | ------------------ | ------------ | ------------- |
| Astyx | 500 | Point cloud | not specify | not specify | day | 1 | 3D | Yes | Yes |
| CARRADA | 12600 | Point cloud, Doppler, Range, Azimuth | 1 | front | day | 1 | 2D | No | Yes |
| CRUW | 396000 | Point cloud, Doppler, Range, Azimuth | not specify | not specify | day | 1 | Point | No | Yes |
| K-Radar | 35000 | Point cloud, Doppler, Range, Azimuth, Elevation | 1 | front | day/night | 5 | 3D | Yes | Yes |
| NuScenes | 40000 | Point cloud | 5 | 3 Front + 2 Back radar | day/night | 3 | 3D | Yes | Yes |
| RADDet | 10000 | Point cloud, Doppler, Range, Azimuth | 1 | front | day | 1 | 2D | No | Yes |
| RADIATE | 44000 | Point cloud, Doppler, Range, Azimuth | 1 | top | day/night | 4 | 2D | Yes | Yes |
| VoD | 87000 | Point cloud | 1 | front | day | 1 | 3D | Yes | Yes |
| Zendar | 4800 | Point cloud, Doppler, Range, Azimuth | 1 | front | day | 1 | 2D | Yes | Yes |



Notes: all the dataset links are in the reference section.

The nuScenes dataset is chosen for the experiment for the following reasons:

- **Popular in Research Area**: The nuScenes dataset is highly recognized with 3104 citations, reflecting its extensive usage in road object detection research.

- **Day & Night Samples**: Including both day and night conditions ensures a wide variety of scenarios, enhancing the robustness of the experiment.

- **Sensor-Riched**: Featuring data from LiDAR, radar, and cameras provides a comprehensive view, allowing for sophisticated analysis and improved accuracy.

- **Diverse and Realistic Scenarios**: Comprising diverse driving scenarios from real-world environments offers a realistic understanding of real-world driving situations.

- **Well-Documented and Structured**: The well-organized and detailed annotations facilitate efficient preprocessing and more focused model development.

- **Inclusion of Adverse Weather Conditions**: Scenarios with rain and fog add additional complexity vital for developing weather-robust algorithms.

Overall, the nuscenes dataset is the ideal option for our project.







## II. Data preprocessing
Nuscenes v1.0 https://www.nuscenes.org/nuscenes
23 category, 6975 instance, 12 sensor,
3977 samples

![data sample](images\preprocessing\nuscenes_dataset_samples_1.png)
![data sample](images\preprocessing\nuscenes_dataset_samples_2.png)






 - Data preprocessing tasks:
   - Simplified the classes (23 cls -> 3 cls)
   - 3D bounding box to 2D bounding box
   - Labelling the day/night

In [9]:
# Simplified the classes (23 -> 3)
from nuscenes.nuscenes import NuScenes
nusc = NuScenes(dataroot=r'D:\Projects\onRoadDatasets\StreamPETR\data\sets\nuscenes',  verbose=False)
cat = []
for i in range(len(nusc.category)):
    # print(nusc.category[i]['name'])
    cat.append(nusc.category[i]['name'])
print(f"There are {len(cat)} classes")

class_human = [i for i in cat if "human" in i]
class_bicycle = [i for i in cat if "cycle" in i]
class_vehicle = [i for i in cat if "vehicle" in i and i not in class_bicycle]
print(f"human class: {class_human}")
print(f"vehicle class: {class_vehicle}")
print(f"bicycle class: {class_bicycle}")


There are 23 classes
human class: ['human.pedestrian.adult', 'human.pedestrian.child', 'human.pedestrian.wheelchair', 'human.pedestrian.stroller', 'human.pedestrian.personal_mobility', 'human.pedestrian.police_officer', 'human.pedestrian.construction_worker']
vehicle class: ['vehicle.car', 'vehicle.bus.bendy', 'vehicle.bus.rigid', 'vehicle.truck', 'vehicle.construction', 'vehicle.emergency.ambulance', 'vehicle.emergency.police', 'vehicle.trailer']
bicycle class: ['vehicle.motorcycle', 'vehicle.bicycle', 'static_object.bicycle_rack']


classes that we are detecting :

- human
- car
- bicycle

#### Day/Night Labelling and 3D bounding box to 2D bounding box


In [16]:
out_sample_path = set()
out_sample_anno = dict()
out_sample_scene = dict()
out_sample_is_day = dict()


for i in range(10):
    my_scene = nusc.scene[i]
    is_day = False if "night" in my_scene['description'].lower() else True

    first_sample_token = my_scene['first_sample_token']
    my_sample = nusc.get('sample', first_sample_token)

    front_cam_sample = nusc.get('sample_data', my_sample['data']['CAM_FRONT'])
    out_sample_path.add(front_cam_sample['filename'])
    # get annotation
    out_sample_anno[front_cam_sample['filename']] = nusc.get_boxes(front_cam_sample['token'])
    out_sample_is_day[front_cam_sample['filename']] = is_day
    out_sample_scene[front_cam_sample['filename']] = i

    while front_cam_sample['next']:
        front_cam_sample = nusc.get('sample_data', front_cam_sample['next'])
        out_sample_path.add(front_cam_sample['filename'])
        # get annotation
        out_sample_anno[front_cam_sample['filename']] = nusc.get_boxes(front_cam_sample['token'])
        out_sample_is_day[front_cam_sample['filename']] = is_day
        out_sample_scene[front_cam_sample['filename']] = i


out_sample_anno_json = {k:str(v) for k,v in out_sample_anno.items()}

import json
with open('preprocessing/front_cam_sample_path.json', 'w+') as file:
    json.dump(list(out_sample_path), file)

with open('preprocessing/front_cam_sample_anno.json', 'w+') as file:
    json.dump(out_sample_anno_json, file)

with open('preprocessing/front_cam_sample_scene.json', 'w+') as file:
    json.dump(out_sample_scene, file)

with open('preprocessing/front_cam_sample_is_day.json', 'w+') as file:
    json.dump(out_sample_is_day, file)


print(f"Sample annotations: {len(out_sample_anno)}")
print(f"Day/Night samples: {sum(out_sample_is_day.values())} / {len(out_sample_is_day.keys()) - sum(out_sample_is_day.values())}")


Sample annotations: 2342
Day/Night samples: 1638 / 704


2D bounding box generation
Converting 3D bounding box to 2D via [the code](preprocessing/class_simplified_and_2D_BB_generation.ipynb)
![bb_2d_to_3d](images/preprocessing/3d_to_2d_bb.png)

## Applying radar point cloud to 2D image
Refer to [the code](preprocessing/nuscenes_viz.py), we can easily apply the 3D point cloud data from radar/lidar to the 2D images, here are some examples:
First column: camera
Second column: camera+lidar point cloud
Third column: camera+radar point cloud
![point_cloud_to_img_1](images/preprocessing/apply_point_cloud_to_img_1.png)
![point_cloud_to_img_2](images/preprocessing/apply_point_cloud_to_img_2.png)
![point_cloud_to_img_3](images/preprocessing/apply_point_cloud_to_img_3.png)




# 3. Models
We have thoroughly examined four distinct models - one baseline model and three alternative designs - in order to compare their performance based on variations in network architecture and input features. Our primary objective is to discern the most critical factors that can enhance the accuracy of nighttime object detection for deep learning models. All the models we explored are sourced from research papers published in recent years. Additionally, we have endeavored to fine-tune the parameters of each model to optimize them specifically for nighttime detection.

| Networks | Year | Sensor | Nighttime Object Detection Capabilities                                                                                                                                           |
| -------- | ---- | ------ |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| YOLOv3   | 2019 | Camera | famous and popular models, real-time detection capability can be advantageous for dynamic driving scenarios.                                                                      |
| YOLOv8s  | 2023 | Camera | Improved detection accuracy over YOLOv3. Its suitability for embedded devices enables on-board processing, beneficial for real-time applications.                                 |
| HRFuser  | 2022 | Camera, LiDAR, Radar | Excellent performance in low-light and foggy situations due to the fusion of different sensor data, thus, highly reliable for nighttime on-road object detection.                 |
| 2DPASS   | 2022 | Camera and LiDAR | Top model on the leaderboard, effective for nighttime detection due to the fusion of LiDAR and camera data, offering depth perception that is beneficial in low-light conditions. |



## YOLOv3
Redmon, J. and Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
 - baseline model: well-known architecture
 - real-time detection
 - Based on camera img only
 - 106 layers (darknet 53 + producing 53)



## YOLOv8s
Link to document: https://docs.ultralytics.com/

#### Model Introduction

The state-of-the-art object detection system YOLO (You Only Look Once) is a single-stage detector.Since the initial release of YOLOv1 in 2016, the YOLO family has been refined and upgraded to YOLOv8. We chose YOLOv8, the most recent iteration of the YOLO series, to conduct 2D object detection for this project. Instead of a single model, YOLOv8 has created multiple versions, each with its own distinctive characteristics. Our project uses YOLOv8s. These version are:
1. YOLOv8n ---- The nano version
2. YOLOv8s ---- The small version
3. YOLOv8m ---- The medium version
4. YOLOv8l ---- The large version
The performance of recent version of YOLO is shown below:
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8 performance.png" />
</div>
refer to https://github.com/ultralytics/ultralytics


Architecture of YOLOv8 is shown below:
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8 network.jpeg" />
</div>
refer to https://arxiv.org/pdf/2304.00501.pdf



Compared with previous YOLO algorithms, YOLOv8 has a new backnone network, a new anchor-free detection head and a new loss function. It uses a similar backbone network as YOLOv5 but changes CSPLayer to C2f which combines high-level features with contextual information to improve detection accuracy. The anchor-free detection head allows each branch to focus on its own tasks and improve overall accuracy. The new loss function uses CIoU and DFL loss function for bounding box loss and binary cross-entropy for classification loss, which helps to improve detection performance , especially when detecting small objects.

## HRFUser

Link to paper: https://arxiv.org/pdf/2206.15157.pdf



#### Model Introduction

HRFuser is a multi-resolution sensor fusion architecture that easily scales to any number of input modalities. HRFuser is built on cutting-edge high-resolution networks for image-only dense prediction and includes a new multi-window cross-attention block to conduct fusion of many modalities at multiple resolutions.

While numerous recent research focus on fusing specific pairs of sensors—such as camera with lidar or radar—by leveraging architectural components relevant to the investigated context, the literature lacks a general and modular sensor fusion architecture. We have HRFuser, a modular architecture for multi-modal 2D object identification. It multiresolutionly integrates numerous sensors and scales to an indefinite number of input modalities. HRFuser is built on cutting-edge high-resolution networks for image-only dense prediction and includes a new multi-window cross-attention block to conduct fusion of many modalities at multiple resolutions.

HRFuser have a slight special architecture being shown as follow:


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR-Fuser-architecture.png" />
</div>


Because of extended layer of input, HRUser results in a better training of combination Data on not just cameras but also multiple type of sensors

### 2DPASS 
Link to paper: https://arxiv.org/pdf/2207.04397.pdf 

#### Model Introduction
2DPASS (2D Priors Assisted Semantic Segmentation) leverages an auxiliary modal fusion and multi-scale fusion-to-single knowledge distillation (MSFSKD) to acquire richer semantic and structural information from multi-modal data consisting of 2D camera images and 3D LiDAR point clouds.

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/2DPASS_camera_lidar.png" />
</div>

LiDAR offers accurate depth information regardless of lighting conditions so it is not inhibited by dark conditions and will be able to effectively assist in day/night object detection. However, the data that LiDAR captures is sparse and textureless and so may benefit from additional sources of data. Cameras images are able to provide dense color information and fine grained textures but provide ambiguous depth perception and can be unclear in low light conditions such as at night. As cameras and LiDAR complement each other, 2DPASS was chosen as it uses both input modalities and we check if this provides better night time detection compared to models only using 2D images.

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/2DPASS_model_structure.png" />
</div>

The model uses two independent networks to extract multi-scale features from the 2D images and 3D point clouds in parallel. It has a 2D network which takes in a small patch (480 x 320) from the full camera image as the input and applies a ResNet34 encoder with 2D convolutions. The 3D network takes in 3D point clouds and uses sparse convolutions to take advantage of the sparse nature of the point cloud data.
 
Multi scale fusion to single knowledge distillation (MSFSKD) then enhances the 3D network by providing textural information and structural regularisation from the 2D network to enhance feature learning. To transfer information between the two modalities, point-to-pixel correspondence is used to create paired features of the two modalities. The 3D features are transformed using a multilayer perceptron before they are fused with their 2D counterparts and this can be used to improve the 3D network. Then features at each scale are then used to generate the semantic segmentation predictions.

# 4. Model Results
## Overall comparison
| Networks | Year | Sensor | mAP (day/night) | IoU | Pros | Cons |
| -------- | ---- | ------ | -------------- | --- | ---- | ---- |
| YOLOv3   | 2019 | Camera | 48/24.1        | 23  | Light weight | Low performance |
| YOLOv8s  | 2023 | Camera | 65/18.4        | 30  | Suitable for embed device | Low performance |
| HRFuser  | 2022 | Camera, LiDAR, Radar | 51/47 | 79 | High quality detection on low-light and foggy situations | Requires multiple inputs; High computation cost |
| 2DPASS   | 2022 | Camera and LiDAR | 66/63 | 81 | Good for nighttime detection | High computation cost |


### YOLOv8s Result
#### Model Result
According to the table below, YOLOv8s performs better in daytime detection than nighttime detection. For nighttime detection, the mAP is only 18%.

| Dataset     | mAP (%)     |
|-------------|-------------|
|overall      | 40%         |
|daytime      | 65%         |
|nighttime    | 18%         |

Based on the confusion matrix of model on overall dataset shown below, we can see that car has the highest precision, while other categories are easily misdetected as background.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8_confusion_matrix.png" width="700px" />
</div>

#### Prediction Output

The comparison between prediction and label shows that although YOLOv8s performs better than YOLOv3, the result is still not ideal due to misdetection.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8_prediction.jpeg" width="700px" />
</div>

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8_label.jpeg" width="700px" />
</div>


#### Model Tuning
In order to improve the precision of our model, we tried model tuning to get the best model. We followed the introduction of guides for best training results and hyperparameter evolution.
Attachment is:
- https://docs.ultralytics.com/yolov5/tutorials/tips_for_best_training_results/
- https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution/
Among parameters we adjusted, the effect of image size is one of the greatest. We tried 640 and 1280 respectively.
The results of the model are shown below:

| imagesize   | mAP (%)     |
|-------------|-------------|
|640          | 38%         |
|1280         | 40%         |
Thus, we chose imagesize = 1280, as it has a better result.



### HRFUser Results

#### General sample images results:
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR-Fuser_result.png" width="700px" />
</div>


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR-Fuser-Result-2.png" width="450px" />
</div>


#### Images output from Nuscene MiniDataset after train (note this is only a few out of more than 500 results):

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuser_Result.png" width="750px" />
</div>

##### Front Camera:

<div style="margin-top: 20px; margin-bottom: 20px; display: flex;">
<img src="./images/HRFuseroutput/Front-Camera/FC1.jpg" width="375px" />
<img src="./images/HRFuseroutput/Front-Camera/FC2.jpg" width="375px" />
</div>

##### Back Camera:

<div style="margin-top: 20px; margin-bottom: 20px; display: flex;">
<img src="./images/HRFuseroutput/Back-Camera/BC1.jpg" width="375px" />
<img src="./images/HRFuseroutput/Back-Camera/BC2.jpg" width="375px" />
</div>

##### Back Left Camera:

<div style="margin-top: 20px; margin-bottom: 20px; display: flex;">
<img src="./images/HRFuseroutput/Back-Left-Camera/BLC1.jpg" width="375px" />
    <img src="./images/HRFuseroutput/Back-Left-Camera/BLC2.jpg" width="375px" />
</div>

##### Back Right Camera:

<div style="margin-top: 20px; margin-bottom: 20px; display: flex;">
<img src="./images/HRFuseroutput/Back-Right-Camera/BR_Camera2.jpg" width="375px" />
<img src="./images/HRFuseroutput/Back-Right-Camera/BR-Camera1.jpg" width="375px" />
</div>

Model Tuning

### 2DPASS Results

#### 2DPASS Trained on Mini-Dataset
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/mini.png" width="850px" />
</div>

#### 2DPASS Pretrained Model
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/pretrained.png" width="850px" />
</div>

#### Model Results (organised into a table)
**Per Class IoU**

| Class                         | IoU Our Training (%) | IoU Pretrained (%) |
|-------------------------------|----------|----------|
| Movable Object Barrier        |  0.00    |  84.60   |
| Bicycle               |  0.00    |  64.95   |
| Bus             |  23.21   |  92.30   |
| Car                   |  84.27   |  90.76   |
| Construction          |   NaN    |   NaN    |
| Motorcycle            |  0.28    |  71.84   |
| Human Pedestrian Adult        |  44.91   |  86.12   |
| Movable Object Trafficcone    |  0.00    |  47.46   |
| Trailer               |  0.00    |  86.06   |
| Truck                 |  66.60   |  71.99   |
| Driveable Surface        |  89.26   |  96.10   |
| Other                    |  1.19    |  85.00   |
| Sidewalk                 |  33.79   |  72.88   |
| Terrain                  |  59.13   |  86.08   |
| Manmade                |  72.33   |  91.41   |
| Vegetation             |  69.03   |  91.09   |

**Global Metrics**

| Metric       | Our Training           | Pretrained              |
|--------------|------------------|---------------------|
| Accuracy      | 0.56             | 0.63                |
| mIoU     | 0.36             | 0.81                |

Major improvements in accuracy and mIoU are both significant for the pretrained model which was initially trained on the full dataset. Note, that this result is worse than the one displayed in the paper as their model was trained with additional validation set and using instance-level augmentation.

#### Epoch Training Steps
NOTE: X-axis is number of epoch.
<div style="display: flex; justify-content: space-around; align-items: center; margin-bottom: 30px;">
  <div style="text-align: center; margin-right: 5px;">
    <h4>mIoU vs Epoch</h4>
    <img src="./images/miou_r.png" alt="mIoU vs Epoch" width="500px" style="margin-bottom: 20px;" />
  </div>
  <div style="text-align: center; margin-left: 5px;">
    <h4>Best mIoU vs Epoch</h4>
    <img src="./images/miou.png" alt="Best mIoU vs Epoch" width="500px" style="margin-bottom: 20px;" />
  </div>
</div>

From the mIoU curves and best mIoU curve(smoothened out), we see that around 8000 epoch there are no significant improves in the mIoU value, emphasizing that further training after 8000 epoch does not improve the model, and could lead to overfitting existing data.

##### Accuracy vs Epoch 
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/accuracy.png" width="500px" />
</div>
The accuracy during the training of the model behaves similarly to the mIoU curve as optimum accuracy is reached around 8000 epoche

#### 2DPASS Model Tuning
To improve the existing model we have tried hyper-parameter tuning. We did this by incrementing and decrementing its parameters, and in particular:
- Learning Rate (original parameter 0.24)
- Optimizer (tried Adam instead of SGD)
- Momentum (original momentum 0.9)
By introducing small increments and decrements, we retrained the model on the mini-dataset. However, we only trained this twice and used the following paramters:
- Learning Rate (0.20)
- Optimizer (Adam)
- Momentum (original momentum 0.85)

Results:

| Metric       | Our Training           | Original Training                |
|--------------|------------------|---------------------|
| Accuracy      | 0.41             | 0.56                |
| mIoU     | 0.24             | 0.36                |

and;
- Learning Rate (0.3)
- Optimizer (Adam)
- Momentum (original momentum 0.95)

Results:

| Metric       | Our Training           | Original Training              |
|--------------|------------------|---------------------|
| Accuracy      | 0.46             | 0.56               |
| mIoU     | 0.29             | 0.36                |

However, our results became worse for both runs, using learning rates of 0.20 and 0.3, Adam optimizer, and momemtums of 0.85 and 0.95 respectively. This suggests that the original hyperparameters were already near optimal for this model. We also highly doubt that using a smaller training dataset would have impacted this, but we did consider that possibility.
Further hyper-paramter tuning could not be tested due to time limitations, as each run of the training took 5 hours. 

# Model Code
Our experimental setup involved careful parameter tuning, meticulous selection of evaluation metrics, and strategic data splitting for training, validation, and testing. YOLOv8s was trained using SGD optimizer with a learning rate of 0.01, a batch size of 16, and an IoU threshold of 0.7. For HRFuser, due to hardware constraints, we used a batch size of 1 and made use of pre-trained weights. We also used a mini-training dataset of Nuscenes due to computational and time constraints. Lastly, 2DPASS was trained with a learning rate of 0.24, SGD optimizer, momentum of 0.9, and weight decay of 1.0e-4. A batch size of 1 was adopted due to memory constraints. Evaluation metrics for all models included mAP (mean average precision) for object detection, evaluated separately for day and night conditions, and IoU (Intersection over Union) for bounding box overlap. For training and validation, we used a subset of the nuScenes dataset, ensuring a balanced representation of different classes and illumination conditions (day and night). The test set was kept separate and was not involved in any part of the model training or tuning process. Codes are as following:

## Yolov8 Code
### Experiment Implementation

In [5]:
!mkdir /content/dataset
%cd /content/dataset
!pip install roboflow
from roboflow import Roboflow
rf = Roboflow(api_key='U9qUuMcwNCEJrQCd2sUC')
project = rf.workspace('unsw-kgzbp').project('comp9444-v2')
dataset = project.version(1).download('yolov8')



In [6]:
!pip install ultralytics==8.0.20
from IPython import display
display.clear_output()
import ultralytics
import time
ultralytics.checks()
from ultralytics import YOLO
from IPython.display import display, Image



In [7]:
!yolo task=detect mode=train model=yolov8s.pt data={dataset.location}/data.yaml epochs=150 imgsz=1280 plots=True



The result is shown below
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8 result.png" width="700px" />
</div>



## 2DPASS Code with Pretrained Weights

In [3]:
import os
os.chdir("C:/Users/samyu/Code/comp9444project/2DPASS")

import yaml
import torch
import datetime
import importlib
import numpy as np
import pytorch_lightning as pl

from easydict import EasyDict
from argparse import ArgumentParser
from pytorch_lightning import loggers as pl_loggers
from pytorch_lightning.profiler import SimpleProfiler
from pytorch_lightning.callbacks import ModelCheckpoint, StochasticWeightAveraging
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from dataloader.dataset import get_model_class, get_collate_class
from dataloader.pc_dataset import get_pc_model_class
from pytorch_lightning.callbacks import LearningRateMonitor
from torch import distributed as dist


import warnings
warnings.filterwarnings("ignore")

import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

print(os.getcwd())


C:\Users\samyu\Code\comp9444project\2DPASS


In [4]:
def load_yaml(file_name):
    with open(file_name, 'r') as f:
        try:
            config = yaml.load(f, Loader=yaml.FullLoader)
        except:
            config = yaml.load(f)
    return config

def parse_config():
    config = load_yaml('config/2DPASS-nuscenese.yaml')  # Load config from yaml file

    # manually set the values that were previously command-line arguments
    config['gpu'] = [0]
    config['seed'] = 0
    config['config_path'] = 'config/2DPASS-nuscenese.yaml'
    config['log_dir'] = 'default'
    config['monitor'] = 'val/mIoU'
    config['stop_patience'] = 50
    config['save_top_k'] = 1
    config['check_val_every_n_epoch'] = 1
    config['SWA'] = False
    config['baseline_only'] = False
    config['test'] = True
    config['fine_tune'] = False
    config['pretrain2d'] = False
    config['num_vote'] = 1
    config['submit_to_server'] = False
    config['checkpoint'] = None
    config['debug'] = False

    return EasyDict(config)


def build_loader(config):
    pc_dataset = get_pc_model_class(config['dataset_params']['pc_dataset_type'])
    dataset_type = get_model_class(config['dataset_params']['dataset_type'])
    train_config = config['dataset_params']['train_data_loader']
    val_config = config['dataset_params']['val_data_loader']
    train_dataset_loader, val_dataset_loader, test_dataset_loader = None, None, None

    if not config['test']:
        train_pt_dataset = pc_dataset(config, data_path=train_config['data_path'], imageset='train')
        val_pt_dataset = pc_dataset(config, data_path=val_config['data_path'], imageset='val')
        train_dataset_loader = torch.utils.data.DataLoader(
            dataset=dataset_type(train_pt_dataset, config, train_config),
            batch_size=train_config["batch_size"],
            collate_fn=get_collate_class(config['dataset_params']['collate_type']),
            shuffle=train_config["shuffle"],
            num_workers=train_config["num_workers"],
            pin_memory=True,
            drop_last=True
        )
        # config['dataset_params']['training_size'] = len(train_dataset_loader) * len(configs.gpu)
        val_dataset_loader = torch.utils.data.DataLoader(
            dataset=dataset_type(val_pt_dataset, config, val_config, num_vote=1),
            batch_size=val_config["batch_size"],
            collate_fn=get_collate_class(config['dataset_params']['collate_type']),
            shuffle=val_config["shuffle"],
            pin_memory=True,
            num_workers=val_config["num_workers"]
        )
    else:
        if config['submit_to_server']:
            test_pt_dataset = pc_dataset(config, data_path=val_config['data_path'], imageset='test', num_vote=val_config["batch_size"])
            test_dataset_loader = torch.utils.data.DataLoader(
                dataset=dataset_type(test_pt_dataset, config, val_config, num_vote=val_config["batch_size"]),
                batch_size=val_config["batch_size"],
                collate_fn=get_collate_class(config['dataset_params']['collate_type']),
                shuffle=val_config["shuffle"],
                num_workers=val_config["num_workers"]
            )
        else:
            val_pt_dataset = pc_dataset(config, data_path=val_config['data_path'], imageset='val', num_vote=val_config["batch_size"])
            val_dataset_loader = torch.utils.data.DataLoader(
                dataset=dataset_type(val_pt_dataset, config, val_config, num_vote=val_config["batch_size"]),
                batch_size=val_config["batch_size"],
                collate_fn=get_collate_class(config['dataset_params']['collate_type']),
                shuffle=val_config["shuffle"],
                num_workers=val_config["num_workers"]
            )

    return train_dataset_loader, val_dataset_loader, test_dataset_loader


if __name__ == '__main__':
    # parameters
    configs = parse_config()
    print(configs)

    # setting
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, configs.gpu))
    num_gpu = len(configs.gpu)

    # output path
    log_folder = 'logs/' + configs['dataset_params']['pc_dataset_type']
    tb_logger = pl_loggers.TensorBoardLogger(log_folder, name=configs.log_dir, default_hp_metric=False)
    os.makedirs(f'{log_folder}/{configs.log_dir}', exist_ok=True)
    profiler = SimpleProfiler(output_filename=f'{log_folder}/{configs.log_dir}/profiler.txt')
    np.set_printoptions(precision=4, suppress=True)

    # save the backup files
    backup_dir = os.path.join(log_folder, configs.log_dir, 'backup_files_%s' % str(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M')))
    if not configs['test']:
        os.makedirs(backup_dir, exist_ok=True)
        os.system('copy main.py {}'.format(backup_dir))
        os.system('copy dataloader/dataset.py {}'.format(backup_dir))
        os.system('copy dataloader/pc_dataset.py {}'.format(backup_dir))
        os.system('copy {} {}'.format(configs.config_path, backup_dir))
        os.system('copy network/base_model.py {}'.format(backup_dir))
        os.system('copy network/baseline.py {}'.format(backup_dir))
        os.system('copy {}.py {}'.format('network/' + configs['model_params']['model_architecture'], backup_dir))

    # reproducibility
    torch.manual_seed(configs.seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    np.random.seed(configs.seed)
    config_path = configs.config_path

    train_dataset_loader, val_dataset_loader, test_dataset_loader = build_loader(configs)
    model_file = importlib.import_module('network.' + configs['model_params']['model_architecture'])
    my_model = model_file.get_model(configs)

    pl.seed_everything(configs.seed)
    checkpoint_callback = ModelCheckpoint(
        monitor=configs.monitor,
        mode='max',
        save_last=True,
        save_top_k=configs.save_top_k)

    if configs.checkpoint is not None:
        print('load pre-trained model...')
        if configs.fine_tune or configs.test or configs.pretrain2d:
            my_model = my_model.load_from_checkpoint(configs.checkpoint, config=configs, strict=(not configs.pretrain2d))
        else:
            # continue last training
            my_model = my_model.load_from_checkpoint(configs.checkpoint)

    if configs.SWA:
        swa = [StochasticWeightAveraging(swa_epoch_start=configs.train_params.swa_epoch_start, annealing_epochs=1)]
    else:
        swa = []

    print('Start testing...')
    assert num_gpu == 1, 'only support single GPU testing!'
    trainer = pl.Trainer(gpus=[i for i in range(num_gpu)],
                         accelerator='ddp_spawn',
                         resume_from_checkpoint=configs.checkpoint,
                         logger=tb_logger,
                         profiler=profiler)

    trainer.test(my_model, test_dataset_loader if configs.submit_to_server else val_dataset_loader)

{'format_version': 1, 'model_params': {'model_architecture': 'arch_2dpass', 'input_dims': 4, 'spatial_shape': [1000, 1000, 70], 'scale_list': [2, 4, 8, 16, 16, 16], 'hiden_size': 256, 'num_classes': 17, 'backbone_2d': 'resnet34', 'pretrained2d': False}, 'dataset_params': {'training_size': 28130, 'dataset_type': 'point_image_dataset_nus', 'pc_dataset_type': 'nuScenes', 'collate_type': 'collate_fn_default', 'ignore_label': 0, 'label_mapping': './config/label_mapping/nuscenes.yaml', 'resize': [400, 240], 'color_jitter': [0.4, 0.4, 0.4], 'flip2d': 0.5, 'image_normalizer': [[0.485, 0.456, 0.406], [0.229, 0.224, 0.225]], 'max_volume_space': [50, 50, 3], 'min_volume_space': [-50, -50, -4], 'train_data_loader': {'data_path': './dataset/nuscenes/', 'batch_size': 1, 'shuffle': True, 'num_workers': 8, 'rotate_aug': True, 'flip_aug': True, 'scale_aug': True, 'transform_aug': True, 'dropout_aug': True}, 'val_data_loader': {'data_path': './dataset/nuscenes', 'shuffle': False, 'num_workers': 8, 'batc

Global seed set to 0


Start testing...


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Upon running the above code, the training would actually be executed in the temrinal, and I have attached the respective photos below:
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/2DPASS Jup_training.png" />
</div>

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/pretrained jup.png" />
</div>

This result was analysed in the results section above.

## HRFuser code demo on a pre trained weight with output file generated

In [59]:
import argparse
import copy
import os
import os.path as osp
import time
import warnings

import mmcv
import torch
from mmcv import Config, DictAction
from mmcv.runner import get_dist_info, init_dist
from mmcv.utils import get_git_hash

from mmdet import __version__
from mmdet.apis import init_random_seed, set_random_seed, train_detector
from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.utils import collect_env, get_root_logger


In [60]:
# Arguments
arguments = ['./HRFuser_config/hrfuser/cascade_rcnn_hrfuser_t_1x_nus_r640_l_r_fusion_bn.py', # batch-norm
             './checkpoints/cascade_rcnn_hrfuser_t_1x_nus_r640_l_r_fusion_latest.pth',
             '--cfg-options', 'data.test.samples_per_gpu=1',
             '--show-dir', 'demo/output']


In [61]:
def parse_args():
    parser = argparse.ArgumentParser(description='Train a detector')
    parser.add_argument('config', help='train config file path')
    parser.add_argument('--work-dir', help='the dir to save logs and models')
    parser.add_argument(
        '--resume-from', help='the checkpoint file to resume from')
    parser.add_argument(
        '--no-validate',
        action='store_true',
        help='whether not to evaluate the checkpoint during training')
    group_gpus = parser.add_mutually_exclusive_group()
    group_gpus.add_argument(
        '--gpus',
        type=int,
        help='number of gpus to use '
        '(only applicable to non-distributed training)')
    group_gpus.add_argument(
        '--gpu-ids',
        type=int,
        nargs='+',
        help='ids of gpus to use '
        '(only applicable to non-distributed training)')
    parser.add_argument('--seed', type=int, default=None, help='random seed')
    parser.add_argument(
        '--deterministic',
        action='store_true',
        help='whether to set deterministic options for CUDNN backend.')
    parser.add_argument(
        '--options',
        nargs='+',
        action=DictAction,
        help='override some settings in the used config, the key-value pair '
        'in xxx=yyy format will be merged into config file (deprecate), '
        'change to --cfg-options instead.')
    parser.add_argument(
        '--cfg-options',
        nargs='+',
        action=DictAction,
        help='override some settings in the used config, the key-value pair '
        'in xxx=yyy format will be merged into config file. If the value to '
        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
        'Note that the quotation marks are necessary and that no white space '
        'is allowed.')
    parser.add_argument(
        '--launcher',
        choices=['none', 'pytorch', 'slurm', 'mpi'],
        default='none',
        help='job launcher')
    parser.add_argument('--local_rank', type=int, default=0)
    args, _ = parser.parse_known_args(arguments)
    if 'LOCAL_RANK' not in os.environ:
        os.environ['LOCAL_RANK'] = str(args.local_rank)

    if args.options and args.cfg_options:
        raise ValueError(
            '--options and --cfg-options cannot be both '
            'specified, --options is deprecated in favor of --cfg-options')
    if args.options:
        warnings.warn('--options is deprecated in favor of --cfg-options')
        args.cfg_options = args.options

    return args


In [62]:
def main():
    args = parse_args()

    cfg = Config.fromfile(args.config)
    if args.cfg_options is not None:
        cfg.merge_from_dict(args.cfg_options)
    # set cudnn_benchmark
    if cfg.get('cudnn_benchmark', False):
        torch.backends.cudnn.benchmark = True

    # work_dir is determined in this priority: CLI > segment in file > filename
    if args.work_dir is not None:
        # update configs according to CLI args if args.work_dir is not None
        cfg.work_dir = args.work_dir
    elif cfg.get('work_dir', None) is None:
        # use config filename as default work_dir if cfg.work_dir is None
        cfg.work_dir = osp.join('./work_dirs',
                                osp.splitext(osp.basename(args.config))[0])
    if args.resume_from is not None:
        cfg.resume_from = args.resume_from
    if args.gpu_ids is not None:
        cfg.gpu_ids = args.gpu_ids
    else:
        cfg.gpu_ids = range(1) if args.gpus is None else range(args.gpus)

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher, **cfg.dist_params)
        # re-set gpu_ids with distributed training mode
        _, world_size = get_dist_info()
        cfg.gpu_ids = range(world_size)

    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    # create work_dir
    mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
    # dump config
    cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
    # init the logger before other steps
    timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
    log_file = osp.join(cfg.work_dir, f'{timestamp}.log')
    logger = get_root_logger(log_file=log_file, log_level=cfg.log_level)

    # init the meta dict to record some important information such as
    # environment info and seed, which will be logged
    meta = dict()
    # log env info
    env_info_dict = collect_env()
    env_info = '\n'.join([(f'{k}: {v}') for k, v in env_info_dict.items()])
    dash_line = '-' * 60 + '\n'
    logger.info('Environment info:\n' + dash_line + env_info + '\n' +
                dash_line)
    meta['env_info'] = env_info
    meta['config'] = cfg.pretty_text
    # log some basic info
    logger.info(f'Distributed training: {distributed}')
    logger.info(f'Config:\n{cfg.pretty_text}')

    # set random seeds
    if 'seed' in cfg.keys() and cfg.seed is not None:
        seed = cfg.seed
    else:
        seed = init_random_seed(args.seed)
    logger.info(f'Set random seed to {seed}, '
                f'deterministic: {args.deterministic}')
    #set_random_seed(seed, deterministic=args.deterministic)
    cfg.seed = seed
    meta['seed'] = seed
    meta['exp_name'] = osp.basename(args.config)

    model = build_detector(
        cfg.model,
        train_cfg=cfg.get('train_cfg'),
        test_cfg=cfg.get('test_cfg'))
    model.init_weights()

    datasets = [build_dataset(cfg.data.train)]
    if len(cfg.workflow) == 2:
        val_dataset = copy.deepcopy(cfg.data.val)
        val_dataset.pipeline = cfg.data.train.pipeline
        datasets.append(build_dataset(val_dataset))
    if cfg.checkpoint_config is not None:
        # save mmdet version, config file content and class names in
        # checkpoints as meta data
        cfg.checkpoint_config.meta = dict(
            mmdet_version=__version__ + get_git_hash()[:7],
            config=cfg.pretty_text,
            CLASSES=datasets[0].CLASSES)

    # add an attribute for visualization convenience
    model.CLASSES = datasets[0].CLASSES
    train_detector(
        model,
        datasets,
        cfg,
        distributed=distributed,
        validate=(not args.no_validate),
        timestamp=timestamp,
        meta=meta)

# 5. Discussion

### YOLOv8s Discussion
#### System Performance
System Specifications: We have trained YOLOv8 model through Google Colab with NVIDIA A100-SXM4-40GB

#### Training Specifications
Training parameter have been pre-tuned by the developer:
- Optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.001), 63 bias
- batch: 16
- IoU: 0.7
And we set image size = 1280, which has a better result.

#### Training Time
The training process of YOLOv8 on the mini dataset would take 0.230 hours. When we added epoch size to 150, our model stopped training at epoch 75, since there was no improvement in the last 50 epochs.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8_training _time.png" />
</div>

And the best result was observed at epoch 25, so we used epoch size = 25. As a result, the training time has decreased to 0.08 hours.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/yolov8_training_time_25_epoches.png" />
</div>

#### Challenges and Solutions
The nuImages sample dataset contains approximately 40,000 images in total. But after downloading, we got around 20,000 images as training set. Using all of these images for training the model would result in excessively long training times. Apart from the provided all_sample dataset, the official source also offers a mini dataset with only 50 images. However, if we use the mini dataset to train the model, the small sample size and insufficient representation of certain categories could lead to inaccurate learning for those categories and over-fitting as well.

In order to mitigate the challenges associated with dataset size, we constructed a dataset by randomly sampling 350 images from each camera and manually adjusted the distribution of daytime and nighttime images within our mini dataset from nuImage. However, this approach may still lead to imbalanced proportions of different classes in comparison to the full dataset, potentially influencing our prediction results.

### HRFUser Model Discussion
#### System Performance:
System Specifications:
We train this data on another machine which is a Linux-Sub-System Machine but does not have GPU onboard, the training has to be on CPU. We have to reduce the batch size and the dataset. The training would took ages for this to happen so we have to make use of pre-trained weighted

#### Dataset:
For the interest of time we have used the mini-training dataset of Nuscenes which is around 6 gigabytes compared to the 80 gigabytes full dataset.

#### Training Specifications
Training batch size had to be limited to a size of 1 as any batch sizes larger than this would cause insufficient memory errors.
As we have insufficient training resources, for this device working on this model, we have to make use of the pre-trained weight provided by the Research Paper

#### Model Architecture
HRFuser is a multi-resolution sensor fusion architecture that easily scales to any number of input modalities. HRFuser is built on cutting-edge high-resolution networks for image-only dense prediction and includes a new multi-window cross-attention block to conduct fusion of many modalities at multiple resolutions.

#### Training Time
It would take ages, approximately more than 8 hours on the device working on this. But to use pre-trained weight, it would cost us approximately an hour to perform.

<div style="margin-top: 10px; margin-bottom: 10px;">
<img src="./images/train_time_HR_Fuser.png" />
</div>

#### Challenges and Solutions
The challenge for this model is that it makes use of the mmdet library and mmcv but the running environment on the paper provided is only suitable with a Linux running environment. We would have to create a WSL ( Window Subsystem for Linux) and then run the project on it. Also, WSL cannot connect and refer directly with the Window's CUDA and GPU, we have to modify the code for it to accept CPU train on Pytorch. As CPU train is very limited, we have to fully reduced the batch size to 1 and then perform training on a small MiniDataset instead of a huge one. Also, making use of a pre-trained weight would save us much time rather than training the whole process.

Overall HRuser actually better the performance on low-light detection since it was able to fuse all the data input such as camera and sensors together for the train. Thus, improve significantly the detection on varies foggy and low-light environment.

<div style="margin-top: 10px; margin-bottom: 10px;">
<img src="./images/HR_Fuser_discussion.png" />
</div>

### 2DPASS Discussion
#### System Performance:
System Specifications:
We have trained the 2DPASS model on a Nvidia 4060 laptop graphics card with 16 gigabytes of RAM. 

#### Dataset:
For the interest of time we have used the mini-training dataset of nuscenes which is around 6 gigabytes compared to the 80 gigabytes full dataset.

#### Training Specifications
Training batch size had to be limited to a size of 1 as any batch sizes larger than this would cause insufficient memory errors.
Training parameters have been pre-tuned by the developers as:
- Learning Rate: 0.24
- Optimizer: SGD
- Momentum: 0.9
- Weight Decay: 1.0e-4

#### Model Architecture
This model significantly improves upon simple image computer vision neural networks, as 2DPASS introduces lidar detection combined with the use of image. This more accurately detects the existence and classification of the object even in low luminosity environments.

#### Training Time
The training process of our model on the mini-dataset took approximately 5 hours, which is due to our computer’s limited memory as it was only able to manage a batch training size of one. Also, due to the limited variety in the mini-dataset, we observed that the val/mIoU failed to show improvements over the last 50 records, which shows that a lot of the computation towards the end of training did not achieve any notable performance improvements.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/train_time.png" />
</div>

#### Challenges and Solutions
Originally running the model on the whole 80 gigabytes data requires too much computational power and time, so we resorted to using the mini-training set instead, which was much faster to train.

Training on a much smaller dataset could potentially introduce overfitting of data and lead to inaccurate results, in this case we have used their pre-trained model to compare results before drawing conclusions.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/overfit.png" />
</div>
The above is the result from testing the model trained with the mini-dataset, and here we can clearly see a case of overfitting where all vehicle like objects are recognised as cars explaining the high accuracy in car predictions and basically 0% accuracy in all other vehicles detections.

Our main challenges occurred within our limited ability to modify the model, as the training time even on a much smaller dataset took up to five hours. To tackle this problem, we have introduced early-stopping of the training, where if we do not see noticeable improvements on the mIoU(mean intersection over Union) value over five epochs of training we will manually exit the training. However, finding a sweet spot for the improvement was difficult and is hard to optimise. Moreover, as training is also dependent on the distribution of the dataset, it is uncertain how much the model will learn from processing different data.




### References

#### General Research:

Darya Paspelava. Computer Vision Object Detection: Challenges Faced. Available at: https://www.exposit.com/blog/computer-vision-object-detection-challenges-faced/  (Accessed: 28 July 2023)

Liu, Z., Cai, Y., Wang, H. et al. Surrounding Objects Detection and Tracking for Autonomous Driving Using LiDAR and Radar Fusion. Chin. J. Mech. Eng. 34, 117 (2021). https://doi.org/10.1186/s10033-021-00630-y

Gaudenz Boesch. Object Detection in 2023: The Definitive Guide. Available at: https://viso.ai/deep-learning/object-detection/ (Accessed: 29 July 2023)

Waldschmidt, C., Hasch, J. and Menzel, W., 2021. Automotive radar—From first efforts to future systems. IEEE Journal of Microwaves, 1(1), pp.135-148.

Nobis, F., Geisslinger, M., Weber, M., Betz, J. and Lienkamp, M., 2019, October. A deep learning-based radar and camera sensor fusion architecture for object detection. In 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF) (pp. 1-7). IEEE.


#### Datasets:
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O., no year. nuScenes: A multimodal dataset for autonomous driving. Available at: https://doi.org/10.48550/arXiv.1903.11027

Ouaknine, A., Newson, A., Rebut, J., Tupin, F. and Pérez, P., 2021. 'Carrada dataset: Camera and automotive radar with range-angle-doppler annotations'. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5068-5075. IEEE.

Paek, D.H., Kong, S.H. and Wijaya, K.T., 2022. 'K-Radar: 4D Radar Object Detection Dataset and Benchmark for Autonomous Driving in Various Weather Conditions'. arXiv preprint arXiv:2206.08171.

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G. and Beijbom, O., 2020. 'nuscenes: A multimodal dataset for autonomous driving'. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621-11631.

Zhang, A., Nowruzi, F.E. and Laganiere, R., 2021, May. 'RADDet: Range-Azimuth-Doppler based radar object detection for dynamic road users'. In 2021 18th Conference on Robots and Vision (CRV), pp. 95-102. IEEE.

Sheeny, M., De Pellegrin, E., Mukherjee, S., Ahrabian, A., Wang, S. and Wallace, A., 2021, May. 'RADIATE: A radar dataset for automotive perception in bad weather'. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 1-7. IEEE.

Mostajabi, M., Wang, C.M., Ranjan, D. and Hsyu, G., 2020. 'High-resolution radar dataset for semi-supervised learning of dynamic objects'. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 100-101.

Patrick LLGC, 2023. 'Astyx dataset'. Available at: https://patrick-llgc.github.io/Learning-Deep-Learning/paper_notes/astyx_dataset.html (Accessed: 5 August 2023).

CRUW Dataset, 2023. 'CRUW: Radar Dataset for 2D Object Detection'. Available at: https://www.cruwdataset.org/ (Accessed: 5 August 2023).

Intelligent Vehicles, 2023. 'View of Delft: A Radar and Vision Dataset for Autonomous Driving in the Rain'. Available at: https://intelligent-vehicles.org/datasets/view-of-delft/ (Accessed: 5 August 2023).


#### Models:
Redmon, J. and Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.


Broedermann, T., Sakaridis, C., Dai, D. and Van Gool, L., 2023. HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection. ETH Zurich/MPI for Informatics/KU Leuven. Available at: https://doi.org/10.48550/arXiv.2206.15157

Yan, X., Gao, J., Zheng, C., Zheng, C., Zhang, R., Cui, S. and Li, Z., 2022, October. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. Cham: Springer Nature Switzerland. Available at: https://arxiv.org/abs/2207.04397

SJÖBERG, A. (2023) Investigation regarding the performance of yolov8 in pedestrian detection, Investigation regarding the Performance of YOLOv8 in Pedestrian Detection. Available at: https://kth.diva-portal.org/smash/get/diva2:1778368/FULLTEXT01.pdf (Accessed: 05 August 2023).
