# COMP9444 Group Project


## 1. Introduction, Motivation, and/or Problem Statement

### Introduction
Our project aims to leverage computer vision neural network to improve object detection of images during both daytime and nighttime environments. The ability to accurately detect and recognize objects in varying lighting condition has become crucial for the functionalities of many modern day applications; some examples would be autonomous vehicles, surveillance and security systems.

Consider the two images below. It is imperative that everything in left image is very easy to identify, and when contrasted to the image on the right it really highlights just how much harder it is to identify objects with low luminosity.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="https://www.exposit.com/wp-content/webp-express/webp-images/doc-root/wp-content/uploads/2021/04/Illumination_conditions_as_a_challenge_of_comp.width-800.jpg.webp" width="500"/>
</div>

### Motivation
Modern day computer vision neural networks often fail to perform well in nighttime object detection (inaccurate detection of objects in low luminosity environments). Nighttime environment factors like shadow, limited luminosity, and visibility makes it challenging for the network to classify objects. With this problem, it can hinder the effectiveness and safety of pre-existing computer vision applications like surveillance, which requires all day monitoring.

Researchers have made advancements in enhancing accuracy for low-light detection. An example is the REDI low-light enhancement algorithm, which effectively filters noise in low-light conditions and performs detection on the resulting image.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/lowlight.png" />
</div>

Here (a) through to (d) are stages of REDI algorithm filtering. However, there are many downsides to this algorithm like loss of details, over-correction, and high computational cost. This would pose a challenge as it would add extra complexity and computational stress on existing models.

Solving day/night object detection will definitely bring significant enhancements in the real world, and some key areas of improvements are autonomous driving, surveillance and security systems. This is not only an exciting technical challenge for researchers, but also has the potential to open up new possibilities for neural network computer vision advancements.

### Problem Statements
Key challenges that requires to be address by our models are:
1. The model requires to handle varying levels of brightness within the image.
2. Removing noise from nighttime image, as image taken at night might have more noise.

## 2. Exploration Analysis or Data or RL Tasks


## 3. Models and/or Methods

### 2DPASS 
Link to paper: https://arxiv.org/pdf/2210.04208.pdf

#### Model Introduction
This model is an Assisted Semantic Segmentation method that boosts the representation learning on point clouds. A notable advantage of this model is that 
Advantages of this model is that it does not require strict pair data alignments between the camera and LiDAR data. 

The 2DPASS method leverages an auxiliary model fusion and multi-scale fusion to single knowledge distillation (MSFSKD) to acquire richer semantic and structural information from the multi-modal data. This is a significant improvement over baseline models where models only use point cloud.


### HRFUser

Link to paper: https://arxiv.org/pdf/2206.15157.pdf



### Model Introduction

HRFuser is a multi-resolution sensor fusion architecture that easily scales to any number of input modalities. HRFuser is built on cutting-edge high-resolution networks for image-only dense prediction and includes a new multi-window cross-attention block to conduct fusion of many modalities at multiple resolutions.

While numerous recent research focus on fusing specific pairs of sensors—such as camera with lidar or radar—by leveraging architectural components relevant to the investigated context, the literature lacks a general and modular sensor fusion architecture. We have HRFuser, a modular architecture for multi-modal 2D object identification. It multiresolutionly integrates numerous sensors and scales to an indefinite number of input modalities. HRFuser is built on cutting-edge high-resolution networks for image-only dense prediction and includes a new multi-window cross-attention block to conduct fusion of many modalities at multiple resolutions.

HRFuser have a slight special architecture being shown as follow:


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR-Fuser-architecture.png" />
</div>


Because of extended layer of input, HRUser results in a better training of combination Data on not just cameras but also multiple type of sensors



## 4. Results

### 2DPASS Results

#### 2DPASS Trained on Mini-Dataset
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/mini.png" />
</div>

#### 2DPASS Pretrained Model
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/pretrained.png" />
</div>

#### Model Results
| Model                | mIoU | Accuracy |
|----------------------|------|----------|
| 2DPASS (Mini-dataset)| 36%  | 56%      |
| 2DPASS (Pretrained)  | 81%  | 63%      |

Major improvements in accuracy and mIoU are both significant for the pretrained model which was initially trained on the full dataset. Note, that this result is worse than the one displayed in the paper as their model was trained with additional validation set and using instance-level augmentation.

#### Epoch Training Steps
NOTE: X-axis is number of epoch.
##### mIoU vs Epoch 
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/miou_r.png" width="700px" />
</div>

#### Best mIoU vs Epoch
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/miou.png" width="700px" />
</div>

From the mIoU curves and best mIoU curve(smoothened out), we see that around 8000 epoch there are no significant improves in the mIoU value, emphasizing that further training after 8000 epoch does not improve the model, and could lead to overfitting existing data.

##### Accuracy vs Epoch 
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/accuracy.png" width="700px" />
</div>
The accuracy during the training of the model behaves similarly to the mIoU curve as optimum accuracy is reached around 8000 epoche




### HRFUser Results

#### General sample images results:
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR-Fuser_result.png" width="700px" />
</div>


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR-Fuser-Result-2.png" width="700px" />
</div>


#### Images output from Nuscene MiniDataset after train (note this is only a few out of more than 500 results):


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuser_Result.png" width="1000px" />
</div>


##### Front Camera:


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Front-Camera/FC1.jpg" width="700px" />
</div>



<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Front-Camera/FC2.jpg" width="700px" />
</div>

##### Back Camera:


<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Back-Camera/BC1.jpg" width="700px" />
</div>



<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Back-Camera/BC2.jpg" width="700px" />
</div>



##### Back Left Camera:



<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Back-Left-Camera/BLC1.jpg" width="700px" />
</div>



<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Back-Left-Camera/BLC2.jpg" width="700px" />
</div>



##### Back Right Camera:

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Back-Right-Camera/BR_Camera2.jpg" width="700px" />
</div>



<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HRFuseroutput/Back-Right-Camera/BR-Camera1.jpg" width="700px" />
</div>





## 5. Discussion
### 2DPASS Discussion
#### System Performance:
System Specifications:
We have trained the 2DPASS model on a Nvidia 4060 laptop graphics card with 16 gigabytes of RAM. 

#### Dataset:
For the interest of time we have used the mini-training dataset of nuscenes which is around 6 gigabytes compared to the 80 gigabytes full dataset.

#### Training Specifications
Training batch size had to be limited to a size of 1 as any batch sizes larger than this would cause insufficient memory errors.
Training parameters have been pre-tuned by the developers as:
- Learning Rate: 0.24
- Optimizer: SGD
- Momentum: 0.9
- Weight Decay: 1.0e-4

#### Model Architecture
This model significantly improves upon simple image computer vision neural networks, as 2DPASS introduces lidar detection combined with the use of image. This more accurately detects the existence and classification of the object even in low luminosity environments.

#### Training Time
The training process of our model on the mini-dataset took approximately 5 hours, which is due to our computer’s limited memory as it was only able to manage a batch training size of one. Also, due to the limited variety in the mini-dataset, we observed that the val/mIoU failed to show improvements over the last 50 records, which shows that a lot of the computation towards the end of training did not achieve any notable performance improvements.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/train_time.png" />
</div>

#### Challenges and Solutions
Originally running the model on the whole 80 gigabytes data requires too much computational power and time, so we resorted to using the mini-training set instead, which was much faster to train.

Training on a much smaller dataset could potentially introduce overfitting of data and lead to inaccurate results, in this case we have used their pre-trained model to compare results before drawing conclusions.
<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/overfit.png" />
</div>
The above is the result from testing the model trained with the mini-dataset, and here we can clearly see a case of overfitting where all vehicle like objects are recognised as cars explaining the high accuracy in car predictions and basically 0% accuracy in all other vehicles detections.

Our main challenges occurred within our limited ability to modify the model, as the training time even on a much smaller dataset took up to five hours. To tackle this problem, we have introduced early-stopping of the training, where if we do not see noticeable improvements on the mIoU(mean intersection over Union) value over five epochs of training we will manually exit the training. However, finding a sweet spot for the improvement was difficult and is hard to optimise. Moreover, as training is also dependent on the distribution of the dataset, it is uncertain how much the model will learn from processing different data.



### HRFUser Model Discussion


#### System Performance:
System Specifications:
We train this data on another machine which is a Linux-Sub-System Machine but does not have GPU onboard, the training has to be on CPU. We have to reduce the batch size and the dataset. The training would took ages for this to happen so we have to make use of pre-trained weighted

#### Dataset:
For the interest of time we have used the mini-training dataset of Nuscenes which is around 6 gigabytes compared to the 80 gigabytes full dataset.

#### Training Specifications
Training batch size had to be limited to a size of 1 as any batch sizes larger than this would cause insufficient memory errors.
As we have insufficient training resources, for this device working on this model, we have to make use of the pre-trained weight provided by the Research Paper

#### Model Architecture
HRFuser is a multi-resolution sensor fusion architecture that easily scales to any number of input modalities. HRFuser is built on cutting-edge high-resolution networks for image-only dense prediction and includes a new multi-window cross-attention block to conduct fusion of many modalities at multiple resolutions.

#### Training Time
It would take ages, approximately more than 8 hours on the device working on this. But to use pre-trained weight, it would cost us approximately an hour to perform.

<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/train_time_HR_Fuser.png" />
</div>

#### Challenges and Solutions
The challenge for this model is that it makes use of the mmdet library and mmcv but the running environment on the paper provided is only suitable with a Linux running environment. We would have to create a WSL ( Window Subsystem for Linux) and then run the project on it. Also, WSL cannot connect and refer directly with the Window's CUDA and GPU, we have to modify the code for it to accept CPU train on Pytorch. As CPU train is very limited, we have to fully reduced the batch size to 1 and then perform training on a small MiniDataset instead of a huge one. Also, making use of a pre-trained weight would save us much time rather than training the whole process.

Overall HRuser actually better the performance on low-light detection since it was able to fuse all the data input such as camera and sensors together for the train. Thus, improve significantly the detection on varies foggy and low-light environment.



<div style="margin-top: 20px; margin-bottom: 20px;">
<img src="./images/HR_Fuser_discussion.png" />
</div>



## HRFuser code demo on a pre trained weight with output file generated

#### Note: Since the file is loaded with images, you can check it up above as I already put it as example

In [59]:
import argparse
import copy
import os
import os.path as osp
import time
import warnings

import mmcv
import torch
from mmcv import Config, DictAction
from mmcv.runner import get_dist_info, init_dist
from mmcv.utils import get_git_hash

from mmdet import __version__
from mmdet.apis import init_random_seed, set_random_seed, train_detector
from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.utils import collect_env, get_root_logger


In [60]:
# Arguments
arguments = ['./HRFuser_config/hrfuser/cascade_rcnn_hrfuser_t_1x_nus_r640_l_r_fusion_bn.py', # batch-norm
             './checkpoints/cascade_rcnn_hrfuser_t_1x_nus_r640_l_r_fusion_latest.pth',
             '--cfg-options', 'data.test.samples_per_gpu=1',
             '--show-dir', 'demo/output']


In [61]:
def parse_args():
    parser = argparse.ArgumentParser(description='Train a detector')
    parser.add_argument('config', help='train config file path')
    parser.add_argument('--work-dir', help='the dir to save logs and models')
    parser.add_argument(
        '--resume-from', help='the checkpoint file to resume from')
    parser.add_argument(
        '--no-validate',
        action='store_true',
        help='whether not to evaluate the checkpoint during training')
    group_gpus = parser.add_mutually_exclusive_group()
    group_gpus.add_argument(
        '--gpus',
        type=int,
        help='number of gpus to use '
        '(only applicable to non-distributed training)')
    group_gpus.add_argument(
        '--gpu-ids',
        type=int,
        nargs='+',
        help='ids of gpus to use '
        '(only applicable to non-distributed training)')
    parser.add_argument('--seed', type=int, default=None, help='random seed')
    parser.add_argument(
        '--deterministic',
        action='store_true',
        help='whether to set deterministic options for CUDNN backend.')
    parser.add_argument(
        '--options',
        nargs='+',
        action=DictAction,
        help='override some settings in the used config, the key-value pair '
        'in xxx=yyy format will be merged into config file (deprecate), '
        'change to --cfg-options instead.')
    parser.add_argument(
        '--cfg-options',
        nargs='+',
        action=DictAction,
        help='override some settings in the used config, the key-value pair '
        'in xxx=yyy format will be merged into config file. If the value to '
        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
        'Note that the quotation marks are necessary and that no white space '
        'is allowed.')
    parser.add_argument(
        '--launcher',
        choices=['none', 'pytorch', 'slurm', 'mpi'],
        default='none',
        help='job launcher')
    parser.add_argument('--local_rank', type=int, default=0)
    args, _ = parser.parse_known_args(arguments)
    if 'LOCAL_RANK' not in os.environ:
        os.environ['LOCAL_RANK'] = str(args.local_rank)

    if args.options and args.cfg_options:
        raise ValueError(
            '--options and --cfg-options cannot be both '
            'specified, --options is deprecated in favor of --cfg-options')
    if args.options:
        warnings.warn('--options is deprecated in favor of --cfg-options')
        args.cfg_options = args.options

    return args


In [62]:
def main():
    args = parse_args()

    cfg = Config.fromfile(args.config)
    if args.cfg_options is not None:
        cfg.merge_from_dict(args.cfg_options)
    # set cudnn_benchmark
    if cfg.get('cudnn_benchmark', False):
        torch.backends.cudnn.benchmark = True

    # work_dir is determined in this priority: CLI > segment in file > filename
    if args.work_dir is not None:
        # update configs according to CLI args if args.work_dir is not None
        cfg.work_dir = args.work_dir
    elif cfg.get('work_dir', None) is None:
        # use config filename as default work_dir if cfg.work_dir is None
        cfg.work_dir = osp.join('./work_dirs',
                                osp.splitext(osp.basename(args.config))[0])
    if args.resume_from is not None:
        cfg.resume_from = args.resume_from
    if args.gpu_ids is not None:
        cfg.gpu_ids = args.gpu_ids
    else:
        cfg.gpu_ids = range(1) if args.gpus is None else range(args.gpus)

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher, **cfg.dist_params)
        # re-set gpu_ids with distributed training mode
        _, world_size = get_dist_info()
        cfg.gpu_ids = range(world_size)

    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    # create work_dir
    mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
    # dump config
    cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
    # init the logger before other steps
    timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
    log_file = osp.join(cfg.work_dir, f'{timestamp}.log')
    logger = get_root_logger(log_file=log_file, log_level=cfg.log_level)

    # init the meta dict to record some important information such as
    # environment info and seed, which will be logged
    meta = dict()
    # log env info
    env_info_dict = collect_env()
    env_info = '\n'.join([(f'{k}: {v}') for k, v in env_info_dict.items()])
    dash_line = '-' * 60 + '\n'
    logger.info('Environment info:\n' + dash_line + env_info + '\n' +
                dash_line)
    meta['env_info'] = env_info
    meta['config'] = cfg.pretty_text
    # log some basic info
    logger.info(f'Distributed training: {distributed}')
    logger.info(f'Config:\n{cfg.pretty_text}')

    # set random seeds
    if 'seed' in cfg.keys() and cfg.seed is not None:
        seed = cfg.seed
    else:
        seed = init_random_seed(args.seed)
    logger.info(f'Set random seed to {seed}, '
                f'deterministic: {args.deterministic}')
    #set_random_seed(seed, deterministic=args.deterministic)
    cfg.seed = seed
    meta['seed'] = seed
    meta['exp_name'] = osp.basename(args.config)

    model = build_detector(
        cfg.model,
        train_cfg=cfg.get('train_cfg'),
        test_cfg=cfg.get('test_cfg'))
    model.init_weights()

    datasets = [build_dataset(cfg.data.train)]
    if len(cfg.workflow) == 2:
        val_dataset = copy.deepcopy(cfg.data.val)
        val_dataset.pipeline = cfg.data.train.pipeline
        datasets.append(build_dataset(val_dataset))
    if cfg.checkpoint_config is not None:
        # save mmdet version, config file content and class names in
        # checkpoints as meta data
        cfg.checkpoint_config.meta = dict(
            mmdet_version=__version__ + get_git_hash()[:7],
            config=cfg.pretty_text,
            CLASSES=datasets[0].CLASSES)

    # add an attribute for visualization convenience
    model.CLASSES = datasets[0].CLASSES
    train_detector(
        model,
        datasets,
        cfg,
        distributed=distributed,
        validate=(not args.no_validate),
        timestamp=timestamp,
        meta=meta)

In [63]:
main()

2023-08-01 04:43:09,802 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
CUDA available: False
GCC: gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
PyTorch: 1.11.0+cu115
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_

2023-08-01 04:43:11,781 - mmdet - INFO - Set random seed to 0, deterministic: False
2023-08-01 04:43:12,445 - mmdet - INFO - initialize HRFuserHRFormerBased with init_cfg [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}]
2023-08-01 04:43:12,800 - mmdet - INFO - initialize HRFPN with init_cfg {'type': 'Caffe2Xavier', 'layer': 'Conv2d'}
2023-08-01 04:43:12,823 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}
2023-08-01 04:43:12,834 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg'}}, {'type': 'Xavier', 'distribution': 'uniform', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]
2023-08-01 04:43:12,952 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'nam

loading annotations into memory...
Done (t=0.37s)
creating index...
index created!


RuntimeError: No CUDA GPUs are available

### References (To be cleaned up later)
https://www.exposit.com/blog/computer-vision-object-detection-challenges-faced/
