# Object Detection System Report

## System Design
By revisiting our original requirements below, we can confirm that the system meets the requirements.
1. The system should be able to detect 20 significant objects from a video stream (30 frames per second)
    * This was achieved by implementing an inference service containing a video processing module, object detection module, and non-maximal supression module that are used together to detect objects in a video stream.
2. The system should allow human reviewers to fix errors and automatically update the model (human-in-the-loop)
    * This was achieved by implementing a rectification service containing a hard negative mining module to identify the hardest samples that should be used for retraining. Additional compnents for this service which were out of scope for this case study included a retraining module, error storage database, annotation module, and augmentation module. These additional components would allow human auditors to identify incorrectly labeled images, which would the nbe stored, augmented, and used to retrain the model.
3. The system should be capable of using CPU-only computing resources for inference.
    * This was achieved by running all parts of the system using only the CPU resources on my laptop. In the docker container that runs main.py, two separate threads are used: one for streaming video and one for handling API requests. However, these threads still only use CPU resources.

A diagram of the entire system, from the Module 6 lecture slides, can be seen below. There are three main services: the inference service, the rectification service, and the interface service. The inference service processes frames of a video stream and performs object detection on each frame of the video stream, saving the detections and predictions to files. The rectification service is used to improve the object detection model by retraining, and allows for retaining on error data identified by human annotators, as well as augmented and difficult data identified by hard negative mining. Finally the interface is how the outside world interacts with the system by providing input via streaming video or by making API calls to retrieve information about the predictions made by the system.

![diagram](../assets/images/diagram.png)

## Data, Data Pipelines, and model

### Data
There are three main types of data: representative training images and annotations, videos, and output prediction data. The data obtained from the `logistics.zip` file is stored in the `storages/training` directory, and contains 9525 `.jpg` images with objects of 20 different classes, and each image has a corresponding `.txt.` file containing annotations of objects in the associated image in YOLO format. The two video files used to test the streaming capabilities of the system are stored in the `yolo_resources/test_videos` folder. Additionally, there is one test image in the `yolo_resources/test_images` folder. Finally, the system outputs detections and predictions into the `storages/prediction` folder, which contains a `.jpg` image file labeled with bounding boxes of detected objects and a corresponding `.txt` file containing predicted annotations in YOLO format assocated with each image frame.

### Data Pipelines
There are two main data pipelines: one involving the inference service and the other involving the rectification service. The inference pipeline begins by capturing frames from a video stream using the `InferenceService` class, anylzing one out of every 30 frames in the stream by using the `VideoProcessing` class. This decreases the computational load since the number of frames procesed is decreases, compared to processing all frames, allowing the system to detect objects in real time so that it can respond accordingly without delay. Each of these selected frames is then processed by scaling and resizing the image to a uniform size, resulting in the image being large enough for objects to be detected, but small enough to reduce any latency that would be caused by the increased computational needs for processing large images. Next, the `YOLOObjectDetector` class with the trained YOLO model is used to detect objects in each frame, which includes localization and classification of the objects. Non-maximal supression is performed using the `NMS` class to eliminate detections that overlap significantly with others, reducing the risk of false positives and counting the same object multiple times. A list of detections can be obtained for a range of frames by using the `GET /detections_list` endpoint. Finally, the images with their detected boundings boxes and corresponding prediction annotations are saved to the `storages/prediction` folder for future human review. A `.zip` file containing the output images and the corresponding `.txt` file of annotations for a range of frames can be obtained using the `GET /predictions` endpoint.

The rectification pipeline was not entirely implemented for this case study as some parts were out of scope for this project. The rectification service uses data to retrain the object detection model in order to achieve better performance. This data can come from an error storage source that is created by humans identifying incorrectly labeled outbuts by reviewing and annotating the prediction outputs that were stored from the inference pipeline as described above. Additionally, the retraining data can contain augmented samples that contain objects that were difficult for the model to detect. These difficult samples are identified by hard negative mining in the `HardNegativeMiner` class, which specifies difficult samples as those with high cross entropy loss calculated using the `Loss` class. These retraining data sources are then used to train a new version of the object detection model, which will replace the version in the inference pipeline if performance is improved. A list of the top N hardest sample files can be obtained using the `GET /hard_negatives` endpoint.

### Model
The object detection model used for this case study is the YOLOv4 (["You Only Look Once version 4"](https://arxiv.org/abs/2004.10934)) object detection model. This is a one-stage network, meaning it simultaneously localizes objects and classifies them, so that only one pass through the image is required, unlike two-stage networks where multiple passes are required since localization and classification are performed separately. The one-stage format is good for our robotoc platform system where real-time performance is needed, even with the tradeoff of potentially decreased accuracy compared to a slower two-stage network. There are two trained versions of the YOLOv4 model available in the `yolo_resources` folder.

## Metrics Definition

### Offline Metrics
Precision, recall, mAP, loss

### Online Metrics
Frame usage, detection time, class perforamnce, overall performance

## Analysis of System Parameters and Configurations

### Inference Service

### Rectification Service

### Overall System?

## Post-Deployment Policies
### Monitoring and Maintenance Plan
The stored predictions and online metrics are a large part of the monitoring and mitigation plan. Every analyzed frame and its corresponding detections are saved in their own files. This gives us insights into everything the robot sees, and will allow us to identify any potential issues or reasons for why the robot and detection model may not be performing as expected. The online metric of frame usage that we're tracking will allow us to determine potential detections of interest, so we can focus our improvement efforts and tune our hard negative mining efforts accordingly. Additional online metrics such as time to detection, perforance by class, and overall model performance overtime all would help to ensure our system is continuing to meet performance requirements. Any ongoing maintenance and code changes would be performed locally, and the docker image could be built and deployed in a container to ensure stability and reproducibility.

### Fault Mitigation Strategies
Some fault mitigation stratigies may include backing up the docker image so it can be redeployed if for some reason the system goes down or the robot is damaged. Containerization makes it easier to rebuild the system the exact same way repeatedly. Additionally, storing any logs and our data outside of the system (i.e. in a database) would be ideal, so we could still access the data even if the robot is damaged, the docker container goes down, or the docker image is redeployed. Carefully monitoring the logged detections and online metrics, in addition to implementing more online metrics as described above, can help us catch any potential issues before they arise or immediately when they arise, allowing us to stop the robot if needed and invoke the rectification service to improve the robot's object detection performance. This can occur by setting up alerts to notify humans of potential robot malfunction, perhaps using third party monitoring tools like AWS CloudWatch or Datadog.