# Object Detection System Report

## System Design
By revisiting our original requirements below, we can confirm that the system meets the requirements.
1. The system should be able to detect 20 significant objects from a video stream (30 frames per second)
    * This was achieved by implementing an inference service containing a video processing module, object detection module, and non-maximal supression module that are used together to detect objects in a video stream. The system was tested using 2 videos streaming at 30 frames per second. Additionally, analysis showed that the inference service can process roguhly 50 images per second, and accounting for latency, as well as skipping every 2 frames, meaning only 15 of the 30 frames per second are processed, then the system is still well within its emperically determined maximum of 50 images per second.
2. The system should allow human reviewers to fix errors and automatically update the model (human-in-the-loop)
    * This was achieved by implementing a rectification service containing a hard negative mining module to identify the hardest samples that should be used for retraining. Additional compnents for this service which were out of scope for this case study included a retraining module, error storage database, annotation module, and augmentation module. These additional components would allow human auditors to identify incorrectly labeled images, which would the nbe stored, augmented, and used to retrain the model. The current system does contain some image augmentation functions including scaling and resizing, which could be used in the augmentation module. Additional augmentations to implement would include image transformations such as brightness adjustments, color adjustments, image rotation, and image flipping.
3. The system should be capable of using CPU-only computing resources for inference.
    * This was achieved by running all parts of the system using only the CPU resources on my laptop. In the docker container that runs main.py, two separate threads are used: one for streaming video and one for handling API requests. These two threads still only use CPU resources.

A diagram of the entire system, from the Module 6 lecture slides, can be seen below. There are three main services: the inference service, the rectification service, and the interface service. The inference service processes frames of a video stream and performs object detection on each frame of the video stream, saving the detections and predictions to files. The rectification service is used to improve the object detection model by retraining, and allows for retaining on error data identified by human annotators, as well as augmented and difficult data identified by hard negative mining. Finally the interface is how the outside world interacts with the system by providing input via streaming video or by making API calls to retrieve information about the predictions made by the system.

![diagram](../assets/images/diagram.png)

## Data, Data Pipelines, and model

### Data
There are three main types of data: representative training images and annotations, videos, and output prediction data. The data obtained from the `logistics.zip` file is stored in the `storages/training` directory, and contains 9525 `.jpg` images with objects of 20 different classes, and each image has a corresponding `.txt.` file containing annotations of objects in the associated image in YOLO format. The two video files used to test the streaming capabilities of the system are stored in the `yolo_resources/test_videos` folder. Additionally, there is one test image in the `yolo_resources/test_images` folder. Finally, the system outputs detections and predictions into the `storages/prediction` folder, which contains a `.jpg` image file labeled with bounding boxes of detected objects and a corresponding `.txt` file containing predicted annotations in YOLO format assocated with each image frame.

### Data Pipelines
There are two main data pipelines: one involving the inference service and the other involving the rectification service. The inference pipeline begins by capturing frames from a video stream using the `InferenceService` class, anylzing one out of every 2 frames in the stream by using the `VideoProcessing` class. This decreases the computational load since the number of frames processed is decreased, compared to processing all frames, allowing the system to detect objects in real time so that it can respond accordingly without delay. Each of these selected frames is then processed by scaling and resizing the image to a uniform size, determined during analysis to be 624x624 pixels, resulting in the image being large enough for objects to be detected, but small enough to reduce any latency that would be caused by the increased computational needs for processing large images. Next, the `YOLOObjectDetector` class with the trained YOLO model is used to detect objects in each frame, which includes localization and classification of the objects. Of the 2 models provided, model 2 was chosen as it produced higher mAP than model 1 during experimentation. Non-maximal supression is performed using the `NMS` class to eliminate detections that overlap significantly with others, reducing the risk of false positives and counting the same object multiple times. A list of detections can be obtained for a range of frames by using the `GET /detections_list` endpoint. Finally, the images with their detected boundings boxes and corresponding prediction annotations are saved to the `storages/prediction` folder for future human review. A `.zip` file containing the output images and the corresponding `.txt` file of annotations for a range of frames can be obtained using the `GET /predictions` endpoint.

The rectification pipeline was not entirely implemented for this case study as some parts were out of scope for this project. The rectification service uses data to retrain the object detection model in order to achieve better performance. This data can come from an error storage source that is created by humans identifying incorrectly labeled outputs by reviewing and annotating the prediction outputs that were stored from the inference pipeline in the `storages/predictions` folder as described above. Additionally, the retraining data can contain augmented samples derived from images containing objects that were difficult for the model to detect. These difficult samples are identified by hard negative mining in the `HardNegativeMiner` class, which determines difficult samples to be those with high cross entropy loss, calculated using the `Loss` class. These retraining data sources are then used to train a new version of the object detection model, which will replace the version in the inference pipeline if performance is improved. A list of the top N hardest sample files can be obtained using the `GET /hard_negatives` endpoint.

### Model
The object detection model used for this case study is the YOLOv4 (["You Only Look Once version 4"](https://arxiv.org/abs/2004.10934)) object detection model. This is a one-stage network, meaning it simultaneously localizes objects and classifies them, so that only one pass through the image is required, unlike two-stage networks where multiple passes are required since localization and classification are performed separately. The one-stage format is good for our robotic platform system where real-time performance is needed, even with the tradeoff of potentially decreased accuracy compared to a slower two-stage network. There are two trained versions of the YOLOv4 model available in the `yolo_resources` folder. Experimentation showed that Model 2 had slightly better mAP performance than Model 1, so that is the the Model that was chosen to be included in the deployed system configured in `deployment.py`.

## Metrics Definition
Most of the metrics tracked by the system are offline metrics due to time and feasibility constraints of this Case Study project. In the real world, cloud resources such as AWS CloudWatch, Datadog, and Splunk could be used to obtain more online metrics. These could be configured to include things like API errors and traffic, latency, and duration. Additionally, upon implementing the entire system (including the whole rectification service that was out of scope for this project), where humans would correctly annotate the robot's predicted outputs, then logging could be added to help detect things like online performance and any model drift.

### Offline Metrics
These metrics have been implemented in this case study, and are used by various modules (e.g. analysis and hard negative mining); however, the Flask application doesn't include an endpoint to directly get these metrics since that was not a specified requirement of the case study. In a real system, an endpoint to get these metrics would be useful, especially after the rectification service is built to include a module to collect human annotations of the system's predicted outputs.
1. **Precision**: This metric is tracked so we can identify how useful a model is for reducing the risk of false positives. This can give us an idea about how well our non-maximal supression is working. Although this metric is sensitive to IOU threshold, it is necessary in order to calculate mAP (discussed below).
2. **Recall**: This is an important metric for ensuring that we're correctly identifying objects that are truly objects, allowing us to maximize true positives. This metrics is also sensitive to IOU threshold, but it is necessary in order to calculate mAP (discussed below).
3. **Mean Average Precision**: Since precision and recall are both sensitive to things like variations in IOU threshold, then a more robust measure of performance may be Average Precision. This metric would be more robust to changes in IOU threshold, but does not accurately represent performance of a dataset with imbalanced classes, since the metric would be dominated by the performance of the majority class. To combat this, the mean average precison (mAP) is used, which calculates the mean of the average precision by object class, allowing the performance of each class to be weighed equally. The mAP value can give us a robust measure of model performance, which acn be useful when comparing different model version and tuning model parameters.
4. **Cross Entropy Loss**: This metric is used for the hard negative mining criteria in the rectification service when determining which inputs had objects that were the most difficult for the model to detect. This metric is useful because it allows the system to comprehensively address errors since it accounts for both classification and localization errors, and ensures that both high and low confidence errors are considered. Although not explored in this case study, the loss metric can be tuned by adjusting the lambda parameter that determines how much weight to put on localization errors vs classification errors.

### Online Metrics
These metrics were not implemented for this case study due to time and feasibility constraints, but given more time to develop a real-world system, these metrics could be useful.
1. **Inference time**: Keeping track of the time it takes to process frames would allow us to identify any changes in processing speed, which could cause latency and delays in robot response. Identifying these slow downs before visible robotic delays occur would be important for safety in the warehouse.
2. **Frames requested**: Keeping track of how frequently a frame number is requested using the `GET /detections_list` and `GET /predictions` endpoints may be useful in identifying the most requested frames. Those frames could have special cases or objects of interest that may be good candidates for further analysis or inclusion in a retraining set.

## Analysis of System Parameters and Configurations
Exploratory data analysis was performed in [data_analysis.ipynb](data_analysis.ipynb), which led to the discovery of imbalanced classes, varying numbers of objects per image, and varying image qualities and orientations. More detail is in that file, but the general conclusions were that mAP would be a good metric for model performance evaluation due to the imbalanced classes, different IOU threshold values should be explored (see analysis below) to determine what works best for the varying numbers and sizes of objects in images, and the diversity in the representative training set may lead to reduced overfitting and better generalizability to production or unseen data.

The following design considerations and analyses were performed and described in [design_considerations.ipynb](design_considerations.ipynb), and are summarized again here.

### Design Considerations
1. **IOU Threshold**: The Intersection-Over-Union (IOU) threshold is used to determine whether bounding boxes are overlapping during the Non-Maximal Supression step as well as the object matching step during performance metric calculations. A higher IOU threshold would require a larger degree of overlap to consider two bounding boxes to represent the same object. This would be useful for cases when objects are close together, but may lead to false positives if objects are further apart. Lower IOU thresholds may help eliminate redundant detections, but may lead to missing some detections especially if objects are close together. Analysis of the IOU threshold was performed (see below) by testing various thresholds [0.1, 0.3, 0.5, 0.7, 0.9] and plotting the IOU threshold against Mean Average Precision. Future analysis could be done the same way, but applied to each object class separately to determine whether different IOU thresholds are better for different object classes.
2. **Image Size**: The input images into the system must be resized so that they are all the same size and can be ingested by the model. Larger images contain more pixel information and therefore ideally lead to more accurate detections; however, they're more computationally expensive to process. Smaller images can be processed quicker, but have less information and may lead to less accurate predictions. An analysis of image size was performed (see below) by testing various image sizes [208, 312, 416, 624, 832] and plotting the image size against Mean Average Precision as well as inference time. This allows us to find an image size that balances both speed and performance.
3. **Frame Skipping**: Skipping frames can reduce the computational time required to process a video, which can be especially useful for real-time applications like a robotic platform in a warehouse. If we only process one out of every five frames for example, we could process the frames in a video in 1/5 of the time it would take to process every frame. The speed and computational efficency come at the cost of decreased information. Since we're not processing some frames, we're not seeing some data, which may lead worse overall system performance. If we skip too many frames, we may not see an object entirely if it's only in frame for a moment. To find a balance between speed and performance, we can analyze a video with known detections and run our prediction process on the video, skipping various numbers of frames each time we run the prediction. We can then plot number of frames skipped against both MAP and inference time. By plotting these 2 relationships, we can identify the number of frames to skip to achieve a balance of good performance and speed.
4. **Detection Ranking Criteria for Hard Negative Mining**: Hard negative mining is used to find samples that the model performed the worst on, so they can be used to retrain the model on these difficult input examples. The criteria for ranking the worst performing inputs can be based on confidence score or loss. The confidence score directly reflects a model's certainty about its prediction, so it's easy to interpret. For example, high-confidence false positives mean that the sample will likely significantly impact model performance due to the confidence level. This allows the model to be retrained with the most problematic errors directly. Alternatively, the loss score coprehensively addresses the errors since it accounts for both classification and localization errors, and ensures that both low and high confidence errors are addressed. The loss score can be further tuned by adjusting the lambda parameter that determines how much weight to put on the localization errors vs classification errors. The analyze the ranking criteria, different measures could be used such as cross entropy loss with different lambda parameters, and confidence score. Determining the mAP on the highest ranked hard negatives should lead to the lowest mAP value compared to the lowest ranked hard negatives. Additionally, retraining a model with the identified hard negatives should lead to increased mAP of the resulting model. This mAP metric will help determine the best detection ranking criteria to use for hard negative mining.
5. **Dataset Size**: Increased training dataset size potentially could result in a better performing model if the dataset includes good, diverse data that accurately represents test/real world data. However, lack of diversity or imbalances in the training set could be more likely to lead to overfitting with a larger dataset size. Additionally, larger datasets take longer to train and process than smaller datasets. To decide on an appropriate dataset size, various training sets of different sizes can be used to train the model and calculate the model performance (again using mAP), keeping track of training time. We can then plot dataset size vs mAP, and dataset size vs. training time, and find a dataset size that balances both speed and performance. Speed will be important in cases where the robotic platform system is making errors so rectification and retraining is required. Ideally this rectification and retraining process should be efficient so that the robot is not out of use for too long.

### Inference Service
In the analyses below (except dataset size analysis), a random sample 1000 of the 9525 representative training images were used due to computational and time constraints. Ideally with more processing power, the analyses would be repeated with all of the representative training data. The implementatins of all  analyses are in [design_considerations.ipynb](design_considerations.ipynb).
1. **Inference Time and Mean Average Precision vs. Image Size**: By running the inference service and calculating the mAP for 1000 images, repeating on different image sizes of [208, 312, 416, 624, 832], I was able to plot the Averge Inference Time per Image vs Image Size and mAP vs Image Size. The results indicate that an image size of 416 may be efficient since the mean processing time decreases at 416, before increasing again for an image size of 624. However, the mAP results indicate that in image size of 416 produces the lowest mAP while 624 produces the highest mAP. For this reason, I chose to use an image size of 624 in the deployed system and the remaining analyses.
2. **Inference Time and Mean Average Precision vs. Dataset Size**: By running the inference service and calculating the mAP for 1000 images, repeating on different dataset sizes of [500, 1000, 1500, 2000, 2500], I was able to plot the Inference Time vs Dataset Size and mAP vs Dataset Size. The results indicate that there's a positive linear relationship between dataset size and processing time, with a slope of roughly 50 images per second. This suggests that the inference service can process up to 50 frames per second. Theoretically, this means that we can meet the system requirement of processing a video stream at 30fps. Accounting for latency and any decrease in speed due to the system running other processes, if we set our skip frames parameter to 2 in the VideoProcessing module, meaning we process one out of every two frames at 30fps, that means we'd need to process 15 frames per second of the 30fps stream. Additionally, the results showed a higher mAP value for lower dataset size. This led to my choice to use a skip__every_frame parameter value of 2 in the VideoProcessing component of my deployed system.
3. **Mean Average Precision vs. IOU Threshold**: I performed this analysis using both trained versions of the YOLO Object Detection Models. By running the inference service and calculating the mAP for 1000 images, repeating with different IOU thresholds of [0.1, 0.3, 0.5, 0.7, 0.9], I was able to plot the mAP vs IOU Threshold. The results indicate that mAP decreases almost linearly (but not quite) as IOU threshold increases. This makes sense that mAP may be decreased with higher (stricter) IOU threshold since there would likely be less matches between predictions and ground truth labels. This led to my choice to use an IOU threshold of 0.3 in my deployed system and for the remaining analyses. Although 0.1 produced the highest mAP, this seemed to low to adequately distinguish between objects that may be close together in real world data. Additionally, the results showed the Model 2 had a higher mAP than Model 1 for all IOU thresholds, except 0.3 where the mAP was roughly the same, so I chose to use Model 2 in my deployed system and remaining analyses.
4. **Mean Average Precision vs. Score Threshold**: I performed this analysis using both trained versions of the YOLO Object Detection Models. By running the inference service and calculating the mAP for 1000 images, repeating with different score thresholds of [0.1, 0.3, 0.5, 0.7, 0.9], I was able to plot the mAP vs IOU Threshold. The results indicate that mAP decreases almost linearly (but not quite) as score threshold increases. This makes sense that mAP may be decreased with higher (stricter) confidence score threshold since there would likely be less matches when determining overlapping boxes in NMS, resulting in more redundant detections. This led to my choice to use a score threshold of 0.3 in my deployed system and for the remaining analyses. Although 0.1 produced the highest mAP, this seemed to low to adequately distinguish between objects that may be close together in real world data. Additionally, the results showed the Model 2 had a higher mAP than Model 1 for all score thresholds, except 0.3 where the mAP was roughly the same, so this confirmed my choice to use 0.3 threshold and Model 2 in my deployed system and remaining analyses.
5. **Average Precision by Object Class**: I performed this analysis using both trained versions of the YOLO Object Detection Models. By running the inference service for 1000 images and calculating the Average Precision for each object class, I was able to plot the AP vs Object Class. The results indicate that AP is the same for all classes and for both models. This was surprising, especially due to the differences in number of objects per class among the 1000 sampled images. The results may be odd due to randomly sampling 1000 out of the 9525 representative training images. With more computational resources, repeating this analysis on the whole training set would be ideal. Additionally, I calculated the AP using the mAP function, but only passing the ids and scores for a single class at a time, with num_classes=1. I thought this should work adequately since I set num_classes=1, but if this doesn't work like I thought it would, then this may have caused unexpected results. Regardless, based on these results, this led to my choice to use a consistent IOU threshold of 0.3 across all classes in my deployed system, instead of investigating class-specific IOU thresholds. I know that this analysis didn't directly test the affect of IOU thresholds on performance for specific classes, and given more time, I would preform the IOU threshold test mentioned above, specifically on single classes of data at a time. However, the consistent results across all classes here led me to believe that the system performs equally well (or not well) on all classes, so the class-specific IOU threshold test was not needed for now. It would be needed for more class-specific tuning.

### Rectification Service
1. **Mean Average Precision by Hard Negative Rank**: I performed this analysis using only Model 2 since it was determined to perform better than Model 1 in the above analyses. By running the hard negative miner for 1000 images, and then running the inference service on the results and calculating the Mean Average Precision for subsets of 100 images at a time, I was able to plot the mAP vs hard negative ranking. The results indicate that mAP increases as the hard negative ranking number gets higher (i.e. ranked lower). This confirms that the hard negative mining module seems to be working as expected, since we'd expect the highest ranked hard negatives to have the worst performance when provided to the model for inference. This is because the highest ranked hard negatives are supposed to be the hardest for the model to detect, i.e. the most erroneous, resulting in the worst mAP value. The lowest ranked hard negatives should have a higher mAP value because they should be easier for the model to detect. This is good because we know that we'll be retraining the model on the right data in the rectification service, resulting ideally in the largest gains in model performance improvement after retraining on the highest ranked hard negatives.

### Overall System
Overall, each component of the system allows for flexibility and tuning, to adjust system performance possibly as part of future rectification enhancements. For example, the NMS and Metrics modules allow for IOU threshold to be passed as a parameter, the YOLOObjectDetector allows frame size to be specified, the hard negative mining module allows a different ranking criteria to be passed, and the VideoProcessing module allows the frame skip rate and output frame size to be specified via parameters. This flexibility and parameterization will allow our system to tuned and refined as needed based on the results and potential issues identified during rectification.

## Post-Deployment Policies
### Monitoring and Maintenance Plan
The stored predictions and online metrics are a large part of the monitoring and mitigation plan. Every analyzed frame and its corresponding detections are saved in their own files. This gives us insights into everything the robot sees, and will allow us to identify any potential issues or reasons for why the robot and detection model may not be performing as expected. The online metric of frame usage that we're tracking will allow us to determine potential detections of interest, so we can focus our improvement efforts and tune our hard negative mining efforts accordingly. Additional online metrics such as time to detection, perforance by class, and overall model performance overtime all would help to ensure our system is continuing to meet performance requirements. Any ongoing maintenance and code changes would be performed locally, and the docker image could be built and deployed in a container to ensure stability and reproducibility.

### Fault Mitigation Strategies
Some fault mitigation stratigies may include backing up the docker image so it can be redeployed if for some reason the system goes down or the robot is damaged. Containerization makes it easier to rebuild the system the exact same way repeatedly. Additionally, storing any logs and our data outside of the system (i.e. in a database) would be ideal, so we could still access the data even if the robot is damaged, the docker container goes down, or the docker image is redeployed. Carefully monitoring the logged detections and online metrics, in addition to implementing more online metrics as described above, can help us catch any potential issues before they arise or immediately when they arise, allowing us to stop the robot if needed and invoke the rectification service to improve the robot's object detection performance. This can occur by setting up alerts to notify humans of potential robot malfunction, perhaps using third party monitoring tools like AWS CloudWatch or Datadog.