# Computer Vision - Object Detection

Typically image classification, the samples of images where only one object is presented, for e.g. cat or dog. Image Classification classifies the image to a single class. 

Given an image we want to learn the class of the image and where are the class location in the image. We need to detect a class and a rectangle of where that object is. Usually only one object is presented.

![Image of Cat](cat.png)

## What  is object detection ?
Object detection refers to the capability of computer and software systems to locate objects in an image/scene and identify each object. Object detection has been widely used for face detection, vehicle detection, pedestrian counting, web images, security systems and driverless cars. 

The breakthrough and rapid adoption of deep learning in 2012 brought into existence modern and highly accurate object detection algorithms and methods such as R-CNN, Fast-RCNN, Faster-RCNN, RetinaNet and fast yet highly accurate ones like SSD and YOLO.

![Object Detection](odexample.png)

The above image is a popular example of illustrating how an object detection algorithm works. Each object in the image, from a person to a kite, have been located and identified with a certain level of precision.

## Techniques

### CNN
Splitting the images into regions & then classifying each region image. 

![Object Detection Original Image](odimg1.png)
![Object Detection Split Image](odimgsplit.png)
![Object Detection Selected Region](odimgsel.png)

The problem with using this approach is that the objects in the image can have different aspect ratios and spatial locations. For instance, in some cases the object might be covering most of the image, while in others the object might only be covering a small percentage of the image. The shapes of the objects might also be different (happens a lot in real-life use cases).As a result of these factors, we would require a very large number of regions resulting in a huge amount of computational time. 

So to solve this problem and reduce the number of regions, we can use region-based CNN, which selects the regions using a proposal method. Let’s understand what this region-based CNN can do for us.

### Region-based CNN (RCNN)
Instead of working on a massive number of regions, the RCNN algorithm proposes a bunch of boxes in the image and checks if any of these boxes contain any object. RCNN uses selective search to extract these boxes from an image (these boxes are called regions).

Let’s first understand what selective search is and how it identifies the different regions. There are basically four regions that form an object: varying scales, colors, textures, and enclosure. Selective search identifies these patterns in the image and based on that, proposes various regions. Following is simplified sequence of steps:
- First take a pre-trained convolutional neural network.
- Then, this model is retrained. Train the last layer of the network based on the number of classes that need to be detected.
- The third step is to get the Region of Interest for each image. Then reshape all these regions so that they can match the CNN input size.
- After getting the regions, train SVM to classify objects and background. For each class, train one binary SVM.
- Finally, train a linear regression model to generate tighter bounding boxes for each identified object in the image.

![Object Detection Step1](odimg2.png)
![Object Detection Step2](odimgregions.png)
![Object Detection Step3](odimgcnn.png)
![Object Detection Step4](odimgsvm.png)
![Object Detection Step5](odimgpred.png)

**Limitations**
- Training an RCNN model is expensive and slow. For e.g.
    - Extracting 2,000 regions for each image based on selective search
    - Extracting features using CNN for every image region. Suppose we have N images, then the number of CNN features will be N*2,000
    - The entire process of object detection using RCNN has three models:
        - CNN for feature extraction
        - Linear SVM classifier for identifying objects
        - Regression model for tightening the bounding boxes.
- It takes around 40-50 seconds to make predictions for each new image, which essentially makes the model cumbersome and practically impossible to build when faced with a gigantic dataset.

### Fast RCNN
In Fast RCNN, the input image fed into the CNN, which in turn generates the convolutional feature maps. Using these maps, the regions of proposals are extracted. Then use a RoI pooling layer to reshape all the proposed regions into a fixed size, so that it can be fed into a fully connected network.
1. As with the earlier two techniques, take an image as an input.
2. This image is passed to a ConvNet which in turns generates the Regions of Interest.
3. A RoI pooling layer is applied on all of these regions to reshape them as per the input of the ConvNet. Then, each region is passed on to a fully connected network.
4. A softmax layer is used on top of the fully connected network to output classes. Along with the softmax layer, a linear regression layer is also used parallely to output bounding box coordinates for predicted classes.

**Note: Instead of using three different models (like in RCNN), Fast RCNN uses a single model which extracts features from the regions, divides them into different classes, and returns the boundary boxes for the identified classes simultaneously.**

![Object Detection Step2](odimgroi.png)
![Object Detection Step3](odimgroipool.png)
![Object Detection Step4](odimg2pred.png)

**Limitatons**
- Fast RCNN also uses selective search for ROI, which is a slow and time consuming process
- It takes around 2 seconds per image to detect objects, which is much better compared to RCNN. But when we consider large real-life datasets, then even a Fast RCNN doesn’t look so fast anymore.

### Faster RCNN
Faster RCNN is the modified version of Fast RCNN. The major difference between them is that Fast RCNN uses selective search for generating Regions of Interest, while Faster RCNN uses “Region Proposal Network”, aka RPN. RPN takes image feature maps as an input and generates a set of object proposals, each with an objectness score as output.
1. Take an image as input and pass it to the ConvNet which returns the feature map for that image.
2. Region proposal network is applied on these feature maps. This returns the object proposals along with their objectness score.
3. A RoI pooling layer is applied on these proposals to bring down all the proposals to the same size.
4. Finally, the proposals are passed to a fully connected layer which has a softmax layer and a linear regression layer at its top, to classify and output the bounding boxes for objects.

**How Regional Proposal Network works?**
Faster RCNN takes the feature maps from CNN and passes them on to the Region Proposal Network. RPN uses a sliding window over these feature maps, and at each window, it generates k Anchor boxes of different shapes and sizes

![RPN](RPN.png)

Anchor boxes are fixed sized boundary boxes that are placed throughout the image and have different shapes and sizes. For each anchor, RPN predicts two things:
- The first is the probability that an anchor is an object (it does not consider which class the object belongs to)
- Second is the bounding box regressor for adjusting the anchors to better fit the object
These bounding boxes of different shapes and sizes are passed on to the RoI pooling layer. Now it might be possible that after the RPN step, there are proposals with no classes assigned to them. Each proposal can be taken and crop it so that each proposal contains an object. This is what the RoI pooling layer does. It extracts fixed sized feature maps for each anchor. Then these feature maps are passed to a fully connected layer which has a softmax and a linear regression layer. It finally classifies the object and predicts the bounding boxes for the identified objects.

**Limitations**
RCNN, Fast RCNN and Faster CNN, all use regions to identify the objects. The network does not look at the complete image in one go, but focuses on parts of the image sequentially. This creates two complications:
- The algorithm requires many passes through a single image to extract all the objects
- As there are different systems working one after the other, the performance of the systems further ahead depends on how the previous systems performed

![Algorithm Comparisons](algocomp.jpg)

## References

[https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-object-detection-algorithms-part-1/](https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-object-detection-algorithms-part-1/)

**RCNN**

![RCNN](rcnn1.png)

![RCNN](RCNN.png)

**Fast RCNN**

![Fast RCNN](fast-rcnn.png)

**Faster RCNN**

![Faster RCNN](faster-rcnn.png)

[RCNN implimentation from scartch - https://github.com/1297rohit/RCNN](https://github.com/1297rohit/RCNN)

Will it not be easy if we can take some readymade modules and use it for object detection instead of doing everything from scratch  ?

## Introducing YOLO

The YOLO framework (You Only Look Once) on the other hand, deals with object detection in a different way. It takes the entire image in a single instance and predicts the bounding box coordinates and class probabilities for these boxes. The biggest advantage of using YOLO is its superb speed – it’s incredibly fast and can process 45 frames per second. YOLO also understands generalized object representation.

![YOLO](yolo.png)

**Steps**
1. Take the Input image
2. The framework then divides the input image into grids (say a 3 X 3 grid)
3. Image classification and localization are applied on each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects

The labelled data neeed to be passd to the model in order to train it. Suppose the image is divided into a grid of size 3 X 3 and there are a total of 3 classes. Let’s say the classes are Dog, Cat, and Tree respectively. So, for each grid cell, the label y will be an eight dimensional vector:
1. Probability of whether an object is present in the grid or not
2. x,y, width, height of the bounding box if there is an object
3. Object Classes, here its Dog, Cat and Tree. (For e.g. if the object is Cat, then Cat will be 1, Dog & Tree will be zero)

Using the above example (input image, say -  100 X 100 X 3, output – 3 X 3 X 8), the model will be trained as follows:
![YOLO Model](yolomodel.png)

**Note: Generally in real-world scenarios larger grid sizes (perhaps 19 X 19) are used**

### Cost Functions
1. Intersection over Union:
    - Calculate the area of the intersection over union of the ground truth box and predicted boxe
    - Area of the intersection / Area of the union
    ![Intersection Over Union](iou.png)
2. Non Max Suppression
    - One of the most common problems with object detection algorithms is that rather than detecting an object just once, they might detect it multiple times

    ![Multi Detect](multidetect.png)
    
    - Algorithm:
        1. Discard all the boxes having probabilities less than or equal to a pre-defined threshold (say, 0.5)
        2. For the remaining boxes:
            - Pick the box with the highest probability and take that as the output prediction
            - Discard any other box which has IoU greater than the threshold with the output box from the above step
            - Repeat step 2 until all the boxes are either taken as the output prediction or discarded


## Sample Code
[https://github.com/enggen/Deep-Learning-Coursera/blob/master/Convolutional%20Neural%20Networks/Week3/Car%20detection%20for%20Autonomous%20Driving/Autonomous%20driving%20application%20-%20Car%20detection%20-%20v1.ipynb](https://github.com/enggen/Deep-Learning-Coursera/blob/master/Convolutional%20Neural%20Networks/Week3/Car%20detection%20for%20Autonomous%20Driving/Autonomous%20driving%20application%20-%20Car%20detection%20-%20v1.ipynb)

## Single Shot Detection (SSD)

Single Shot MultiBox Detector is a deep learning model used to detect objects in an image or from a video source. SSD has two components and they are the Backbone Model and the SSD Head. Backbone Model is a pre-trained image classification network as a feature extractor. Usually, the fully connected classification layer is removed from the model. SSD Head is another set of convolutional layers added to this backbone and the outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layer's activations.

**Steps**
- A feature layer of size m×n (number of locations) with p channels
- For each location, we got k bounding boxes
- For each of the bounding box, we will compute c class scores and 4 offsets relative to the original default bounding box shape.
- Thus, we got (c+4) kmn outputs.

![SSD vs YOLO](ssd-yolo.png)

## Mean Average Precision (mAP)

The mean average precision (mAP) or sometimes simply just referred to as AP is a popular metric used to measure the performance of models doing document/information retrival and object detection tasks.

Refresh of precession and recall metrics:

![Precission Recall](precission-recall.png)

**Note: Higher the precision, the more confident the model is when it classifies a sample as Positive. Higher the recall, the more positive samples the model correctly classified as Positive.**

When a model has high recall but low precision, then the model classifies most of the positive samples correctly but it has many false positives (i.e. classifies many Negative samples as Positive). When a model has high precision but low recall, then the model is accurate when it classifies a sample as Positive but it may classify only some of the positive samples.

Due to the importance of both precision and recall, there is a precision-recall curve the shows the tradeoff between the precision and recall values for different thresholds. This curve helps to select the best threshold to maximize both metrics.

The average precision (AP) is a way to summarize the precision-recall curve into a single value representing the average of all precisions.

Using a loop that goes through all precisions/recalls, the difference between the current and next recalls is calculated and then multiplied by the current precision. In other words, the AP is the weighted sum of precisions at each threshold where the weight is the increase in recall

![AP Formula](ap-formula.jpg)

Usually, the object detection models are evaluated with different IoU thresholds where each threshold may give different predictions from the other thresholds. mAP is calculated using IoU Thresholds, Precision & Recall in case of object detection.