# Tasks Taxonomy
![](../Images/det_tasks.jpeg)
![](../Images/inst_seg.png)
##### Real time semantic segmentation example: https://www.youtube.com/watch?v=ATlcEDSPWXY
##### Real time instance segmentation example: https://www.youtube.com/watch?v=0pMfmo8qfpQ

# Datasets 

##### PASCAL Visual Object Classes (PASCAL VOC)

The PASCAL VOC dataset (2012) is well-known an commonly used for object detection and segmentation. More than 11k images compose the train and validation datasets while 10k images are dedicated to the test dataset.

##### Common Objects in COntext ([COCO](http://cocodataset.org/#explore))

There are two COCO challenges (in 2017 and 2018) for image semantic segmentation (“object detection” and “stuff segmentation”). The “object detection” task consists in segmenting and categorizing objects into 80 categories. The “stuff segmentation” task uses data with large segmented part of the images (sky, wall, grass), they contain almost the entire visual information. In this blog post, only the results of the “object detection” task will be compared because too few of the quoted research papers have published results on the “stuff segmentation” task.

##### Cityscapes

The Cityscapes dataset has been released in 2016 and consists in complex segmented urban scenes from 50 cities. It is composed of 23.5k images for training and validation (fine and coarse annotations) and 1.5 images for testing (only fine annotation)

# Evaluation metrics

#### IoU(Intersection over union)

![](../Images/IoU.png)

***Average Precision*** is defined as the area under the precision-recall curve (PR curve).In the case of objection detection or instance segmentation, the multiple precision-recall value is done by changing the score cutoff.

 In COCO they change the IoU threshold values from 50% to 95%, at a step of 5%. So we end up with 10 precision-recall pairs. If we take the average of those 10 values, we get ***AP[0.5:0.95]***.
 
**mAP is simply all the AP values averaged over different classes/categories**

[more](https://medium.com/@yanfengliux/the-confusing-metrics-of-ap-and-map-for-object-detection-3113ba0386ef)

[losses](https://lars76.github.io/neural-networks/object-detection/losses-for-segmentation/)

# Segmentation Models

### Fully Convolutional Networks (FCN)
[H. Noh et al. (2015)](https://arxiv.org/pdf/1505.04366.pdf)

![](../Images/segnet.png)



##### Previous papers FCN [J. Long et al. (2015)](https://arxiv.org/abs/1605.06211), ParseNet [W. Liu et al. (2015)](https://arxiv.org/pdf/1506.04579.pdf)

### Unpooling 

![](../Images/upscale.png)

"It records the locations of maximum activations selected during pooling operation in switch variables, which are employed to place each activation back to its original pooled
location" J. Long et al. (2015)
![](../Images/upsample.png)


### [Transposed convolution](https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)




### [U-NET](https://arxiv.org/pdf/1505.04597.pdf)
![](../Images/U-net.png)
Note that it doesn’t use any fully-connected layer. As consequencies, the number of parameters of the model is reduced and it can be trained with a small labelled dataset (using appropriate data augmentation). For example, the authors have used a public dataset with 30 images for training during their experiments.
![](../Images/unet_res.png)

[more models](https://medium.com/@arthur_ouaknine/review-of-deep-learning-algorithms-for-image-semantic-segmentation-509a600f7b57)

# Classification with localization

![](../Images/class+loc.png)


# Object Detection


## Interesting topics
* Sliding Window
* Image Pyramides
* Non Max Supression

All these ideas are behind  [OverFeat](https://arxiv.org/abs/1312.6229) model

## Region-based Convolutional Network ([R-CNN](http://islab.ulsan.ac.kr/files/announcement/513/rcnn_pami.pdf))
The first models intuitively begin with the region search and then perform the classification.


In R-CNN, the selective search method developed by [J.R.R. Uijlings and al. (2012)](http://www.huppelen.nl/publications/selectiveSearchDraft.pdf) is an alternative to exhaustive search in an image to capture object location. It initializes small regions in an image and merges them with a hierarchical grouping. Thus the final group is a box containing the entire image. The detected regions are merged according to a variety of color spaces and similarity metrics. The output is a few number of region proposals which could contain an object by merging small regions.

[Selective search ideas](https://www.youtube.com/watch?v=uX4LLf-33p0&list=PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S&index=60)

![](../Images/select_search.png)
Each region proposal is resized to match the input of a CNN from which we extract a 4096-dimension vector of features.
Each one of these classes has a SVM classifier trained to infer a probability to detect this object for a given vector of features. This vector also feeds a linear regressor to adapt the shapes of the bounding box for a region proposal and thus reduce localization errors.
![](../Images/rcnn.png)



## Fast Region-based Convolutional Network ([Fast R-CNN](https://arxiv.org/pdf/1504.08083.pdf))
A main CNN with multiple convolutional layers is taking the entire image as input instead of using a CNN for each region proposals (R-CNN). Region of Interests (RoIs) are detected with the selective search method applied on the produced feature maps. Formally, the feature maps size is reduced using a [RoI Pooling](https://towardsdatascience.com/region-of-interest-pooling-f7c637f409af) layer to get valid Region of Interests with fixed heigh and width as hyperparameters. Each RoI layer feeds fully-connected layers¹ creating a features vector. The vector is used to predict the observed object with a softmax classifier and to adapt bounding box localizations with a linear regressor.

![](../Images/fast_rcnn.png)

## Faster R-CNN
Region proposals detected with the selective search method were still necessary in the previous model, which is computationally expensive. S. Ren and al. (2016) have introduced Region Proposal Network (RPN) to directly generate region proposals, predict bounding boxes and detect objects. The Faster Region-based Convolutional Network (Faster R-CNN) is a combination between the RPN and the Fast R-CNN model.

A CNN model takes as input the entire image and produces feature maps. A window of size 3x3 slides all the feature maps and outputs a features vector linked to two fully-connected layers, one for box-regression and one for box-classification. Multiple region proposals are predicted by the fully-connected layers. A maximum of k regions is fixed thus the output of the box-regression layer has a size of 4k (coordinates of the boxes, their height and width) and the output of the box-classification layer a size of 2k (“objectness” scores to detect an object or not in the box). The k region proposals detected by the sliding window are called anchors.
![](../Images/anchor_nums.png)
[Absolute vs Relative BBOX Regression | Anchor Boxes](https://www.youtube.com/watch?v=AVTs_N8YhBw)


![](../Images/fast-rcnn.png)

![](../Images/fast_rccn2.png)

## YOLO:https://pjreddie.com/darknet/yolo/

# Instance Segmentation
## [Mask R-CNN](https://arxiv.org/pdf/1703.06870.pdf)
![](../Images/Mask-RCNN.jpg)
#### ROI allign:https://towardsdatascience.com/understanding-region-of-interest-part-2-roi-align-and-roi-warp-f795196fc193

REFERANCES

https://medium.com/@arthur_ouaknine/review-of-deep-learning-algorithms-for-image-semantic-segmentation-509a600f7b57

https://www.youtube.com/watch?v=9I6nzfx_kpE&list=PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S&index=1