# Object Detection
## 1. Object Localisation:
   * Model is responsible not just for object classification but also putting a bounding box around the object.
   * Localisation refers to where the object is in the image.
   * Localisation Understanding(considering there is single object in the image):
   <p align='center'><img src='./imgs/ol1.png' height="50%" width="50%" /></p>
    
   * The way it differs from the classic classification task is that it has 4 more outputs added i.e. $b_{x}, b_{y}, b_{h}, b_{w}$ alonng with class labels.
   * Also input would also include these 4 values.
   * Understanding the output and loss:
   <p align='center'><img src='./imgs/ol2.png' height="50%" width="50%" /></p>
    
   * where $p_c$ is whether there is an object in the image or not(1/0).
   * For simplicity we see we are using `MSE` loss but one can use different set of loss for 3 different task, i.e for $p_c$, bounding boxes and classes

## 2. Landmark Detection:
   * Consider the following example:
   <p align='center'><img src='./imgs/local1.png' height="50%" width="50%" /></p>

   * Suppose if we want to detect whether the person is smiling or not or the position of the eye, we can have let's say 64 different x and y values.
   * Each point say $l_{1x}, l_{1y}$ tells about eyes, $l_{2x}, l_{2y}$ tells about the nose and so on....(needs to be uniformed across the dataset)
   * We can always have first value to tell us whether there is face or not? like in case of object above.

## 3. Object Detection:
   * **Car Detection Example**:
      * Initially have a car or not model to predict whether there is car or not.
      <p align='center'><img src='./imgs/od1.png' height="50%" width="50%" /></p>

      * Then we implement `sliding window detection` technique to pass over a window with particular size and pass it over different regions of the image
      
      <p align='center'><img src='./imgs/od2.png' height="50%" width="50%" /></p>

      * The size of the sliding window and the stride decides the computational cost of the operation
      * If we use a very coarse stride, a very big step size, then that will reduce the number of windows you need to pass through the ConvNet, but that courser granularity may hurt performance
      * If we use a very fine granularity or a very small stride, then the huge number of all these little regions you're passing through the ConvNet means that means there is a very high computational cost.

   
### 3.1 Convolutional Implementation of Sliding Window
   * Need to fill it later

### 3.2 Bounding Box Prediction
   * Need to fill it later

### 3.3 Intersection Over Union(IoU):
   * Helps to evaluate an Object Detection Task
   * Used for the mapping of **localisation to accuracy**.
   <p align='center'><img src='./imgs/iou.png' height="50%" width="50%" /></p>

   * The red box is the ground truth and the blue box is the prediction.
   * Ideally we want the value to be as high as 1, but most people use 0.5 as the threshold.
   * If you want to be stringent you can use 0.6 as the threshold.
   * More generally, IOU is a measure of the overlap between 2 bounding boxes 

### 3.4 Non-max Suppression
   * Need to fill it later

### 3.5 Anchor Boxes
   * Need to fill it later

### 3.6 YOLO Algorithm
   * Need to fill it later

## 4. Image Segmentation
### 4.1 Semantic Segmentation with U-Net
   * Objective is to draw a careful outline around the object that is detected so that you know exactly which pixels belong to the object and which pixels don't. It tries to predict every single pixel
   * Let's say you want to identify car, building and the load
   <p align='center'><img src='./imgs/is1.png' height="50%" width="50%" /></p>

   * The algorithm tries to **predict each pixel in the image to some class.**
   * Basic architecture understanding:
   <p><img src='./imgs/is2.png' height="40%" width="40%" />
   <img src='./imgs/is3.png' height="55%" width="55%" /></p>

   * In this case we have to restore the dimensions of the image, with each pixel refering to specific value.
   
### 4.2 Transpose Convolutions
   * The goal is to take an input and blow it up to some higher dimensions than the input
   <p align='center'><img src='./imgs/tc.png' height="50%" width="50%" /></p>

   * The follow eg illustrates how we calculate value for different points:
   <p align='center'><img src='./imgs/tc1.png' height="50%" width="50%" /></p>

   * The red box value are calculated using the upper 2, green using upper 1, blue using lower 3 and so on.
   * The values at the padding region are ignored.
   * For the values in the intersection area the final is the total sum of all the values.

### 4.3 U-net Architecture Intuition:
   * Ideally while using conv operation in the later layers, we loose lot of `spatial information`.
   * The middle layer may represent that looks like there's a cat roughly in the lower right hand portion of the image.
   * The second half of this neural network uses the transpose convolution to blow the representation size up back to the size of the original input image.
   * `Skip Connections` are implemented to copy earlier block of activations to the later block.
   * Why skip connections?
      *  In the final layer to decide which region is a cat, two types of information are useful:
         * The high level, spatial, high level contextual information which it gets from this previous layer
         * The fine grained spatial information.
      * It allows the neural network to take this very high resolution, low level feature information where it could capture for every pixel position, how much furry stuff is there in this pixel? And used to skip connection to pause that directly to this later layer. * This way the final layer has both the lower resolution, but high level, spatial, high level contextual information, as well as the low level.
   <p align='center'><img src='./imgs/is4.png' height="50%" width="50%" /></p>

### 4.4 U-net Architecture
   * Following is the descriptive understanding of U-Net architecture:
   <p align='center'><img src='./imgs/is5.png' height="50%" width="50%" /></p>
      
   * It starts by adding more number of channels(feature maps) initially, then reduce the height and width(after max-pooling).
   * Then after the middle part, it starts rebuildig the dimensions for the image using skip connections and transpose convolution operations.
   * [Ronneberger et al., 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)