# Object Detection

---

## Classification with Localization
- Use Convolution Network to classify an image
- To **Localize** an image, change Output to include additional information for *box* location: (x,y,h,w)
    - x,y is the center of the *box*, and h,w are the height and width of the box
    - The whole image dimension being (0,0) at the top left corner, and (1,1) at the bottom right corner 
- The Target Label Output: $\begin{bmatrix} p_c \\ b_x \\ b_y \\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_n \end{bmatrix}$
- where:
    - $p_c$ indictes if there's an object of interest (1 or 0).  When $p_c = 0$ and there is no object, the rest of the outputs are ignored
    - $b_x, ..., b_w$ indicates objects *box* location
    - $c_1, ..., c_n$ indicates probability of that object

## Loss Function Example:
- When $p_c = 0$, the **Loss Function** is $(\hat{y_1} - y_1)^2$
- When $p_c = 1$, the **Loss Function** is $(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + ... + (\hat{y_n} - y_n)^2 $

---

## Sliding Windows Algorithm:
- Train a CovNet on a Training Set of closely cropped images to classify an object
- Input a **slidding window** of a new image to detect into a trained Network to classify if that cropped part of an image contains an object, until all the regions of the image have been proccessed
- Repeat with larger **slidding window**, with hope of capturing an image of interest using one of the windows
- High computational cost may be reduced with larger **stride** but can reduce preformance

## Improving Sliding Windows Algorithm:
- Turn Fully Connected Layers into Convolution Layers:
    - Instead of *Flatteing* out a convolution layer into a FC layer, use a Convolution filter of the same size
        - 5x5x16 -> 400 FC Layer
        - 5x5x16 with 5x5x16 filters x400 -> 1x1x400 Conv Layer 
    - Use a Convolution Layer as an Output Layer
- Run a full *bigger* image to detect through the same Network, instead of multiple runs through with a **sliding window**, with the Convolution Output for each *part*

---

## YOLO Algorithm - Bounding Box Prediction:
- Convolution implementation of **sliding windows** algorthm
- Run the image through a Convolution Network, then separate the output into a *Grid* and run the Localization and Classification Algorithm on each grid
- Output for each grid: $y = \begin{bmatrix} p_c \\ b_x \\ b_y \\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_n \end{bmatrix}$
- where:
    - $p_c$ indictes if there's an object of interest (1 or 0) in that grid.  When $p_c = 0$ and there is no object and the rest of the outputs are ignored
    - $b_x, ..., b_w$ indicates objects *box* location relative to the grid 
    - $c_1, ..., c_n$ indicates probability of an object
- The total target output depends on number of grids (nxn) and labels (m): $n \times n \times m$

## Intersection Over Union:
- Estimate how well detection algorithm is working
- Compute the Union of Predicted boundry box vs Actual boundry box (Overlap between two boundry boxes)
    - Size of Intersection / Size of the Total Union
- Bigger IoU means better detection
    - Common threshold is set at above 0.5 (can be adjusted)

## Non-max Suppression:
- Dealing with Multiple detection
- Associate probability $p_c$ with each detection and take the largest one
    - Discard all predictions with $p_c \leq 0.6$
    - Pick the prediction with largest $p_c$
    - Discard all remaining predicitions with IoU $\geq 0.5$ with the chosen (largest) predicition

## Anchor Boxes for Multiple Detection:
- Use when midpoint of several objects are at the same point
- Use pre-defined anchor box shapes to associate objects to different anchor boxes
- Output: $y = \begin{bmatrix} p_c \\ b_x \\ b_y \\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_n \\ --- \\ p_c \\ b_x \\ b_y \\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_n  \end{bmatrix}$
- where each $p_c$ *group* corresponds to each **anchor box**
- Output shape with **Grid** (nxn), **Labels** (l), and **Anchor** (a) is: $n \times n \times l \times a$

## YOLO Algorithm Example:
- Training to detect pedestrian, car, and motorcycle using a 3x3 **Grid** wirh 2 **Anchors** and 8 **Labels**
- Output $y$ is: 3x3x8x2

---