# Object Localization

### Object localization

- Classification with localization problem
- Detection problem (for multiple objects)

In **Image Classification**, the image is passed through *ConvNet* and *softmax* to extract the class.  

For localization, we need to change the NN to get the bounding box. $b_x$, $b_y$, $b_h$, $b_w$. 

If training set contains the bounding box data, the bounding box can be learned and found in new data. 

How to define the target label $y$?  
Assume that $y$ is a vector $[p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3]^T$,  where $p_c$ probability that the object is there. And $c_i$ are probabilities for different classes of the object (1 or 0).  

Loss function $\mathcal{L}(\hat{y},y)$ is sum of squared residuals for if $y_1 = 1$ (if there is an image). Otherwise we do not care about other labels, so we get 

$$
\mathcal{L}(\hat{y},y) = 
\begin{cases}
\sum(\hat{y}_i - y_i)^2 \text{ if } y_1 = 1 \\
(\hat{y}_1-y_1)^2 \text{ if } y_1 = 0
\end{cases}
$$

### Landmark detection 

Outpot an important points of the image (coordinates)

An image can have many landmarks of interest (facial features). These can be extracted via *ConvNet*. This is used to extract emoptions, etc.  
Computer graphics.  

The labels in the train set are usually prepared _manually_. 

Labells have to be consistent between different images.  

### Object detection

Using _sliding window_ dectection algorithm and _ConvNet_.  
- Create training set with __closely cropped__ image of an object (e.g., car)
- and images without a car.  
- Create a ConvNet to classifiy these crops if it is a car or not.  
- Create a slide-window detection. For atest image, pick the window size and scroll through the entire image with the "window" (with a small stride). Classify each part of the image whether it is a car not. 
    - Resize the window and repeat the process
    - Resize again and repeat 

> So, if there is a car in the image, the ConvNet should give positive at least once in this search. 

**Distadvantage** Very computatiaonlly expensive

**Solution** Convolutional implementation of a sliding-window

### Convolutional Implementation of Sliding Windows

Turning FC layers into ConvLayers.  

Consider an image $(14\times14\times3)$ then $5\times5$ layers to make $(10\times10\times16)$ and then $2\times2$ MAXPOOL to make $(5\times 5\times 16)$ ... then FC, FC, softmax with say $4$ labels for 4 varius objects present in the image.  

In order to convert the last FC layers to ConvLayers, we repalce $1$ FC laer with $400$ neurons with $400$ ConvLayers with $5\times 5$ so that ouput is $1\times1\times400$, then you need another $400$ of $1\times 1$ filters. 

Implementation is from OverFeat paper: by Sermanet et al 2014. 

Consider imput of $(14\times14\times3)$ image. 

$$
\Big[ 14\times14\times3 \Big]
\underbrace{\rightarrow}_{5\times 5}
\Big[ 10\times 10\times 16 \Big]
\overbrace{\underbrace{\rightarrow}_{2\times 2}}^{\text{MAX POOL}}
\Big[ 5\times 5\times 16 \Big]
\overbrace{\underbrace{\rightarrow}_{5\times 5}}^{\text{FC}}
\Big[ 1\times 1\times 400 \Big]
\overbrace{\underbrace{\rightarrow}_{1\times 1}}^{\text{FC}}
\Big[ 1\times 1\times 400 \Big]
\overbrace{\underbrace{\rightarrow}_{1\times 1}}^{\text{FC}}
\Big[ 1\times 1\times 4 \Big]
$$

Consider a larger input image: 

$$
\Big[ 16\times 16\times 3 \Big]
\underbrace{\rightarrow}_{5\times 5}
\Big[ 12\times 12\times 16 \Big]
\overbrace{\underbrace{\rightarrow}_{2\times 2}}^{\text{MAX POOL}}
\Big[ 6\times 6\times 16 \Big]
\overbrace{\underbrace{\rightarrow}_{5\times 5}}^{\text{FC}}
\Big[ 2\times 2\times 400 \Big]
\overbrace{\underbrace{\rightarrow}_{1\times 1}}^{\text{FC}}
\Big[ 2\times 2\times 400 \Big]
\overbrace{\underbrace{\rightarrow}_{1\times 1}}^{\text{FC}}
\Big[ 2\times 2\times 4 \Big]
$$

The resulting $4$ in the output volume represents the results if we took the original image and movied it within the bigger 16 by 16 image, rerunning the net each time. But, but doing it like this we duplicate a lot of calcualtions!
In the above implementaion, many calculations are shared. 
The `MAX POOL` layer guverns the stride with which we "slide the window" and the output density.  

> Convolutional implementation allows to process the entire image without manually sliding the window and _rerunning_ the net every time!

**Shortcoming** position of the bounding boxes is not accurate 


### Bounding Box Predictions

It is possible that non of the boxes matches the position of the car.  

Especially if bounding box may not be perfectly squred. 

Solution: `YOLO` algorithm (you only look once). See Redmon et al 2015 paper. 

Idea: Devide a given image into _cells_. And process each cell of the grid spearately through the ConvNet discussed above. For each grid cell there exists a label $y=[p_c,b_x, b_y, b_h, b_w, c_1, c_2, c_3]^T$.  
Assign object to the center of a given cell, if the __object center__ lies within the boundaries of this grid cell.  

__Target output volume__ is $3\times 3\times 8$  for $8$ labels in the $Y$ and $3\times 3$ cell devision. 

On the other side, Use the __ConvNet__ to process the image and adjust the layers such, that the output shape is equal to the shape of the _target volume_. 

use backprop to train the NN to output the target volume.  

Advantage: faster.  
**Note** there should be one object of interest in a gird cell.  
**Single Conv. implementation**  

#### Specifying the boundary boxes

In YOLO algorithm in each cell, the opposite corners have (1,1) and $(0,0)$. Then, the coordinates are specified within these bounds. $x_{i}\int(0,1)$ however width and hight can be larger than $1$.  

### Intesection Over Union







