# Amphi 5. More About Images

# 1. Notions

## 1.1 Object Localization

In image processing, **object localization** refers to determining the position in a given image where our interested object is located, for example, by drawing a box surrounding that object.

<img src="F2.png"></img>

## 1.2 Image Classification vs Image Classification with localization

- For image classification without localization, the output (target) is a class ($1, \ldots, K $). It can be transformed to a one-hot vector $(0, 0, \ldots, 0, 1, 0, \ldots, 0)$

- For image classification with localization, the output is a class, together with the object's position. The object's position is typically a rectangle box, characterized by coordinate of its center ($b_x, b_y$), its height $b_h$ and width $b_w$. $b_x, b_y, b_w, b_h$ are usually proportion of the width/height of the box to the width/height of image. So $0 \leq b_x, b_y, b_w, b_h \leq 1$.

<img src="F3.png"></img>

So the output will be a one-hot vector, concatenated by the quadruple $(b_x, b_y, b_h, b_w)$. It has the form $(C, C_0, C_1, \ldots, C_{K-1}, b_x, b_y, b_h, b_w) = (C, 0, 0, \ldots, 0, 1, 0, \ldots, 0, b_x, b_y, b_h, b_w)$. Here the coordinate $C$ defines whether any of $K$ objects appear in the image. 

We usually construct a model that predict output in probability for $C, C_0, C_1, \ldots, C_{K-1}$ and real number for $b_x, b_y, b_w, b_h$, for example with an activation composed of softmax for $C_0, C_1, \ldots, C_{K-1}$, sigmoid for $C$ and linear for $b_x, b_y, b_w, b_h$. 

- If $C$ is greater than some threshold (like 0.5), we decide that there is an of our $K$ interested objects that presents in the image. In that case, we assign the image to the class associated with the greatest value (probability) among $C_0, \ldots, C_{K-1}$. We draw a rectangle centered at $b_x, b_y$ with height $b_h$, width $b_x$ as a detection of the object.

- Otherwise, we decide that no object presents in the image. In some articles, the target label is written $(0, ?, ?, \ldots, ?)$ where $?$ stands for "don't care".

## 1.3 Landmark Detection

In **landmark detection** we add to the target label (output) a set of numbers corresponding to key features of the object. For example:

- For car classification, we draw a rectangle surrounding the car and add  $(b_x, b_y, b_w, b_h)$ to the target label.
- For face classification, we add position of some important features to the target label, like $(x_{left-eye-center}, y_{left-eye-center}, width_{left-eye}, height_{left-eye}, \ldots, x_{chin}, y_{chin})$.

The classification problem becomes classification with localization key features of our interested objects.


## 1.4 Object Detection

**Object detection** refers to localizing all objects in the image. The target label is therefore more complex than the $(C, C_0, C_1, \ldots, C_{K-1}, b_x, b_y, b_h, b_w)$ form in section 1.2. 

For example, the YOLO algorithm introduced in section 2 suggests splitting the image into a $S \times S$ grid ($S = 19$ in the original paper) with hypothesis that each of the $S^2$ small square contains at most 1 center of the objects. Than reformat the output to a $\left( (1 + 4 +K) * S^2\right)$-vector.

# 2. YOLO (V1)

https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

## 2.1 The target label

The target label can be reformulated as

$$
\begin{pmatrix}
C^{zone 1}\\
C_0^{zone 1}\\
\ldots\\
C_{K-1}^{zone 1}\\
b_x^{zone 1}\\
b_y^{zone 1}\\
b_w^{zone 1}\\
b_h^{zone 1}\\
\ldots\\
C^{zone S\times S}\\
C_0^{zone S\times S}\\
\ldots\\
C_{K-1}^{zone S\times S}\\
b_x^{zone S\times S}\\
b_y^{zone S\times S}\\
b_w^{zone S\times S}\\
b_h^{zone S\times S}\\
\end{pmatrix}
$$

The coordinates are divided into $S\times S$ groups, each contains 1 value of confidence (whether an object appears in the image), $K$ values for classes, 4 values for localization. 
$$
zone 1 = \begin{cases}
C^{zone 1}\\
C_0^{zone 1}\\
\ldots\\
C_{K-1}^{zone 1}\\
b_x^{zone 1}\\
b_y^{zone 1}\\
b_w^{zone 1}\\
b_h^{zone 1}\\
\end{cases}\\
\ldots\\
zone S\times S \begin{cases}
C^{zone S\times S}\\
C_0^{zone S\times S}\\
\ldots\\
C_{K-1}^{zone S\times S}\\
b_x^{zone S\times S}\\
b_y^{zone S\times S}\\
b_w^{zone S\times S}\\
b_h^{zone S\times S}\\
\end{cases}
$$

<img src="F4.png"></img>

Here, $C^{zonex}$ = 1 if there is an object centered inside zone $x$, otherwise $C^{zonex} = 0$.

The vector is usually sparse, i.e, only values associated to the zone containing the objects' center are non-zero; the others are zeros or "no care".

<img src="F5.png"></img>

**Convention**

- $b_x, b_y, b_w, b_h$ resp. are **ratio** of $x-$coordinate in the small square, $y-$coordinate in the small square, width, height resp. of an object boundary and the size of the small square. Therefore, $0 \leq b_x, b_y < 1$, $0 \leq b_w, b_h < S$.

## 2.2 Loss function

Apart from classical machine learning, YOLO's original paper introduce mean square error for every coordinates of the target vector.

There are three kinds of coordinates of the target vector:

- **Confidence coordinates**: $C^{zonex}$ where $x = 1, \ldots, S^2$: they are predicted as a probability
- **Classification coordinates**: $C_k^{zonex}$ where $k = 0, \ldots, K-1; x = 1, \ldots, S^2$. They are predicted as probabilities, sum of probabilities for a fixed $x$ is 1.
- **Localization coordinates**: $b_x^{zonex}, b_y^{zonex}, b_w^{zonex}, b_h^{zonex}$ where $x = 1, \ldots, S^2$. They are predicted as real numbers.

We can define the loss function to be a composed of 3 losses:

- Confidence loss:

For every training data, 

$$
\sum_{i=1}^{S^2} \left[ \mathbf 1_i (C^{(i)} - C'^{(i)})^2 + \lambda_{noobj} \mathbf (1 - \mathbf 1_i) (C^{(i)} - C'^{(i)})^2 \right]
$$

where $C'_i$ is the prediction whether there is an object in cell $i$, $\mathbf 1_i$: whether there is an object in cell $i$, $\lambda_{noobj}$: a penalization coefficient for case of no object.

- Localization loss:

$$
\lambda_{coord} \sum_{i=1}^{S^2} \mathbf 1_i \left[ \left(b_x^{(i)} - b_x'^{(i)}\right)^2 + \left( b_y^{(i)} - b_y'^{(i)}\right)^2 + \left(\sqrt{b_w^{(i)}} - \sqrt{b_w'^{(i)}} \right)^2 + \left(\sqrt{b_h^{(i)}} - \sqrt{b_h'^{(i)}} \right)^2 \right]
$$

where $\lambda_{coord} > 0$ is the coefficient of localization, $\mathbf 1_i$: whether there is an object in cell $i$.

- Classification loss:
$$
\sum_{i=1}^{S^2}\mathbf 1_i \sum_{k=0}^{K-1} \left(C_k^{(i)}-C_k'^{(i)}\right)^2
$$

The total loss is the sum of the above 3 kinds of loss:
$$
L = \sum_{i=1}^{S^2} \left[ \mathbf 1_i (C^{(i)} - C'^{(i)})^2 + \lambda_{noobj} \mathbf (1 - \mathbf 1_i) (C^{(i)} - C'^{(i)})^2 \right]
$$

$$
+ \lambda_{coord} \sum_{i=1}^{S^2} \mathbf 1_i \left[ \left(b_x^{(i)} - b_x'^{(i)}\right)^2 + \left( b_y^{(i)} - b_y'^{(i)}\right)^2 + \left(\sqrt{b_w^{(i)}} - \sqrt{b_w'^{(i)}} \right)^2 + \left(\sqrt{b_h^{(i)}} - \sqrt{b_h'^{(i)}} \right)^2 \right]
$$

$$
+ \sum_{i=1}^{S^2}\mathbf 1_i \sum_{k=0}^{K-1} \left(C_k^{(i)}-C_k'^{(i)}\right)^2
$$



## 2.3 The Model

The deep network used in YOLO is a very deep CNN model with some image size as input (like 448 x 558 x 3) and a ($(5 + K)S^2)$-dimensional vector as output. Example:

<img src="F6.png"></img>

## 2.4 Non-maximal Suppresion

After training, the model is able to predict the output. Usually, it predicts multiple boxes for the same object like this.

<img src="F7.png"></img>

So a post-processing step is introduced at prediction: keep only one box $B_{opt}$ with highest confidence value and delete all boxes that overlapped a lot with $B_{opt}$.

To do this, we introduce the IOU (Intersection over Union function) of 2 boxes as the ratio of area of their intersection and union.

**IOU**
$$
Iou(B_1, B_2) = \frac{S(B_1 \cap B_2)}{S(B_1 \cup B_2)}
$$


**Non-maximal Suppression**
1. Sort the predictions by the confidence scores.
2. Start from the top scores, ignore any current prediction if we find any previous predictions that have the same class and IoU > some threshold (typically 0.5) with the current prediction.
3. Repeat step 2 until all predictions are checked.

## 2.5 YOLO V2 and V3

# 3. Face Recognition