# Detection and Localization

In the previous topics in this chapter we've been focusing on:
- The construction of CNN
- The Architectures of CNN
- Classifcation using CNN
- Feature Visualisation

What is we wanted to classfiy multiple object in an image input, and also identify (via a tight bounding box around the object) the location of the classified object. 

**Objective**

1. Single object Detection
2. Multiple Object Detection:
   - R-CNN
   - Fast R-CNN
   - Faster R-CNN
3. Single Stage Object Detectors

## Single Object Detection


### Steps

1. Choose a backbone network such as AlexNet/Resnet/DenseNet etc.
2. From a FC create two outputs
    - One for Classifcation (Vector of 4096 $\to$ 1000)
    - Box Coordinate (Vector of 4096 $\to$ 4)

<br>
<div align="center">
<img src="../images/chap8/SingleObjDet.png" width="710"/>
</div>

#### Loss

| **Input** | **Task** | **Output** | **Loss Function** |
|-----------|----------|------------|-------------------|
| (H, W, 3) | **Classification** | Class label $c \in \{1, \ldots, C\}$ | Cross-Entropy Loss: $-\log p_c$ |
| (H, W, 3) | **Localization** | Bounding box $(x, y, w, h)$ | L2 Loss: $\sum (b_i - \hat{b}_i)^2$ or Smooth L1 |


Our Total Loss is computed as follows: 

$$L(y, P, b, b') = -\log P(y) + \lambda L_{regression}(b, b')$$

### Evaluate 

- For **Classification**, we evaluate using the **accuracy** metric 
- For **Localization**, we compute the **Intersection over Union (IoU)**

---

#### **Intersection over Union (IoU)**

$$\text{IoU}(A, B) = \frac{\text{Area of Intersection}}{\text{Area of Union}} = \frac{A \cap B}{A \cup B}$$

**Where:**
- $A$ = predicted bounding box area
- $B$ = ground truth bounding box area  
- $A \cap B$ = overlapping region (intersection)
- $A \cup B$ = total covered area (union)

**Properties:**
- **Range:** $\text{IoU} \in [0, 1]$
  - $\text{IoU} = 0$ → no overlap (completely wrong prediction)
  - $\text{IoU} = 1$ → perfect overlap (exact match)
- **Common threshold:** $\text{IoU} > 0.5$ is considered a "good" detection
  - $\text{IoU} \geq 0.5$ → True Positive (TP)
  - $\text{IoU} < 0.5$ → False Positive (FP)

**Visual Example:**

<div align="center">
<img src="../images/chap8/IOUCalc.png" width="500"/>
</div>

---
---

## Multi-Object Detection

$\text{Input} \ (W, H, 3)$

$\text{Output} \ \{(C_1, (x_1, y_1, w_1, h_1)), (C_2, (x_2, y_2, w_2, h_2)) \dots (C_k, (x_k, y_k, w_k, h_k))\}$

Where $k$ is the number of objects to identify in the image


<div align="center">
<img src="../images/chap8/obj1.png" width="295"/>
<img src="../images/chap8/obj2.png" width="300"/>
<img src="../images/chap8/obj4.png" width="335"/>
</div>


### Challanges

1. **Multiple Outputs** - Number of objects changes per image
2. **Multiple types of Outputs** - We need to answer: "what" and "where" for every object
3. **Multiple Lengths of outputs** - 
4. **Multiple Size** - Objects vary in size
5. **Multiple Detections** - Objects can be detected multiple times
6. **Occlusions** - Objects can hide part/total aspects of other objects

### Naïve Approach - Sliding Window

1. Apply CNN classification to many different crops of the image
2. Classify each crop as **background** or a specific **object class**
3. Vary the crop size and position to detect objects at different scales and locations

**Complexity Issue**

To create a bounding box, we need 2 corner points (4 coordinates total): $(x_1, y_1, x_2, y_2)$

**Question:** Given an image of size $H \times W$, how many different possible bounding boxes can be created?

**Answer:**

Each bounding box is defined by:
- Top-left corner: $(x_1, y_1)$ where $x_1 \in \{0, \ldots, W-1\}$, $y_1 \in \{0, \ldots, H-1\}$
- Bottom-right corner: $(x_2, y_2)$ where $x_2 \in \{x_1+1, \ldots, W\}$, $y_2 \in \{y_1+1, \ldots, H\}$

**Number of possible boxes:**

$$\binom{H}{2} \times \binom{W}{2} = \frac{H(H-1)}{2} \times \frac{W(W-1)}{2} = O(H^2 W^2)$$

**Why this is a problem:**
- For a $224 \times 224$ image: $\approx 2.5 \times 10^9$ possible boxes!
- Running CNN on each crop is **computationally infeasible**
- Need ~2.5 billion forward passes per image

**This motivates region proposal methods (R-CNN, Fast R-CNN, Faster R-CNN)**

### Background: Region Proposal 

**Region proposal algorithms** provide candidate bounding boxes where objects are likely located.

**Advantages:**
- Fast: generates ~2000 regions in seconds on CPU
- Reduces search space from billions to thousands of candidates
- No neural network required (traditional computer vision)

**Example: Selective Search**

1. **Over-segment** the image into many small regions (via graph-based segmentation)
2. **Iteratively merge** similar regions based on color, texture, size, and shape
3. Creates a **hierarchical structure** capturing objects at various scales

<div align="center">
<img src="../images/chap8/SelectiveS.png" width="695"/>
<p><i>Selective Search: from fine-grained segments to object-level proposals</i></p>
</div>

**Output:** ~2000 region proposals per image (vs. 2.5 billion sliding windows!)

## R-CNN: Region-Based CNN

**Key Idea** Use region proposals + CNN features + Classifiers to detect multiple objects



### **Step-by-Step Process**

<div align="center">
<img src="../images/chap8/step1RCNN.png" width="500"/>
</div>

#### **Step 1: Generate Region Proposals**

- Use **Selective Search** on input image
- Generates ~2000 region proposals (candidate bounding boxes)
- These are locations where objects *might* be

**Input:** Image $(H, W, 3)$  
**Output:** ~2000 regions $\{R_1, R_2, \ldots, R_{2000}\}$

<div align="center">
<img src="../images/chap8/step2RCNN.png" width="500"/>
</div>

---

#### **Step 2: Warp Regions to Fixed Size**

- Each region has different size/aspect ratio
- CNN requires fixed input size (e.g., $224 \times 224$ for AlexNet)
- **Warp** (resize) each region to $224 \times 224$

**Why?** CNNs need fixed-size inputs

**Case 1: Too Large:** Use (bilinear, bicubic) interpolation to reduce pixel count<br>
**Case 2: Too Small:** Use interpolation to create new (estimated) pixels


<div align="center">
<img src="../images/chap8/step3RCNN.png" width="500"/>
</div>

---

#### **Step 3: Extract CNN Features**

- Pass each warped region through a **pre-trained CNN** (e.g., AlexNet)
- Extract features from the last fully connected layer
- Get a **4096-dimensional feature vector** per region

**For each region $R_i$:**

$$\mathbf{f}_i = \text{CNN}(R_i) \in \mathbb{R}^{4096}$$

**Result:** 2000 feature vectors, one per region

<div align="center">
<img src="../images/chap8/step4RCNN.png" width="500"/>
</div>

---

#### **Step 4: Classify Each Region**

- Train **class-specific linear SVMs** (one per class)
- Each SVM takes the 4096-dim feature vector as input
- Outputs: **class scores** for each region

**For region $i$ and class $c$:**

$$\text{score}_{i,c} = \mathbf{w}_c^T \mathbf{f}_i + b_c$$

**Output:** Class probabilities for each of the 2000 regions

<div align="center">
<img src="../images/chap8/step5RCNN.png" width="500"/>
</div>

---

#### **Step 5: Bounding Box Regression**

- Initial proposals are rough
- Train a **linear regressor** to refine bounding box coordinates
- Learns to adjust $(x, y, w, h)$ for better localization

**For each region $i$:**

$$(\Delta x, \Delta y, \Delta w, \Delta h) = \text{Regressor}(\mathbf{f}_i)$$

**Refined box:**

$$\hat{x} = x + \Delta x, \quad \hat{y} = y + \Delta y, \quad \hat{w} = w \cdot e^{\Delta w}, \quad \hat{h} = h \cdot e^{\Delta h}$$

<div align="center">
<img src="../images/chap8/step6RCNN.png" width="500"/>
</div>

---

#### **Step 6: Non-Maximum Suppression (NMS)**

**Problem:** Multiple overlapping boxes detect the same object

**Solution:** Keep only the highest-scoring box, remove overlaps

**Algorithm:**
1. Sort boxes by confidence score (descending)
2. Select box with highest score
3. Remove all boxes with $\text{IoU} > 0.5$ with selected box
4. Repeat until no boxes remain

<div align="center">
<img src="../images/chap8/B4MNS.png" width="400"/>
<img src="../images/chap8/AfterMNS.png" width="400"/>
<p><i>Before NMS (left) vs After NMS (right)</i></p>
</div>

---

### **R-CNN Training**

**Three-stage training process:**

| **Stage** | **What's Trained** | **Loss Function** |
|-----------|-------------------|-------------------|
| **1. Pre-train CNN** | CNN backbone on ImageNet | Classification loss |
| **2. Fine-tune CNN** | CNN on detection dataset | Classification loss |
| **3. Train SVMs** | Class-specific SVMs | Hinge loss |
| **4. Train bbox regressor** | Bounding box refinement | L2 loss |

**Training data:**
- **Positive examples:** $\text{IoU} \geq 0.5$ with ground truth
- **Negative examples:** $\text{IoU} < 0.3$ with ground truth

---

### **R-CNN Mathematical Summary**

**Given image $I$:**

1. **Region proposals:** $\{R_1, \ldots, R_N\} = \text{SelectiveSearch}(I)$ where $N \approx 2000$

2. **CNN features:** $\mathbf{f}_i = \text{CNN}(\text{Warp}(R_i)) \in \mathbb{R}^{4096}$

3. **Classification scores:** $s_{i,c} = \text{SVM}_c(\mathbf{f}_i)$ for class $c$

4. **Bbox refinement:** $\hat{b}_i = b_i + \text{Regressor}(\mathbf{f}_i)$

5. **NMS:** Keep boxes with $\text{IoU} < \tau$ (typically $\tau = 0.5$)

---

### **R-CNN Performance**

**Results on PASCAL VOC 2010:**
- **mAP (mean Average Precision):** 53.7%
- Improvement over previous best: +30% relative gain
- First method to successfully apply CNNs to object detection

---

### **Advantages & Limitations**

| **✅ Advantages** | **❌ Limitations** |
|-------------------|-------------------|
| Significant accuracy improvement | **Very slow:** ~47 seconds per image |
| Leverages pre-trained CNNs | **Redundant computation:** CNN runs 2000 times |
| Simple pipeline | **Multi-stage training:** CNN, SVM, regressor trained separately |
| Works with any CNN backbone | **High memory usage:** Must cache features for all regions |
| Pioneered region-based detection | **Fixed region proposals:** Cannot learn better proposals |

**Key bottleneck:** Running CNN 2000 times per image!

**This motivates Fast R-CNN** → Share CNN computation across regions

---
```