# Detection and Localization

In the previous topics in this chapter we've been focusing on:
- The construction of CNN
- The Architectures of CNN
- Classifcation using CNN
- Feature Visualisation

What is we wanted to classfiy multiple object in an image input, and also identify (via a tight bounding box around the object) the location of the classified object. 

**Objective**

1. Single object Detection
2. Multiple Object Detection:
   - R-CNN
   - Fast R-CNN
   - Faster R-CNN
3. Single Stage Object Detectors

## Single Object Detection


### Steps

1. Choose a backbone network such as AlexNet/Resnet/DenseNet etc.
2. From a FC create two outputs
    - One for Classifcation (Vector of 4096 $\to$ 1000)
    - Box Coordinate (Vector of 4096 $\to$ 4)

<br>
<div align="center">
<img src="../images/chap8/SingleObjDet.png" width="710"/>
</div>

#### Loss

| **Input** | **Task** | **Output** | **Loss Function** |
|-----------|----------|------------|-------------------|
| (H, W, 3) | **Classification** | Class label $c \in \{1, \ldots, C\}$ | Cross-Entropy Loss: $-\log p_c$ |
| (H, W, 3) | **Localization** | Bounding box $(x, y, w, h)$ | L2 Loss: $\sum (b_i - \hat{b}_i)^2$ or Smooth L1 |


Our Total Loss is computed as follows: 

$$L(y, P, b, b') = -\log P(y) + \lambda L_{regression}(b, b')$$

### Evaluate 

- For **Classification**, we evaluate using the **accuracy** metric 
- For **Localization**, we compute the **Intersection over Union (IoU)**

---

#### **Intersection over Union (IoU)**

$$\text{IoU}(A, B) = \frac{\text{Area of Intersection}}{\text{Area of Union}} = \frac{A \cap B}{A \cup B}$$

**Where:**
- $A$ = predicted bounding box area
- $B$ = ground truth bounding box area  
- $A \cap B$ = overlapping region (intersection)
- $A \cup B$ = total covered area (union)

**Properties:**
- **Range:** $\text{IoU} \in [0, 1]$
  - $\text{IoU} = 0$ → no overlap (completely wrong prediction)
  - $\text{IoU} = 1$ → perfect overlap (exact match)
- **Common threshold:** $\text{IoU} > 0.5$ is considered a "good" detection
  - $\text{IoU} \geq 0.5$ → True Positive (TP)
  - $\text{IoU} < 0.5$ → False Positive (FP)

**Visual Example:**

<div align="center">
<img src="../images/chap8/IOUCalc.png" width="500"/>
</div>

---
---

## Multi-Object Detection

$\text{Input} \ (W, H, 3)$

$\text{Output} \ \{(C_1, (x_1, y_1, w_1, h_1)), (C_2, (x_2, y_2, w_2, h_2)) \dots (C_k, (x_k, y_k, w_k, h_k))\}$

Where $k$ is the number of objects to identify in the image


<div align="center">
<img src="../images/chap8/obj1.png" width="295"/>
<img src="../images/chap8/obj2.png" width="300"/>
<img src="../images/chap8/obj4.png" width="335"/>
</div>


### Challanges

1. **Multiple Outputs** - Number of objects changes per image
2. **Multiple types of Outputs** - We need to answer: "what" and "where" for every object
3. **Multiple Lengths of outputs** - 
4. **Multiple Size** - Objects vary in size
5. **Multiple Detections** - Objects can be detected multiple times
6. **Occlusions** - Objects can hide part/total aspects of other objects

### Naïve Approach - Sliding Window

1. Apply CNN classification to many different crops of the image
2. Classify each crop as **background** or a specific **object class**
3. Vary the crop size and position to detect objects at different scales and locations

**Complexity Issue**

To create a bounding box, we need 2 corner points (4 coordinates total): $(x_1, y_1, x_2, y_2)$

**Question:** Given an image of size $H \times W$, how many different possible bounding boxes can be created?

**Answer:**

Each bounding box is defined by:
- Top-left corner: $(x_1, y_1)$ where $x_1 \in \{0, \ldots, W-1\}$, $y_1 \in \{0, \ldots, H-1\}$
- Bottom-right corner: $(x_2, y_2)$ where $x_2 \in \{x_1+1, \ldots, W\}$, $y_2 \in \{y_1+1, \ldots, H\}$

**Number of possible boxes:**

$$\binom{H}{2} \times \binom{W}{2} = \frac{H(H-1)}{2} \times \frac{W(W-1)}{2} = O(H^2 W^2)$$

**Why this is a problem:**
- For a $224 \times 224$ image: $\approx 2.5 \times 10^9$ possible boxes!
- Running CNN on each crop is **computationally infeasible**
- Need ~2.5 billion forward passes per image

**This motivates region proposal methods (R-CNN, Fast R-CNN, Faster R-CNN)**

### Background: Region Proposal 

**Region proposal algorithms** provide candidate bounding boxes where objects are likely located.

**Advantages:**
- Fast: generates ~2000 regions in seconds on CPU
- Reduces search space from billions to thousands of candidates
- No neural network required (traditional computer vision)

**Example: Selective Search**

1. **Over-segment** the image into many small regions (via graph-based segmentation)
2. **Iteratively merge** similar regions based on color, texture, size, and shape
3. Creates a **hierarchical structure** capturing objects at various scales

<div align="center">
<img src="../images/chap8/SelectiveS.png" width="695"/>
<p><i>Selective Search: from fine-grained segments to object-level proposals</i></p>
</div>

**Output:** ~2000 region proposals per image (vs. 2.5 billion sliding windows!)

## R-CNN: Region-Based CNN

**Key Idea** Use region proposals + CNN features + Classifiers to detect multiple objects



### **Step-by-Step Process**

<div align="center">
<img src="../images/chap8/step1RCNN.png" width="500"/>
</div>

#### **Step 1: Generate Region Proposals**

- Use **Selective Search** on input image
- Generates ~2000 region proposals (candidate bounding boxes)
- These are locations where objects *might* be

**Input:** Image $(H, W, 3)$  
**Output:** ~2000 regions $\{R_1, R_2, \ldots, R_{2000}\}$

<div align="center">
<img src="../images/chap8/step2RCNN.png" width="500"/>
</div>

---

#### **Step 2: Warp Regions to Fixed Size**

- Each region has different size/aspect ratio
- CNN requires fixed input size (e.g., $224 \times 224$ for AlexNet)
- **Warp** (resize) each region to $224 \times 224$

**Why?** CNNs need fixed-size inputs

**Case 1: Too Large:** Use (bilinear, bicubic) interpolation to reduce pixel count<br>
**Case 2: Too Small:** Use interpolation to create new (estimated) pixels


<div align="center">
<img src="../images/chap8/step3RCNN.png" width="500"/>
</div>

---

#### **Step 3: Extract CNN Features**

- Pass each warped region through a **pre-trained CNN** (e.g., AlexNet)
- Extract features from the last fully connected layer
- Get a **4096-dimensional feature vector** per region

**For each region $R_i$:**

$$\mathbf{f}_i = \text{CNN}(R_i) \in \mathbb{R}^{4096}$$

**Result:** 2000 feature vectors, one per region

<div align="center">
<img src="../images/chap8/step4RCNN.png" width="500"/>
</div>

---

#### **Step 4: Classify Each Region**

- Train **class-specific linear SVMs** (one per class)
- Each SVM takes the 4096-dim feature vector as input
- Outputs: **class scores** for each region

**For region $i$ and class $c$:**

$$\text{score}_{i,c} = \mathbf{w}_c^T \mathbf{f}_i + b_c$$

**Output:** Class probabilities for each of the 2000 regions

<div align="center">
<img src="../images/chap8/step5RCNN.png" width="500"/>
</div>

---

#### **Step 5: Bounding Box Regression**

- Initial proposals are rough
- Train a **linear regressor** to refine bounding box coordinates
- Learns to adjust $(x, y, w, h)$ for better localization

**For each region $i$:**

$$(\Delta x, \Delta y, \Delta w, \Delta h) = \text{Regressor}(\mathbf{f}_i)$$

**Refined box:**

$$\hat{x} = x + \Delta x, \quad \hat{y} = y + \Delta y, \quad \hat{w} = w \cdot e^{\Delta w}, \quad \hat{h} = h \cdot e^{\Delta h}$$

<div align="center">
<img src="../images/chap8/step6RCNN.png" width="500"/>
</div>

---

#### **Step 6: Non-Maximum Suppression (NMS)**

**Problem:** Multiple overlapping boxes detect the same object

**Solution:** Keep only the highest-scoring box, remove overlaps

**Algorithm:**
1. Sort boxes by confidence score (descending)
2. Select box with highest score
3. Remove all boxes with $\text{IoU} > 0.5$ with selected box
4. Repeat until no boxes remain

<div align="center">
<img src="../images/chap8/B4MNS.png" width="400"/>
<img src="../images/chap8/AMNS.png" width="400"/>
<p><i>Before NMS (left) vs After NMS (right)</i></p>
</div>

---

### **R-CNN Training**

**Three-stage training process:**

| **Stage** | **What's Trained** | **Loss Function** |
|-----------|-------------------|-------------------|
| **1. Pre-train CNN** | CNN backbone on ImageNet | Classification loss |
| **2. Fine-tune CNN** | CNN on detection dataset | Classification loss |
| **3. Train SVMs** | Class-specific SVMs | Hinge loss |
| **4. Train bbox regressor** | Bounding box refinement | L2 loss |

**Training data:**
- **Positive examples:** $\text{IoU} \geq 0.5$ with ground truth
- **Negative examples:** $\text{IoU} < 0.3$ with ground truth

---

### **R-CNN Mathematical Summary**

**Given image $I$:**

1. **Region proposals:** $\{R_1, \ldots, R_N\} = \text{SelectiveSearch}(I)$ where $N \approx 2000$

2. **CNN features:** $\mathbf{f}_i = \text{CNN}(\text{Warp}(R_i)) \in \mathbb{R}^{4096}$

3. **Classification scores:** $s_{i,c} = \text{SVM}_c(\mathbf{f}_i)$ for class $c$

4. **Bbox refinement:** $\hat{b}_i = b_i + \text{Regressor}(\mathbf{f}_i)$

5. **NMS:** Keep boxes with $\text{IoU} < \tau$ (typically $\tau = 0.5$)

---

### **R-CNN Performance**

**Results on PASCAL VOC 2010:**
- **mAP (mean Average Precision):** 53.7%
- Improvement over previous best: +30% relative gain
- First method to successfully apply CNNs to object detection

---

### **Advantages & Limitations**

| **✅ Advantages** | **❌ Limitations** |
|-------------------|-------------------|
| Significant accuracy improvement | **Very slow:** ~47 seconds per image |
| Leverages pre-trained CNNs | **Redundant computation:** CNN runs 2000 times |
| Simple pipeline | **Multi-stage training:** CNN, SVM, regressor trained separately |
| Works with any CNN backbone | **High memory usage:** Must cache features for all regions |
| Pioneered region-based detection | **Fixed region proposals:** Cannot learn better proposals |

**Key bottleneck:** Running CNN 2000 times per image!

**This motivates Fast R-CNN** → Share CNN computation across regions

---
```

## Fast R-CNN


**Key Idea** Run the whole image through a CNN before obtaining region proposals

### **Step-by-Step Process**

<div align="center">
<img src="../images/chap8/step1RCNN.png" width="500"/>
</div>

---

#### **Step 1: Extract CNN Features**

- Pass the **whole image** through a **pre-trained CNN** (e.g., AlexNet)
- Extract features from the **last convolutional Layer**
- Get a **512 features/Channels of 20x15**

$$\mathbf{f} = \text{CNN}(R) \in \mathbb{R}^{512 \times 20 \times 15}$$

<div align="center">
<img src="../images/chap8/imcovNet.png" width="500"/>
</div>

---

#### **Step 2: Generate Region Proposals**

- Use **Selective Search** on input image
- Generates ~2000 region proposals (candidate bounding boxes)
- These are locations where objects *might* be

**Input:** Image $(H, W, 3)$  
**Output:** ~2000 regions $\{R_1, R_2, \ldots, R_{2000}\}$

<div align="center">
<img src="../images/chap8/step2RCNN.png" width="500"/>
</div>

---

#### **Step 3: Matching Region dimensions in Feature Space: ROI Pooling and RoI Align**

We assume:
- Input image: $$ (3, H, W) $$
- Feature map from backbone: $$ (C, H_f, W_f) $$

**Running Example**

$$
\text{Image size} = (3, 640, 480)
$$

$$
\text{Feature map} = (512, 20, 15)
$$

- Each $R_i$ is defined by $(x_{i1}, y_{i1}, x_{i2}, y_{i2})$ in the **original image**. 
- **Compute the scaling factors:**
  - $x_{scale} = \frac{\text{feature map width}}{\text{image width}}$
  - $y_{scale} = \frac{\text{feature map height}}{\text{image height}}$
    - In our example:
    - $x_{\text{scale}} = \frac{15}{480} = \frac{1}{32}$
    - $y_{\text{scale}} = \frac{20}{640} = \frac{1}{32}$
- **Map RoI to Feature Map**
  - $x_{i1}’ = x_{i1} \cdot x_{\text{scale}}, \quad y_{i1}’ = y_{i1} \cdot y_{\text{scale}}$
  - $x_{i2}’ = x_{i2} \cdot x_{\text{scale}}, \quad y_{i2}’ = y_{i2} \cdot y_{\text{scale}}$
  - This gives a region in feature map coordinates
- **Crop the region**
  - Use the scaled coordinates $(x_{i1}', y_{i1}', x_{i2}', y_{i2}')$ to select the corresponding rectangle from the feature map for all channel.
  - Producing $$R_i’ \in \mathbb{R}^{C \times h_i’ \times w_i’}$$
  - EAch spatial location contains a **C-dimensional feature vector**
- **Divide into Fixed Grid**
  - Divide $R_i’$ into fixed bins (e.g. $7 \times 7$).
  - Each bin size: 
    - $\text{bin width} = \frac{x_2’ - x_1’}{7}$
    - $\text{bin height} = \frac{y_2’ - y_1’}{7}$
- **RoI Pooling**
  - For each bin: 
    1. Round bin boundaries to integer indices
    2. Take all features values inside bin
    3. Apply max pooling
  - **Output** For N RoIs $(N, C, 7, 7)$

#### RoI Pooling

The issue is **quantization error** cause by rounding (step 1 in RoI Pooling).

**Example**

Let the Feature Map size: $20 \times 15$

Our Mapped RoI: $(x_1’, y_1’, x_2’, y_2’) = (2.3, 3.7, 10.8, 9.2)$

The Width: $10.8-2.3=8.5$

The Hight: $9.2 - 3.7 = 5.5$

Suppose we divide into 2x2 bins then:

- Bin Width: $\frac{8.5}{2} = 4.25$
- bin Height: $\frac{5.5}{2} = 2.75$

Our boundaries are:

- Top Left: $x \in [2.3, 6.55] \quad y \in [6.45, 9.2]$
- Top Right: $x \in [6.55, 10.8] \quad y \in [6.45, 9.2]$
- Bottom Left: $x \in [2.3, 6.55] \quad y \in [3.7, 6.45]$
- Bottom Right:$x \in [6.55, 10.8] \quad y \in [3.7, 6.45]$
  
**What RoI Pooling does:**

- Top Left: $x \in [2.3 \rightarrow 2, 6.55 \rightarrow 7] \quad y \in [6.45 \rightarrow 6, 9.2 \rightarrow 9]$
- Top Right: $x \in [6.55 \rightarrow 7 , 10.8 \rightarrow 11] \quad y \in [6.45 \rightarrow 6, 9.2 \rightarrow 9]$
- Bottom Left: $x \in [2.3 \rightarrow 2, 6.55 \rightarrow 7] \quad y \in [3.7 \rightarrow 4, 6.45 \rightarrow 6]$
- Bottom Right:$x \in [6.55 \rightarrow 7 , 10.8 \rightarrow 11] \quad y \in [3.7 \rightarrow 4, 6.45 \rightarrow 6]$

This causes spatial misalignment.

#### RoI Align

- **Do NOT Round coordinates**
  - Our Mapped RoI: $(x_1’, y_1’, x_2’, y_2’) = (2.3, 3.7, 10.8, 9.2)$
- **Divide into Exact Floating bins**
  - Bin Width: $\frac{8.5}{2} = 4.25$
  - bin Height: $\frac{5.5}{2} = 2.75$
- **Sampling:** Choose sampling point(s) inside each bin (Often the center of the bin)
  -  Top Left: $(5.575, 5.975)$
  -  Top Right: $(7.525,  5.975)$
  -  Bottom Left: $(5.575, 4.6)$
  -  Bottom Right: $(7.525, 4.6)$
- **Bilinear Interpolation** since the feature map only exists at interger grid points compute: 
  - Top Left: $i = \lfloor 5.575 \rfloor = 5 \quad j = \lfloor 5.975 \rfloor = 5 \\ \alpha = 5.575 - 5= 0.575 \quad \beta = 5.975 - 5 = 0.975 \\ \text{The neares integer neighbors are: } (5, 5), (5, 6), (6, 5), (6,6) \\ \text{Interpolate:} \qquad F(x,y) = (1-\alpha)(1-\beta)F(5,5) \cdot \alpha(1-\beta)F(5,5) \cdot (1-\alpha)\beta F(6,5) \cdot \alpha\beta F(6,6) \\ \text{Where } F(x,y) \text{ is the value of the feature map at spatial location } (x, y)$
- **Aggregate:** If multiple sampling points are used per bin: 
  - Average them 
  - Retreive Max

**Final Output** $(C, 7, 7)$ or $(C, 2, 2)$ in this example

<div align="center">
<img src="../images/chap8/Biinterpolation.png" width="450"/>
<img src="../images/chap8/ROIPool.png" width="510"/>
</div>

Note:

Generally, $\alpha$ and $\beta \in [0,1]$ are calculated as follows: 

$x_1 = \lfloor x \rfloor \quad x_2 = \lfloor y \rfloor$

$\alpha = \frac{x - x_1}{x_2 - x_1} \quad \beta = \frac{y - y_1}{y_2 - y_1}$

Since $x_2 = x_1 + 1 \rightarrow x_2 - x_1 = 1$ so it simplifies to $\alpha = x - x_1$

For more information on Bilinear interpolation: https://github.com/yossefPartouche/Computer_Graphics/blob/main/Unit%201/L1.6_Anti-Aliasing.ipynb

---

#### **Step 4:Classify and Regress**

- **Flatten** each pooled feature map.
- **Pass through fully connected layers (2 Layers)** (shared for all RoIs).
- **Output two heads**:
  - **Classification head:** prediciting the class for each RoI.
  - **Regression head:** Predicting the transform for each RoI.

<div align="center">
<img src="../images/chap8/FRCNNpred.png" width="500"/>
</div>

---

<div align="center">

### R-CNN vs. Fast R-CNN



<img src="../images/chap8/RCNNvsFRCNN.png" width="800"/>
</div>


**Key bottleneck:** Runtime is dominated by the Region Proposals

**This motivates Faster R-CNN** Which uses a Region Proposal Network, to predict RP from CNN features.

## Faster R-CNN

**Key Idea** 

After the feature mapping (like the previous networks), we branch out (like the previous networks), but the bounding box branch becomes another small network to produce region proposal (unlike the previous networks).

### **Step-by-Step Process**

<div align="center">
<img src="../images/chap8/step1RCNN.png" width="500"/>
</div>

---

#### **Step 1: Extract CNN Features**

- Pass the **whole image** through a **pre-trained CNN** (e.g., AlexNet)
- Extract features from the **last convolutional Layer**
- Get a **$C_{in}$ features/Channels of WxH**

$$CNN(\text{Image}) = F \in \mathbb{R}^{C_{in} \times H_f \times W_f}$$

<div align="center">
<img src="../images/chap8/imcovNet.png" width="500"/>
</div>

---

#### **Step 2: Define K Achor Boxes**

Anchor boxes are predefined reference bounding boxes placed at each spatial location in the feature map.

- They serve as boxes for detecting objects of different scales and aspect ratios.
- At each spatial location in the feature map well try to fit these predefined achor boxes.
- K is a hyperpameter

Spatial location: is a single entry (i,j) across all channels.

<div align="center">
<img src="../images/chap8/spatialLoc.png" width="500"/>
</div>

---

#### **Step 3: 3x3 Convolution**

- **Apply 3x3 convolution**
   - This is to increase the receptive field of the network and not rely on the a single pixel for context.
   - $F' = Conv_{3 \times 3}(F)$

---

#### **Step 4: Branch into Two Heads**
From $F'$ we branch into two 1x1 convolutions
1. **Classification/Objectness Head:** 
   - $\text{Conv}_{1 \times 1} = W_{cls} \in \mathbb{R}^{2k \times C_{in} \times 1 \times 1}$
   - **Output** $F_{1}'' \in \mathbb{R}^{2k \times H_f \times W_f}$
2. **Regression Head:** 
   - $\text{Conv}_{1 \times 1} = W_{reg} \in \mathbb{R}^{4k \times C_{in} \times 1 \times 1}$
   - **Output** $F_{2}'' \in \mathbb{R}^{4k \times H_f \times W_f}$

---
  
#### **Step 5: Compute Objectness Scores + BB Proposals**

The output of the RPN's two head, can be thought of as scores (like a classic network). 

- Classifiction Score
- Regression Score

**Step 5a: Map Channel to Anchors**

- Each spatial Location (i, j) in the feature map corresponds to a receptive field in the input image.
- At each location, we have k-anchors, the channels of the heads are assigned to anchors in a fixed mapping.
- We extract: $F'_{cls}[:, i, j] \in \mathbb{R}^{2k}$ and $F'_{reg}[:, i, j] \in \mathbb{R}^{4k}$ which are grouped per anchor.
  
  <div align="center">

  | **Head** | **Anchor** | **Channel Indices** | 
  |----------|------------|---------------------|
  | Classification | 1 | 0-1|
  | Classification | 2 | 2-3|
  | $\dots$ | $\dots$ | $\dots$|
  | Classification | k | 2k-2, 2k-1 | 
  | Regression | 1 | 0-3|
  | Regression | 2 | 4-7| 
  | $\dots$ | $\dots$ | $\dots$|
  | Regression | k | 4k - 4 $\dots$ 4k-1|
</div>

**Step 5b: Compute Objectness Scores**

1. **Extract Anchor scores**: For anchor $a \in {1, \dots, k}$ extract $[s_{bg}^{(a)}, s_{obj}^{(a)}]$.
2. **Convert to probabilities**: Apply softamx: $p_{obj}^{(a)} = \frac{\exp(s_{obj}^{(a)})}{\exp(s_{obj}^{(a)}) + \exp(s_{bg}^{(a)})}$
   - So we first iterate through each anchor at this location and compute the probability and then change locations.
3. **Assign training labels using IoU**: 
   - Compute IoU between each anchor and all ground-truth boxes

<div align="center">
  
   |IoU condition | Label | Contributes to loss? | 
   |--------------|-------|----------------------|
   |Highest IoU with a GT box| Positive (1) | Yes (classification + regression) | 
   | IoU ≥ 0.7 with any GT | Positive (1) | Yes |
   | IoU ≤ 0.3 with all GT | Negative (0) | Yes (classification only) | 
   | 0.3 < IoU < 0.7 | Ignore | No | 
</div>

4. **Compute Loss (CE)**:
   - For each anchor used in training: $L_\text{cls}^{(a)} = - \Big[ y^{(a)} \log p_{obj}^{(a)} + (1 - y^{(a)}) \log p_{bg}^{(a)} \Big]$ 
5. **Full image Classification Loss**:
   -  $L_\text{cls} = \frac{1}{N_\text{anchors}} \sum_{a \in \text{used anchors}} L_\text{cls}^{(a)}$

<div align="center">

```spatial location → anchor → IoU → label → softmax → CE loss```

</div>


**Step 5c: Regression Per Anchor**

<div align="center">
<img src="../images/chap8/step5c.png" width="800"/>
</div>

1. **Extract predicted offsets** 
   - For anchor $a \in {1, \dots, k}$ extract $[t_x^{(a)}, t_y^{(a)}, t_w^{(a)}, t_h^{(a)}]$.
2. **Apply predicted offsets to produce RP**
  - $x_{pred} = x_{anchor} + t_x^{(a)} \cdot w_{anchor}$
  - $y_{pred} = y_{anchor} + t_y^{(a)} \cdot h_{anchor}$
  - $w_{pred} = w_{anchor} \cdot e^{t_w^{(a)}} $
  - $h_{pred} = h_{anchor} \cdot e^{t_h^{(a)}}$
3. **Compute training targets for positice anchors:**
   - Only **Postive** labels from 5b are used in regression.
   - Target offsets are computed relative to the matched **ground truth** box $(x_\text{gt}, y_\text{gt}, w_\text{gt}, h_\text{gt})$
   - $t_x^* = \frac{x_\text{gt} - x_\text{anchor}}{w_\text{anchor}}$
   - $t_y^* = \frac{y_\text{gt} - y_\text{anchor}}{h_\text{anchor}}$
   - $t_w^* = \log \frac{w_\text{gt}}{w_\text{anchor}}$
   - $t_h^* = \log \frac{h_\text{gt}}{h_\text{anchor}}$
- **Compute Regression Loss:**
  - $L_\text{reg} = \text{SmoothL1}\big([t_x^{(a)}, t_y^{(a)}, t_w^{(a)}, t_h^{(a)}] - [t_x^*, t_y^*, t_w^*, t_h^*]\big)$
  - Full regression Loss: $L_\text{reg} = \frac{1}{N_\text{pos}} \sum_{a \in \text{positive anchors}} L_\text{reg}^{(a)}$


---
---


## Faster R-CNN Optimization: Single Stage Object Detection (SSD)

In the step 4 of the RP, were created 2k classification channels, to determine if and object existed or not, but we can directly determine **which** object it is including determinining if it's a background object!

**Reminder**

In chapter 4 we discussed **Multi-Class Classification**, so we can just extend the idea in this network.

---

#### **Step 1: Extract CNN Features (same)** 
#### **Step 2: Define K Achor Boxes (same)**
#### **Step 3: 3x3 Convolution (same)**

---

#### **Step 4: Branch into Two Heads (Optimized)**
From $F'$ we branch into two 1x1 convolutions<br> $\text{Given } C = \text{ number of object Classes}$

1. **Classification/Objectness Head:** 
   - $\text{Conv}_{1 \times 1} = W_{cls} \in \mathbb{R}^{(C+1)k \times C_{in} \times 1 \times 1}$
   - **Output** $F_{1}'' \in \mathbb{R}^{2k \times H_f \times W_f}$
2. **Regression Head:** 
   - $\text{Conv}_{1 \times 1} = W_{reg} \in \mathbb{R}^{4k \times C_{in} \times 1 \times 1}$
   - **Output** $F_{2}'' \in \mathbb{R}^{4k \times H_f \times W_f}$
---

#### **Step 5: Compute Objectness Scores + BB Proposals (Optimized)**

**Step 5a: Map Channel to Anchors**

- Each spatial Location (i, j) in the feature map corresponds to a receptive field in the input image.
- At each location, we have k-anchors, the channels of the heads are assigned to anchors in a fixed mapping.
- We extract: $F'_{cls}[:, i, j] \in \mathbb{R}^{(C+1)k}$ and $F'_{reg}[:, i, j] \in \mathbb{R}^{4k}$ which are grouped per anchor.
  
<div align="center">

| **Head**         | **Anchor** | **Channel Indices**      |
|------------------|------------|--------------------------|
| Classification   | 1          | 0 to C                   |
| Classification   | 2          | (C+1) to 2C+1            |
| ...              | ...        | ...                      |
| Classification   | k          | (k-1)(C+1) to k(C+1)-1   |
| Regression       | 1          | 0-3                      |
| Regression       | 2          | 4-7                      |
| ...              | ...        | ...                      |
| Regression       | k          | 4k-4 to 4k-1             |
</div>

**Step 5b: Compute Objectness Scores**

1. **Extract Anchor scores**: For anchor $a \in {1, \dots, k}$ extract $[s_{bg}^{(a)}, s_{1}^{(a)}, s_{2}^{(a)} \dots s_{C}^{(a)}]$.
2. **Convert to probabilities**: Apply softmax over all C+1 classes:  
For anchor $a$,  
$
p_{c}^{(a)} = \frac{\exp(s_{c}^{(a)})}{\sum_{j=0}^{C} \exp(s_{j}^{(a)})}
$
where $c = 0$ is background, $c = 1, \dots, C$ are object classes.
   - So we first iterate through each anchor at this location and compute the probability and then change locations.
3. **Assign training labels using IoU (same)**: 
   - Compute IoU between each anchor and all ground-truth boxes

<div align="center">
  
   |IoU condition | Label | Contributes to loss? | 
   |--------------|-------|----------------------|
   |Highest IoU with a GT box| Positive (1) | Yes (classification + regression) | 
   | IoU ≥ 0.7 with any GT | Positive (1) | Yes |
   | IoU ≤ 0.3 with all GT | Negative (0) | Yes (classification only) | 
   | 0.3 < IoU < 0.7 | Ignore | No | 
</div>

4. **Compute Loss (CE)**:
   - For each anchor used in training: 
     $
     L_\text{cls}^{(a)} = -\sum_{c=0}^{C} y_c^{(a)} \log p_c^{(a)}
     $ <br> where $y_c^{(a)}$ is 1 if anchor $a$'s true class is $c$, else 0, and $p_c^{(a)}$ is the predicted probability for class $c$.
5. **Full image Classification Loss**:
   -  $L_\text{cls} = \frac{1}{N_\text{anchors}} \sum_{a \in \text{used anchors}} L_\text{cls}^{(a)}$

<div align="center">

```spatial location → anchor → IoU → label → softmax → CE loss```

</div>

---

**Step 5c: Regression Per Anchor (same)**

---
---