In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Sliding Window and R-CNN Family**

---

## 1. Sliding Window Object Detection

### Overview

Before deep learning, **object detection** relied on a **sliding window** approach. The idea was to slide a small window (of different sizes and aspect ratios) across the image and run a **classifier** (like SVM or CNN) on each patch to decide if it contains an object.

### Process

1. Take the input image.
2. Slide a fixed-size window across the image (stride = few pixels).
3. For each window:

   * Extract features (HOG, SIFT, or CNN features).
   * Classify using a binary classifier (object vs background).
4. Combine detections with **Non-Maximum Suppression (NMS)** to remove duplicates.

### Drawbacks

* **Computationally expensive**: Thousands of overlapping windows.
* **Inefficient**: Most windows are background.
* **Poor scalability** for multiple object categories.
* **Fixed aspect ratios** make it difficult to detect objects of varying shapes.

### Key takeaway

Sliding window introduced the **idea of localized detection**, but not efficiency. Deep learning-based models (R-CNN family) solved these issues.

---

## 2. R-CNN (Regions with CNN features)

### Paper

**R-CNN: Regions with Convolutional Neural Networks (2014, Ross Girshick)**
[Paper link](https://arxiv.org/abs/1311.2524)

### Main Idea

Instead of checking *every window*, R-CNN uses **region proposals** — a small number (~2000) of likely object regions — and classifies each using a CNN.

### Pipeline

1. **Region Proposal**: Generate ~2000 candidate regions using *Selective Search*.
2. **Feature Extraction**: For each region, resize to fixed size (e.g., 224×224) and feed into a pre-trained CNN (like AlexNet).
3. **Classification**: Use SVM to classify each region as object class or background.
4. **Bounding Box Regression**: Refine box coordinates.

### Architecture Diagram

```
Input Image → Selective Search → CNN Feature Extractor → SVM Classifier + Box Regressor
```

### Pros

* Major accuracy boost over sliding window.
* Introduced deep features to object detection.

### Cons

* **Extremely slow** (≈50 seconds per image).
* Requires saving CNN features to disk.
* Training is multi-stage (CNN → SVM → Bounding box regressor).

---

## 3. Fast R-CNN (2015)

### Paper

**Fast R-CNN: Ross Girshick (2015)**
[Paper link](https://arxiv.org/abs/1504.08083)

### Motivation

R-CNN was accurate but **too slow** because it ran CNN separately on each region.
Fast R-CNN improves efficiency by **sharing computation**.

### Pipeline

1. Input entire image to the **CNN once** → get a **feature map**.
2. Use **Region of Interest (RoI) pooling** to extract region-specific features from this shared feature map.
3. Each region’s feature vector goes into:

   * A softmax classifier (for object category)
   * A bounding box regressor (to refine coordinates)

### Architecture

```
Image → CNN → Feature Map → RoI Pooling → Fully Connected Layers
                          → (Softmax Classifier + Box Regressor)
```

### Key Component: RoI Pooling

* Converts regions of arbitrary size into fixed-size feature maps (e.g., 7×7).
* Enables shared computation across proposals.

### Pros

* **Much faster** than R-CNN (≈2 seconds per image).
* **Single-stage training** (end-to-end).
* **Higher accuracy**.

### Cons

* Still uses **Selective Search** for region proposals (CPU-based, slow).

---

## 4. Faster R-CNN (2016)

### Paper

**Faster R-CNN: Ren et al. (2016)**
[Paper link](https://arxiv.org/abs/1506.01497)

### Motivation

Fast R-CNN was still limited by the **slow region proposal step**.
Faster R-CNN replaces Selective Search with a **Region Proposal Network (RPN)** — making the entire pipeline fully deep-learning-based and faster.

### Key Innovation: Region Proposal Network (RPN)

* A small CNN that slides over the feature map.
* Predicts:

  * **Objectness score** (object vs background)
  * **Bounding box coordinates** for each anchor
* Anchors are predefined boxes (different scales and aspect ratios).

### Pipeline

1. Image → CNN → Feature Map
2. **RPN** → generates region proposals
3. **RoI Pooling** → extract features for each proposed region
4. **Classifier + Box Regressor** → final object detection

### Architecture

```
Image → CNN Backbone
       ├──→ Region Proposal Network → RoIs
       └──→ RoI Pooling → Classifier + Box Regressor
```

### Training

* RPN and detection network share convolutional layers.
* Trained jointly (end-to-end).

### Performance

* **Speed**: ~5–10 FPS (depending on backbone)
* **Accuracy**: Among the best for 2016–2018
* **Used in**: Mask R-CNN, Feature Pyramid Networks, etc.

### Pros

* Fully deep-learning-based (no Selective Search).
* High accuracy and speed.
* Works on complex datasets like COCO and Open Images.

### Cons

* Relatively complex to train.
* Still not real-time (compared to YOLO or SSD).

---

## 5. Comparison Summary

| Method             | Year | Region Proposal Method  | Speed     | Accuracy  | End-to-End |
| ------------------ | ---- | ----------------------- | --------- | --------- | ---------- |
| **Sliding Window** | 2005 | Dense windows           | Very slow | Low       | No         |
| **R-CNN**          | 2014 | Selective Search        | Very slow | High      | No         |
| **Fast R-CNN**     | 2015 | Selective Search        | Moderate  | High      | Partly     |
| **Faster R-CNN**   | 2016 | Region Proposal Network | Fast      | Very High | Yes        |

---

## 6. Implementation Example (PyTorch)

A minimal Faster R-CNN example using `torchvision`:

```python
import torchvision
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_Weights

# Load pretrained Faster R-CNN
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(
    weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT
)

# Set to evaluation mode
model.eval()

from PIL import Image
import torchvision.transforms as T

# Load and preprocess image
image = Image.open("example.jpg").convert("RGB")
transform = T.Compose([T.ToTensor()])
img = transform(image)

# Run detection
with torch.no_grad():
    predictions = model([img])

print(predictions[0]["boxes"], predictions[0]["labels"], predictions[0]["scores"])
```

---

## 7. Evolution Summary

```
Sliding Window → R-CNN → Fast R-CNN → Faster R-CNN → Mask R-CNN
```

Each step improved:

* **Speed**: by sharing computation or removing hand-crafted methods
* **Accuracy**: by end-to-end deep learning
* **Scalability**: by enabling multi-class detection and segmentation

---

## 8. References

* [R-CNN Paper (2014)](https://arxiv.org/abs/1311.2524)
* [Fast R-CNN Paper (2015)](https://arxiv.org/abs/1504.08083)
* [Faster R-CNN Paper (2016)](https://arxiv.org/abs/1506.01497)
* [TorchVision Object Detection Docs](https://pytorch.org/vision/stable/models.html#object-detection)
