## Object Detection

Hoping at this point, we are familiar with classification, object detection can be explained as a classification with localization.

### Classification

![mater](assets/mcqueen_real.png)

### with Localization (and over multiple objects)

![mcqueen](assets/mcqueen.jpeg)

## Where to use?

- Could be used on any kind of task where finding the location of the object(s) are of any use
- Anything related to traffic, pedestrians, types of vehicles, drivable roads, landing zones etc.
- Anything related to locating a disease over some type of medical imaging (MRI, Ultrasound, CT ...)
- When designing automated stores, factories etc. (Like Amazon Go cashierless stores)

this could go on

## Yea yea yea its all good but how does it come to be and how can I learn / use it?

Okay we are kind of familiar with a CNN, it acts as a feature extractor, connects to a FCN with number of classes as neurons for output and ta-dah, we have a multi class classifier. 

![out_neurons](assets/detection_output_neurons.png)

![annotated_out_neurons](assets/annotated_output_neurons.png)

In [1]:
# This can be penalized with any loss but main logic here is that 
loss_fn = lambda x, y: (x - y) ** 2
prediction = [1] * 8  # P, x, y, w, h, c1, c2, c3
label = [1] * 8

if prediction[0]:
    # calculate loss for only first neuron, we want it to be 0
    loss = loss_fn(prediction[0], label[0])
else:
    # calculate loss over all the other predictions as well
    loss = sum([loss_fn(p, l) for p, l in zip(prediction, label)])  # you do not have to use one type of loss function here
    # you can use variation of losses which may differ from a bounding box to a class

## But how do we classify an unknown number of objects?

### Let me explain while expanding on some utility functions that make object detection the way it is

#### Sliding window detection

Sliding window detection is like searching for your car in a crowded parking lot - 
except instead of cars, it is any kind of object in the image. This method involves repeatedly 
applying the same feature detector or "window" to an image at multiple locations 
and scales. As it slides around, it checks each spot to see if there is a good match.

**Example:** Imagine you're looking for your snail hiding behind you in the house. A sliding window detector would move its feature template over the 
image, checking possible locations where your cat might be hiding, at different 
scales.


![](assets/sliding_snail.gif)

In [5]:
from typing import Tuple, Iterable
import numpy as np

def sliding_window(image: np.array, step_size: int, window_size: Tuple[int, int]) -> Iterable[Tuple[int, int, np.ndarray]]:
    H, W = image.shape
    # Check if the image has two channels as expected
    if len(image.shape) != 2:
        raise ValueError("Input image should be a 2D array.")
    
    for y in range(0, H, step_size):
        for x in range(0, W, step_size):
            yield (x, y, image[y: y + window_size[1], x: x + window_size[0]])


#### Sliding Windows over Convolution 

![swin_conv](assets/sliding_window_conv.png)

[image reference](https://www.coursera.org/learn/convolutional-neural-networks/lecture/6UnU4/convolutional-implementation-of-sliding-windows)

#### Intersection Over Union (IoU)

IoU is a measure of how well two objects that cover an area fit together, in this case the prediction and the ground truth. 

IoU, stated by its name as well, simply calculates the ratio of the intersection area to the union area between two 
bounding boxes.

![](assets/iou-formula.webp)

![](assets/iou-example.png)

[image_1 reference](https://idiotdeveloper.com/what-is-intersection-over-union-iou/)

[image_2 reference](https://www.superannotate.com/blog/intersection-over-union-for-object-detection)

In [6]:
def iou(pred: Tuple[int, int, int, int], 
        gt: Tuple[int, int, int, int]) -> float:
    """ in xyxy format, you can write it as xywh format if you'd like """
    # intersection points
    x1 = max(pred[0], gt[0])
    y1 = max(pred[0], gt[0])
    x2 = max(pred[0], gt[0])
    y2 = max(pred[0], gt[0])

    # intersection
    intersection = max(0, x2 - x1 + 1) * max(0, y2 - y1 + 1)

    # area of boxes
    area_pred = (pred[2] - pred[0] + 1) * (pred[3] - pred[1] + 1)
    area_gt = (gt[2] - gt[0] + 1) * (gt[3] - gt[1] + 1)

    iou = intersection / float(area_pred + area_gt - intersection)
    return iou

#### Anchor Boxes

Anchor boxes are like the buffet of object detection - they offer multiple choices or "anchors" for bounding box predictions. Instead of predicting a single box, an anchor box-based detector proposes a range of possible boxes that might contain an object.

**Example:** Imagine you are trying to detect all the animals in an image. An anchor box-based detector would propose multiple bounding boxes with different sizes and aspect ratios, covering possible locations and orientations of the animals. The algorithm then adjusts these anchors based on the detected objects characteristics, like size and shape, to get a more accurate detection result.


![anchor](assets/anchor_box.png)

In [7]:
def anchor_boxes(scales: list, aspect_ratios: list, image_size: Tuple[int, int]):
    anchor_boxes = []
    for scale in scales:  # different sizes for anchor boxes
        for ratio in aspect_ratios:
            width = scale * np.sqrt(ratio)
            height = scale / np.sqrt(ratio)
            # create anchor box
            anchor_boxes.append([width, height])
    return anchor_boxes

#### Non-Max Suppression (NMS)

We saw that grid cells are used in anchor boxes and will learn different shapes and orientations. But running the algorithm, you will see that there are many unnecessary detections (can be observed in the image below). 

Non-max suppression is by name, an algorithm that supresses the bounding boxes of the same grid with lower than certain threshold and iou value with respect to the other bounding boxes


![](assets/nms.png)

[image reference](https://learnopencv.com/weighted-boxes-fusion/)

In [8]:
def non_max_supression(boxes: list, scores: list, threshold: float = 0.5):
    if len(boxes) == 0:
        return []  # no prediction to supress
    
    # it is good to work with np arrays / easier if it is not already that way
    boxes = np.array(boxes)
    scores = np.array(scores)

    # sorting bbox confidence scores in descending order
    indices = np.argsort(scores)[::-1]
    picked = []

    while len(indices) > 0:
        current = indices[0]
        picked.append(current)

        # compute iou for all of the rest
        remaining = indices[1:]
        ious = np.array([iou(boxes[current], boxes[i]) for i in remaining])

        indices = remaining[ious < threshold]  # elliminate boxes that computes iou less than the threshold

    return boxes[picked]

#### Side note on how YOLO calculates loss

(maybe not the current ones like YOLO7-8-9-10..., can't keep track of them)

In [9]:
iou_loss_fn, bce, categorical_ce = None, None, None

def yolo_loss(predictions, ground_truth, anchors):
    # Split predictions into components
    obj_preds = predictions[..., 0]   # objectness
    box_preds = predictions[..., 1:5]  # x, y, w, h
    class_preds = predictions[..., 6:] # class predictions
    
    # "is there" an object?
    obj_loss = bce(obj_preds, ground_truth[..., 0])
    
    # "how much" of the object we have correctly guessed
    iou_loss = iou_loss_fn(box_preds, ground_truth[..., 1:5])
    
    # did we guess "which" object it is
    class_loss = categorical_ce(class_preds, ground_truth[..., 6:])
    
    return iou_loss + obj_loss + class_loss

#### What YOLO does other than that?

![image reference](assets/yolo_dls.png)

[image reference](https://www.coursera.org/learn/convolutional-neural-networks/lecture/fF3O0/yolo-algorithm)

#### Some limitations of original yolo

- Struggles to generalize objects that does not fit the anchor boxes, different aspect ratio objects
- Struggles to differentiate between small errors on large boxes vs same errors in smaller boxes are huge

## Let's just infer stuff with yolo for fun, yolov5 is in torch.hub

In [10]:
import torch 
# pred stuff on yolo

# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)

# Images
imgs = [
    "https://ultralytics.com/images/zidane.jpg",
    
    "https://lumiere-a.akamaihd.net/v1/images/open-uri20150608-27674-iuiafs_2fd2629d.jpeg",

    "https://wallpapercave.com/wp/s1o8rpn.jpg",

    "https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fi.pinimg.com%2Foriginals%2F36%2Fcd%2Feb%2F36cdebcd4fdd7eef3c9d0723cb0a886e.jpg",

    "https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.cctvcamerapros.com%2Fv%2Fimages%2FHD-Security-Cameras%2FHD-TVI-BL2%2Finfrared-HD-TVI-camera-1080p-surveillance.jpg",
    ]

# Inference
results = model(imgs)

# Results
results.print()
results.save()  # or .show()
# results.show()

Downloading: "https://github.com/ultralytics/yolov5/zipball/master" to /Users/dogukanince/.cache/torch/hub/master.zip


Collecting ultralytics
  Downloading ultralytics-8.3.27-py3-none-any.whl.metadata (35 kB)
Collecting py-cpuinfo (from ultralytics)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting ultralytics-thop>=2.0.0 (from ultralytics)
  Downloading ultralytics_thop-2.0.10-py3-none-any.whl.metadata (9.4 kB)
Downloading ultralytics-8.3.27-py3-none-any.whl (878 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m879.0/879.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading ultralytics_thop-2.0.10-py3-none-any.whl (26 kB)
Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Installing collected packages: py-cpuinfo, ultralytics-thop, ultralytics
Successfully installed py-cpuinfo-9.0.0 ultralytics-8.3.27 ultralytics-thop-2.0.10
Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/Users/dogukanince/Library/Application Support/Ultralytics/settings.json'
Update Settings with 'yo

YOLOv5 🚀 2024-11-3 Python-3.10.13 torch-2.4.1 CPU

Downloading https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt to yolov5s.pt...
100%|██████████| 14.1M/14.1M [00:01<00:00, 10.5MB/s]

Fusing layers... 
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
Adding AutoShape... 
  with amp.autocast(autocast):
image 1/5: 720x1280 2 persons, 2 ties
image 2/5: 880x1580 2 cars, 1 boat
image 3/5: 1080x1920 1 person
image 4/5: 3116x4816 1 sports ball, 1 fork, 1 knife, 1 apple, 1 scissors
image 5/5: 1080x1920 1 person, 1 chair, 2 tvs, 1 refrigerator
Speed: 1391.5ms pre-process, 61.9ms inference, 2.2ms NMS per image at shape (5, 3, 416, 640)
Saved 5 images to [1mruns/detect/exp[0m
