# Object Detection with RetinaNet

**Author:** [Srihari Humbarwadi](https://twitter.com/srihari_rh)<br>
**Date created:** 2020/05/17<br>
**Last modified:** 2023/07/10<br>
**Description:** Implementing RetinaNet: Focal Loss for Dense Object Detection.


## Introduction

Object detection a very important problem in computer
vision. Here the model is tasked with localizing the objects present in an
image, and at the same time, classifying them into different categories.
Object detection models can be broadly classified into "single-stage" and
"two-stage" detectors. Two-stage detectors are often more accurate but at the
cost of being slower. Here in this example, we will implement RetinaNet,
a popular single-stage detector, which is accurate and runs fast.
RetinaNet uses a feature pyramid network to efficiently detect objects at
multiple scales and introduces a new loss, the Focal loss function, to alleviate
the problem of the extreme foreground-background class imbalance.

**References:**

- [RetinaNet Paper](https://arxiv.org/abs/1708.02002)
- [Feature Pyramid Network Paper](https://arxiv.org/abs/1612.03144)


In [2]:
import os
import re
import zipfile

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import numpy as np
import tensorflow as tf
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

2024-12-06 10:32:15.711349: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-06 10:32:15.755112: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-06 10:32:15.755138: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-06 10:32:15.755163: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-06 10:32:15.763007: I tensorflow/core/platform/cpu_feature_g

In [None]:
physical_gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    physical_gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=30000)]
)

## Downloading the COCO2017 dataset


Training on the entire COCO2017 dataset which has around 118k images takes a lot of time, hence we will be using a smaller subset of ~500 images for training in this example.


In [3]:
url = "https://github.com/srihari-humbarwadi/datasets/releases/download/v0.1.0/data.zip"
filename = os.path.join(os.getcwd(), "data.zip")
keras.utils.get_file(filename, url)


with zipfile.ZipFile("data.zip", "r") as z_fp:
    z_fp.extractall("./")

Downloading data from https://github.com/srihari-humbarwadi/datasets/releases/download/v0.1.0/data.zip


## Implementing utility functions

Bounding boxes can be represented in multiple ways, the most common formats are:

- Storing the coordinates of the corners `[xmin, ymin, xmax, ymax]`
- Storing the coordinates of the center and the box dimensions
  `[x, y, width, height]`

Since we require both formats, we will be implementing functions for converting
between the formats.


In [4]:
def swap_xy(boxes):
    """Swaps the order of x and y coordinates of the boxes.

    Arguments:
        boxes: A tensor with shape `(num_boxes, 4)` representing bounding boxes.

    Returns:
        swapped boxes with shape same as that of boxes
    """
    return tf.stack([boxes[:, 1], boxes[:, 0], boxes[:, 3], boxes[:, 2]], axis=-1)


def convert_to_xywh(boxes):
    """Changes the box format to center, width and height

    Arguments:
        boxes: A tensor of rank 2 or higher with a shape of `(..., num_boxes, 4)`
        representing bounding boxes where each box is of the format
        `[xmin, ymin, xmax, ymax]`

    Returns
        converted boxes with shape same as that of boxes
    """
    return tf.concat(
        [(boxes[..., :2] + boxes[..., 2:]) / 2, boxes[..., 2:] - boxes[..., :2]],
        axis=-1,
    )


def convert_to_corners(boxes):
    """Changes the box format to corner coordinates

    Arguments:
        boxes: A tensor of rank 2 or higher with a shape of `(..., num_boxes, 4)`
        representing bounding boxes where each box is of the format
        `[x, y, width, height]`

    Returns
        converted boxes with shape same as that of boxes
    """
    return tf.concat(
        [boxes[..., :2] - boxes[..., 2:] / 2, boxes[..., :2] + boxes[..., 2:] / 2],
        axis=-1,
    )

## Computing pairwise Intersection Over Union (IOU)

As we will see later in the example, we would be assigning ground truth boxes
to anchor boxes based on the extent of overlapping. This will require us to
calculate the Intersection Over Union (IOU) between all the anchor
boxes and ground truth boxes pairs.


In [45]:
def compute_iou(boxes1, boxes2):
    """Computes pairwise IOU matrix for given two sets of boxes

    Arguments:
        boxes1: A tensor with shape `(N, 4)` representing bounding boxes where
        each box is of the format `[x, y, width, height]`.
        boxes2: A tensor with shape `(M, 4)` representing bounding boxes where
        each box is of the format `[x, y, width, height]`.

    Returns:
        pairwise IOU matrix with shape `(N, M)`, where the value at i-th row,
        j-th column holds the IOU between i-th box and j-th box from boxes1 and
        boxes2 respectively.
    """
    boxes1_corners = convert_to_corners(boxes1)
    boxes2_corners = convert_to_corners(boxes2)

    max_upper_left = tf.maximum(  # shape (N, M, 2)
        boxes1_corners[:, None, :2],
        boxes2_corners[:, :2],
    )
    min_lower_right = tf.minimum(  # shape (N, M, 2)
        boxes1_corners[:, None, 2:],
        boxes2_corners[:, 2:],
    )

    intersection = tf.maximum(0.0, min_lower_right - max_upper_left)  # shape (N, M, 2)
    intersection_area = tf.reduce_prod(intersection, axis=-1)  # shape (N, M)

    boxes1_area = tf.reduce_prod(boxes1[:, 2:], axis=-1)  # shape (N, )
    boxes2_area = tf.reduce_prod(boxes2[:, 2:], axis=-1)  # shape (M, )

    union_area = boxes1_area[:, None] + boxes2_area - intersection_area
    return intersection_area / union_area


def visualize_detections(
    image, boxes, classes, scores, figsize=(7, 7), linewidth=7, color=[0, 0, 1]
):
    """Visualize Detections"""
    image = np.array(image, dtype=np.uint8)
    plt.figure(figsize=figsize)
    plt.axis("off")
    plt.imshow(image)
    ax = plt.gca()

    for box, _cls, score in zip(boxes, classes, scores):
        text = f"{_cls}: {score:.2f}"
        x1, y1, x2, y2 = box
        w, h = x2 - x1, y2 - y1
        patch = plt.Rectangle(
            [x1, y1], w, h, fill=False, edgecolor=color, linewidth=linewidth
        )
        ax.add_patch(patch)
        ax.text(
            x1,
            y1,
            text,
            bbox={"facecolor": color, "alpha": 0.4},
            clip_box=ax.clipbox,
            clip_on=True,
        )
    plt.show()
    return ax

## Implementing Anchor generator

Anchor boxes are fixed sized boxes that the model uses to predict the bounding box for an object. It does this by regressing the offset between the location of the object's center and the center of an anchor box, and then uses the width and height of the anchor box to predict a relative scale of the object. In the case of RetinaNet, each location on a given feature map has nine anchor boxes (at three scales and three ratios).


In [83]:
class AnchorBox:
    """Generates anchor boxes

    This class has operations to generate anchor boxes for feature maps at
    strides `[8, 16, 32, 64, 128]`. Each anchor box is of the format
    `[x, y, width, height]`

    Attributes:
        `aspect_ratios`: A list of float values representing the aspect ratios of
        the anchor boxes at each location on the feature map.
        `scales`: A list of float values representing the scale of the anchor boxes
        at each location on the feature map.
        `num_anchors`: The number of anchor boxes at each location on the feature
        map.
        `areas`: A list of float vales representing the areas of the anchor boxes
        for each feature map in the feature pyramid.
        `strides`: A list of float values representing the strides for each
        feature map in the feature pyramid.
    """

    def __init__(self):
        self.aspect_ratios = [0.5, 1.0, 2.0]  # aspect ratio = width / height
        self.scales = [2**x for x in [0, 1 / 3, 2 / 3]]

        self.num_anchors = len(self.aspect_ratios) * len(self.scales)
        self.strides = [2**i for i in range(3, 8)]
        self.areas = [x**2 for x in [32, 64, 128, 256, 512]]
        self.anchor_dims = self.compute_dims()

    def compute_dims(self):
        """Computes anchor box dimensions for all ratios and scales at all levels
        of the feature pyramid."""
        # This list will have 5 tensor of shape (1, 9, 2)
        anchor_dims_all = []
        for area in self.areas:
            anchor_dims = []

            for ratio in self.aspect_ratios:
                anchor_height = tf.math.sqrt(area / ratio)  # h.w = S <=> h.h.r = S
                anchor_width = area / anchor_height

                dims = tf.reshape(
                    tf.stack([anchor_width, anchor_height], 0),
                    shape=[1, 1, 2],
                )

                # After the loop, anchor_dims will receive 3 more tensor of shape (1, 1, 2)
                for scale in self.scales:
                    anchor_dims.append(scale * dims)
            # At this point, anchor_dims will have 9 tensor of shape (1, 1, 2)

            # anchor_dims will be converted to tensor of shape (1, 1, 9, 2)
            anchor_dims_all.append(tf.stack(anchor_dims, axis=-2))

        return anchor_dims_all

    def _get_anchors(self, feature_height, feature_width, level):
        """Generates anchor boxes for a given feature map size and level

        Arguments:
            feature_height: An integer representing the height of the feature map
            feature_width: An integer representing the width of the feature map
            level: An integer representing the level of the feature map in the
            feature pyramid

        Returns
            anchor boxes with the shape of
            `(feature_height * feature_width * num_anchors, 4)`
        """
        rx = tf.range(feature_width, dtype=tf.float32) + 0.5
        ry = tf.range(feature_height, dtype=tf.float32) + 0.5

        centers = tf.stack(tf.meshgrid(rx, ry), axis=-1)  # shape (height, width, 2)
        centers *= self.strides[level - 3]
        centers = tf.expand_dims(centers, axis=-2)  # shape (height, width, 1, 2)
        centers = tf.tile(centers, [1, 1, self.num_anchors, 1]) # shape (height, width, 9, 2)

        # dims = tf.tile()
        return centers

x = AnchorBox()
x._get_anchors(3, 3, 3).shape

TensorShape([3, 3, 9, 2])

In [75]:
rx = tf.range(3, dtype=tf.float32) + 0.5
ry = tf.range(4, dtype=tf.float32) + 0.5

centers = tf.stack(tf.meshgrid(rx, ry), axis=-1)