# SOTA model for object detection - RetinaNet

![RetinaNet results](https://www.researchgate.net/profile/Xuxi-Yang/publication/342017040/figure/fig4/AS:901688093310976@1591990602471/sualized-detection-results-of-RetinaNet-trained-model.ppm)
<center>Image taken from <a href="https://www.researchgate.net/figure/sualized-detection-results-of-RetinaNet-trained-model_fig4_342017040">here</a></center>
<br><br>

This project aims to walk through the main parts of the **RetinaNet** - one of the best architectures for the object detection task. 

The full implementation of this model can be very long, and it can take even several weeks to cover everything that goes into it. The RetinaNet is an upgrade on the previous iterations such as SSD (Singe-Shot Detector), so we will cover and implement the main parts that made RetinaNet better than the rest of the models in this project category. 

#### We recommend going over the original implementation in Keras on your own time if you are interested in applying this model to your project. 


### Example of RetinaNet for self-driving car

RetinaNet works extremely fast, and because of that, many self-driving car projects are using it to describe its surroundings. Here is a video about it:

In [1]:
from IPython.display import IFrame

In [2]:
IFrame("https://www.youtube.com/embed/KYueHEMGRos", 1000, 500)

**In some cases, IPython widgets do not work!**

If this is the case, here is the like for YouTube video from the cell above: https://www.youtube.com/embed/KYueHEMGRos


<br>

![RetinaNet](https://www.researchgate.net/publication/327737749/figure/fig1/AS:672393336987655@1537322472864/The-network-architecture-of-RetinaNet-RetinaNet-uses-the-Feature-Pyramid-Network-FPN.png)

<center>Image taken from <a href="https://www.researchgate.net/figure/The-network-architecture-of-RetinaNet-RetinaNet-uses-the-Feature-Pyramid-Network-FPN_fig1_327737749">here</a></center>
<br><br>

### Paper overview of RetinaNet 

In [3]:
IFrame("https://www.youtube.com/embed/infFuZ0BwFQ", 1000, 500)

**In some cases, IPython widgets do not work!**

If this is the case, here is the like for YouTube video from the cell above: https://www.youtube.com/embed/infFuZ0BwFQ


RetinaNet has 4 main parts, and we will cover all of them in this project.


### Steps:
1. ResNet Backbone
2. Feature Pyramid Network (FPN)
3. Classification and Regression heads
4. Focal loss

### Topics covered and learning objectives
- ResNet
- Pyramid Networks
- Multi-head networks
- RetinaNet
- Focal Loss

### Time estimates:
- Reading/Watching materials: 2h 20min
- Exercises: 1h
<br><br>
- **Total**: ~3.5h



RetinaNet intuition: https://medium.com/@14prakash/the-intuition-behind-retinanet-eb636755607d

### Imports for the project

In [4]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import Conv2D, Layer, UpSampling2D, ReLU, Input

from tests import TEST_BACKBONE_MODEL, TEST_RESNET, TEST_HEAD_MODEL

### Exercise 1: Create ResNet50 backbone

![](images/resnet.png)

The original version of RetinaNet uses a pre-trained backbone - ResNet50. This network is used to make features for the Feature Pyramid Network (FPN). 

The goal of FPN is to generate anchor proposals of different sizes and scales. Having only the final output from the backbone (in this case, it's ResNet50, but you can use almost any feature generator as long as it matches with the FPN) won't provide enough context about the image itself. To overcome this challenge, we will output features (layer outputs) from several parts of the ResNet50 model and use them within the FPN to better image representations.

Your task is to implement the **get_backbone** function that returns a custom Model with multiple outputs. 

Here are the points to follow:

1. Define ResNet50 object with input_shape of 128x128x3 and do not include the *top*
2. We need to extract specific layers from the model. You can do that with model.get_layer("name_of_the_layer"). Since we want to return these as the output, do not forget to add **.output** at the end. 
    
    Following the paper, here are layer names to extract:

    - conv3_block4_out
    - conv4_block6_out
    - conv5_block3_out
    
    
3. Return keras Model with inputs as inputs of the backbone network, and outputs are three extracted layers from step 2

In [None]:
def get_backbone():
    """
    Create ResNet50 backbone with several outputs used in the Feature Pyramid Network
    """
    raise NotImplementedError

In [None]:
# RUN THIS CELL TO TEST YOUR CODE
TEST_BACKBONE_MODEL(get_backbone)

### Exercise 2: Feature Pyramid Network 

![](https://miro.medium.com/max/500/1*XmNDHT8WWZbXACyBjg3ZeQ.jpeg)

In [5]:
IFrame("https://www.youtube.com/embed/mwMopcSRx1U", 1000, 500)

**In some cases, IPython widgets do not work!**

If this is the case, here is the like for YouTube video from the cell above: https://www.youtube.com/watch?v=mwMopcSRx1U

Further reading about FPNs:
- First read this: https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-detection-262fc7482610
- Read this for a detailed intro: https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c


![](https://miro.medium.com/max/700/1*edviRcl3vwlyx9TS_gRbmg.png)


In the cell below, you'll find FPN already started with layers defined. Your task is to connect them inside the **call** method. The goal of this exercise is to learn how connected FPNs layers are.

Here is how:
- 1st, Put all 3 outputs from the backbone through 1x1 convolutions
- 2nd merge p4_1x1 with upsampled p5_1x1
- 3rd merge p3_1x1 with upsampled merged_p4
- 4th create p3-p7 outputs by using corresponding 3x3 Conv layers and inputs to them 
    - for p3 and p4, we use merged versions
    - for p7, use p6_output

In [None]:
class FeaturePyramidNetwork(Layer):
    """
    Builds the Feature Pyramid with the feature maps from the backbone.
    """

    def __init__(self, **kwargs):
        super(FeaturePyramidNetwork, self).__init__(name="FeaturePyramidNetwork", **kwargs)
        
        self.backbone = get_backbone()
        
        # 1x1 convolution layers
        self.conv_c3_1x1 = Conv2D(filters=256, kernel_size=1, strides=1, padding="same")
        self.conv_c4_1x1 = Conv2D(filters=256, kernel_size=1, strides=1, padding="same")
        self.conv_c5_1x1 = Conv2D(filters=256, kernel_size=1, strides=1, padding="same")
        
        # 3x3 convo layers
        self.conv_c3_3x3 = Conv2D(filters=256, kernel_size=3, strides=1, padding="same")
        self.conv_c4_3x3 = Conv2D(filters=256, kernel_size=3, strides=1, padding="same")
        self.conv_c5_3x3 = Conv2D(filters=256, kernel_size=3, strides=1, padding="same")
        self.conv_c6_3x3 = Conv2D(filters=256, kernel_size=3, strides=2, padding="same")
        self.conv_c7_3x3 = Conv2D(filters=256, kernel_size=3, strides=2, padding="same", activation='relu')
        
        self.upsample_2x = UpSampling2D(2)

    def call(self, images, training=False):
        
        c3_output, c4_output, c5_output = self.backbone(images, training=training)
         
        # YOUR CODE HERE
        # Reduce image depth with 1x1 convolutions
        p3_1x1 = None
        p4_1x1 = None
        p5_1x1 = None
            
        # Merge p4 1x1 reduced with upsampled p5 1x1 
        merged_p4 = None
        
        # Merge p3 1x1 with upsampled merged P4
        merged_p3 = None
        
        # Make p3-p6 outputs with 3x3 convs
        p3_output = None
        p4_output = None
        p5_output = None
        p6_output = None
        p7_output = None
        
        return p3_output, p4_output, p5_output, p6_output, p7_output

### Exercise 3: Network heads

![](images/heads.png)

The last part of the RetinaNet architecture is prediction heads - one for classification and the other for regression (bbox prediction).

![](https://miro.medium.com/max/2400/1*rYvoP6VcmMGVQdIsaSawdQ.png)

Each of these heads receives the same input, but its task is a bit different. The classification network can predict the correct class for a specific place of the image, whereas the regression network predicts the coordinates of anchors. 

To understand how to calculate the number of outputs of these heads, go back to this link: https://medium.com/@14prakash/the-intuition-behind-retinanet-eb636755607d

Your task in this exercise is to build a generic head of the RetinaNet. 

The architecture is straightforward. Start by adding the Input layer to the **head** (Sequential model defined for you). Input shape should be [None, None, 256].

When you are done with that add **four (4)** convolutional layers with the same set of parameters:
- filters=256
- kernel_size=3
- padding="same"
- activation="relu"

To complete the head network add one more convolutional layer, which will be the output of this model. For this layer set *filters* to be *output_filters*, kernel_size=3, stride=1 and padding='same'. We don't want any activation function here.

Return the model head from this function

In [None]:
def build_head(output_filters):
    """Builds the class/box predictions head.

    Arguments:
      output_filters: Number of convolution filters in the final layer.

    Returns:
      A keras sequential model representing either the classification 
      or the box regression head depending on `output_filters`.
    """
    head = Sequential()
    
    # YOUR CODE HERE
    
    return head

In [None]:
# RUN THIS CELL TO TEST YOUR CODE
TEST_HEAD_MODEL(build_head)

### Exercise 4: Building the RetinaNet model

You have created all the building blocks for the RetinaNet model and gone through materials about it. It's time to put everything together!

*This exercise has 2 parts*

### Part 1: __ init__ method

You'll find the starting point for the **RetinaNet** class with two methods (*__ init __* and *call*) in the cell below.

In this part of the exercise, your task is to complete the *init* method following the next steps:

- Define a variable that holds a number of classes (provided as an argument of the init method)
- Create an object of the *FeaturePyramidNetwork*
- Create classification head object using function **build_head** with argument: num_anchors * num_classes
- Create regression head object using function **build_head** with argument: num_anchors * 4

### Part 2: Full model in the **call** method

For part two, your task is to complete the **call** function. Here is what you need to do:
NOTE: Two empty lists to store outputs from the class head and regression head are created for you

1. Generate features using the FPN model defined in the __init__ method
2. Since FPN will generate features for many sizes, create a for loop that iterates over them
3. Inside the for-loop, call the regression head from the *init* method on the feature, reshape the output to be **batch_size, -1, 4** (this size is used for the loss function)
5. Append the output from step 4 to the list dedicated for regression head outputs
6. Do the same process for the classification head, only this time reshape the output to be **batch_size, -1, self.num_classes**
7. Using the TensorFlow method, concatenate all tensors from **cls_outputs** along axis=1, and assign that to cls_outputs at the end of the call method
8. Using the TensorFlow method, concatenate all tensors from **box_outputs** along axis=1, and assign that to box_outputs at the end of the call method

In [None]:
class RetinaNet(Model):
    """A subclassed Keras model implementing the RetinaNet architecture.

    Attributes:
      num_classes: Number of classes in the dataset.
      backbone: The backbone to build the feature pyramid from.
        Currently supports ResNet50 only.
    """

    def __init__(self, num_classes, backbone=None, **kwargs):
        super(RetinaNet, self).__init__(name="RetinaNet", **kwargs)
        

    def call(self, image, training=False):
        batch_size = tf.shape(image)[0]
        # Empty lists used to store head outputs
        cls_outputs = []
        box_outputs = []
        
        # FPN features
        # YOUR CODE HERE
            

        cls_outputs = None
        box_outputs = None
        
        return tf.concat([box_outputs, cls_outputs], axis=-1)

### Execute these two lines to see if everything compiles for you

In [None]:
model =  RetinaNet(num_classes=10)

In [None]:
# RUN THIS CELL TO TEST YOUR CODE
TEST_RESNET(model)

## Focal loss

The last step, before we compile our model, we will need to mention and learn about Focal loss, the last part that holds everything together.

![](https://lh4.googleusercontent.com/_Zb8VyevBHbPdlPS1Bcph18b0GnRdY__yrSWaxEobHAOSq5izCVXdRS0Eo-26pU5Q8JE2daQAmFlwwUKnRiaf7JJrv7VJOLXbTOF-B6G8yshVWdBwhRXFBuMB5L6eH7KCTjzen-t7e39pxku5A)


In [6]:
IFrame("https://www.youtube.com/embed/Y8_OVwK4ECk", 1000, 500)

**In some cases, IPython widgets do not work!**

If this is the case, here is the like for YouTube video from the cell above: https://www.youtube.com/embed/Y8_OVwK4ECk

Focal-loss tutorial: https://www.analyticsvidhya.com/blog/2020/08/a-beginners-guide-to-focal-loss-in-object-detection/

Focal-loss implementations: https://www.programmersought.com/article/60001511310/

The implementation of Focal-loss is a bit complex to put here, and we would need many more util functions to enable it. Luckily, a Python library called **focal-loss** allows us to import Keras compliant focal-loss and put it to our model.

If you don't have it installed, uncomment the code in the cell below and execute it.

In [None]:
# !pip install focal-loss

Using the *focal-loss* library we are able to import it for Binary classification or Categorical classification in only one line of code!

In [None]:
from focal_loss import SparseCategoricalFocalLoss 

In [None]:
model.compile(loss=SparseCategoricalFocalLoss(gamma=2), optimizer="SGD")

## Where from here?

- Full keras implementation of RetinaNet: https://keras.io/examples/vision/retinanet/
- Keras RetinaNet library: https://github.com/cvisionai/keras_retinanet
- Pedestrian detection with RetinaNet: https://towardsdatascience.com/pedestrian-detection-in-aerial-images-using-retinanet-9053e8a72c6