<h3> Introduction </h3>

In this post we will understand and implement the components of the modern object detection model Faster-RCNN. Object detectors involve many parts and
it can be difficult to follow the code in the open implementations available. Here we will give a clear layout of of such a model. This post is divided into two parts. In this first one, we will construct the Faster-RCNN network and several of it's components. This will allow us to perform inference using the pretrained weights available from <a href="https://github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md">Facebook's Detectron</a>, which will help us understand how that object-detection framework works. In the  second part of the post, we will actually train the network ourselves and understand the steps involved in that process. 

Here we will take certain components for granted - NMS, bounding box transformations, ROI-Align, and the proposal layer. For an explanation of how these work please consult the papers and the full code repo. The focus here is more on how the pieces fit together.

<h3> The Two Stages </h3>

The Faster-RCNN family of detectors works in two stages. The first stage, the Region Proposal Network (RPN), outputs box regions and their associated 'objectness' score (i.e., object vs no object). These proposals are filtered, then used to crop features from the top-level of the backbone feature extractor (e.g., Resnet-50). This process of feature cropping was done by ROI-pooling in the original Faster-RCNN, or more recently, using ROI-Align in Mask-RCNN. The second stage, the Faster-RCNN network, takes these cropped features and refines the initial proposals of RPN, along with predicting the probability of the object class. The figure below from the Faster-RCNN paper illustrates this two-stage process:

![image.png](attachment:image.png)

In the Faster-RCNN version with the Resnet feature extractor (or backbone), RPN operates on the final convolutional layer of the C4 block [3]. ROI-align is performed on the C4 features and those pooled features are first fed through the C5 block and the average pool of Resnet. These last two operations replace the fully-connected layers of VGG in the original application of Resnet to object-detection [3]. After that, the Faster-RCNN network predicts bounding-boxes and classes.

The code below show the structure of this computation:

In [None]:
class FasterRCNN(nn.Module):
    def __init__(self, rpn, backbone, roi_align, n_classes):
        super(FasterRCNN, self).__init__()
        self.backbone = backbone  # eg. Resnet-50
        self.rpn = rpn  # Region Proposal Network
        self.roi_align = roi_align  # ROI-Align layer
        self.bbox_pred = nn.Linear(2048, 4 * n_classes)  # bounding-box head
        self.cls_score = nn.Linear(2048, n_classes)  # class logits head

    def forward(self, images, h, w, im_scale):
        features =elf.backbone(images)  # compute block C4 features for image
        proposals, scores = self.rpn(features, h, w, im_scale)  # apply RPN to get proposals
        pooled_feat = self.roi_align(features, proposals)  # apply ROI align on the C4 features using the proposals
        pooled_feat = self.backbone.top(pooled_feat).squeeze()  # apply the C5 block and Average pool of Resnet
        bbox_pred = self.bbox_pred(pooled_feat)  # apply bounding-box head
        cls_score = self.cls_score(pooled_feat)  # apply class-score head
        cls_prob = F.softmax(cls_score, dim=1)  # softmax to get object-class probabilities

        return {'bbox_pred': bbox_pred, 'cls_prob': cls_prob, 'rois': proposals}

Note that the C5 block and the Average Pool of Resnet produce a 2048-dimensional vector per ROI. The bounding-box heads and class-score heads are fully-connected layers that operate on those ROI vectors. The class-score head output dimension is the number of object classes. The bounding-box head output dimension is 4 times the number of classes. This is because a bounding box is predicted separately for each object class, later we take the box corresponding to the class with the highest score.

Now that we see the overal structure of the networks, let's see how each part works in more detail.

<h3> The Region Proposal Network </h3>

RPN works on the top convolutional layer of the C4 block. First it applies a convolution with a 3x3 kernel to get another feature map with 1024 output channels. Then two convolutional layers with 1x1 kernels are applied to get proposals and scores <b>per cell</b> in the feature map, as shown in the figure below from the paper [1]:

![image.png](attachment:image.png)

The output channels for the scores convolutional head give the 'objectness' score for each anchor at that cell. The output channels for the proposal convolutional head give the bounding boxes for each anchor at that cell.

In [None]:
class RPN(nn.Module):
    def __init__(self, in_channels, out_channels, n_anchors, proposal_layer):
        super(RPN, self).__init__()

        self.n_anchors = n_anchors

        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bbox_pred = nn.Conv2d(out_channels, self.n_anchors * 4, 1)
        self.cls_score = nn.Conv2d(out_channels, self.n_anchors, 1)
        self.proposal_layer = proposal_layer

    def forward(self, feature_map, h, w, im_scale):
        x = self.conv(feature_map)
        x = F.relu(x, inplace=True)
        cls_score = self.cls_score(x)
        cls_prob = F.sigmoid(cls_score)
        bbox_pred = self.bbox_pred(x)
        proposals, scores = self.proposal_layer(cls_prob, bbox_pred,
                                                h, w, im_scale)
        return proposals, scores

Note at the end of the forward method we make a call to a proposal layer. This layer performs additional computations and filtering on the given proposals before they are passed further on. This because the outputs of the RPN need to be applied to the anchors at every cell to get the actual proposal, and the proposals need to be filtered to make sense and reduce their huge number. The steps it does are:

<ol> <li> for each location on the feature map grid grid: 
    <ul> <li> generate the anchor boxes centered on cell i </li>
        <li> apply predicted bbox deltas to each of the A anchors at cell i </li>
    </ul> </li>
    <li> clip predicted boxes to image </li>
    <li> remove predicted boxes that are smaller than a threshold </li>
    <li> sort all proposals by score from highest to lowest </li> 
    <li> take the top proposals before NMS </li>
    <li> apply NMS with a loose threshold (0.7) to the remaining proposals </li>
    <li> take top proposals after NMS </li>
    <li> return the top proposals </li>
</ol>
    
The code for this is straightforward but a bit involved, so we leave it out of this post.

<h3> A Complete Example </h3>

In [None]:
############# define some configurations: ################

# Number of top scoring boxes to keep before apply NMS to RPN proposals
RPN_PRE_NMS_TOP_N = 6000
# Number of top scoring boxes to keep after applying NMS to RPN proposals
RPN_POST_NMS_TOP_N = 1000
# NMS threshold used on RPN proposals
RPN_NMS_THRESH = 0.7
# Proposal height and width both need to be greater than RPN_MIN_SIZE (at orig image scale)
RPN_MIN_SIZE = 0
# Size of the pooled region after RoI pooling
POOLING_SIZE = 14

TEST_NMS = 0.5
TEST_MAX_SIZE = 1333
PIXEL_MEANS = np.array([122.7717, 115.9465, 102.9801])

anchor_sizes, anchor_ratios = [32, 64, 128, 256, 512], [0.5, 1, 2]
feat_stride = 16

############ prepare an image #################

x = cv2.imread('samples/15673749081_767a7fa63a_k.jpg')[:, :, ::-1]

blobs, im_scales = prep_im_for_blob(x, PIXEL_MEANS, target_sizes=(800,), max_size=TEST_MAX_SIZE)
blobs = im_list_to_blob(blobs)
img = Variable(torch.from_numpy(blobs))

im_info = torch.from_numpy(np.array([[blobs.shape[2], blobs.shape[3], im_scales[0]]], dtype=np.float32))
im_size, im_scale = [blobs.shape[2], blobs.shape[3]], im_scales[0]

########## make the modules ####################

backbone = BackBone()
proposal_layer = ProposalLayer(feat_stride, anchor_sizes, anchor_ratios,
                               RPN_PRE_NMS_TOP_N, RPN_POST_NMS_TOP_N, RPN_NMS_THRESH, RPN_MIN_SIZE)
roi_align = RoIAlign(POOLING_SIZE, POOLING_SIZE, spatial_scale=1./16.)

rpn = RPN(1024, 1024, 15, proposal_layer)
frcnn = FasterRCNN(rpn, backbone, roi_align, 81)
frcnn.load_pretrained_weights('model_final.pkl', 'resnet50_mapping.npy')

############ feedforward pass with postprocessing ################

frcnn = frcnn.cuda()
frcnn.eval()
img = img.cuda()

output = frcnn(img, im_size[0], im_size[1], im_scale)
class_scores, bbox_deltas, rois = output['cls_prob'], output['bbox_pred'], output['rois']

scores_final, boxes_final, boxes_per_class = postprocess_output(rois, im_scale, im_size, class_scores, bbox_deltas,
                                                                bbox_reg_weights=(10.0, 10.0, 5.0, 5.0))

########## visualize ###############

vis.vis_one_image(
    x,  # BGR -> RGB for visualization
    'output',
    'samples/',
    boxes_per_class,
    dataset=None,
    box_alpha=0.3,
    show_class=True,
    thresh=0.7,
    ext='jpg'
)

<h3> References </h3>

<ol> 
     <li> Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99). </li>
     <li>He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017, October). Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on (pp. 2980-2988). IEEE.</li>
         <li>  He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). </li>
</ol>