# Solution 1: Fast R-CNN

Fast R-CNN is a deep learning solution for object detection, which was proposed by Ross Girshick from Microsoft Research at 2015. He presented the paper titled "Fast R-CNN"\[1\] on ICCV'15 conference. This paper has already more than 3700 citations.

The author has put the source codes of Fast R-CNN on github, but they were written in python and C, using the Caffe framework. Besides, as the author proposed a stronger solution "Faster R-CNN" in the same year, the following contributors usually go directly to Faster R-CNN. Thus, you can find a lot of implementations of Faster R-CNN using different languages and frameworks, but very few for Fast R-CNN.

That is the reason why we choose Fast R-CNN in the R-CNN series. We plan to implement Fast R-CNN using pytorch, relying on the details given by the paper, to solve our object detection task on drone captured images.

## Step 1: prepare for the input data

For every input image for training, there is another input we need to prepare: the bounding box proposals(also called Region of Interest, RoI) for this image. We need to use a very classical algorithm called Selective Search\[2\] to generate enough bounding box proposals.

Our setting is: generate 32 bounding box proposals for each image, among which 8 have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5. The remaining bounding boxes have IoU with ground-truth in the interval [0.1, 0.5).

This is the suggestion from the paper. The only different thing between ours and that of the paper in this step is the number of bounding box proposals, 64 for each image in the paper. The reason is we made a lot of attempt to tune the parameters of Selective Search algorithm, but still cannot get that many bounding box proposals. It is because lots of our images have a small object, there is limited space for proposals. 

To guarantee that we have enough qualified images to train, we cut the number of bounding box proposals to a half. 

In [None]:
import csv
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import selectivesearch
import torch
import numpy as np
import readimg_new
import utils
import os
import random as rm
from datetime import datetime

segment = "segment-random"
train_rp_num = 32   # bounding box proposals for images in training set
test_rp_num = 1000  # bounding box proposals for images in testing set
first_threshold = 0.1
second_threshold = 0.5

rm.seed(datetime.now())

def get_relaxed_candidates (rps, ious, label):
    sorted_rps = [list(z[1]) + [label.item()] if z[0] > 0.5 else list(z[1]) + [0] for z in sorted(zip(ious, rps), reverse=True)]
    return sorted_rps[0: train_rp_num]

def get_region_proposal (img):
    # perform selective search0-
    img_lbl, regions1 = selectivesearch.selective_search(img, scale=50, sigma=0.8, min_size=5)
    img_lbl, regions2 = selectivesearch.selective_search(img, scale=200, sigma=0.8, min_size=5)
    img_lbl, regions3 = selectivesearch.selective_search(img, scale=300, sigma=0.8, min_size=5)
    img_lbl, regions4 = selectivesearch.selective_search(img, scale=500, sigma=0.8, min_size=5)
    pool = set()
    for r in regions1 + regions2 + regions3 + regions4:
        # excluding same rectangle (with different segments)
        if r['rect'] in pool:
            continue
        if r['size'] < 100:
            continue
        if r['rect'][2] > 640 * 3 / 4 or r['rect'][3] > 360 * 3 / 4:
            continue
        if r['rect'][2] < 5 or r['rect'][3] < 5:
            continue
        pool.add(r['rect'])
    return pool

flag = "test"

train_data, train_label = readimg_new.read_data([segment + "/" + flag + "_data", segment + "/" + flag + "_label"])
root_path = os.getcwd() + '/'
file_names = [z.split('.')[0] for z in readimg_new.file_name(root_path + segment + "/" + flag + "_data")]
completed = [z.split('.')[0] for z in readimg_new.file_name(root_path + segment + "/region_proposals_train")] + [z.split('.')[0] for z in readimg_new.file_name(root_path + segment + "/region_proposals_test")]

for i in range(len(train_data)):
    if file_names[i] in completed:
        continue
    img = torch.from_numpy(np.transpose(train_data[i].numpy(), (1, 2, 0)))
    pool = list(get_region_proposal(img))
    bbox_gt = train_label[i][1:]
    label = train_label[i][0]
    print(i, ": ", str(len(pool)))
    '''
    # draw rectangles on the original image
    fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6, 6))
    ax.imshow(img)
    for x, y, w, h in pool:
        ax.add_patch(mpatches.Rectangle(
            (x, y), w, h, fill=False, edgecolor='red', linewidth=1))
    [x, y, w, h] = bbox_gt
    ax.add_patch(mpatches.Rectangle(
        (x, y), w, h, fill=False, edgecolor='green', linewidth=1))
    plt.show()
    '''

    candidates_obj = []
    candidates_backgd = []
    candidates_all = []
    ious_obj = []
    ious_backgd = []
    ious_all = []
    for rp in pool:
        iou = utils.get_IOU(bbox_gt, rp)
        candidates_all.append(rp)
        ious_all.append(iou)
        if iou >= first_threshold:
            if iou > second_threshold:
                candidates_obj.append(rp)
                ious_obj.append(iou)
            else:
                candidates_backgd.append(rp)
                ious_backgd.append(iou)

    if flag == "train":
        if len(ious_obj) < train_rp_num * 0.25 or len(ious_backgd) < train_rp_num * 0.75:
            print(file_names[i], ", obj_num not enough: ", len(ious_obj), ", backgd_num not enough: ", len(ious_backgd))
            candidates = get_relaxed_candidates (candidates_all, ious_all, label)
        else:
            candidates = [list(candidates_obj[l]) + [label.item()] for l in range(int(train_rp_num * 0.25))] + [list(candidates_backgd[l]) + [0] for l in range(int(train_rp_num * 0.75))]

        rm.shuffle(candidates)

        with open(root_path + segment + "/region_proposals_train/" + file_names[i] + ".csv", 'w', newline='') as outfile:
            csvwriter = csv.writer(outfile)
            csvwriter.writerow(["x_lefttop", "y_lefttop", "width", "height", "label"])
            for j in range(train_rp_num):
                csvwriter.writerow(candidates[j])

    elif flag == "test":
        with open(root_path + segment + "/region_proposals_test/" + file_names[i] + ".csv", 'w', newline='') as outfile:
            csvwriter = csv.writer(outfile)
            csvwriter.writerow(["x_lefttop", "y_lefttop", "width", "height", "label"])
            for j in range(min(test_rp_num, len(pool))):
                if ious_all[j] > 0.5:
                    csvwriter.writerow(list(pool[j]) + [label.item(), ious_all[j].item()])
                else:
                    csvwriter.writerow(list(pool[j]) + [0, ious_all[j].item()])


The output is a .csv file containing the information of bounding box proposals, each of (x, y, w, h, label). (x, y) is the coordinates of the left-top corner, w and h are width and height of the box. The label is the object label(number) for those with IoU more than 0.5, and 0(background) for those with IoU in [0.1, 0.5).

## Step 2: build the network

![title](img/fast_rcnn_network.png)

A Fast R-CNN network takes as input an entire image and a set of bounding box proposals. The network first processes the whole image with several convolutional(conv) and max pooling layers to produce a conv feature map. Then for each pre-generated bounding box proposal for this image, a corresponding projection from the feature map is extracted, then a RoI pooling layer transform the projection into a fixed-length feature vector. Each feature vector is fed into a sequence of fully connected layers that finally branch into two sibling output layers: one that produces softmax probability estimates over 80 classes plus a background class, another layer that outputs four real-valued numbers (x, y, w, h) for each of the 80 classes.

In our project, the Fast R-CNN network is built based on a pre-trained vgg16 network.

It undergoes three transformations from the vgg16:
(1) the last max pooling layer of vgg16 is replaced by a RoI pooling layer, which transform different size feature map projections into a fix-length feature vector of 7*7. This layer is implemented using torch.nn.AdaptiveMaxPool2d().

(2) the network's last fully connected layer and softmax are replaced with the two sibling layers described earlier (a fully connected layer and softmax over 81 categories and category-specific bounding-box regressors). The bounding-box regressor uses the parameterization for regression targets given in paper [3].

(3) the network is modified to take two data inputs: a list of images and a list of bounding box proposals for those images.

One thing to note is, the RoI pooling layer implementation is not stated clear in the paper. For a original image of size 360*640, after 4 max pooling layer, the feature map is of size 512*22*40, each layer is 16 times smaller than the original image. And we need to extract the projection corresponding to the place of each bounding box proposals, ususlly a small size projection. We found that the projection was easily smaller than size 7*7. Then how to do the RoI pooling? Our way is to first pad the projection with (3, 3, 3, 3), so that we make sure the size is enough for RoI pooling. But no doubt this will slower computation.

In [None]:
from torchvision import models
import torch
from torch import nn
import math
import torch.nn.functional as F
import numpy as np

stride_prod = 16

device = torch.device("cuda")

class fast_rcnn_net(nn.Module):

    def __init__(self, output_size):
        super(fast_rcnn_net, self).__init__()

        # TODO: use vgg16 as ConvNet
        self.features = nn.Sequential(
            # 0-0 conv layer: 3 * 360 * 640 -> 64 * 360 * 640
            nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 0-1 conv layer: 64 * 360 * 640 -> 64 * 360 * 640
            nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 0 max pooling: 64 * 360 * 640 -> 64 * 180 * 320
            nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False),

            # 1-0 conv layer: 64 * 180 * 320 -> 128 * 180 * 320
            nn.Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 1-1 conv layer: 128 * 180 * 320 -> 128 * 180 * 320
            nn.Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 1 max pooling: 128 * 180 * 320 -> 128 * 90 * 160
            nn.MaxPool2d(kernel_size=2, stride =2, padding=0, dilation=1, ceil_mode=False),

            # 2-0 conv layer: 128 * 90 * 160 -> 256 * 90 * 160
            nn.Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 2-1 conv layer: 256 * 90 * 160 -> 256 * 90 * 160
            nn.Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 2-2 conv layer: 256 * 90 * 160 -> 256 * 90 * 160
            nn.Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 2 max pooling: 256 * 90 * 160 -> 256 * 45 * 80
            nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False),

            # 3-0 conv layer: 256 * 45 * 80 -> 512 * 45 * 80
            nn.Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU()
            
        )
        
        # according to the fast-rcnn paper, it is unnecessary to change the parameters of the first 8 conv layers, thus we freeze these conv layers so the parameters won't be updated during training
        for param in self.parameters():
            param.requires_grad = False    # freeze these parameters

        # from here on, the parameters will be updated by back-propagation
        self.features_unfreeze = nn.Sequential(
            # 3-1 conv layer: 512 * 45 * 80 -> 512 * 45 * 80
            nn.Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 3-2 conv layer: 512 * 45 * 80 -> 512 * 45 * 80
            nn.Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 3 max pooling: 512 * 45 * 80 -> 512 * 22 * 40
            nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False),

            # 4-0 conv layer: 512 * 22 * 40 -> 512 * 22 * 40
            nn.Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 4-1 conv layer: 512 * 22 * 40 -> 512 * 22 * 40
            nn.Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU(),

            # 4-2 conv layer: 512 * 22 * 40 -> 512 * 22 * 40
            nn.Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
            nn.ReLU()
        )

        # TODO: the ROI Pooling layer
        # the last max pooling layer is replaced by a ROI pooling layer that is configured by setting H=W=7: 7 * 7 * 512
        self.roi_pooling = nn.AdaptiveMaxPool2d((7, 7), return_indices=False)

        # TODO: continue the vgg fully connected layer
        self.classifier = nn.Sequential(
            # 0 fully connected
            nn.Linear(in_features=25088, out_features=4096, bias=True),
            nn.ReLU(),
            nn.Dropout(p=0.5),

            # 1 fully connected
            nn.Linear(in_features=4096, out_features=4096, bias=True),
            nn.ReLU(),
            nn.Dropout(p=0.5)
        )

        # TODO: two sibling output layer: one that produces softmax probability estimates, another outputs four real-valued numbers for each of the object class
        self.class_score_layer = nn.Linear(in_features=4096, out_features=output_size, bias=False)

        self.bbox_target_layer = nn.Linear(in_features=4096, out_features=(output_size-1)*4, bias=False)

    ## bbox regressor uses the parameterization for regression targets given in paper "Rich feature hierarchies for accurate object detection and semantic segmentation"
    def bbox_target_to_pred_bbox(self, region_proj, bbox_target):
        box = torch.Tensor(region_proj).to(device)

        r, c, w, h = box[0], box[1], box[2], box[3]

        dr = bbox_target[0::4]
        dc = bbox_target[1::4]
        dw = bbox_target[2::4]
        dh = bbox_target[3::4]

        pred_bbox = torch.zeros(bbox_target.size(), dtype=bbox_target.dtype).to(device)

        pred_bbox[0::4] = w * dr + r
        pred_bbox[1::4] = h * dc + c
        pred_bbox[2::4] = w * torch.Tensor(np.exp(dw.detach())).to(device)
        pred_bbox[3::4] = h * torch.Tensor(np.exp(dh.detach())).to(device)

        for i in range(len(pred_bbox.detach())):
            if i % 4 == 0 or i % 4 == 1:
                pred_bbox[i] = math.ceil(pred_bbox[i] * stride_prod) - 1
            if i % 4 == 2 or i % 4 == 3:
                pred_bbox[i] = math.floor(pred_bbox[i] * stride_prod) + 1

        return pred_bbox

    # two forward function: one forward_feature is following by 32 forward_ouput
    # the forward_feature is to extract feature map for a image
    # the forward_output is to do the projection and fully connected
    def forward_feature (self, x):
        feature_maps = self.features(x)
        feature_maps = self.features_unfreeze(feature_maps)
        return feature_maps

    def forward_output (self, x, region_projs):
        size = x.detach().size()
        output = torch.Tensor(size[0], size[1], 7, 7).to(device)
        for idx in range(size[0]):
            (r, c, w, h) = (int(z) for z in region_projs[idx])
            output[idx] = self.roi_pooling(F.pad(x[idx, :, c: c+h, r: r+w], (3, 3, 3, 3)))
        output = self.classifier(output.view(size[0], -1))
        clf_scores = self.class_score_layer(output)
        clf_scores = F.softmax(clf_scores, dim=1)
        bbox_targets = self.bbox_target_layer(output)
        bbox_pred = torch.Tensor(bbox_targets.detach().size()).to(device)
        for idx in range(len(region_projs)):
            bbox_pred[idx] = self.bbox_target_to_pred_bbox(region_projs[idx], bbox_targets[idx])
        return clf_scores, bbox_pred

# the bounding box proposals need to first transform to projection size.
def map_region_proposals_to_feature_map (rps):
    rp_projs = []
    for rp in rps:
        (r1, c1, w, h) = rp  # (r1, c1) is the top-left corner of the region proposal, w is width, h is height
        r2, c2 = r1 + w - 1, c1 + h - 1  # (r2, c2) is the bottom-right corner

        r1_ = min(math.floor(r1 / stride_prod) + 1, 38)  # max index is 39, but we have to guarantee that the projection has at least width 1
        c1_ = min(math.floor(c1 / stride_prod) + 1, 20)  # max index is 21, ...
        r2_ = math.ceil(r2 / stride_prod) - 1
        c2_ = math.ceil(c2 / stride_prod) - 1
        w = max(1.0, r2_-r1_+1)
        h = max(1.0, c2_-c1_+1)
        rp_projs.append((r1_, c1_, w, h))
    return rp_projs

# calculate the sibiling losses and combine them into one multi-task loss, averaging between all the batches
def smooth_multi_task_loss (clf_scores, clf_gtruth, bbox_pred, bbox_gtruth, bbox_label, lambda_):
    loss = torch.zeros(len(clf_gtruth))
    for idx in range(len(clf_gtruth)):
        loss_cls = torch.Tensor([- math.log(max(clf_scores[idx][int(clf_gtruth[idx].item())], 1e-45))]).squeeze(0) # loss for classification
        loss_cls.requires_grad_()
        loss_bbox = 0 # loss for bounding box

        u = int(bbox_label[idx].item())
        if u > 0:
            loss_bbox = F.smooth_l1_loss(bbox_pred[idx][(u - 1)*4: u*4], bbox_gtruth[idx].type(torch.float), reduction="sum")
        else:
            loss_bbox = 0

        loss[idx] = loss_cls + lambda_ * loss_bbox

    return loss.mean(dim=0)

## Step 3: train the network

We loaded the pre-trained vgg16 parameters of the 1-8 conv layers and the fully connected layers. Then we freeze the 1-8 conv layers. So the training will only change the parameters from the 9th conv layer up.

There are two forward function: *forward_feature* and *forward_output*. The forward_feature is to extract feature map for a image, and the forward_output is to do the projection and fully connected.

Each forward_feature is followed by 32 forward_output and one backward.

In [None]:
import torch
import numpy as np
import utils
import time
import fast_rcnn_network
import readimg_new
import os
from torchvision import models
from fast_rcnn_network import fast_rcnn_net
from datetime import datetime

root_path = os.getcwd()
segment = "segment2"

model_name = ''.join(str(datetime.now())[11:19].split(':'))
print(model_name)

device = torch.device("cuda")

# get data
# region_proposal: r, c, w, h, label; (r, c) is the axis of left-top corner, label is 0(background) if IOU<50% else is the ground truth label
train_data, train_label, train_rps_4_imgs, train_rp_labels_4_imgs = readimg_new.read_data_with_rps(
    [segment + "/train_data", segment + "/train_label", segment + "/region_proposals_train"])

train_rp_labels_4_imgs = torch.Tensor(train_rp_labels_4_imgs)

train_data = train_data.type(torch.float).to(device)
train_label = train_label.to(device)

# map region proposals to feature maps
train_rg_projs_4_imgs = []
for rps_4_img in train_rps_4_imgs:
    train_rg_projs_4_imgs.append(fast_rcnn_network.map_region_proposals_to_feature_map(rps_4_img))
train_rg_projs_4_imgs = torch.Tensor(train_rg_projs_4_imgs)

# fast rcnn network
output_size = 16
our_net = fast_rcnn_net(output_size).to(device)

# load the vgg16 pre-trained parameter values
pretrained_vgg16 = models.vgg16(pretrained=True)
pretrained_dict = pretrained_vgg16.state_dict()

# update our network(the vgg16 part) with pre-trained vgg16 parameter values
our_net_dict = our_net.state_dict()
pretrained_dict = dict({k: v for k, v in pretrained_dict.items() if k in our_net_dict})
our_net_dict.update(pretrained_dict)
our_net.load_state_dict(our_net_dict)


# TODO: train net

train_data_num = train_data.size()[0]
#test_data_num = test_data.size()[0]
pixel_num_of_each_pict = np.prod(train_data[0].size()[-2:])
start_time = time.time()
epoch_num = 20
bs = 2  # number of images in each batch, as mentioned in the paper
lr = 0.05
roi_size = fast_rcnn_network.roi_size
roi_padding = fast_rcnn_network.roi_pad
train_region_num = 32  # number of region_projections for each train image

print("epoch_num=", epoch_num, ", roi_size=", roi_size, ", roi_pad=", roi_padding, ", bs=", bs, ", lr=", lr)

for epoch in range(epoch_num):
    print("epoch: ", epoch)
    # learning rate strategy : divide the learning rate by 1.5 every 10 epochs
    if epoch % 5 == 0 and epoch > 0:
        lr /= 1.1

    # create a new optimizer at the beginning of each epoch: give the current learning rate
    optimizer = torch.optim.SGD(our_net.parameters(), lr=lr)

    shuffled_indices = torch.randperm(train_data_num)

    running_loss_train = 0
    num_train = 0

    for count in range(0, train_data_num, bs):
        print("count:", count)
        # forward and backward pass
        optimizer.zero_grad()

        indices = shuffled_indices[count: count + bs]
        minibatch_train_data = train_data[indices]
        minibatch_train_label = train_label[indices]
        minibatch_train_rp_label = train_rp_labels_4_imgs[indices]
        minibatch_train_rg_projs = train_rg_projs_4_imgs[indices]
        train_inputs = minibatch_train_data

        train_inputs.requires_grad_()

        feature_maps = our_net.forward_feature(train_inputs)

        loss = 0
        count_train_loss = 0
        for i in range(train_region_num):
            region_projs = [minibatch_train_rg_projs[j][i] for j in range(min(bs, len(minibatch_train_rg_projs)))] # the ith region for the jth image
            rp_labels = [minibatch_train_rp_label[j][i] for j in range(min(bs, len(minibatch_train_rp_label)))]

            clf_scores, bbox_pred = our_net.forward_output(feature_maps, region_projs)
            clf_gtruth = [z[0] for z in minibatch_train_label]
            bbox_gtruth = [z[1:] for z in minibatch_train_label]
            loss += fast_rcnn_network.smooth_multi_task_loss(clf_scores, clf_gtruth, bbox_pred, bbox_gtruth, rp_labels, 1)

            running_loss_train += loss.detach().item()
            num_train += 1
            count_train_loss += loss.detach().item()

        print("count=", count, ", total train loss=", count_train_loss, ", lr=", lr, ", ", minibatch_train_label)
        loss.backward()
        optimizer.step()

    total_loss_train = round(running_loss_train / num_train, 4)
    elapsed_time = round(time.time() - start_time, 4)
    print("epoch=", epoch, ", time=", elapsed_time, ", train loss=", total_loss_train, ", lr=", lr)

# save the model parameters
torch.save(our_net.state_dict(), segment + "/model_params_" + model_name + ".pkl")

Output (running on our server with cuda9):

091317

torch.Size([400, 3, 360, 640]) #training with 400 images

torch.LongTensor

torch.Size([400, 5]) #label: (class_label, x, y, w, h)

epoch_num= 20 , roi_size= 7 , roi_pad= 0 , bs= 2 , lr= 0.05

epoch:  0

count= 0 , total train loss= 192488.05795288086 , lr= 0.05 ,  tensor([[  5, 304, 148,  35,  60],
        [  5, 300, 146,  36,  62]])
        
count= 2 , total train loss= 95298.66703796387 , lr= 0.05 ,  tensor([[  4,  77, 107, 178,  81],
        [  8, 307, 224, 114,  95]])
        
count= 4 , total train loss= 80875.30766296387 , lr= 0.05 ,  tensor([[  3, 295, 253, 156,  55],
        [ 11, 304, 161,  30,  56]])
        
count= 6 , total train loss= 81827.7059249878 , lr= 0.05 ,  tensor([[  3, 300, 249, 142,  70],
        [  4, 280, 103, 171, 117]])
        
count= 8 , total train loss= 75715.70306396484 , lr= 0.05 ,  tensor([[ 14, 124, 129,  87,  96],
        [  3, 329, 248, 119,  75]])
        
count= 10 , total train loss= 77361.64921450615 , lr= 0.05 ,  tensor([[  3, 339, 249, 100,  75],
        [  8, 277, 151,  98, 103]])
        
...

## Result

The model can not converge.The loss vibrates up and down 10000.

I think there is two possible reason:
1. Not enough fine tuning due to the slow training speed. At the beginning, we built a network with a very un-efficient RoI pooling layer, in which we did the projection layer by layer (512 feature map layer). This implementation of RoI pooling layer created a lot of intermediate variables, so it took a long time to complete a backward. We lost a lot of time struggling with this network until we realized the reason. Then we change this layer to the current version, the training time is at least 10x faster. Though, compare with a RoI pooling layer written in C, the speed is still slow, but it's already acceptable.

2. Not enough training images due to the bad performance of Selective Search. As mentioned before, due to the small size of objects, there is limited space for bounding box proposals, and we cannot get enough qualified images each with 32 proposals meets our settings. Finally, we actually train the network with 500 images from 15 classes.

3. The performance of the network is largely relying on the bounding box proposal input, this is a conclusion by previous learners. And the Selective Search algorithm may not be suitable for our case, so we did not do well in this part and then it affect the entire network.

## Reference

\[1\] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.

\[2\] Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013): 154-171.

\[3\] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.