# Region Proposal Network

## Anchor Box
In `configuration.py` we have
```python
aspect = lambda s, r: (s * 1 / r ** 0.5, s * r ** 0.5)
self.rpn_base_aspect_ratios = [
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
        ]

self.rpn_base_sizes = [8, 16, 32, 64]
self.rpn_scales = [2, 4, 8, 16]
```
In which `s, r` means scale and ratio, each element in `rpn_base_apsect_ratios` means (height, width) of one anchor box. 4 array means 4 layers, so there are 3 different anchor boxes for each of the 4 feature maps generated from FPN, i.e
$$ 
\begin{align}
height & = scale \times (\frac{1}{\sqrt{ratio}}) \\
width  & = scale \times (\sqrt{ratio})
\end{align}
$$

So
$$\frac{width}{height} = ratio$$

And we have 2 scales $(1, \sqrt{2})$ and 3 aspects $(1:1, 1:2, 2:1)$, each feature map correspinding to 3 different anchor boxes.

![](images/anchor.svg)

1. The actual proposal size is controlled by `rpn_base_sizes`, different feature maps have different receptive field, small feature map has big receptive field, so we assign big base size of anchor box for small feature maps. 

2. The corresponding scale ratio from image to feature map is recorded in `rpn_scales`, when using RoiAlign, we need to know which feature map to pool from given proposals by measuring the proposals' sizes. When feature map is set, we need these scales to get the actual sizes to crop from the feature maps.

The figure above explains the parameters used in `configuration.py`

## RPN's output

![](images/rpn_head.svg)

1. FPN generate 4 feature maps from a image
2. For one feature map say p3(64\*64), take each pixel(of the feature map) as the centre of an anchor box. 
3. There are a total of $(64*64)*3*2 = 24576$ proposals arranged in shape $(3, 2, 64, 64)$, in which 2 means the score for foreground/background. Flatten it on the pixels' dimension gives the tensor $(B, N, 2)$, in which `B` is batch size, `N` will be calculated:
4. All 4 feature maps have a total of $(16*16*3 + 32*32*3 + 64*64*3 + 128*128*3) = 65280$ anchor boxes for foreground and background, so the final return size for RPN layer is $(B, 65280, 2)$. Let's check it out:

In [1]:
import sys
sys.path.append("..")
import torch
from configuration import Configuration
from net.layer.rpn.rpn_head import RpnMultiHead

In [2]:
cfg = Configuration()
# set anchor boxes to our example above
aspect = lambda s, r: (s * 1 / r ** 0.5, s * r ** 0.5)
cfg.rpn_base_aspect_ratios = [
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
           [(1,1), aspect(2**0.5,2), aspect(2**0.5,0.5),],
        ]

# input features, the batch size is set to 5, channels of each feature map is 256
p2 = torch.randn(5, 256, 128, 128)
p3 = torch.randn(5, 256, 64, 64)
p4 = torch.randn(5, 256, 32, 32)
p5 = torch.randn(5, 256, 16, 16)
fs = [p2, p3, p4, p5]
# init RPN head
net = RpnMultiHead(cfg, 256)

In [3]:
logits_flat, deltas_flat = net(fs)
print(logits_flat.size(), deltas_flat.size())

torch.Size([5, 65280, 2]) torch.Size([5, 65280, 2, 4])


RPN layer has 2 branches, one gives the score for foreground/background for each anchor box, another gives the box regression parameters, which is used to reshape each anchor box to fit the instance better. There are 4 parameters to refine the anchor box, so we have output tensor of size $(B, 65280, 2, 4)$ 

So this is the base idea behind RPN layer, which is probably the most complicated layer in mask rcnn.

## RPN Train/Test Process

![](images/rpn.svg)

1. Get features from FPN
2. Get rpn output: logits and deltas
3. Make targets for training from `rpn_target.py`
4. Do nms to get rpn proposals, which is in `nms.py`

## RCNN Train/Test Process

![](images/rcnn.svg)

The train/test process for rcnn is not much different from RPN except for:

1. RCNN use cropped features by roi-align layer. The crop size is decided by rpn proposals
2. RCNN proposals for training are re-sampled, with balanced positive/negative samples
3. RCNN predict class label for each proposal

## MASK Train/Test Process

![](images/mask.svg)

The train/test process for mask head is not much different from RPN and RCNN except for:

1. Mask head predicts per-pixel probability mask for each class, so it's output is (B, mask_size, mask_size)
2. The target for mask size is cropped from truth box and truch instance, which may not the same size as mask_size, so we need to resize the cropped instance into mask_size.