# The Anatomy
## Training data generator (`class DataGenerator`)
* `backbone_shapes`: width and height of each stage in backbone (encoding) network. 
* `generate_pyramid_anchors` Generates anchors
  * `generate_anchors` gives all the anchors for a feature layer of the backbone network as a tuple of corner (top left x and y, bottom right x and y) coordinates. The generated anchors are in image pixel coordinates even though
  they are generated for feature layers.
    * `anchor_stride` is 1, so anchors are generated for each point of the feature map. 
    * `feature_stride` is the stride made by feature layer of the backbone network (e.g., resnet101 has `[4,8,16,32,64]` as resp. feature strides for its relevant layers.) w.r.t. the original image. If `anchor_stride` is `1` and `feature_stride` is, say, 4, then anchors will be generated over a grid with cell of `4 * 4` in image coordinates.
    * `shape`: shape of the feature map. E.g., for layer 1 of resnet101, this will be image.shape/4.0. 
    * `scales` gives edge lengths of square anchors in image pixel coordinates.
    * `ratios` gives aspect ratios of anchors at a pixel. Height of a non-square anchor is given by `scale/sqrt(ratio)` and the width is given by `scale*sqrt(ratio)`.
*   `load_image_gt`: (Load image ground truth) loads data for training. 
  * This looks complicated because it handles varied images sizes and class
     ids, but that's all irrelevant for us. We can replace this by very simple
     image and mask loading functionality.
  * Returns a cropped and resized image, class ids, bounding boxes and segmentation masks
  * What area of the image does it return. Why is it cropped and resized

  * `load_image`: the first method called by load_image_gt
        * Defined in utils.py
        * Simple function. Reads the image and converts Grayscale to RGB if required
  * `load_mask`: masks are binary and returned in format `(masks, class_ids)` of shape `([H, W, num_instances], [class_ids])` where `masks[:,:,i], class_ids[i]` and defines masks and class id for instance `i`.
  * `resize_image`: Resizes maintaining the aspect ratio and (if required) pads the image. Why do we need to resize the image? The answer is that for our case (all square images of the same size, we can make this a no-op and we don't have to worry about it).
  * What it does is explained in `config.py`. Look for "Input image resizing". Input images are all squared up and padded with 0's if required. We don't have to worry about this as all our images are already squares. Here they like to resize so that short size if image is 800 with constraint that long size <= 1024. I thinks all this is irrelevant for us and we can just pass 512x512 images.  
  * `compose_image_meta`: This is also a no-op for us. This keeps track of effects of resize_image so we can map back to the original image when needed. Since we won't resize, we don't need this information.
* `build_rpn_targets`: Computes ground truth for Region Proposal Network. Takes
`anchors` and ground truth boxes and inputs and outputs positive anchors and 
anchor deltas.

### Training Datapoint
1. `input_rpn_match` of size [1]
2. `input_rpn_bbox` of size[4]
3. `input_gt_class_ids` if size ?
4. 

## Backbone
Residues are extracted from the backbone at 4 layers. For resnet, for an image of size (512, 512) these residues will be of sizes `(128, 128, 256), (64, 64, 512), (32, 32, 1024), (16, 16, 2048)`.
## Feature Pyramid Network
Each layer of FPN takes input from upsampled inner layer of FPN (if available) and corresponding residues from the backbone. Outputs from these layers will be of sizes `(16, 16, 256), (32, 32, 256), (64, 64, 256), (128, 128, 256)`. An additional layer of size `(8, 8, 256)` is created by sub-sampling the innermost layer of the FPN. This subsampled layer is used in RPN but not in classifier. The anchor scales (mentioned above) related to layers of the FPN.
## Region Proposal Network
For each layer in FPN of size(k, k, 256), RPN model outputs `(rpn_class_logits, rpn_class_probs, rpn_bbox_refinement)` where `rpn_class_logits` is of size
`k * k * 2 * anchors_per_pixel`, `rpn_probs` is simply the softmax of `rpn_class_logits` and `rpn_bbox_refinement` is of size `(k * k * 4 * anchors_per_pixel)`. RPN model is applied independently to each layer of the FPN
and outputs are simply concatenated. RPN does not consider object classes when
proposing regions. It only classifies whether an anchor represents an ROI or not.
## Proposal Layer
Recieves inputs from RPN and passes a subset of proposals to the classifier and mask layers. The subset is selected based on scores and non-max suppression. Most of the logic provided by `tf.image.non_max_suppression`. It also applies deltas to anchors to arrive at final bounding boxes computed by RPN. Its output is of shape `proposal_counts * 4`. Deltas are multiplicative.
## Detection Target Layer
Recieves input from Proposal Layer and generates ground truth targets for 
classification and mask generation. Finds positive ROI, assigns each positive ROI to a ground truth box and calculates deltas from the ground truth box. Assigns class labels to positive ROI. Negative ROIs are generated from background proposals (where IOU < 0.5 with every ground truth box). Returns proposals corresponding to positive rois and negative rois. For positive rois, also returns corresponding deltas, class labels and masks. For negative rois, these values are 0-padded. 
## FPN Classifier Graph
Takes as input `roi`s from the detection target layer, 4 feature maps from the FPN and outputs (for each bounding box proposal), class probabilities and bounding box deltas.
  * `PyramidROIAlign`: maps ROI to a layer of FPN depending of ROI size (**What is the meaning/significance of this?**). Bigger ROI's are mapped to deeper layers (lower resolution) where as smaller ROIs are mapped to shallower layer (higher resolution). This layer does not back propagate gradients to the RPN by using explicit `tf.stop_gradient` on relevant variables(**This may be incorrect**). See documentation for this, it is useful.In effect, we are pretending that ROIs are constants inside this layer.
    * Uses `tf.image.crop_and_resize` to crop and resize the input bounding boxes for a corresponding feature map to `POOL_SIZE * POOL_SIZE` (which is in effect `7*7`. This leads to an output shape of `(num_boxes, 7, 7, number of channels in corresponding feature map)`. 
    * Each `channel` from `PyramidROIAlign` is convolved with number of kernels equal to size of fully connected layers. The same kernel is applied to each of the channels using `TimeDistributed` layer. Code for this is instructive.  
### FPN Mask Graph
  * Similar to FPN classifier graph but outputs masks instead of class probs and bounding boxes. 

### Open questions
Why are masks 28 x 28?
* Because masks are `tf.image.crop_and_resize`ed to 28x28 by the detection target graph (described above).
For a nice overview, see : https://www.youtube.com/watch?v=aAmFlJsulHY&list=PLoEMreTa9CNm18TPHIYm3t2CLIqxLxzYD&index=9. This video explains Faster R-CNN. However, Mask R-CNN is basically Faster R-CNN with a mask prediction head (and some changes to ROI alignment -- interpolation instead of rounding).

# The Physiology
## RPN Training
  Inputs to the Backbone -> FPN (shared layers) -> RPN part of the network are:
  1. Images (whole)
  2. Ground truth box assignment for anchors
  3. Bounding box adjustment for anchors that have been assigned a positive label. Code for steps 2-3 is [here](https://colab.research.google.com/drive/1HuFSIknOdX2Af5PtylsUtew11SBcQir5#scrollTo=ZQbM7cLh-Ze4&line=4&uniqifier=1).
  
  Output from RPN is *predicted anchor assignment* (For all anchors) and *bbox regression* for positive anchors (a fixed set, given the ground truth bounding boxes).
  Loss for RPN is sum of misclassification loss for anchors and regression loss due to anchors with positive ground truth bounding box assignments. See the (Faster R-CNN paper, Section 3.1.2)[https://arxiv.org/pdf/1506.01497.pdf].
  
  Given these inputs (creating them is pretty involved, but the network itself is not so hard to understand), we are just training a vanilla deep NN and RPN can be trained end-to-end (including the backbone). Generally, backbone is preinitilized, RPN is trained and then the whole thing (RPN + backbone + FPN) is trained end-to-end. FPN serves as wide part of the network as each level is fed seprately to RPN.

## Alright, RPN is trained, now what?
Now the rest of the training proceeds per positive bbox (region of interest, ROI) output by RPN (and post-processed by Non-max suppression to remove overlapping proposals). We freeze the RPN in this phase. The FPN is initially fixed but may be tuned in later passes of this step. The inputs for rest of the network (this is my curent understanding, need to confirm) are features from FPN layers. Depending on size of proposal, the features are extracted from one of the layers of the FPN. For a larger ROI, features are extracted from the innermost (low resolution) layer of FPN. Smaller regions are extracted from downstream upsampled regions. The crux is that we are trying to simulate an image pyramid with a feature pyramid by pretending the different layers of the feature pyramid *actually* correspond to different resolutions of spatial features (true to an extent) in the hope that the ultimtely trained network actually corresponds to our naive beliefs. So this now is the input to what is called Mask R-CNN (the rest of it is not actually Mask R-CNN, the paper uses this differentiation for other plugable parts, and these are the backbone, the FPN and the RPN). Now there are some possibe lossy transformations here. First is the mapping of the high resolution image to a lower resolution FPN layers and the other one is the mapping of the FPN features to a fixed resolution region alignment layer (RAL, personal acronyn, may not find it in literature). The FPN to RAL mapping is just a crop and resize interpolation (as implemented by tf) of the relevant FPN region, so no need to go in details of this. It is just linear mapping and a standard method of cropping and resizing images. Of course, when the rubber hits the road, the equations may still look a bit hairy, but the idea is simple enough.
### Can the FPN and the backbone be fine tuned with the Mask R-CNN?
Yes, we are just selecting a subset of outputs from the FPN and passing them on to the vanilla DNN. The subset selection is conditioned on inputs and nothing else. Think of how max and max pooling work. This is similar in spirit. We pass a subset to subsequent layers, conditioned on inputs, and when backpropagating, we do so along the selected subset.
### A note on implementation
In effect, the network is trained per ROI, but we avoid looping through each ROI by using the Keras layer, TimeDilated. This applies the same transformation to each slice of the input (frame, borrowed from video processing) and we output per frame too. Thus, all ROIs are packed into a neat stack after alignment and then "time dilated" throught rest of the network.
## Misc
Non-max suppression is differentiable (did not expect it to be, may be interesting to look at the implementation). See: https://www.tensorflow.org/api_docs/python/tf/raw_ops for a list of differentiable and non-differentiable ops.

Models can be used just like any other layer.

## Motivation for region pooling (in general)
It is explained pretty well [here](https://youtu.be/v3jryjHk820?t=205).

In [None]:
## Should anchor scales be inferred from backbone strides?
RPN_ANCHOR_SCALES = (32, 64, 128, 256, 512)
RPN_ANCHOR_RATIOS = [0.5, 1, 2]
BACKBONE_STRIDES = [4, 8, 16, 32, 64]

import numpy as np
def generate_anchors(scales, ratios, shape, feature_stride):
    """                                                                                                                                                                                                                                                                                   
    scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128]                                                                                                                                                                                                                    
    ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2]                                                                                                                                                                                                               
    shape: [height, width] spatial shape of the feature map over which                                                                                                                                                                                                                    
            to generate anchors.                                                                                                                                                                                                                                                          
    feature_stride: Stride of the feature map relative to the image in pixels.                                                                                                                                                                                                                                                                                                                                                                                                                          
    """
    # Get all combinations of scales and ratios                                                                                                                                                                                                                                           
    scales, ratios = np.meshgrid(np.array(scales), np.array(ratios))
    scales = scales.flatten() ## len(scales) * len(ratios)                                                                                                                                                                                                                                
    ratios = ratios.flatten()

    # Enumerate heights and widths from scales and ratios                                                                                                                                                                                                                                 
    heights = scales / np.sqrt(ratios)
    widths = scales * np.sqrt(ratios)

    # Enumerate shifts in feature space                              
    # These masks are actually being generated in image pixel space, it seems                                                                                                                                                                                                                     
    shifts_y = np.arange(0, shape[0]) * feature_stride 
    shifts_x = np.arange(0, shape[1]) * feature_stride
    shifts_x, shifts_y = np.meshgrid(shifts_x, shifts_y)

    # Enumerate combinations of shifts, widths, and heights                                                                                                                                                                                                                               
    box_widths, box_centers_x = np.meshgrid(widths, shifts_x)
    box_heights, box_centers_y = np.meshgrid(heights, shifts_y)

    # Reshape to get a list of (y, x) and a list of (h, w)                                                                                                                                                                                                                                
    box_centers = np.stack(
        [box_centers_y, box_centers_x], axis=2).reshape([-1, 2])
    box_sizes = np.stack([box_heights, box_widths], axis=2).reshape([-1, 2])

    # Convert to corner coordinates (y1, x1, y2, x2)                                                                                                                                                                                                                                      
    boxes = np.concatenate([box_centers - 0.5 * box_sizes,
                            box_centers + 0.5 * box_sizes], axis=1)
    return boxes


In [None]:
RPN_ANCHOR_SCALES = (32, 64, 128, 256, 512)
RPN_ANCHOR_RATIOS = [0.5, 1, 2]
BACKBONE_STRIDES = [4, 8, 16, 32, 64]
RPN_ANCHOR_STRIDE = 1

def generate_pyramid_anchors(scales, ratios, feature_shapes, feature_strides,
                             anchor_stride):
    """Generate anchors at different levels of a feature pyramid. Each scale                                                                                                                                                                                                              
    is associated with a level of the pyramid, but each ratio is used in                                                                                                                                                                                                                  
    all levels of the pyramid.                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                          
    Returns:                                                                                                                                                                                                                                                                              
    anchors: [N, (y1, x1, y2, x2)]. All generated anchors in one array. Sorted                                                                                                                                                                                                            
        with the same order of the given scales. So, anchors of scale[0] come                                                                                                                                                                                                             
        first, then anchors of scale[1], and so on.                                                                                                                                                                                                                                       
    """
    # Anchors                                                                                                                                                                                                                                                                             
    # [anchor_count, (y1, x1, y2, x2)]                                                                                                                                                                                                                                                    
    anchors = []
    for i in range(len(scales)):
        anchors.append(generate_anchors(scales[i], ratios, feature_shapes[i],
                                        feature_strides[i], anchor_stride))
    return np.concatenate(anchors, axis=0)


In [None]:
x = generate_anchors(RPN_ANCHOR_SCALES[0], [0.5, 1.0, 2.0], [128, 128], BACKBONE_STRIDES[0], 1)
y = generate_anchors(RPN_ANCHOR_SCALES[1], [0.5, 1.0, 2.0], [64, 64], BACKBONE_STRIDES[1], 1)
z = generate_anchors(RPN_ANCHOR_SCALES[2], [0.5, 1.0, 2.0], [32, 32], BACKBONE_STRIDES[2], 1)
u = generate_anchors(RPN_ANCHOR_SCALES[3], [0.5, 1.0, 2.0], [16, 16], BACKBONE_STRIDES[3], 1)
v = generate_anchors(RPN_ANCHOR_SCALES[4], [0.5, 1.0, 2.0], [8, 8], BACKBONE_STRIDES[4], 1)



print(x.shape)
print(y.shape)
print(z.shape)
print(u.shape)
print(v.shape)
print(x[52:64,:])
print(y[52,:])
print(z[52,:])
print(u[52,:])
print(v[52,:])

(49152, 4)
(12288, 4)
(3072, 4)
(768, 4)
(192, 4)
[[-16.         52.         16.         84.       ]
 [-11.3137085  45.372583   11.3137085  90.627417 ]
 [-22.627417   60.6862915  22.627417   83.3137085]
 [-16.         56.         16.         88.       ]
 [-11.3137085  49.372583   11.3137085  94.627417 ]
 [-22.627417   64.6862915  22.627417   87.3137085]
 [-16.         60.         16.         92.       ]
 [-11.3137085  53.372583   11.3137085  98.627417 ]
 [-22.627417   68.6862915  22.627417   91.3137085]
 [-16.         64.         16.         96.       ]
 [-11.3137085  57.372583   11.3137085 102.627417 ]
 [-22.627417   72.6862915  22.627417   95.3137085]]
[-32. 104.  32. 168.]
[-64. 208.  64. 336.]
[-96. -96. 160. 160.]
[-128. -192.  384.  320.]
