# RCNN

Progress:  Need to finish training pipeline for FasterRCNN

### Input

Input for Fast RCNN is an image and a list of regional proposals.  

We will treat each image as $x_i \in \mathbb{R}^{H,W,C}$ and regional proposal as a list of $[\cdots (r,c,h,w)_i \cdots ]$

### Model

![fastrcnn](fastrcnn.png)

### New Additions

### RoI Pooling Layer

**RoI pooling**, given non-uniform sized windows, output a fixed sized feature map.  

- Inputs:
    - Feature Maps (result of convolution/max pooling layers)
    - Region Proposals (N, 5), where N is number of proposals and each column represent: 
        - image id
        - top left corner coordninates (r, c)
        - height and width of regional proposal (h, w)
        
- Hyperparameter: 
    - (H, W): fixed sized output feature map (i.e. output dimension will be (H, W, F))
- Process
    - Pull out regional proposal:  ((r, c), (r + h, c), (r, w+c), (r+h, w+c))
    - Divide the grid into (H, W) sub-grids, each grid is (h/H, w/W) size
    - perform max pooling at each subgrid

### Improvement with Respect to RCNN

1. Incorporated Training in one algorithm:  (Is this good or bad?)
2. RoI (Region of Interest Pooling)
3. Same Feature Map for all proposed regions

## Faster RCNN

### Input

Input for Faster RCNN is an image.  No predefined regional proposal is needed as input.  

We will treat each image as $x_i \in \mathbb{R}^{H,W,C}$ 

### Model

![fasterrcnn](fasterrcnn.png)

### New Additions:

### Region Proposal Network

Input is an image and outputs a set of rectangular object proposals.

#### General Idea:

Slide a small network over the convolutional feature map output (from the last shared layer).  For each n x n spatial window, we will generate
1. k anchor boxes of predefined size
2. Cast n x n into a lower dimensional feature vector.  
3. The lower dimensional feature vector will be fed into two 1 x 1 (or fully connected) network, results in two output vectors
    - $\mathbb{R}^{2k}$ - which is the score on how likely is the item an object
    - $\mathbb{R}^{4k}$ - for each anchor (out of k), will output (r, c, h, w) for input into RoI network
    
*Note*: Each anchor is fixed size, every anchor will produce various (r, c, h, w)* outputs that may deviated from anchor.  

### Implementing Faster RCNN with Provided Libraries

#### FasterRCNN Maps

1. Preprocess
2. Convolutional Network that produces feature maps
    - can use a conventional one (ImageNet, Inception, ResNet)
    - Input: Image $\in \mathbb{R}^{N, H, W, C}$
    - Output: Feature Map $\in \mathbb{R}^{}$
3. RPN Network

In [11]:
import tensorflow as tf
import object_detection
from object_detection.core import model, box_list
import slim
from nets import resnet_utils, resnet_v1
from object_detection.utils import ops

# from object_detection.model_lib import create_estimator_and_inputs, create_train_and_eval_specs


In [3]:
from object_detection.predictors.heads import box_head

### Notes

This library is from google's tensorflow research model library.  Each architecture is defined by a MetaArch class.  For Detection Models, the MetaArch contains the following methods:

1. Preprocess
2. predict
3. loss

such that the training pipeline will move from 1 -> 2 -> 3.  

### Parameters, Hyperparameters, and Inputs

This note focus on model building rather than input building.  For more information, look at this [tutorial for datainput]().

In [4]:
# First we will define a .config file in the project folder
pipeline_config_path = "faster_rcnn.config"

# The following library is from object detection of tensorflow research

from object_detection.utils import config_util
from object_detection import inputs

get_configs_from_pipeline = config_util.get_configs_from_pipeline_file

merge_external_params_with_configs = config_util.merge_external_params_with_configs
create_train_input_fn = inputs.create_train_input_fn
create_eval_input_fn = inputs.create_eval_input_fn
create_predict_input_fn = inputs.create_predict_input_fn

In [5]:
configs = get_configs_from_pipeline(pipeline_config_path)

model_config = configs['model']
train_config = configs['train_config']
train_input_config = configs['train_input_config']
eval_config = configs['eval_config']
eval_input_config = configs['eval_input_config']

In [6]:
train_input_fn = create_train_input_fn(
    train_config=train_config,
    train_input_config=train_input_config,
    model_config=model_config)

### Preprocessing

In [7]:
# Build image_resizer function
from object_detection.builders import image_resizer_builder

image_resizer = image_resizer_builder.build(model_config.faster_rcnn.image_resizer)

### Testing Pipeline

In [10]:
# # run_config 
# config = tf.estimator.RunConfig(model_dir="model/")

# # Hparams
# hparams = model_hparams.create_hparams()
# # pipeline_config_path
# pipeline_config = "faster_rcnn.config"
# # train_steps
# train = 100
# # eval_steps
# eval_steps = 100

# train_and_eval_dict = create_estimator_and_inputs(
#     run_config =config,
#     hparams=hparams,
#     pipeline_config_path = pipeline_config,
#     train_steps = train,
#     eval_steps = eval_steps)

### Feature Map Generator

In [None]:
def conv_feature_maps(inputs, scope, args):
    """
    Given a input (image) of dimension:  [Batch, Height, Width, Channels], output 
    a feature map of dimension: [Batch, Height, Width, Depth]
    """
    IMAGE_MIN_HEIGHT = 33
    IMAGE_MIN_WIDTH = 33
    
    shape = inputs.get_shape()
    if len(shape.as_list()) != 4:
        raise ValueError('`preprocessed_inputs` must be 4 dimensional, got a '
                               'tensor of shape %s' % shape)
        
    height, width = tf.shape(inputs)[1], tf.shape(inputs)[2] 
    shape_assert = tf.Assert(tf.logical_and(tf.greater_equal(height, IMAGE_MIN_HEIGHT),
            tf.greater_equal(width, IMAGE_MIN_WIDTH)),
                             
        ['image size must at least be %s in both height and width.' % IMAGE_MIN_HEIGHT])

    resnet_scope = resnet_utils.resnet_arg_scope(
                batch_norm_epsilon=1e-5,
                batch_norm_scale=True,
                weight_decay=args._weight_decay)
    
    with tf.control_dependencies([shape_assert]):
        with slim.arg_scope(resnet_scope):
            with tf.variable_scope(args.architecture, reuse=args.reuse_weights) as var_scope:
            
                _, activations = resnet_v1(
                  inputs,
                  num_classes=None,
                  is_training=args.train_batch_norm,
                  global_pool=False,
                  output_stride=args.first_stage_features_stride,
                  spatial_squeeze=False,
                  scope=var_scope)

    handle = scope + '/%s/block3' % args.architecture
    return activations[handle]


def anchor_generator(map_height, map_width):
    scales = (0.5, 1.0, 2.0)
    aspect_ratios = (0.5, 1.0, 2.0)
    base_anchor_size = (256, 256)
    anchor_stride = (16, 16)
    anchor_offset = (0, 0)
    
    scales_grid, aspect_ratios_grid = ops.meshgrid(scales, aspect_ratios)
    # scales_grid:  [...[0.5, 1.0, 2.0]...] (think for each aspect_ratio,
    # we have a scale list)
    scales = tf.reshape(scales_grid, [-1])
    aspect_ratios = tf.reshape(aspect_ratios_grid, [-1])
    
    sqrt_ratio = tf.sqrt(aspect_ratios)
    heights = scales / ratio_sqrts * base_anchor_size[0]
    widths = scales * ratio_sqrts * base_anchor_size[1]

    # Get a grid of box centers
    y_centers = tf.to_float(tf.range(map_height))
    y_centers = y_centers * anchor_stride[0] + anchor_offset[0]
    x_centers = tf.to_float(tf.range(grid_width))
    x_centers = x_centers * anchor_stride[1] + anchor_offset[1]
    x_centers, y_centers = ops.meshgrid(x_centers, y_centers)
    
    widths_grid, x_centers_grid = ops.meshgrid(widths, x_centers)
    heights_grid, y_centers_grid = ops.meshgrid(heights, y_centers)
    
    bbox_centers = tf.stack([y_centers_grid, x_centers_grid], axis=3)
    bbox_sizes = tf.stack([heights_grid, widths_grid], axis=3)
    centers = tf.reshape(bbox_centers, [-1, 2])
    sizes = tf.reshape(bbox_sizes, [-1, 2])
    
    
    bbox_corners = tf.concat([centers - .5 * sizes, centers + .5 * sizes], 1)
    anchors = box_list.BoxList(bbox_corners)
    
    anchor_indices = tf.zeros([anchors.num_boxes()])
    anchors.add_fields("feature_map_index", anchor_indices)
    return anchors
    
def extract_rpn_features(feature_map_input, feature_map_shape):
    # we crop feature_map_input
    with slim.arg_scope():
        predict_feature = slim.conv2d(feature_map_input,
                   first_stage_box_depth,
                   kernel_size=[kernel_size, kernel_size]
                   rate=atrous_rate,
                   activation_fn = tf.nn.relu6)
    return predict_feature

### Box Prediction

In [9]:
def build_box_predictor(is_training,
                        num_classes,
                        predictor_config,
                        conv_hyperparams_fn
                       ):
    
    box_prediction_head = box_head.ConvolutionalBoxHead(
        is_training=is_training,
        box_code_size=predictor_config.box_code_size,
        kernel_size=predictor_config.kernel_size,
        use_depthwise=True)
    class_prediction_head = class_head.ConvolutionalClassHead(
        is_training=is_training,
        num_classes=num_classes,
        use_dropout=False,
        dropout_keep_prob=predictor_config.dropout_keep_prob,
        kernel_size=predictor_config.kernel_size,
        apply_sigmoid_to_scores=True,
        class_prediction_bias_init=predictor_config.class_prediction_bias_init,
        use_depthwise=True)
    return convolutional_box_predictor.ConvolutionalBoxPredictor(
      is_training=is_training,
      num_classes=num_classes,
      box_prediction_head=box_prediction_head,
      class_prediction_head=class_prediction_head,
      other_heads={},
      conv_hyperparams_fn=conv_hyperparams_fn,
      num_layers_before_predictor=predictor_config.num_layers_before_predictor,
      min_depth=predictor_config.min_depth,
      max_depth=predictor_config.max_depth)

In [None]:
from object_detection.utils import shape_utils

class FasterRCNNMetaArch(model.DetectionModel):
    """
    This is a simpler implementation of Faster RCNN using DetectionModel
    Architecture 
       
    (From DetectionModel documentation):
    
    Training process - 
        input -> preprocess (need to implement) 
        -> predict (need to implement) 
        -> loss (need to implement)
        -> output
    """
    
    def __init__(self, num_classes, resizer, model_config):
        super(FasterRCNNMetaArch, self).__init__(num_classes=num_classes)
        self.image_resizer = resizer
        self.model_config = model_config
        
        
    def preprocess(self, inputs):
        if inputs.dtype is not tf.float32:
            raise ValueError('`preprocess` expects a tf.float32 tensor')
        
        with tf.name_scope('Preprocessor'):
            outputs = shape_utils.static_or_dynamic_map_fn(
            self._image_resizer_fn, elems=inputs, dtype=[tf.float32, tf.int32],
            parallel_iterations=self._parallel_iterations)
            
        resized_inputs = outputs[0]
        true_image_shapes = outputs[1]
        return (self._feature_extractor.preprocess(resized_inputs),
              true_image_shapes)
    
    def predict(self, preprocessed_inputs):
        """    
        Args:
          preprocessed_inputs: a [batch, height, width, channels] float tensor
            representing a batch of images.
          true_image_shapes: int32 tensor of shape [batch, 3] where each row is
            of the form [height, width, channels] indicating the shapes
            of true images in the resized images, as resized images can be padded
            with zeros.

        Returns:
          prediction_dict: a dictionary holding "raw" prediction tensors:
            1) rpn_box_predictor_features: A 4-D float32 tensor with shape
              [batch_size, height, width, depth] to be used for predicting proposal
              boxes and corresponding objectness scores.
            2) rpn_features_to_crop: A 4-D float32 tensor with shape
              [batch_size, height, width, depth] representing image features to crop
              using the proposal boxes predicted by the RPN.
            3) image_shape: a 1-D tensor of shape [4] representing the input
              image shape.
            4) rpn_box_encodings:  3-D float tensor of shape
              [batch_size, num_anchors, self._box_coder.code_size] containing
              predicted boxes.
            5) rpn_objectness_predictions_with_background: 3-D float tensor of shape
              [batch_size, num_anchors, 2] containing class
              predictions (logits) for each of the anchors.  Note that this
              tensor *includes* background class predictions (at class index 0).
            6) anchors: A 2-D tensor of shape [num_anchors, 4] representing anchors
              for the first stage RPN (in absolute coordinates).  Note that
              `num_anchors` can differ depending on whether the model is created in
              training or inference mode.

        Raises:
          ValueError: If `predict` is called before `preprocess`.
        """ 
    
        # ResNet extract block3 feature
        feature_map = conv_feature_maps(inputs, scope, self.model_config)

        feature_map_shape = tf.shape(feature_map)

        # Generate anchors
        anchors = anchor_generator(feature_map_shape[0], feature_map_shape[1])

        # At this point, we have two feature maps to look at.  One is
        # prediction network, where for each location, we will predict anchor number
        # of objectiveness score, this is achieved using box predictor
        box_predictor = build_box_predictor(is_training,
                        num_classes,
                        predictor_config,
                        conv_hyperparams_fn)
        
        box_predict = box_predictor.predict(feature_map)
        box_objectiveness = box_predict[box_predict.CLASS_PREDICTIONS_WITH_BACKGROUND]
        box_encoding = box_predict[box_predict.BOX_ENCODINGS]
        clip_window = tf.to_float(tf.stack([0, 0, image_shape[1], image_shape[2]]))
        anchors_boxlist = box_list_ops.clip_to_window(anchors_boxlist, clip_window, filter_nonoverlapping=False)

        prediction_dict = {
            'rpn_box_predictor_features': rpn_box_predictor_features,
            'rpn_features_to_crop': rpn_features_to_crop,
            'image_shape': image_shape,
            'rpn_box_encodings': rpn_box_encodings,
            'rpn_objectness_predictions_with_background':
            rpn_objectness_predictions_with_background,
            'anchors': self._anchors.get()
        }

    
        return prediction_dict
    
    def loss(self, prediction_dict, true_image_shapes, scope=None):
        """Compute scalar loss tensors given prediction tensors.

        If number_of_stages=1, only RPN related losses are computed (i.e.,
        `rpn_localization_loss` and `rpn_objectness_loss`).  Otherwise all
        losses are computed.

        Args:
          prediction_dict: a dictionary holding prediction tensors (see the
            documentation for the predict method.  If number_of_stages=1, we
            expect prediction_dict to contain `rpn_box_encodings`,
            `rpn_objectness_predictions_with_background`, `rpn_features_to_crop`,
            `image_shape`, and `anchors` fields.  Otherwise we expect
            prediction_dict to additionally contain `refined_box_encodings`,
            `class_predictions_with_background`, `num_proposals`, and
            `proposal_boxes` fields.
          true_image_shapes: int32 tensor of shape [batch, 3] where each row is
            of the form [height, width, channels] indicating the shapes
            of true images in the resized images, as resized images can be padded
            with zeros.
          scope: Optional scope name.

        Returns:
          a dictionary mapping loss keys (`first_stage_localization_loss`,
            `first_stage_objectness_loss`, 'second_stage_localization_loss',
            'second_stage_classification_loss') to scalar tensors representing
            corresponding loss values.
        """

        with tf.name_scope(scope, "Loss", prediction_dict.values()):

          (groundtruth_boxlists, groundtruth_classes_with_background_list,
           groundtruth_masks_list, groundtruth_weights_list
          ) = self._format_groundtruth_data(true_image_shapes)
          loss_dict = self._loss_rpn(
              prediction_dict['rpn_box_encodings'],
              prediction_dict['rpn_objectness_predictions_with_background'],
              prediction_dict['anchors'], groundtruth_boxlists,
              groundtruth_classes_with_background_list, groundtruth_weights_list)

        return loss_dict        

In [None]:
fast_rcnn_model = FasterRCNNMetaArch(model.DetectionModel)

## Mask RCNN

### Input

Input for Fast RCNN is an image and a list of regional proposals.  

We will treat each image as $x_i \in \mathbb{R}^{H,W,C}$ and regional proposal as a list of $[\cdots (r,c,h,w)_i \cdots ]$