From b886f7431ea01343c46845f7b4d8ae098fff6e73 Mon Sep 17 00:00:00 2001
From: Francisco Massa <fvsmassa@gmail.com>
Date: Tue, 21 May 2019 13:48:18 +0200
Subject: [PATCH] Add detection and segmentation models to doc folder

---
 docs/source/models.rst                        | 157 +++++++++++++++++-
 torchvision/models/detection/faster_rcnn.py   |  44 ++++-
 torchvision/models/detection/keypoint_rcnn.py |  53 ++++--
 torchvision/models/detection/mask_rcnn.py     |  56 +++++--
 4 files changed, 279 insertions(+), 31 deletions(-)

diff --git a/docs/source/models.rst b/docs/source/models.rst
index 7d4f568cb99..d7d3a359f50 100644
--- a/docs/source/models.rst
+++ b/docs/source/models.rst
@@ -1,8 +1,18 @@
 torchvision.models
-==================
+##################
+
+
+The models subpackage contains definitions of models for addressing
+different tasks, including: image classification, pixelwise semantic
+segmentation, object detection, instance segmentation and person
+keypoint detection.
+
+
+Classification
+==============
 
 The models subpackage contains definitions for the following model
-architectures:
+architectures for image classification:
 
 -  `AlexNet`_
 -  `VGG`_
@@ -182,8 +192,149 @@ MobileNet v2
 .. autofunction:: mobilenet_v2
 
 ResNext
--------------
+-------
 
 .. autofunction:: resnext50_32x4d
 .. autofunction:: resnext101_32x8d
 
+
+Semantic Segmentation
+=====================
+
+As with image classification models, all pre-trained models expect input images normalized in the same way.
+The images have to be loaded in to a range of ``[0, 1]`` and then normalized using
+``mean = [0.485, 0.456, 0.406]`` and ``std = [0.229, 0.224, 0.225]``.
+They have been trained on images resized such that their minimum size is 520.
+
+The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are
+present in the Pascal VOC dataset. You can see more information on how the subset has been selected in
+``references/segmentation/coco_utils.py``. The classes that the pre-trained model outputs are the following,
+in order:
+
+  .. code-block:: python
+
+      ['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
+       'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
+       'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
+
+The accuracies of the pre-trained models evaluated on COCO val2017 are as follows
+
+================================  =============  ====================
+Network                           mean IoU       global pixelwise acc
+================================  =============  ====================
+FCN ResNet101                     63.7           91.9
+DeepLabV3 ResNet101               67.4           92.4
+================================  =============  ====================
+
+
+Fully Convolutional Networks
+----------------------------
+
+.. autofunction:: torchvision.models.segmentation.fcn_resnet50
+.. autofunction:: torchvision.models.segmentation.fcn_resnet101
+
+
+DeepLabV3
+---------
+
+.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet50
+.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet101
+
+
+Object Detection, Instance Segmentation and Person Keypoint Detection
+=====================================================================
+
+The pre-trained models for detection, instance segmentation and
+keypoint detection are initialized with the classification models
+in torchvision.
+
+The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
+The models internally resize the images so that they have a minimum size
+of ``800``. This option can be changed by passing the option ``min_size``
+to the constructor of the models.
+
+
+For object detection and instance segmentation, the pre-trained
+models return the predictions of the following classes:
+
+  .. code-block:: python
+
+      COCO_INSTANCE_CATEGORY_NAMES = [
+          '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
+          'train', 'truck', 'boat', 'traffic', 'light', 'fire', 'hydrant', 'N/A', 'stop',
+          'sign', 'parking', 'meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+          'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
+          'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports', 'ball',
+          'kite', 'baseball', 'bat', 'baseball', 'glove', 'skateboard', 'surfboard', 'tennis',
+          'racket', 'bottle', 'N/A', 'wine', 'glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
+          'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot', 'dog', 'pizza',
+          'donut', 'cake', 'chair', 'couch', 'potted', 'plant', 'bed', 'N/A', 'dining', 'table',
+          'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell',
+          'phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
+          'clock', 'vase', 'scissors', 'teddy', 'bear', 'hair', 'drier', 'toothbrush'
+      ]
+
+
+Here are the summary of the accuracies for the models trained on
+the instances set of COCO train2017 and evaluated on COCO val2017.
+
+================================  =======  ========  ===========
+Network                           box AP   mask AP   keypoint AP
+================================  =======  ========  ===========
+Faster R-CNN ResNet-50 FPN        37.0     -         -
+Mask R-CNN ResNet-50 FPN          37.9     34.6      -
+================================  =======  ========  ===========
+
+For person keypoint detection, the accuracies for the pre-trained
+models are as follows
+
+================================  =======  ========  ===========
+Network                           box AP   mask AP   keypoint AP
+================================  =======  ========  ===========
+Keypoint R-CNN ResNet-50 FPN      54.6     -         65.0
+================================  =======  ========  ===========
+
+For person keypoint detection, the pre-trained model return the
+keypoints in the following order:
+
+  .. code-block:: python
+
+    COCO_PERSON_KEYPOINT_NAMES = [
+        'nose',
+        'left_eye',
+        'right_eye',
+        'left_ear',
+        'right_ear',
+        'left_shoulder',
+        'right_shoulder',
+        'left_elbow',
+        'right_elbow',
+        'left_wrist',
+        'right_wrist',
+        'left_hip',
+        'right_hip',
+        'left_knee',
+        'right_knee',
+        'left_ankle',
+        'right_ankle'
+    ]
+
+
+
+Faster R-CNN
+------------
+
+.. autofunction:: torchvision.models.detection.fasterrcnn_resnet50_fpn
+
+
+Mask R-CNN
+----------
+
+.. autofunction:: torchvision.models.detection.maskrcnn_resnet50_fpn
+
+
+Keypoint R-CNN
+--------------
+
+.. autofunction:: torchvision.models.detection.keypointrcnn_resnet50_fpn
+
diff --git a/torchvision/models/detection/faster_rcnn.py b/torchvision/models/detection/faster_rcnn.py
index 2d519b332a9..9df5428ecab 100644
--- a/torchvision/models/detection/faster_rcnn.py
+++ b/torchvision/models/detection/faster_rcnn.py
@@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN):
 
     During training, the model expects both the input tensors, as well as a targets dictionary,
     containing:
-        boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
-            between 0 and H and 0 and W
-        labels (Tensor[N]): the class label for each ground-truth box
+        - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
+          between 0 and H and 0 and W
+        - labels (Tensor[N]): the class label for each ground-truth box
+
     The model returns a Dict[Tensor] during training, containing the classification and regression
     losses for both the RPN and the R-CNN.
 
     During inference, the model requires only the input tensors, and returns the post-processed
     predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
     follows:
-        boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
-            0 and H and 0 and W
-        labels (Tensor[N]): the predicted labels for each image
-        scores (Tensor[N]): the scores or each prediction
+        - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
+          0 and H and 0 and W
+        - labels (Tensor[N]): the predicted labels for each image
+        - scores (Tensor[N]): the scores or each prediction
 
     Arguments:
         backbone (nn.Module): the network used to compute the features for the model.
@@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True,
     """
     Constructs a Faster R-CNN model with a ResNet-50-FPN backbone.
 
+    The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
+    image, and should be in ``0-1`` range. Different images can have different sizes.
+
+    The behavior of the model changes depending if it is in training or evaluation mode.
+
+    During training, the model expects both the input tensors, as well as a targets dictionary,
+    containing:
+        - boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
+          between ``0`` and ``H`` and ``0`` and ``W``
+        - labels (``Tensor[N]``): the class label for each ground-truth box
+
+    The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
+    losses for both the RPN and the R-CNN.
+
+    During inference, the model requires only the input tensors, and returns the post-processed
+    predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
+    follows:
+        - boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
+          ``0`` and ``H`` and ``0`` and ``W``
+        - labels (``Tensor[N]``): the predicted labels for each image
+        - scores (``Tensor[N]``): the scores or each prediction
+
+    Example::
+
+        >>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
+        >>> model.eval()
+        >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
+        >>> predictions = model(x)
+
     Arguments:
         pretrained (bool): If True, returns a model pre-trained on COCO train2017
         progress (bool): If True, displays a progress bar of the download to stderr
diff --git a/torchvision/models/detection/keypoint_rcnn.py b/torchvision/models/detection/keypoint_rcnn.py
index a3f23944bfd..9a950e3cc34 100644
--- a/torchvision/models/detection/keypoint_rcnn.py
+++ b/torchvision/models/detection/keypoint_rcnn.py
@@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN):
 
     During training, the model expects both the input tensors, as well as a targets dictionary,
     containing:
-        boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
-            between 0 and H and 0 and W
-        labels (Tensor[N]): the class label for each ground-truth box
-        keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
-            format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
+        - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
+          between 0 and H and 0 and W
+        - labels (Tensor[N]): the class label for each ground-truth box
+        - keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
+          format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
+
     The model returns a Dict[Tensor] during training, containing the classification and regression
     losses for both the RPN and the R-CNN, and the keypoint loss.
 
     During inference, the model requires only the input tensors, and returns the post-processed
     predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
     follows:
-        boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
-            0 and H and 0 and W
-        labels (Tensor[N]): the predicted labels for each image
-        scores (Tensor[N]): the scores or each prediction
-        keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
+        - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
+          0 and H and 0 and W
+        - labels (Tensor[N]): the predicted labels for each image
+        - scores (Tensor[N]): the scores or each prediction
+        - keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
 
     Arguments:
         backbone (nn.Module): the network used to compute the features for the model.
@@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True,
     """
     Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone.
 
+    The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
+    image, and should be in ``0-1`` range. Different images can have different sizes.
+
+    The behavior of the model changes depending if it is in training or evaluation mode.
+
+    During training, the model expects both the input tensors, as well as a targets dictionary,
+    containing:
+        - boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
+          between ``0`` and ``H`` and ``0`` and ``W``
+        - labels (``Tensor[N]``): the class label for each ground-truth box
+        - keypoints (``Tensor[N, K, 3]``): the ``K`` keypoints location for each of the ``N`` instances, in the
+          format ``[x, y, visibility]``, where ``visibility=0`` means that the keypoint is not visible.
+
+    The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
+    losses for both the RPN and the R-CNN, and the keypoint loss.
+
+    During inference, the model requires only the input tensors, and returns the post-processed
+    predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
+    follows:
+        - boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
+          ``0`` and ``H`` and ``0`` and ``W``
+        - labels (``Tensor[N]``): the predicted labels for each image
+        - scores (``Tensor[N]``): the scores or each prediction
+        - keypoints (``Tensor[N, K, 3]``): the locations of the predicted keypoints, in ``[x, y, v]`` format.
+
+    Example::
+
+        >>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
+        >>> model.eval()
+        >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
+        >>> predictions = model(x)
+
     Arguments:
         pretrained (bool): If True, returns a model pre-trained on COCO train2017
         progress (bool): If True, displays a progress bar of the download to stderr
diff --git a/torchvision/models/detection/mask_rcnn.py b/torchvision/models/detection/mask_rcnn.py
index e3c08a10226..7fb4b0445c7 100644
--- a/torchvision/models/detection/mask_rcnn.py
+++ b/torchvision/models/detection/mask_rcnn.py
@@ -28,23 +28,24 @@ class MaskRCNN(FasterRCNN):
 
     During training, the model expects both the input tensors, as well as a targets dictionary,
     containing:
-        boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
-            between 0 and H and 0 and W
-        labels (Tensor[N]): the class label for each ground-truth box
-        masks (Tensor[N, H, W]): the segmentation binary masks for each instance
+        - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
+          between 0 and H and 0 and W
+        - labels (Tensor[N]): the class label for each ground-truth box
+        - masks (Tensor[N, H, W]): the segmentation binary masks for each instance
+
     The model returns a Dict[Tensor] during training, containing the classification and regression
     losses for both the RPN and the R-CNN, and the mask loss.
 
     During inference, the model requires only the input tensors, and returns the post-processed
     predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
     follows:
-        boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
-            0 and H and 0 and W
-        labels (Tensor[N]): the predicted labels for each image
-        scores (Tensor[N]): the scores or each prediction
-        mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
-            obtain the final segmentation masks, the soft masks can be thresholded, generally
-            with a value of 0.5 (mask >= 0.5)
+        - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
+          0 and H and 0 and W
+        - labels (Tensor[N]): the predicted labels for each image
+        - scores (Tensor[N]): the scores or each prediction
+        - mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
+          obtain the final segmentation masks, the soft masks can be thresholded, generally
+          with a value of 0.5 (mask >= 0.5)
 
     Arguments:
         backbone (nn.Module): the network used to compute the features for the model.
@@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True,
     """
     Constructs a Mask R-CNN model with a ResNet-50-FPN backbone.
 
+    The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
+    image, and should be in ``0-1`` range. Different images can have different sizes.
+
+    The behavior of the model changes depending if it is in training or evaluation mode.
+
+    During training, the model expects both the input tensors, as well as a targets dictionary,
+    containing:
+        - boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
+          between ``0`` and ``H`` and ``0`` and ``W``
+        - labels (``Tensor[N]``): the class label for each ground-truth box
+        - masks (``Tensor[N, H, W]``): the segmentation binary masks for each instance
+
+    The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
+    losses for both the RPN and the R-CNN, and the mask loss.
+
+    During inference, the model requires only the input tensors, and returns the post-processed
+    predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
+    follows:
+        - boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
+          ``0`` and ``H`` and ``0`` and ``W``
+        - labels (``Tensor[N]``): the predicted labels for each image
+        - scores (``Tensor[N]``): the scores or each prediction
+        - mask (``Tensor[N, H, W]``): the predicted masks for each instance, in ``0-1`` range. In order to
+          obtain the final segmentation masks, the soft masks can be thresholded, generally
+          with a value of 0.5 (``mask >= 0.5``)
+
+    Example::
+
+        >>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
+        >>> model.eval()
+        >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
+        >>> predictions = model(x)
+
     Arguments:
         pretrained (bool): If True, returns a model pre-trained on COCO train2017
         progress (bool): If True, displays a progress bar of the download to stderr