From b886f7431ea01343c46845f7b4d8ae098fff6e73 Mon Sep 17 00:00:00 2001 From: Francisco Massa Date: Tue, 21 May 2019 13:48:18 +0200 Subject: [PATCH] Add detection and segmentation models to doc folder --- docs/source/models.rst | 157 +++++++++++++++++- torchvision/models/detection/faster_rcnn.py | 44 ++++- torchvision/models/detection/keypoint_rcnn.py | 53 ++++-- torchvision/models/detection/mask_rcnn.py | 56 +++++-- 4 files changed, 279 insertions(+), 31 deletions(-) diff --git a/docs/source/models.rst b/docs/source/models.rst index 7d4f568cb99..d7d3a359f50 100644 --- a/docs/source/models.rst +++ b/docs/source/models.rst @@ -1,8 +1,18 @@ torchvision.models -================== +################## + + +The models subpackage contains definitions of models for addressing +different tasks, including: image classification, pixelwise semantic +segmentation, object detection, instance segmentation and person +keypoint detection. + + +Classification +============== The models subpackage contains definitions for the following model -architectures: +architectures for image classification: - `AlexNet`_ - `VGG`_ @@ -182,8 +192,149 @@ MobileNet v2 .. autofunction:: mobilenet_v2 ResNext -------------- +------- .. autofunction:: resnext50_32x4d .. autofunction:: resnext101_32x8d + +Semantic Segmentation +===================== + +As with image classification models, all pre-trained models expect input images normalized in the same way. +The images have to be loaded in to a range of ``[0, 1]`` and then normalized using +``mean = [0.485, 0.456, 0.406]`` and ``std = [0.229, 0.224, 0.225]``. +They have been trained on images resized such that their minimum size is 520. + +The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are +present in the Pascal VOC dataset. You can see more information on how the subset has been selected in +``references/segmentation/coco_utils.py``. The classes that the pre-trained model outputs are the following, +in order: + + .. code-block:: python + + ['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', + 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', + 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'] + +The accuracies of the pre-trained models evaluated on COCO val2017 are as follows + +================================ ============= ==================== +Network mean IoU global pixelwise acc +================================ ============= ==================== +FCN ResNet101 63.7 91.9 +DeepLabV3 ResNet101 67.4 92.4 +================================ ============= ==================== + + +Fully Convolutional Networks +---------------------------- + +.. autofunction:: torchvision.models.segmentation.fcn_resnet50 +.. autofunction:: torchvision.models.segmentation.fcn_resnet101 + + +DeepLabV3 +--------- + +.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet50 +.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet101 + + +Object Detection, Instance Segmentation and Person Keypoint Detection +===================================================================== + +The pre-trained models for detection, instance segmentation and +keypoint detection are initialized with the classification models +in torchvision. + +The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``. +The models internally resize the images so that they have a minimum size +of ``800``. This option can be changed by passing the option ``min_size`` +to the constructor of the models. + + +For object detection and instance segmentation, the pre-trained +models return the predictions of the following classes: + + .. code-block:: python + + COCO_INSTANCE_CATEGORY_NAMES = [ + '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', + 'train', 'truck', 'boat', 'traffic', 'light', 'fire', 'hydrant', 'N/A', 'stop', + 'sign', 'parking', 'meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', + 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A', + 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports', 'ball', + 'kite', 'baseball', 'bat', 'baseball', 'glove', 'skateboard', 'surfboard', 'tennis', + 'racket', 'bottle', 'N/A', 'wine', 'glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', + 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot', 'dog', 'pizza', + 'donut', 'cake', 'chair', 'couch', 'potted', 'plant', 'bed', 'N/A', 'dining', 'table', + 'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell', + 'phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', + 'clock', 'vase', 'scissors', 'teddy', 'bear', 'hair', 'drier', 'toothbrush' + ] + + +Here are the summary of the accuracies for the models trained on +the instances set of COCO train2017 and evaluated on COCO val2017. + +================================ ======= ======== =========== +Network box AP mask AP keypoint AP +================================ ======= ======== =========== +Faster R-CNN ResNet-50 FPN 37.0 - - +Mask R-CNN ResNet-50 FPN 37.9 34.6 - +================================ ======= ======== =========== + +For person keypoint detection, the accuracies for the pre-trained +models are as follows + +================================ ======= ======== =========== +Network box AP mask AP keypoint AP +================================ ======= ======== =========== +Keypoint R-CNN ResNet-50 FPN 54.6 - 65.0 +================================ ======= ======== =========== + +For person keypoint detection, the pre-trained model return the +keypoints in the following order: + + .. code-block:: python + + COCO_PERSON_KEYPOINT_NAMES = [ + 'nose', + 'left_eye', + 'right_eye', + 'left_ear', + 'right_ear', + 'left_shoulder', + 'right_shoulder', + 'left_elbow', + 'right_elbow', + 'left_wrist', + 'right_wrist', + 'left_hip', + 'right_hip', + 'left_knee', + 'right_knee', + 'left_ankle', + 'right_ankle' + ] + + + +Faster R-CNN +------------ + +.. autofunction:: torchvision.models.detection.fasterrcnn_resnet50_fpn + + +Mask R-CNN +---------- + +.. autofunction:: torchvision.models.detection.maskrcnn_resnet50_fpn + + +Keypoint R-CNN +-------------- + +.. autofunction:: torchvision.models.detection.keypointrcnn_resnet50_fpn + diff --git a/torchvision/models/detection/faster_rcnn.py b/torchvision/models/detection/faster_rcnn.py index 2d519b332a9..9df5428ecab 100644 --- a/torchvision/models/detection/faster_rcnn.py +++ b/torchvision/models/detection/faster_rcnn.py @@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN): During training, the model expects both the input tensors, as well as a targets dictionary, containing: - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values - between 0 and H and 0 and W - labels (Tensor[N]): the class label for each ground-truth box + - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values + between 0 and H and 0 and W + - labels (Tensor[N]): the class label for each ground-truth box + The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN. During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows: - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between - 0 and H and 0 and W - labels (Tensor[N]): the predicted labels for each image - scores (Tensor[N]): the scores or each prediction + - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between + 0 and H and 0 and W + - labels (Tensor[N]): the predicted labels for each image + - scores (Tensor[N]): the scores or each prediction Arguments: backbone (nn.Module): the network used to compute the features for the model. @@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True, """ Constructs a Faster R-CNN model with a ResNet-50-FPN backbone. + The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each + image, and should be in ``0-1`` range. Different images can have different sizes. + + The behavior of the model changes depending if it is in training or evaluation mode. + + During training, the model expects both the input tensors, as well as a targets dictionary, + containing: + - boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values + between ``0`` and ``H`` and ``0`` and ``W`` + - labels (``Tensor[N]``): the class label for each ground-truth box + + The model returns a ``Dict[Tensor]`` during training, containing the classification and regression + losses for both the RPN and the R-CNN. + + During inference, the model requires only the input tensors, and returns the post-processed + predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as + follows: + - boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between + ``0`` and ``H`` and ``0`` and ``W`` + - labels (``Tensor[N]``): the predicted labels for each image + - scores (``Tensor[N]``): the scores or each prediction + + Example:: + + >>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) + >>> model.eval() + >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] + >>> predictions = model(x) + Arguments: pretrained (bool): If True, returns a model pre-trained on COCO train2017 progress (bool): If True, displays a progress bar of the download to stderr diff --git a/torchvision/models/detection/keypoint_rcnn.py b/torchvision/models/detection/keypoint_rcnn.py index a3f23944bfd..9a950e3cc34 100644 --- a/torchvision/models/detection/keypoint_rcnn.py +++ b/torchvision/models/detection/keypoint_rcnn.py @@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN): During training, the model expects both the input tensors, as well as a targets dictionary, containing: - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values - between 0 and H and 0 and W - labels (Tensor[N]): the class label for each ground-truth box - keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the - format [x, y, visibility], where visibility=0 means that the keypoint is not visible. + - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values + between 0 and H and 0 and W + - labels (Tensor[N]): the class label for each ground-truth box + - keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the + format [x, y, visibility], where visibility=0 means that the keypoint is not visible. + The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN, and the keypoint loss. During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows: - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between - 0 and H and 0 and W - labels (Tensor[N]): the predicted labels for each image - scores (Tensor[N]): the scores or each prediction - keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format. + - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between + 0 and H and 0 and W + - labels (Tensor[N]): the predicted labels for each image + - scores (Tensor[N]): the scores or each prediction + - keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format. Arguments: backbone (nn.Module): the network used to compute the features for the model. @@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True, """ Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone. + The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each + image, and should be in ``0-1`` range. Different images can have different sizes. + + The behavior of the model changes depending if it is in training or evaluation mode. + + During training, the model expects both the input tensors, as well as a targets dictionary, + containing: + - boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values + between ``0`` and ``H`` and ``0`` and ``W`` + - labels (``Tensor[N]``): the class label for each ground-truth box + - keypoints (``Tensor[N, K, 3]``): the ``K`` keypoints location for each of the ``N`` instances, in the + format ``[x, y, visibility]``, where ``visibility=0`` means that the keypoint is not visible. + + The model returns a ``Dict[Tensor]`` during training, containing the classification and regression + losses for both the RPN and the R-CNN, and the keypoint loss. + + During inference, the model requires only the input tensors, and returns the post-processed + predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as + follows: + - boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between + ``0`` and ``H`` and ``0`` and ``W`` + - labels (``Tensor[N]``): the predicted labels for each image + - scores (``Tensor[N]``): the scores or each prediction + - keypoints (``Tensor[N, K, 3]``): the locations of the predicted keypoints, in ``[x, y, v]`` format. + + Example:: + + >>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True) + >>> model.eval() + >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] + >>> predictions = model(x) + Arguments: pretrained (bool): If True, returns a model pre-trained on COCO train2017 progress (bool): If True, displays a progress bar of the download to stderr diff --git a/torchvision/models/detection/mask_rcnn.py b/torchvision/models/detection/mask_rcnn.py index e3c08a10226..7fb4b0445c7 100644 --- a/torchvision/models/detection/mask_rcnn.py +++ b/torchvision/models/detection/mask_rcnn.py @@ -28,23 +28,24 @@ class MaskRCNN(FasterRCNN): During training, the model expects both the input tensors, as well as a targets dictionary, containing: - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values - between 0 and H and 0 and W - labels (Tensor[N]): the class label for each ground-truth box - masks (Tensor[N, H, W]): the segmentation binary masks for each instance + - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values + between 0 and H and 0 and W + - labels (Tensor[N]): the class label for each ground-truth box + - masks (Tensor[N, H, W]): the segmentation binary masks for each instance + The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN, and the mask loss. During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows: - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between - 0 and H and 0 and W - labels (Tensor[N]): the predicted labels for each image - scores (Tensor[N]): the scores or each prediction - mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to - obtain the final segmentation masks, the soft masks can be thresholded, generally - with a value of 0.5 (mask >= 0.5) + - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between + 0 and H and 0 and W + - labels (Tensor[N]): the predicted labels for each image + - scores (Tensor[N]): the scores or each prediction + - mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to + obtain the final segmentation masks, the soft masks can be thresholded, generally + with a value of 0.5 (mask >= 0.5) Arguments: backbone (nn.Module): the network used to compute the features for the model. @@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True, """ Constructs a Mask R-CNN model with a ResNet-50-FPN backbone. + The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each + image, and should be in ``0-1`` range. Different images can have different sizes. + + The behavior of the model changes depending if it is in training or evaluation mode. + + During training, the model expects both the input tensors, as well as a targets dictionary, + containing: + - boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values + between ``0`` and ``H`` and ``0`` and ``W`` + - labels (``Tensor[N]``): the class label for each ground-truth box + - masks (``Tensor[N, H, W]``): the segmentation binary masks for each instance + + The model returns a ``Dict[Tensor]`` during training, containing the classification and regression + losses for both the RPN and the R-CNN, and the mask loss. + + During inference, the model requires only the input tensors, and returns the post-processed + predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as + follows: + - boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between + ``0`` and ``H`` and ``0`` and ``W`` + - labels (``Tensor[N]``): the predicted labels for each image + - scores (``Tensor[N]``): the scores or each prediction + - mask (``Tensor[N, H, W]``): the predicted masks for each instance, in ``0-1`` range. In order to + obtain the final segmentation masks, the soft masks can be thresholded, generally + with a value of 0.5 (``mask >= 0.5``) + + Example:: + + >>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True) + >>> model.eval() + >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] + >>> predictions = model(x) + Arguments: pretrained (bool): If True, returns a model pre-trained on COCO train2017 progress (bool): If True, displays a progress bar of the download to stderr