# TorchVision Object Detection Finetuning Tutorial
For this tutorial, we will be finetuning a pre-trained [Mask R-CNN](https://arxiv.org/abs/1703.06870) model in the [Penn-Fudan Database for Pedestrian Detection and Segmentation](https://www.cis.upenn.edu/~jshi/ped_html/). It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an instance segmentation model on a custom dataset.

## Defining the Dataset
The reference scripts for training object detection, instance segmentation and person keypoint detection allows for easily supporting adding new custom datasets. The dataset should inherit from the standard torch.utils.data.Dataset class, and implement ``__len__`` and ``__getitem__``.

The only specificity that we require is that the dataset ``__getitem__`` should return:

* image: a PIL Image of size ``(H, W)``
* target: a dict containing the following fields
    * ``boxes (FloatTensor[N, 4])``: the coordinates of the ``N`` bounding boxes in ``[x0, y0, x1, y1]`` format, ranging from 0 to W and 0 to H
    * ``labels (Int64Tensor[N])``: the label for each bounding box. 0 represents always the background class.
    * ``image_id (Int64Tensor[1])``: an image identifier. It should be unique between all the images in the dataset, and is used during evaluation
    * ``area (Tensor[N])``: The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.
    * ``iscrowd (UInt8Tensor[N])``: instances with iscrowd=True will be ignored during evaluation.
    * (optionally) ``masks (UInt8Tensor[N, H, W])``: The segmentation masks for each one of the objects
    * (optionally) ``keypoints (FloatTensor[N, K, 3])``: For each one of the N objects, it contains the K keypoints in ``[x, y, visibility]`` format, defining the object. visibility=0 means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt ``references/detection/transforms.py`` for your new keypoint representation

If your model returns the above methods, they will make it work for both training and evaluation, and will use the evaluation scripts from pycocotools.

One note on the labels. The model considers class 0 as background. If your dataset does not contain the background class, you should not have 0 in your labels. For example, assuming you have just two classes, cat and dog, you can define 1 (not 0) to represent cats and 2 to represent dogs. So, for instance, if one of the images has booth classes, your labels tensor should look like [1,2].

Additionally, if you want to use aspect ratio grouping during training (so that each batch only contains images with similar aspect ratio), then it is recommended to also implement a get_height_and_width method, which returns the height and the width of the image. If this method is not provided, we query all elements of the dataset via __getitem__ , which loads the image in memory and is slower than if a custom method is provided.

Writing a custom dataset for PennFudan
Let’s write a dataset for the PennFudan dataset. After downloading and extracting the zip file, we have the following folder structure: