Detectron2 Model Zoo and Baselines

Introduction

This file documents a large collection of baselines trained with detectron2 in Sep-Oct, 2019. The corresponding configurations for all models can be found under the configs/ directory. Unless otherwise noted, the following settings are used for all runs:

Common Settings

All models were trained on Big Basin servers with 8 NVIDIA V100 GPUs, with data-parallel sync SGD and a total minibatch size of 16 images.
All models were trained with CUDA 9.2, cuDNN 7.4.2 or 7.6.3 (the difference in speed is found to be negligible).
Training curves and other statistics can be found in metrics for each model.
The default settings are not directly comparable with Detectron. For example, our default training data augmentation uses scale jittering in addition to horizontal flipping.

For configs that are comparable to Detectron's settings, see Detectron1-Comparisons for accuracy comparison, and benchmarks for speed comparison.
Inference speed is measured by tools/train_net.py --eval-only, with batch size 1 in detectron2 directly. The actual deployment should in general be faster than the given inference speed due to more optimizations.
Training speed is averaged across the entire training. We keep updating the speed with latest version of detectron2/pytorch/etc., so they might be different from the metrics file.
All COCO models were trained on train2017 and evaluated on val2017.
For Faster/Mask R-CNN, we provide baselines based on 3 different backbone combinations:
- FPN: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research.
- C4: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
- DC5 (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.
Most models are trained with the 3x schedule (~37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs) training schedule for comparison when doing quick research iteration.
The model id column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name. Each model also comes with a metrics file with all the training statistics and evaluation curves.

ImageNet Pretrained Models

We provide backbone models pretrained on ImageNet-1k dataset. These models are different from those provided in Detectron: we do not fuse BatchNorm into an affine layer.

R-50.pkl: converted copy of MSRA's original ResNet-50 model
R-101.pkl: converted copy of MSRA's original ResNet-101 model
X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB

Pretrained models in Detectron's format can still be used. For example:

X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see ResNeXt paper for details on ImageNet-5k).
R-50-GN.pkl: ResNet-50 with Group Normalization.
R-101-GN.pkl: ResNet-101 with Group Normalization.

License

All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

COCO Object Detection Baselines

Faster R-CNN:

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	model id	download
R50-C4	1x	0.551	0.110	4.8	35.7	137257644	model \| metrics
R50-DC5	1x	0.380	0.068	5.0	37.3	137847829	model \| metrics
R50-FPN	1x	0.210	0.055	3.0	37.9	137257794	model \| metrics
R50-C4	3x	0.543	0.110	4.8	38.4	137849393	model \| metrics
R50-DC5	3x	0.378	0.073	5.0	39.0	137849425	model \| metrics
R50-FPN	3x	0.209	0.047	3.0	40.2	137849458	model \| metrics
R101-C4	3x	0.619	0.149	5.9	41.1	138204752	model \| metrics
R101-DC5	3x	0.452	0.082	6.1	40.6	138204841	model \| metrics
R101-FPN	3x	0.286	0.063	4.1	42.0	137851257	model \| metrics
X101-FPN	3x	0.638	0.120	6.7	43.0	139173657	model \| metrics

RetinaNet:

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	model id	download
R50	1x	0.200	0.062	3.9	36.5	137593951	model \| metrics
R50	3x	0.201	0.063	3.9	37.9	137849486	model \| metrics
R101	3x	0.280	0.080	5.1	39.9	138363263	model \| metrics

RPN & Fast R-CNN:

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	prop. AR	model id	download
RPN R50-C4	1x	0.130	0.051	1.5		51.6	137258005	model \| metrics
RPN R50-FPN	1x	0.186	0.045	2.7		58.0	137258492	model \| metrics
Fast R-CNN R50-FPN	1x	0.140	0.035	2.6	37.8		137635226	model \| metrics

COCO Instance Segmentation Baselines with Mask R-CNN

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
R50-C4	1x	0.584	0.117	5.2	36.8	32.2	137259246	model \| metrics
R50-DC5	1x	0.471	0.074	6.5	38.3	34.2	137260150	model \| metrics
R50-FPN	1x	0.261	0.053	3.4	38.6	35.2	137260431	model \| metrics
R50-C4	3x	0.575	0.118	5.2	39.8	34.4	137849525	model \| metrics
R50-DC5	3x	0.470	0.075	6.5	40.0	35.9	137849551	model \| metrics
R50-FPN	3x	0.261	0.055	3.4	41.0	37.2	137849600	model \| metrics
R101-C4	3x	0.652	0.155	6.3	42.6	36.7	138363239	model \| metrics
R101-DC5	3x	0.545	0.155	7.6	41.9	37.3	138363294	model \| metrics
R101-FPN	3x	0.340	0.070	4.6	42.9	38.6	138205316	model \| metrics
X101-FPN	3x	0.690	0.129	7.2	44.3	39.5	139653917	model \| metrics

COCO Person Keypoint Detection Baselines with Keypoint R-CNN

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	kp. AP	model id	download
R50-FPN	1x	0.315	0.083	5.0	53.6	64.0	137261548	model \| metrics
R50-FPN	3x	0.316	0.076	5.0	55.4	65.5	137849621	model \| metrics
R101-FPN	3x	0.390	0.090	6.1	56.4	66.1	138363331	model \| metrics
X101-FPN	3x	0.738	0.142	8.7	57.3	66.0	139686956	model \| metrics

COCO Panoptic Segmentation Baselines with Panoptic FPN

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	mask AP	PQ	model id	download
R50-FPN	1x	0.304	0.063	4.8	37.6	34.7	39.4	139514544	model \| metrics
R50-FPN	3x	0.302	0.063	4.8	40.0	36.5	41.5	139514569	model \| metrics
R101-FPN	3x	0.392	0.078	6.0	42.4	38.5	43.0	139514519	model \| metrics

LVIS Instance Segmentation Baselines with Mask R-CNN

Mask R-CNN baselines on the LVIS dataset, v0.5. These baselines are described in Table 3(c) of the LVIS paper.

NOTE: the 1x schedule here has the same amount of iterations as the COCO 1x baselines. They are roughly 24 epochs of LVISv0.5 data. The final results of these configs have large variance across different runs.

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
R50-FPN	1x	0.292	0.127	7.1	23.6	24.4	144219072	model \| metrics
R101-FPN	1x	0.371	0.124	7.8	25.6	25.9	144219035	model \| metrics
X101-FPN	1x	0.712	0.166	10.2	26.7	27.1	144219108	model \| metrics

Cityscapes & Pascal VOC Baselines

Simple baselines for

Mask R-CNN on Cityscapes instance segmentation (initialized from COCO pre-training, then trained on Cityscapes fine annotations only)
Faster R-CNN on PASCAL VOC object detection (trained on VOC 2007 train+val + VOC 2012 train+val, tested on VOC 2007 using 11-point interpolated AP)

Name	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	box AP50	mask AP	model id	download
R50-FPN, Cityscapes	0.240	0.092	4.4			36.5	142423278	model \| metrics
R50-C4, VOC	0.537	0.086	4.8	51.9	80.3		142202221	model \| metrics

Other Settings

Ablations for Deformable Conv and Cascade R-CNN:

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
Baseline R50-FPN	1x	0.261	0.053	3.4	38.6	35.2	137260431	model \| metrics
Deformable Conv	1x	0.342	0.061	3.5	41.5	37.5	138602867	model \| metrics
Cascade R-CNN	1x	0.317	0.066	4.0	42.1	36.4	138602847	model \| metrics
Baseline R50-FPN	3x	0.261	0.055	3.4	41.0	37.2	137849600	model \| metrics
Deformable Conv	3x	0.349	0.066	3.5	42.7	38.5	144998336	model \| metrics
Cascade R-CNN	3x	0.328	0.075	4.0	44.3	38.5	144998488	model \| metrics

Ablations for normalization methods: (Note: The baseline uses 2fc head while the others use 4conv1fc head. According to the GroupNorm paper, the change in head does not improve the baseline by much)

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
Baseline R50-FPN	3x	0.261	0.055	3.4	41.0	37.2	137849600	model \| metrics
SyncBN	3x	0.464	0.063	5.6	42.0	37.8	143915318	model \| metrics
GN	3x	0.356	0.077	7.3	42.6	38.6	138602888	model \| metrics
GN (scratch)	3x	0.400	0.077	9.8	39.9	36.6	138602908	model \| metrics

A few very large models trained for a long time, for demo purposes:

Name	inference time (s/im)	train mem (GB)	box AP	mask AP	PQ	model id	download
Panoptic FPN R101	0.123	11.4	47.4	41.3	46.1	139797668	model \| metrics
Mask R-CNN X152	0.281	15.1	49.3	43.2		18131413	model \| metrics
above + test-time aug.			51.4	45.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_ZOO.md

MODEL_ZOO.md

Detectron2 Model Zoo and Baselines

Introduction

Common Settings

ImageNet Pretrained Models

License

COCO Object Detection Baselines

Faster R-CNN:

RetinaNet:

RPN & Fast R-CNN:

COCO Instance Segmentation Baselines with Mask R-CNN

COCO Person Keypoint Detection Baselines with Keypoint R-CNN

COCO Panoptic Segmentation Baselines with Panoptic FPN

LVIS Instance Segmentation Baselines with Mask R-CNN

Cityscapes & Pascal VOC Baselines

Other Settings

Files

MODEL_ZOO.md

Latest commit

History

MODEL_ZOO.md

File metadata and controls

Detectron2 Model Zoo and Baselines

Introduction

Common Settings

ImageNet Pretrained Models

License

COCO Object Detection Baselines

Faster R-CNN:

RetinaNet:

RPN & Fast R-CNN:

COCO Instance Segmentation Baselines with Mask R-CNN

COCO Person Keypoint Detection Baselines with Keypoint R-CNN

COCO Panoptic Segmentation Baselines with Panoptic FPN

LVIS Instance Segmentation Baselines with Mask R-CNN

Cityscapes & Pascal VOC Baselines

Other Settings