how to set the "lr" when using the "ap.SyncBatchNorm"? #2

zimenglan-sysu-512 · 2019-04-29T06:41:24Z

hi @xvjiarui
i move the code into maskrcnn-benchmark and run the config of mask_rcnn_r16_ct_c3-c5_r50_sbn_fpn_1x with the settings: 16 images / 8 GPUs, lr=0.02, and using ap.SyncBatchNorm. it encouters the NaN in the first few interations, it seems to use more GPU than mask_rcnn_r50_fpn_1x.
when i set lr to be 0.0025, the training can run successfully. so can u give me some tips how to set the lr when using the ap.SyncBatchNorm?

The text was updated successfully, but these errors were encountered:

zimenglan-sysu-512 · 2019-04-29T08:17:39Z

i use torch.nn.utils。clip_grad_norm_ to clip the gradients, so that it can solve the NaN problem.

xvjiarui · 2019-04-29T08:44:12Z

The setting looks good to me. I suggest you first run maskrcnn without GC with 16 images on 8 gpus.
I don't think GC would cause gradient explosion.

zimenglan-sysu-512 · 2019-04-29T08:58:41Z

yes, i run maskrcnn with/without GC but using sync bn. if i don't use clip_grads, both of them will encounter the NaN. now i solve it by clipping gradients.
thanks

zimenglan-sysu-512 · 2019-04-30T02:38:47Z

hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?

xvjiarui · 2019-04-30T05:15:40Z

hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?

The Sync BN is all fixed during the test just as BN. How does your loss look like? I suspect that may due to the clip_grad_norm.
I also suggest you check your code. I don't think baseline of maskrcnn-benchmark would encounter gradient explosion even with Sync BN.

zimenglan-sysu-512 · 2019-04-30T09:07:26Z

hi
i use sync bn from apex, like this:

class BottleneckWithAPSyncBN(Bottleneck):
    def __init__(
        self,
        in_channels,
        bottleneck_channels,
        out_channels,
        num_groups=1,
        stride_in_1x1=True,
        stride=1,
        dilation=1,
        configs={},
    ):
        super(BottleneckWithAPSyncBN, self).__init__(
            in_channels=in_channels,
            bottleneck_channels=bottleneck_channels,
            out_channels=out_channels,
            num_groups=num_groups,
            stride_in_1x1=stride_in_1x1,
            stride=stride,
            dilation=dilation,
            norm_func=ap.SyncBatchNorm,
            configs=configs
        )

and use clip_grad after losses.backward() as below where max_norm is set to 35 and norm_type is set to 2:

from torch.nn.utils import clip_grad_norm_ as clip_grad_norm
clip_grad_norm(model.parameters(), max_norm, norm_type=norm_type)

after training using these configs, the mAP on cocoval-2017 is 0.0. the log as below:

2019-04-29 16:15:49,151 maskrcnn_benchmark.trainer INFO: eta: 1 day, 3:45:33  iter: 20  loss: 3.1540 (5.0863)  loss_box_reg: 0.0448 (0.0604)  loss_classifier: 0.9114 (1.8578)  loss_mask: 0.8628 (2.3964)  loss_objectness: 0.3248 (0.4229)  loss_rpn_box_reg: 0.2593 (0.3488)  ftime: 0.2230 (0.5223)  backbone_ftime: 0.0715 (0.3736)  roi_heads_ftime: 0.0337 (0.0381)  rpn_ftime: 0.1085 (0.1106)  time: 0.6449 (1.1106)  data: 0.0086 (0.1168)  lr: 0.007173  max mem: 4027
2019-04-29 16:16:02,300 maskrcnn_benchmark.trainer INFO: eta: 22:05:28  iter: 40  loss: 1.7453 (3.4724)  loss_box_reg: 0.1120 (0.0850)  loss_classifier: 0.5803 (1.2815)  loss_mask: 0.6930 (1.5453)  loss_objectness: 0.2294 (0.3306)  loss_rpn_box_reg: 0.1077 (0.2299)  ftime: 0.2254 (0.3735)  backbone_ftime: 0.0676 (0.2215)  roi_heads_ftime: 0.0475 (0.0427)  rpn_ftime: 0.1078 (0.1093)  time: 0.6584 (0.8840)  data: 0.0102 (0.0640)  lr: 0.007707  max mem: 4027
2019-04-29 16:16:15,600 maskrcnn_benchmark.trainer INFO: eta: 20:15:43  iter: 60  loss: 1.8415 (2.9447)  loss_box_reg: 0.1149 (0.0985)  loss_classifier: 0.5927 (1.0666)  loss_mask: 0.6890 (1.2598)  loss_objectness: 0.2689 (0.3151)  loss_rpn_box_reg: 0.1651 (0.2048)  ftime: 0.2295 (0.3256)  backbone_ftime: 0.0709 (0.1714)  roi_heads_ftime: 0.0507 (0.0455)  rpn_ftime: 0.1077 (0.1087)  time: 0.6604 (0.8110)  data: 0.0131 (0.0475)  lr: 0.008240  max mem: 4027
2019-04-29 16:16:28,584 maskrcnn_benchmark.trainer INFO: eta: 19:14:49  iter: 80  loss: 1.6740 (2.6838)  loss_box_reg: 0.1086 (0.1015)  loss_classifier: 0.4860 (0.9273)  loss_mask: 0.6849 (1.1166)  loss_objectness: 0.2614 (0.3094)  loss_rpn_box_reg: 0.1261 (0.2290)  ftime: 0.2221 (0.3000)  backbone_ftime: 0.0684 (0.1458)  roi_heads_ftime: 0.0497 (0.0461)  rpn_ftime: 0.1057 (0.1081)  time: 0.6513 (0.7706)  data: 0.0099 (0.0383)  lr: 0.008773  max mem: 4027
2019-04-29 16:16:41,528 maskrcnn_benchmark.trainer INFO: eta: 18:37:35  iter: 100  loss: 1.8355 (2.5213)  loss_box_reg: 0.1045 (0.1035)  loss_classifier: 0.5610 (0.8732)  loss_mask: 0.6875 (1.0307)  loss_objectness: 0.2637 (0.3028)  loss_rpn_box_reg: 0.1414 (0.2110)  ftime: 0.2239 (0.2848)  backbone_ftime: 0.0703 (0.1307)  roi_heads_ftime: 0.0474 (0.0465)  rpn_ftime: 0.1052 (0.1076)  time: 0.6486 (0.7459)  data: 0.0103 (0.0331)  lr: 0.009307  max mem: 4027
2019-04-29 16:16:54,769 maskrcnn_benchmark.trainer INFO: eta: 18:16:24  iter: 120  loss: 2.8539 (2.8987)  loss_box_reg: 0.1312 (0.1103)  loss_classifier: 1.3366 (1.2354)  loss_mask: 0.6863 (0.9734)  loss_objectness: 0.3254 (0.3663)  loss_rpn_box_reg: 0.1463 (0.2133)  ftime: 0.2244 (0.2749)  backbone_ftime: 0.0674 (0.1202)  roi_heads_ftime: 0.0494 (0.0469)  rpn_ftime: 0.1083 (0.1077)  time: 0.6607 (0.7319)  data: 0.0112 (0.0296)  lr: 0.009840  max mem: 4092
2019-04-29 16:17:07,720 maskrcnn_benchmark.trainer INFO: eta: 17:58:07  iter: 140  loss: 2.9404 (3.3908)  loss_box_reg: 0.1603 (0.1218)  loss_classifier: 1.6348 (1.7406)  loss_mask: 0.6878 (0.9326)  loss_objectness: 0.5641 (0.3878)  loss_rpn_box_reg: 0.1102 (0.2080)  ftime: 0.2173 (0.2669)  backbone_ftime: 0.0646 (0.1123)  roi_heads_ftime: 0.0448 (0.0468)  rpn_ftime: 0.1075 (0.1078)  time: 0.6432 (0.7199)  data: 0.0113 (0.0271)  lr: 0.010373  max mem: 4092

......

2019-04-30 07:18:47,434 maskrcnn_benchmark.trainer INFO: eta: 0:01:24  iter: 89860  loss: 0.9836 (1.3513)  loss_box_reg: 0.0075 (0.0120)  loss_classifier: 0.2176 (0.4399)  loss_mask: 0.3144 (0.3686)  loss_objectness: 0.3680 (0.4110)  loss_rpn_box_reg: 0.1062 (0.1197)  ftime: 0.1991 (0.2021)  backbone_ftime: 0.0668 (0.0694)  roi_heads_ftime: 0.0225 (0.0238)  rpn_ftime: 0.1078 (0.1089)  time: 0.5898 (0.6032)  data: 0.0129 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:18:59,354 maskrcnn_benchmark.trainer INFO: eta: 0:01:12  iter: 89880  loss: 1.1297 (1.3512)  loss_box_reg: 0.0080 (0.0120)  loss_classifier: 0.2681 (0.4399)  loss_mask: 0.3165 (0.3686)  loss_objectness: 0.4145 (0.4110)  loss_rpn_box_reg: 0.1112 (0.1197)  ftime: 0.1974 (0.2021)  backbone_ftime: 0.0662 (0.0694)  roi_heads_ftime: 0.0242 (0.0238)  rpn_ftime: 0.1069 (0.1089)  time: 0.5999 (0.6032)  data: 0.0144 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:11,233 maskrcnn_benchmark.trainer INFO: eta: 0:01:00  iter: 89900  loss: 1.0263 (1.3512)  loss_box_reg: 0.0039 (0.0120)  loss_classifier: 0.2237 (0.4398)  loss_mask: 0.3114 (0.3686)  loss_objectness: 0.3918 (0.4110)  loss_rpn_box_reg: 0.1096 (0.1197)  ftime: 0.1980 (0.2021)  backbone_ftime: 0.0665 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1074 (0.1089)  time: 0.5881 (0.6032)  data: 0.0113 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:22,989 maskrcnn_benchmark.trainer INFO: eta: 0:00:48  iter: 89920  loss: 1.0952 (1.3511)  loss_box_reg: 0.0034 (0.0120)  loss_classifier: 0.2197 (0.4398)  loss_mask: 0.3247 (0.3686)  loss_objectness: 0.4153 (0.4110)  loss_rpn_box_reg: 0.1111 (0.1197)  ftime: 0.1954 (0.2021)  backbone_ftime: 0.0658 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1062 (0.1089)  time: 0.5863 (0.6032)  data: 0.0108 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:34,826 maskrcnn_benchmark.trainer INFO: eta: 0:00:36  iter: 89940  loss: 1.0778 (1.3511)  loss_box_reg: 0.0088 (0.0120)  loss_classifier: 0.2394 (0.4397)  loss_mask: 0.3099 (0.3686)  loss_objectness: 0.4080 (0.4110)  loss_rpn_box_reg: 0.1153 (0.1197)  ftime: 0.1986 (0.2021)  backbone_ftime: 0.0659 (0.0694)  roi_heads_ftime: 0.0233 (0.0238)  rpn_ftime: 0.1074 (0.1089)  time: 0.5890 (0.6032)  data: 0.0108 (0.0131)  lr: 0.000200  max mem: 4250  
2019-04-30 07:19:46,781 maskrcnn_benchmark.trainer INFO: eta: 0:00:24  iter: 89960  loss: 0.9723 (1.3510)  loss_box_reg: 0.0067 (0.0120)  loss_classifier: 0.2280 (0.4397)  loss_mask: 0.3218 (0.3686)  loss_objectness: 0.3552 (0.4110)  loss_rpn_box_reg: 0.0949 (0.1197)  ftime: 0.1975 (0.2021)  backbone_ftime: 0.0671 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1086 (0.1089)  time: 0.5903 (0.6032)  data: 0.0137 (0.0131)  lr: 0.000200  max mem: 4250  
2019-04-30 07:19:58,708 maskrcnn_benchmark.trainer INFO: eta: 0:00:12  iter: 89980  loss: 1.1414 (1.3509)  loss_box_reg: 0.0117 (0.0120)  loss_classifier: 0.2371 (0.4396)  loss_mask: 0.3259 (0.3686)  loss_objectness: 0.4193 (0.4110)  loss_rpn_box_reg: 0.1314 (0.1197)  ftime: 0.1980 (0.2021)  backbone_ftime: 0.0681 (0.0694)  roi_heads_ftime: 0.0236 (0.0238)  rpn_ftime: 0.1060 (0.1089)  time: 0.5894 (0.6032)  data: 0.0100 (0.0131)  lr: 0.000200  max mem: 4250

if i don't use the clip_grad, it encounters the NaN problem.

Iamal1 · 2019-05-13T10:45:29Z

Hi, my experiment could not run with pytorch syncbn or just BatchNorm, do you remove the broadcast_buffer in the DDP. I want to check there the problem is, thank you!

zimenglan-sysu-512 closed this as completed Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to set the "lr" when using the "ap.SyncBatchNorm"? #2

how to set the "lr" when using the "ap.SyncBatchNorm"? #2

zimenglan-sysu-512 commented Apr 29, 2019 •

edited

zimenglan-sysu-512 commented Apr 29, 2019

xvjiarui commented Apr 29, 2019

zimenglan-sysu-512 commented Apr 29, 2019

zimenglan-sysu-512 commented Apr 30, 2019

xvjiarui commented Apr 30, 2019 •

edited

zimenglan-sysu-512 commented Apr 30, 2019 •

edited

Iamal1 commented May 13, 2019

how to set the "lr" when using the "ap.SyncBatchNorm"? #2

how to set the "lr" when using the "ap.SyncBatchNorm"? #2

Comments

zimenglan-sysu-512 commented Apr 29, 2019 • edited

zimenglan-sysu-512 commented Apr 29, 2019

xvjiarui commented Apr 29, 2019

zimenglan-sysu-512 commented Apr 29, 2019

zimenglan-sysu-512 commented Apr 30, 2019

xvjiarui commented Apr 30, 2019 • edited

zimenglan-sysu-512 commented Apr 30, 2019 • edited

Iamal1 commented May 13, 2019

zimenglan-sysu-512 commented Apr 29, 2019 •

edited

xvjiarui commented Apr 30, 2019 •

edited

zimenglan-sysu-512 commented Apr 30, 2019 •

edited