Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to set the "lr" when using the "ap.SyncBatchNorm"? #2

Closed
zimenglan-sysu-512 opened this issue Apr 29, 2019 · 7 comments
Closed

Comments

@zimenglan-sysu-512
Copy link

zimenglan-sysu-512 commented Apr 29, 2019

hi @xvjiarui
i move the code into maskrcnn-benchmark and run the config of mask_rcnn_r16_ct_c3-c5_r50_sbn_fpn_1x with the settings: 16 images / 8 GPUs, lr=0.02, and using ap.SyncBatchNorm. it encouters the NaN in the first few interations, it seems to use more GPU than mask_rcnn_r50_fpn_1x.
when i set lr to be 0.0025, the training can run successfully. so can u give me some tips how to set the lr when using the ap.SyncBatchNorm?

@zimenglan-sysu-512
Copy link
Author

i use torch.nn.utils。clip_grad_norm_ to clip the gradients, so that it can solve the NaN problem.

@xvjiarui
Copy link
Owner

The setting looks good to me. I suggest you first run maskrcnn without GC with 16 images on 8 gpus.
I don't think GC would cause gradient explosion.

@zimenglan-sysu-512
Copy link
Author

yes, i run maskrcnn with/without GC but using sync bn. if i don't use clip_grads, both of them will encounter the NaN. now i solve it by clipping gradients.
thanks

@zimenglan-sysu-512
Copy link
Author

hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?

@xvjiarui
Copy link
Owner

xvjiarui commented Apr 30, 2019

hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?

The Sync BN is all fixed during the test just as BN. How does your loss look like? I suspect that may due to the clip_grad_norm.
I also suggest you check your code. I don't think baseline of maskrcnn-benchmark would encounter gradient explosion even with Sync BN.

@zimenglan-sysu-512
Copy link
Author

zimenglan-sysu-512 commented Apr 30, 2019

hi
i use sync bn from apex, like this:

class BottleneckWithAPSyncBN(Bottleneck):
    def __init__(
        self,
        in_channels,
        bottleneck_channels,
        out_channels,
        num_groups=1,
        stride_in_1x1=True,
        stride=1,
        dilation=1,
        configs={},
    ):
        super(BottleneckWithAPSyncBN, self).__init__(
            in_channels=in_channels,
            bottleneck_channels=bottleneck_channels,
            out_channels=out_channels,
            num_groups=num_groups,
            stride_in_1x1=stride_in_1x1,
            stride=stride,
            dilation=dilation,
            norm_func=ap.SyncBatchNorm,
            configs=configs
        )

and use clip_grad after losses.backward() as below where max_norm is set to 35 and norm_type is set to 2:

from torch.nn.utils import clip_grad_norm_ as clip_grad_norm
clip_grad_norm(model.parameters(), max_norm, norm_type=norm_type)

after training using these configs, the mAP on cocoval-2017 is 0.0. the log as below:

2019-04-29 16:15:49,151 maskrcnn_benchmark.trainer INFO: eta: 1 day, 3:45:33  iter: 20  loss: 3.1540 (5.0863)  loss_box_reg: 0.0448 (0.0604)  loss_classifier: 0.9114 (1.8578)  loss_mask: 0.8628 (2.3964)  loss_objectness: 0.3248 (0.4229)  loss_rpn_box_reg: 0.2593 (0.3488)  ftime: 0.2230 (0.5223)  backbone_ftime: 0.0715 (0.3736)  roi_heads_ftime: 0.0337 (0.0381)  rpn_ftime: 0.1085 (0.1106)  time: 0.6449 (1.1106)  data: 0.0086 (0.1168)  lr: 0.007173  max mem: 4027
2019-04-29 16:16:02,300 maskrcnn_benchmark.trainer INFO: eta: 22:05:28  iter: 40  loss: 1.7453 (3.4724)  loss_box_reg: 0.1120 (0.0850)  loss_classifier: 0.5803 (1.2815)  loss_mask: 0.6930 (1.5453)  loss_objectness: 0.2294 (0.3306)  loss_rpn_box_reg: 0.1077 (0.2299)  ftime: 0.2254 (0.3735)  backbone_ftime: 0.0676 (0.2215)  roi_heads_ftime: 0.0475 (0.0427)  rpn_ftime: 0.1078 (0.1093)  time: 0.6584 (0.8840)  data: 0.0102 (0.0640)  lr: 0.007707  max mem: 4027
2019-04-29 16:16:15,600 maskrcnn_benchmark.trainer INFO: eta: 20:15:43  iter: 60  loss: 1.8415 (2.9447)  loss_box_reg: 0.1149 (0.0985)  loss_classifier: 0.5927 (1.0666)  loss_mask: 0.6890 (1.2598)  loss_objectness: 0.2689 (0.3151)  loss_rpn_box_reg: 0.1651 (0.2048)  ftime: 0.2295 (0.3256)  backbone_ftime: 0.0709 (0.1714)  roi_heads_ftime: 0.0507 (0.0455)  rpn_ftime: 0.1077 (0.1087)  time: 0.6604 (0.8110)  data: 0.0131 (0.0475)  lr: 0.008240  max mem: 4027
2019-04-29 16:16:28,584 maskrcnn_benchmark.trainer INFO: eta: 19:14:49  iter: 80  loss: 1.6740 (2.6838)  loss_box_reg: 0.1086 (0.1015)  loss_classifier: 0.4860 (0.9273)  loss_mask: 0.6849 (1.1166)  loss_objectness: 0.2614 (0.3094)  loss_rpn_box_reg: 0.1261 (0.2290)  ftime: 0.2221 (0.3000)  backbone_ftime: 0.0684 (0.1458)  roi_heads_ftime: 0.0497 (0.0461)  rpn_ftime: 0.1057 (0.1081)  time: 0.6513 (0.7706)  data: 0.0099 (0.0383)  lr: 0.008773  max mem: 4027
2019-04-29 16:16:41,528 maskrcnn_benchmark.trainer INFO: eta: 18:37:35  iter: 100  loss: 1.8355 (2.5213)  loss_box_reg: 0.1045 (0.1035)  loss_classifier: 0.5610 (0.8732)  loss_mask: 0.6875 (1.0307)  loss_objectness: 0.2637 (0.3028)  loss_rpn_box_reg: 0.1414 (0.2110)  ftime: 0.2239 (0.2848)  backbone_ftime: 0.0703 (0.1307)  roi_heads_ftime: 0.0474 (0.0465)  rpn_ftime: 0.1052 (0.1076)  time: 0.6486 (0.7459)  data: 0.0103 (0.0331)  lr: 0.009307  max mem: 4027
2019-04-29 16:16:54,769 maskrcnn_benchmark.trainer INFO: eta: 18:16:24  iter: 120  loss: 2.8539 (2.8987)  loss_box_reg: 0.1312 (0.1103)  loss_classifier: 1.3366 (1.2354)  loss_mask: 0.6863 (0.9734)  loss_objectness: 0.3254 (0.3663)  loss_rpn_box_reg: 0.1463 (0.2133)  ftime: 0.2244 (0.2749)  backbone_ftime: 0.0674 (0.1202)  roi_heads_ftime: 0.0494 (0.0469)  rpn_ftime: 0.1083 (0.1077)  time: 0.6607 (0.7319)  data: 0.0112 (0.0296)  lr: 0.009840  max mem: 4092
2019-04-29 16:17:07,720 maskrcnn_benchmark.trainer INFO: eta: 17:58:07  iter: 140  loss: 2.9404 (3.3908)  loss_box_reg: 0.1603 (0.1218)  loss_classifier: 1.6348 (1.7406)  loss_mask: 0.6878 (0.9326)  loss_objectness: 0.5641 (0.3878)  loss_rpn_box_reg: 0.1102 (0.2080)  ftime: 0.2173 (0.2669)  backbone_ftime: 0.0646 (0.1123)  roi_heads_ftime: 0.0448 (0.0468)  rpn_ftime: 0.1075 (0.1078)  time: 0.6432 (0.7199)  data: 0.0113 (0.0271)  lr: 0.010373  max mem: 4092

......

2019-04-30 07:18:47,434 maskrcnn_benchmark.trainer INFO: eta: 0:01:24  iter: 89860  loss: 0.9836 (1.3513)  loss_box_reg: 0.0075 (0.0120)  loss_classifier: 0.2176 (0.4399)  loss_mask: 0.3144 (0.3686)  loss_objectness: 0.3680 (0.4110)  loss_rpn_box_reg: 0.1062 (0.1197)  ftime: 0.1991 (0.2021)  backbone_ftime: 0.0668 (0.0694)  roi_heads_ftime: 0.0225 (0.0238)  rpn_ftime: 0.1078 (0.1089)  time: 0.5898 (0.6032)  data: 0.0129 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:18:59,354 maskrcnn_benchmark.trainer INFO: eta: 0:01:12  iter: 89880  loss: 1.1297 (1.3512)  loss_box_reg: 0.0080 (0.0120)  loss_classifier: 0.2681 (0.4399)  loss_mask: 0.3165 (0.3686)  loss_objectness: 0.4145 (0.4110)  loss_rpn_box_reg: 0.1112 (0.1197)  ftime: 0.1974 (0.2021)  backbone_ftime: 0.0662 (0.0694)  roi_heads_ftime: 0.0242 (0.0238)  rpn_ftime: 0.1069 (0.1089)  time: 0.5999 (0.6032)  data: 0.0144 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:11,233 maskrcnn_benchmark.trainer INFO: eta: 0:01:00  iter: 89900  loss: 1.0263 (1.3512)  loss_box_reg: 0.0039 (0.0120)  loss_classifier: 0.2237 (0.4398)  loss_mask: 0.3114 (0.3686)  loss_objectness: 0.3918 (0.4110)  loss_rpn_box_reg: 0.1096 (0.1197)  ftime: 0.1980 (0.2021)  backbone_ftime: 0.0665 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1074 (0.1089)  time: 0.5881 (0.6032)  data: 0.0113 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:22,989 maskrcnn_benchmark.trainer INFO: eta: 0:00:48  iter: 89920  loss: 1.0952 (1.3511)  loss_box_reg: 0.0034 (0.0120)  loss_classifier: 0.2197 (0.4398)  loss_mask: 0.3247 (0.3686)  loss_objectness: 0.4153 (0.4110)  loss_rpn_box_reg: 0.1111 (0.1197)  ftime: 0.1954 (0.2021)  backbone_ftime: 0.0658 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1062 (0.1089)  time: 0.5863 (0.6032)  data: 0.0108 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:34,826 maskrcnn_benchmark.trainer INFO: eta: 0:00:36  iter: 89940  loss: 1.0778 (1.3511)  loss_box_reg: 0.0088 (0.0120)  loss_classifier: 0.2394 (0.4397)  loss_mask: 0.3099 (0.3686)  loss_objectness: 0.4080 (0.4110)  loss_rpn_box_reg: 0.1153 (0.1197)  ftime: 0.1986 (0.2021)  backbone_ftime: 0.0659 (0.0694)  roi_heads_ftime: 0.0233 (0.0238)  rpn_ftime: 0.1074 (0.1089)  time: 0.5890 (0.6032)  data: 0.0108 (0.0131)  lr: 0.000200  max mem: 4250  
2019-04-30 07:19:46,781 maskrcnn_benchmark.trainer INFO: eta: 0:00:24  iter: 89960  loss: 0.9723 (1.3510)  loss_box_reg: 0.0067 (0.0120)  loss_classifier: 0.2280 (0.4397)  loss_mask: 0.3218 (0.3686)  loss_objectness: 0.3552 (0.4110)  loss_rpn_box_reg: 0.0949 (0.1197)  ftime: 0.1975 (0.2021)  backbone_ftime: 0.0671 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1086 (0.1089)  time: 0.5903 (0.6032)  data: 0.0137 (0.0131)  lr: 0.000200  max mem: 4250  
2019-04-30 07:19:58,708 maskrcnn_benchmark.trainer INFO: eta: 0:00:12  iter: 89980  loss: 1.1414 (1.3509)  loss_box_reg: 0.0117 (0.0120)  loss_classifier: 0.2371 (0.4396)  loss_mask: 0.3259 (0.3686)  loss_objectness: 0.4193 (0.4110)  loss_rpn_box_reg: 0.1314 (0.1197)  ftime: 0.1980 (0.2021)  backbone_ftime: 0.0681 (0.0694)  roi_heads_ftime: 0.0236 (0.0238)  rpn_ftime: 0.1060 (0.1089)  time: 0.5894 (0.6032)  data: 0.0100 (0.0131)  lr: 0.000200  max mem: 4250 

if i don't use the clip_grad, it encounters the NaN problem.

@Iamal1
Copy link

Iamal1 commented May 13, 2019

Hi, my experiment could not run with pytorch syncbn or just BatchNorm, do you remove the broadcast_buffer in the DDP. I want to check there the problem is, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants