New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to set the "lr" when using the "ap.SyncBatchNorm"? #2
Comments
i use |
The setting looks good to me. I suggest you first run maskrcnn without GC with 16 images on 8 gpus. |
yes, i run maskrcnn with/without GC but using sync bn. if i don't use clip_grads, both of them will encounter the NaN. now i solve it by clipping gradients. |
hi @xvjiarui |
The Sync BN is all fixed during the test just as BN. How does your loss look like? I suspect that may due to the |
hi
and use clip_grad after losses.backward() as below where max_norm is set to 35 and norm_type is set to 2:
after training using these configs, the mAP on cocoval-2017 is 0.0. the log as below:
if i don't use the clip_grad, it encounters the NaN problem. |
Hi, my experiment could not run with pytorch syncbn or just BatchNorm, do you remove the broadcast_buffer in the DDP. I want to check there the problem is, thank you! |
hi @xvjiarui
i move the code into maskrcnn-benchmark and run the config of
mask_rcnn_r16_ct_c3-c5_r50_sbn_fpn_1x
with the settings:16 images / 8 GPUs, lr=0.02, and using ap.SyncBatchNorm
. it encouters theNaN
in the first few interations, it seems to use more GPU thanmask_rcnn_r50_fpn_1x
.when i set lr to be 0.0025, the training can run successfully. so can u give me some tips how to set the lr when using the ap.SyncBatchNorm?
The text was updated successfully, but these errors were encountered: