New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss becoming nan while training on large batch size on COCO #60
Comments
@subhankar-ghosh that is odd, looks like you scaled the LR appropriately... are you using the exact raining script here on COCO with the standard model head (not changing it), and still using the other params in the command above? especially warmup and sync-bn? The first version of this were very touchy, but that's mostly because there were issues in the model, head init, head currently does not get re-inited properly without manually doing it if you replace it after model creation. |
@rwightman , I took this commit and changed nothing: My training script is the same as yours:
|
Very strange, |
@subhankar-ghosh nope, I have no idea, I don't have access to those sorts of resources so no ability to test, I'm actually suprised it works at all with 4-nodes, 8 GPU per node. There are limits to the scaling wrt to batch and LR, I'm not sure if that's being run into or if their are other issues. Does single node 8-gpu training definitely have stability issues, someone else was doing somethign like that and didn't mention any issues. |
I have the same problem while training with and I found the output class EfficientDet(nn.Module):
def __init__(self, config, norm_kwargs=None, pretrained_backbone=True, alternate_init=False):
super(EfficientDet, self).__init__()
...
def forward(self, x):
x = self.backbone(x) ### x becomes all nan here
x = self.fpn(x)
x_class = self.class_net(x)
x_box = self.box_net(x)
return x_class, x_box |
@wangraying From my experiments I found that as I was scaling the LR linearly with batch size, I also had to increase warm-up epochs either linearly or at least square root linearly with batch size. This prevents loss from becoming nan. May be this might work for you too. |
@wangraying don't use 'sgd' as the opt string, it is not stable with the default hparams, for legacy reasons 'sgd' with my optmizer factory is sgd + nesterov, 'momentum' is sgd without nesterov (after the and that's what the hparams from official paper/impl were based on... also the official version does warmup ramp per step, I only ramp per epoch, one could try a different scheduler. I'm going to close this issue now. I don't think there is any major bug or defect here (please let me know if one is found), just the usual hparam tuning and the default hparams for these models are on the edge of stability. |
@subhankar-ghosh I also noticed that for large-scale, multi-node training increasing the number of epoch for warmup seems to avoid the NaN (divergence) issue, however sometimes later in epochs the training will still diverge. Have you found a set of hyper-params that can reach 33.6mAP for d0 under 32-gpu setting? So far the best I can do is to get 33.5mAP with 8GPU, and this just linearly scales the LR. |
Hey @pichuang1984 , The best results I have got with d0 32-gpu setting is 33.35mAP. I simply linearly scaled the LR and warmup and ema-decay 0.999 instead of 0.9999. Increasing ema-decay a little bit actually might lead to convergence faster. |
@rwightman what is the role of |
Hey @rwightman ,
I am trying to train EfficientDet D0 model on COCO from scratch, it works perfectly and converges when I use your settings:
./distributed_train.sh 4 /mscoco --model efficientdet_d0 -b 22 --amp --lr .12 --sync-bn --opt fusedmomentum --warmup-epochs 5 --lr-noise 0.4 0.9 --model-ema --model-ema-decay 0.9999
But when I use a Larger Batch size setting like the following, loss becomes nan:
In case of AMP there is a cascade of loss scaling and then the loss becomes nan, And this does not necessarily happen in the first few epochs.
Strangely enough the TF1 Google automl code base has no problem scaling linearly. Do you know what might be the problem here? Do I need to change the recipe when using a large batch size?
The text was updated successfully, but these errors were encountered: