Optimizer Choice: SGD vs Adam #4

glenn-jocher · 2018-09-04T12:37:07Z

When developing the training code I found that SGD caused divergence very quickly at the default LR of 1e-4. Loss terms began to grow exponentially, becoming Inf within about 10 batches of starting training.

Adam always seems to converge in contrast, which is why I use it as the default optimizer in train.py. I don't understand why Adam works and SGD does not, as darknet uses SGD successfully. This is one of the key differences between darknet and this repo, so any insights into how we can get SGD to converge would be appreciated.

It might be that I simply don't have the proper learning rate (and scheduler) in place.

line 82 in train.py

# optimizer = torch.optim.SGD(model.parameters(), lr=.001, momentum=.9, weight_decay=5e-4)
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4, weight_decay=5e-4)

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2018-09-11T17:06:04Z

See #2 (comment) for possible SGD warm-up requirements.

xyutao · 2018-09-12T01:53:23Z

@glenn-jocher In the official darknet code, the burn_in config is defined here:
https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/cfg/yolov3.cfg#L19

If current batch_num < burn_in, the learning rate would be scaled based on the value of burn_in:
https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L95

glenn-jocher · 2018-09-13T15:53:36Z

@xyutao thanks for the links. This looks like a fairly easy change to implement. I can go ahead and submit a commit with this. Have you tried this out successfully on your side?

glenn-jocher · 2018-09-19T23:57:49Z

@xyutao From your darknet link I think the correct burnin in formula is this, which will slowly ramp up the LR to 1e-3 after 1000 iterations and leave it there:

# SGD burn-in
if (epoch == 0) & (i <= 1000):
    power = ??
    lr = 1e-3 * (i / 1000) ** power
    for g in optimizer.param_groups:
        g['lr'] = lr

I can't find the correct value of power though. I tried with power=2 and training diverged around 200 iterations. Increasing to power=5 training diverges after 400 iterations. power=10 also diverges.

I see that the divergence is in the width and height losses, the other terms appear fine. I think one problem may be that the width and height terms are bound at zero at the bottom, but are unbound at the top, so its possible that the network is predicting impossibly large widths and heights, causing the losses there to diverge. I may need to bound these or redefine the width and height terms and try again. I used a variant of the width and height terms for a different project that had no divergence problems with SGD.

xyutao · 2018-09-20T06:00:50Z

@glenn-jocher The default value of power is 4. See:
https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/parser.c#L698
When I tried other yolov3 implementations, the training successfully converged with power = 1 to 4. Maybe just as you thought, it is the problem of the weight and height losses.

glenn-jocher · 2019-01-09T19:06:41Z

Closing this as SGD burn-in has been successfully implemented.

kieranstrobel · 2019-08-18T01:42:25Z

although I know this is closed, we exclusively use Adam for training with our fork of this repo, it instantly took us from a precision on our dataset of 20% to 85% (with slight mAP increases as well)

glenn-jocher · 2019-08-18T10:53:56Z

@kieranstrobel that's interesting. Have you trained COCO as well with Adam?

We tried Adam as well as Adabound recently, but observed performance drops on both on COCO. What LR did you use for Adam vs SGD?

glenn-jocher · 2019-08-18T11:47:52Z

@kieranstrobel I ran a quick comparison using our small coco dataset coco_16img.data. I used the default hyperparameters for both. i.e.:

    # Optimizer
    optimizer = optim.Adam(model.parameters(), lr=hyp['lr0'], weight_decay=hyp['weight_decay'])
    # optimizer = AdaBound(model.parameters(), lr=hyp['lr0'], final_lr=0.1)
    optimizer = optim.SGD(model.parameters(), lr=hyp['lr0'], momentum=hyp['momentum'], weight_decay=hyp['weight_decay'], nesterov=True)

The training command was:

python3 train.py --data data/coco_16img.data --batch-size 16 --accumulate 1 --img-size 320 --nosave --cache

BTW, the burn-in period (original issue topic) has been removed because the wh-divergence issue is now resolved due to GIoU loss replacing the four individual regression losses (x, y, w, h). This example scenario above should actually favor Adam, since Adam is known for reducing training losses moreso than validation losses (and then failing to generalize well), because the dataset trains and validates on the same images, but SGD still clearly outperforms it.

Can you plot a comparison using your custom dataset?

glenn-jocher · 2019-09-02T09:27:27Z

I see xNets https://arxiv.org/pdf/1908.04646v2.pdf uses Adam at 5E5 LR in their results, so I did a study again of Adam results on the first epoch of COCO at 320. The results show lowest validation loss and best mAP (0.202) at 9E-5 Adam LR. This exceeds the 0.161 SGD mAP after the same 1 epoch. The validation losses were also lower with Adam:

[1.79, 3.96, 2.44] Adam val losses lr=9E-5 (giou, obj, cls)
[1.80, 4.15, 2.68] SGD val losses lr=0.0023, momentum=0.97 (giou, obj, cls)

I will try to train to 27 epochs with Adam at this LR next.

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  # iE5 Adam LR
do
  python3 train.py --epochs 1 --weights weights/darknet53.conv.74 --batch-size 64 --accumulate 1 --img-size 320 --var ${i}
done
sudo shutdown

glenn-jocher · 2019-09-04T21:27:05Z

44.9 SGD vs 45.2 Adam 9E-5 LR:

xuefeicao · 2019-12-06T16:27:42Z

44.9 SGD vs 45.2 Adam 9E-5 LR:

Hi, thanks for the experiment. are you using weight decay here for this Adam experiment?

glenn-jocher · 2019-12-06T20:19:01Z

@xuefeicao yes, for both. Search train.py for weight_decay.

xuefeicao · 2019-12-09T05:30:37Z

Got it, thanks!

…

On Fri, Dec 6, 2019 at 3:19 PM Glenn Jocher ***@***.***> wrote: @xuefeicao <https://github.com/xuefeicao> yes, for both. Search train.py for weight_decay. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AFSMXV2N53ZLS4YPNI75443QXKXTLA5CNFSM4FTDTCA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFHSHI#issuecomment-562723101>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSMXV5G7SUWEBVVOY3FFETQXKXTLANCNFSM4FTDTCAQ> .

github-actions · 2020-03-08T00:09:25Z

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

nanhui69 · 2020-08-24T09:31:08Z

@glenn-jocher does this experimental results aply to this repo now，when i see the optimizer is also SGD ?

glenn-jocher · 2020-08-25T02:35:42Z

@nanhui69 yes, but I would recommend yolov5 for new projects.
https://github.com/ultralytics/yolov5

danielcrane · 2021-03-11T07:54:34Z

@glenn-jocher Do you happen to know if YOLOv5 has the same issue with Adam performing better than the default SGD?

glenn-jocher · 2021-03-11T20:13:11Z

@danielcrane I don't know, but you can test Adam out on your own training workflows by passing the --adam flag (make sure you reduce your LR accordingly in your hyp file):

yolov3/data/hyp.scratch.yaml

Line 6 in c1f8dd9

lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)

danielcrane · 2021-03-12T03:49:16Z

@glenn-jocher Understood, thanks!

glenn-jocher · 2023-11-14T19:22:02Z

@danielcrane you're welcome! If you have any other questions, feel free to ask.

glenn-jocher mentioned this issue Sep 5, 2018

Resume training from official yolov3 weights #2

Closed

glenn-jocher added the help wanted Extra attention is needed label Sep 9, 2018

glenn-jocher self-assigned this Sep 9, 2018

glenn-jocher changed the title ~~Optimizer Choice: SGD vs Adam~~ Optimizer Choice: SGD w/ burn-in vs Adam Sep 13, 2018

glenn-jocher closed this as completed Jan 9, 2019

YourGc mentioned this issue May 6, 2019

RuntimeError: reduce failed to synchronize: device-side assert triggered #263

Closed

feitiandemiaomi mentioned this issue Jun 13, 2019

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one #331

Closed

dagap mentioned this issue Jul 24, 2019

Guideline for training a model from scratch #389

Closed

sanazss mentioned this issue Jul 29, 2019

error on using parallel gpu #404

Closed

glenn-jocher reopened this Aug 18, 2019

glenn-jocher mentioned this issue Aug 30, 2019

Have you ever trained the model without weight decay? #469

Closed

glenn-jocher changed the title ~~Optimizer Choice: SGD w/ burn-in vs Adam~~ Optimizer Choice: SGD vs Adam Sep 4, 2019

github-actions bot added the Stale label Mar 8, 2020

github-actions bot closed this as completed Mar 13, 2020

chrisway613 mentioned this issue Apr 3, 2020

Exception with NMS when using gpus #1004

Closed

winnerCR7 mentioned this issue Jul 3, 2020

After interrupting training, load weights/last.pt to continue training #1368

Closed

thibault390 mentioned this issue May 7, 2021

Abandon (core dumped) #1755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer Choice: SGD vs Adam #4

Optimizer Choice: SGD vs Adam #4

glenn-jocher commented Sep 4, 2018

glenn-jocher commented Sep 11, 2018

xyutao commented Sep 12, 2018

glenn-jocher commented Sep 13, 2018

glenn-jocher commented Sep 19, 2018 •

edited

Loading

xyutao commented Sep 20, 2018

glenn-jocher commented Jan 9, 2019

kieranstrobel commented Aug 18, 2019

glenn-jocher commented Aug 18, 2019

glenn-jocher commented Aug 18, 2019

glenn-jocher commented Sep 2, 2019 •

edited

Loading

glenn-jocher commented Sep 4, 2019

xuefeicao commented Dec 6, 2019

glenn-jocher commented Dec 6, 2019

xuefeicao commented Dec 9, 2019 via email

github-actions bot commented Mar 8, 2020

nanhui69 commented Aug 24, 2020

glenn-jocher commented Aug 25, 2020

danielcrane commented Mar 11, 2021

glenn-jocher commented Mar 11, 2021

danielcrane commented Mar 12, 2021

glenn-jocher commented Nov 14, 2023

Optimizer Choice: SGD vs Adam #4

Optimizer Choice: SGD vs Adam #4

Comments

glenn-jocher commented Sep 4, 2018

glenn-jocher commented Sep 11, 2018

xyutao commented Sep 12, 2018

glenn-jocher commented Sep 13, 2018

glenn-jocher commented Sep 19, 2018 • edited Loading

xyutao commented Sep 20, 2018

glenn-jocher commented Jan 9, 2019

kieranstrobel commented Aug 18, 2019

glenn-jocher commented Aug 18, 2019

glenn-jocher commented Aug 18, 2019

glenn-jocher commented Sep 2, 2019 • edited Loading

glenn-jocher commented Sep 4, 2019

xuefeicao commented Dec 6, 2019

glenn-jocher commented Dec 6, 2019

xuefeicao commented Dec 9, 2019 via email

github-actions bot commented Mar 8, 2020

nanhui69 commented Aug 24, 2020

glenn-jocher commented Aug 25, 2020

danielcrane commented Mar 11, 2021

glenn-jocher commented Mar 11, 2021

danielcrane commented Mar 12, 2021

glenn-jocher commented Nov 14, 2023

glenn-jocher commented Sep 19, 2018 •

edited

Loading

glenn-jocher commented Sep 2, 2019 •

edited

Loading