-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warmup schedulers in References #4411
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Just double-checking, did you run the new schedulers in a loop to compare before / after results?
optimizer, | ||
lambda x: (1 - x / (len(data_loader) * args.epochs)) ** 0.9) | ||
lambda x: (1 - x / (iters_per_epoch * (args.epochs - args.lr_warmup_epochs))) ** 0.9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to track if we can now use a stock PyTorch scheduler for this #4438
@fmassa Thanks! Yes I did. They match really close (to the 5th decimal) most of the times. For reference here are the scripts used to test this:
|
Summary: * Warmup on Classficiation references. * Adjust epochs for cosine. * Warmup on Segmentation references. * Warmup on Video classification references. * Adding support of both types of warmup in segmentation. * Use LinearLR in detection. * Fix deprecation warning. Reviewed By: datumbox Differential Revision: D31268039 fbshipit-source-id: d0fe7e334c01201c2413bac8b911d740b9a69bba
Resolves #4281
Adds warmup on the following recipes:
This PR maintains the location where we call
scheduler.step()
in each recipe. Segmentation and Video classification do it on the iteration level, Classification does it on the epoch level and Detection does it in a hybrid.Though doing it on the iteration level provides more flexibility, doing so will have slight effects on the reproducibility of existing models. These effects should be minor and largely overshadowed by other differences across runs (such as the randomness of the initialization scheme). The only reason I'm not doing the switch here is because it requires extra work which I'm deferring for when we will start retraining the models using the new utils of Batteries Included.