-
Notifications
You must be signed in to change notification settings - Fork 7.2k
fix a little bug about resume #1628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When resuming, we need to start from the last epoch not 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
I have a question, let me know your thoughts.
Alternatively, one could also do something similar to the classification scripts, where we do
vision/references/classification/train.py
Line 209 in 5b1716a
for epoch in range(args.start_epoch, args.epochs): |
vision/references/classification/train.py
Line 201 in 5b1716a
args.start_epoch = checkpoint['epoch'] + 1 |
references/detection/train.py
Outdated
optimizer.load_state_dict(checkpoint['optimizer']) | ||
lr_scheduler.load_state_dict(checkpoint['lr_scheduler']) | ||
|
||
last_epoch = lr_scheduler.last_epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking: should this be lr_scheduler.last_epoch
or lr_scheduler.last_epoch + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same idea with you at first, but I think this style is more simple and make less parameter to set. Most importantly, others won't make a mistake about which epoch to resume because the last epoch is kept by lr_scheduler
(And we don't need to consider if we set a wrong start epoch that will overwrite the checkpoint). Maybe you have to save epoch
in checkpoint and need to check the value before resuming. This simple way make you only need to set the checkpoint model. For your convenience in merge, I will submit the second one for chosing.
As for lr_scheduler.last_epoch
or lr_scheduler.last_epoch + 1
, I have checked and it's the lr_scheduler.last_epoch
. The first time that scheduler.step()
runs, last_epoch
is set to 1. For example, if the program breaks at epoch 3 [200/2000]
, we should start from last_epoch=3
(0,1,2 epoch have been run).
Thanks!
the second way for resuming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
* fix a little bug about resume When resuming, we need to start from the last epoch not 0. * the second way for resuming the second way for resuming
Summary: * fix a little bug about resume When resuming, we need to start from the last epoch not 0. * the second way for resuming the second way for resuming Pull Request resolved: #1806 Reviewed By: javier-m Differential Revision: D19599039 Pulled By: fmassa fbshipit-source-id: 22ebe14bd1ba7728cbdc5149ee181429b834a307
When resuming, we need to start from the last epoch not 0.