Skip to content

Conversation

kangkang59812
Copy link
Contributor

When resuming, we need to start from the last epoch not 0.

When resuming, we need to start from the last epoch not 0.
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

I have a question, let me know your thoughts.

Alternatively, one could also do something similar to the classification scripts, where we do

for epoch in range(args.start_epoch, args.epochs):
and
args.start_epoch = checkpoint['epoch'] + 1
, what are your thoughts on this?

optimizer.load_state_dict(checkpoint['optimizer'])
lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])

last_epoch = lr_scheduler.last_epoch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking: should this be lr_scheduler.last_epoch or lr_scheduler.last_epoch + 1?

Copy link
Contributor Author

@kangkang59812 kangkang59812 Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same idea with you at first, but I think this style is more simple and make less parameter to set. Most importantly, others won't make a mistake about which epoch to resume because the last epoch is kept by lr_scheduler(And we don't need to consider if we set a wrong start epoch that will overwrite the checkpoint). Maybe you have to save epoch in checkpoint and need to check the value before resuming. This simple way make you only need to set the checkpoint model. For your convenience in merge, I will submit the second one for chosing.
As for lr_scheduler.last_epoch or lr_scheduler.last_epoch + 1, I have checked and it's the lr_scheduler.last_epoch. The first time that scheduler.step() runs, last_epoch is set to 1. For example, if the program breaks at epoch 3 [200/2000], we should start from last_epoch=3(0,1,2 epoch have been run).
Thanks!

the second way for resuming
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa fmassa merged commit f4a8224 into pytorch:master Dec 19, 2019
fmassa pushed a commit to fmassa/vision-1 that referenced this pull request Jan 28, 2020
* fix a little bug about resume

When resuming, we need to start from the last epoch not 0.

* the second way for resuming

the second way for resuming
facebook-github-bot pushed a commit that referenced this pull request Jan 28, 2020
Summary:
* fix a little bug about resume

When resuming, we need to start from the last epoch not 0.

* the second way for resuming

the second way for resuming
Pull Request resolved: #1806

Reviewed By: javier-m

Differential Revision: D19599039

Pulled By: fmassa

fbshipit-source-id: 22ebe14bd1ba7728cbdc5149ee181429b834a307
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants