fix a little bug about resume #1628

kangkang59812 · 2019-12-02T12:51:00Z

When resuming, we need to start from the last epoch not 0.

fmassa

Thanks for the catch!

I have a question, let me know your thoughts.

Alternatively, one could also do something similar to the classification scripts, where we do

vision/references/classification/train.py

Line 209 in 5b1716a

for epoch in range(args.start_epoch, args.epochs):

and

vision/references/classification/train.py

Line 201 in 5b1716a

args.start_epoch = checkpoint['epoch'] + 1

, what are your thoughts on this?

fmassa · 2019-12-02T20:48:26Z

references/detection/train.py

        optimizer.load_state_dict(checkpoint['optimizer'])
        lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
-
+        last_epoch = lr_scheduler.last_epoch


Just double-checking: should this be lr_scheduler.last_epoch or lr_scheduler.last_epoch + 1?

I have the same idea with you at first, but I think this style is more simple and make less parameter to set. Most importantly, others won't make a mistake about which epoch to resume because the last epoch is kept by lr_scheduler(And we don't need to consider if we set a wrong start epoch that will overwrite the checkpoint). Maybe you have to save epoch in checkpoint and need to check the value before resuming. This simple way make you only need to set the checkpoint model. For your convenience in merge, I will submit the second one for chosing.
As for lr_scheduler.last_epoch or lr_scheduler.last_epoch + 1, I have checked and it's the lr_scheduler.last_epoch. The first time that scheduler.step() runs, last_epoch is set to 1. For example, if the program breaks at epoch 3 [200/2000], we should start from last_epoch=3(0,1,2 epoch have been run).
Thanks!

the second way for resuming

fmassa

Thanks!

* fix a little bug about resume When resuming, we need to start from the last epoch not 0. * the second way for resuming the second way for resuming

Summary: * fix a little bug about resume When resuming, we need to start from the last epoch not 0. * the second way for resuming the second way for resuming Pull Request resolved: #1806 Reviewed By: javier-m Differential Revision: D19599039 Pulled By: fmassa fbshipit-source-id: 22ebe14bd1ba7728cbdc5149ee181429b834a307

fix a little bug about resume

4e24578

When resuming, we need to start from the last epoch not 0.

fmassa reviewed Dec 2, 2019

View reviewed changes

the second way for resuming

7bb3ed1

the second way for resuming

fmassa approved these changes Dec 19, 2019

View reviewed changes

fmassa merged commit f4a8224 into pytorch:master Dec 19, 2019

fmassa pushed a commit to fmassa/vision-1 that referenced this pull request Jan 28, 2020

fix a little bug about resume (pytorch#1628)

c869d46

* fix a little bug about resume When resuming, we need to start from the last epoch not 0. * the second way for resuming the second way for resuming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix a little bug about resume #1628

fix a little bug about resume #1628

Uh oh!

kangkang59812 commented Dec 2, 2019

Uh oh!

fmassa left a comment

Uh oh!

fmassa Dec 2, 2019

Uh oh!

kangkang59812 Dec 3, 2019 •

edited

Loading

Uh oh!

fmassa left a comment

Uh oh!

Uh oh!

fix a little bug about resume #1628

fix a little bug about resume #1628

Uh oh!

Conversation

kangkang59812 commented Dec 2, 2019

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

fmassa Dec 2, 2019

Choose a reason for hiding this comment

Uh oh!

kangkang59812 Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kangkang59812 Dec 3, 2019 •

edited

Loading