-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] use epoch for ptl checkpoint dir name #14392
[Tune] use epoch for ptl checkpoint dir name #14392
Conversation
step = trainer.current_epoch if is_epoch else trainer.global_step | ||
with tune.checkpoint_dir(step=step) as checkpoint_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it be possible that you can have global_step
and current_epoch
override each other in the same training loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don’t think that’s possible as long as you’re using 1 callback since it’ll checkpoint at either the epoch frequency or batch frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe I’m misunderstanding what you are asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually we do allow passing multiple on
values to a single callback. So in that case it might interfere.
I'm not sure this is a common use case (mixing on-step and on-epoch checkpointing) but we might want to ensure we don't run into conflicts.
Things we could do: First, use a formatted string instead of an int to pass to tune.checkpoint_dir(step)
(which is actually supported). Second, set self._epoch = True
if any of the on
indicate per-epoch checkpointing and always use epochs for checkpointing (potentially overwriting existing checkpoints at the end of an epoch). Third, disallow multiple on
values for checkpoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fourth, like thirds but throw a warning instead. I'd be fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, passing in multiple values to on
would be pretty uncommon but is still possible. I'll change it to option 1- the default PTL callbacks do the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change @krfricke- please take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me! Thanks!
@richardliaw can you merge this please |
Why are these changes needed?
Related issue number
If using epoch level callbacks (on_epoch_start or on_epoch_end), then use
trainer.current_epoch
as the checkpoint directory name instead oftrainer.global_step
.Closes #13971
Checks
scripts/format.sh
to lint the changes in this PR.