New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpoint saving and loading functionality to training loop #123
Conversation
I think pipeline integration is a good idea. If you provide a checkpoint file path and it doesn't exist, just start checkpointing like normal. If you provide a checkpoint path and it does exist, load from there and continue training. Still to do:
|
Codecov Report
@@ Coverage Diff @@
## master #123 +/- ##
==========================================
- Coverage 68.10% 67.17% -0.93%
==========================================
Files 94 91 -3
Lines 5950 5783 -167
Branches 751 741 -10
==========================================
- Hits 4052 3885 -167
- Misses 1678 1691 +13
+ Partials 220 207 -13
Continue to review full report at Codecov.
|
@cthoyt Here's my update
The problem with timeout will be that it isn't deterministic, since it depends on the test server's speed. There's an easier way to do that. I'll provide tests once we have agreed on how the functions should look like 😄
I have just implemented full early stopping support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me. Most of the comments are no merge-stoppers 🙂
12f5d9a now adds support for the pipeline. |
tests/test_pipeline.py
Outdated
"""Test whether the resumed pipeline creates the same results as the one shot pipeline.""" | ||
checkpoint_directory = self.temporary_directory.name | ||
|
||
# TODO check that there aren't any checkpoint files already existing in the right place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some more tests here that illustate what's going on between each step? It's pretty high level and won't be helpful to someone without them having to fully grok the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be the tests in test_training.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason as given in #123 (comment) , these tests only exist to see whether the pipeline handles those training loop checkpoints properly. The main functionality fully resides in the training loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay we don't have to make this part overcomplicated. Maybe a code comment saying to check the other test then?
I wish the greatest bot of all time were here. Where are you @PyKEEN-bot?? |
To show how checkpoints are used with PyKEEN let's look at a simple example of how a model is setup. | ||
For fixing possible errors and safety fallbacks please also look at :ref:`word_of_caution`. | ||
|
||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't these tutorials be more useful for users if they started by being centered on the pipeline
and then at the end gave some insight into the underlying implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason it is kept right now is due to the fact that the checkpoint functionality is only a true training loop functionality, because even though the pipeline supports using training loop checkpoints, it is not a true pipeline checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but I don't think the beginning of a tutorial section of the documentation benefits from being pedagogical. The technical part can be in the reference, or at the end of the tutorial to help users who understand how to use the simple parts and want to understand how it works. I think one place this worked really well was the First Steps tutorial, which ended with the Beyond the Pipeline section.
Could you elaborate on the difference you mean between a training loop checkpoint vs a pipeline checkpoint?
…n/pykeen into allow_training_checkpoints
@PyKEEN-bot What are your latest findings about our code quality? |
@PyKEEN-bot test this, please! |
Trigger CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge once CI passes. Thanks @lvermue for writing an excellent tutorial!
@PyKEEN-bot Run unit tests one more time, please! |
Currently training loops cannot save checkpoints.
This means that especially in cases of very long training sessions the training cannot be paused and resumed at a later point and that users will lose the entire training progress whenever a crash occurs, e.g. power outage, server reboot or another user trying to use the same GPU.
The solution presented here allows to:
checkpoint_file
, based on the past time interval since saving last time, which can be adjusted by the parametercheckpoint_frequency
checkpoint_file
and resume training where it left off.checkpoint_file
matches the training loop configuration, a checksum is calculated from the hashed model and optimizer configuration.To Do: