Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checkpoint saving and loading functionality to training loop #123

Merged
merged 74 commits into from Dec 7, 2020

Conversation

lvermue
Copy link
Member

@lvermue lvermue commented Nov 1, 2020

Currently training loops cannot save checkpoints.
This means that especially in cases of very long training sessions the training cannot be paused and resumed at a later point and that users will lose the entire training progress whenever a crash occurs, e.g. power outage, server reboot or another user trying to use the same GPU.

The solution presented here allows to:

  • Automatically store checkpoints during training, by setting the parameter checkpoint_file, based on the past time interval since saving last time, which can be adjusted by the parameter checkpoint_frequency
  • Automatically restore checkpoints from the checkpoint_file and resume training where it left off.
  • To ensure that the given checkpoint_file matches the training loop configuration, a checksum is calculated from the hashed model and optimizer configuration.

To Do:

  • Consider logic how the pipeline should interact with the training loop checkpoint functionality
  • Write tests

@cthoyt
Copy link
Member

cthoyt commented Nov 1, 2020

I think pipeline integration is a good idea. If you provide a checkpoint file path and it doesn't exist, just start checkpointing like normal. If you provide a checkpoint path and it does exist, load from there and continue training.

Still to do:

@codecov
Copy link

codecov bot commented Nov 1, 2020

Codecov Report

Merging #123 (dc247da) into master (7405a61) will decrease coverage by 0.92%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #123      +/-   ##
==========================================
- Coverage   68.10%   67.17%   -0.93%     
==========================================
  Files          94       91       -3     
  Lines        5950     5783     -167     
  Branches      751      741      -10     
==========================================
- Hits         4052     3885     -167     
- Misses       1678     1691      +13     
+ Partials      220      207      -13     
Impacted Files Coverage Δ
src/pykeen/datasets/base.py 47.08% <0.00%> (-12.22%) ⬇️
src/pykeen/models/base.py 80.43% <0.00%> (-5.07%) ⬇️
src/pykeen/models/multimodal/complex_literal.py 28.33% <0.00%> (-1.83%) ⬇️
src/pykeen/models/unimodal/rgcn.py 78.64% <0.00%> (-1.82%) ⬇️
src/pykeen/utils.py 71.60% <0.00%> (-1.65%) ⬇️
src/pykeen/models/multimodal/distmult_literal.py 40.00% <0.00%> (-1.51%) ⬇️
src/pykeen/triples/triples_factory.py 78.68% <0.00%> (-1.45%) ⬇️
src/pykeen/datasets/__init__.py 46.29% <0.00%> (-0.98%) ⬇️
src/pykeen/models/unimodal/structured_embedding.py 80.55% <0.00%> (-0.27%) ⬇️
src/pykeen/models/unimodal/simple.py 89.36% <0.00%> (-0.23%) ⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f944a75...605b76e. Read the comment docs.

@lvermue
Copy link
Member Author

lvermue commented Nov 2, 2020

@cthoyt Here's my update

I think pipeline integration is a good idea. If you provide a checkpoint file path and it doesn't exist, just start checkpointing like normal. If you provide a checkpoint path and it does exist, load from there and continue training.

Still to do:

The problem with timeout will be that it isn't deterministic, since it depends on the test server's speed. There's an easier way to do that. I'll provide tests once we have agreed on how the functions should look like 😄

  • What about early stopping?

I have just implemented full early stopping support.

@lvermue lvermue mentioned this pull request Nov 2, 2020
Copy link
Member

@mberr mberr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. Most of the comments are no merge-stoppers 🙂

src/pykeen/stoppers/early_stopping.py Outdated Show resolved Hide resolved
src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved
src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved
src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved
@lvermue
Copy link
Member Author

lvermue commented Nov 2, 2020

12f5d9a now adds support for the pipeline.

src/pykeen/pipeline.py Outdated Show resolved Hide resolved
src/pykeen/pipeline.py Outdated Show resolved Hide resolved
"""Test whether the resumed pipeline creates the same results as the one shot pipeline."""
checkpoint_directory = self.temporary_directory.name

# TODO check that there aren't any checkpoint files already existing in the right place
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some more tests here that illustate what's going on between each step? It's pretty high level and won't be helpful to someone without them having to fully grok the code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be the tests in test_training.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason as given in #123 (comment) , these tests only exist to see whether the pipeline handles those training loop checkpoints properly. The main functionality fully resides in the training loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay we don't have to make this part overcomplicated. Maybe a code comment saying to check the other test then?

@cthoyt
Copy link
Member

cthoyt commented Dec 2, 2020

I wish the greatest bot of all time were here. Where are you @PyKEEN-bot??

To show how checkpoints are used with PyKEEN let's look at a simple example of how a model is setup.
For fixing possible errors and safety fallbacks please also look at :ref:`word_of_caution`.

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't these tutorials be more useful for users if they started by being centered on the pipeline and then at the end gave some insight into the underlying implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it is kept right now is due to the fact that the checkpoint functionality is only a true training loop functionality, because even though the pipeline supports using training loop checkpoints, it is not a true pipeline checkpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but I don't think the beginning of a tutorial section of the documentation benefits from being pedagogical. The technical part can be in the reference, or at the end of the tutorial to help users who understand how to use the simple parts and want to understand how it works. I think one place this worked really well was the First Steps tutorial, which ended with the Beyond the Pipeline section.

Could you elaborate on the difference you mean between a training loop checkpoint vs a pipeline checkpoint?

tests/test_training.py Outdated Show resolved Hide resolved
@lvermue
Copy link
Member Author

lvermue commented Dec 5, 2020

@PyKEEN-bot What are your latest findings about our code quality?

@lvermue
Copy link
Member Author

lvermue commented Dec 5, 2020

@PyKEEN-bot test this, please!

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge once CI passes. Thanks @lvermue for writing an excellent tutorial!

@lvermue
Copy link
Member Author

lvermue commented Dec 7, 2020

@PyKEEN-bot Run unit tests one more time, please!

@lvermue lvermue merged commit 4954438 into master Dec 7, 2020
@lvermue lvermue deleted the allow_training_checkpoints branch December 7, 2020 18:52
@cthoyt cthoyt added the 🛑 Checkpoints issues related to checkpoints and resuming training label Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🛑 Checkpoints issues related to checkpoints and resuming training enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants