Add checkpoint saving and loading functionality to training loop #123

lvermue · 2020-11-01T09:42:31Z

Currently training loops cannot save checkpoints.
This means that especially in cases of very long training sessions the training cannot be paused and resumed at a later point and that users will lose the entire training progress whenever a crash occurs, e.g. power outage, server reboot or another user trying to use the same GPU.

The solution presented here allows to:

Automatically store checkpoints during training, by setting the parameter checkpoint_file, based on the past time interval since saving last time, which can be adjusted by the parameter checkpoint_frequency
Automatically restore checkpoints from the checkpoint_file and resume training where it left off.
To ensure that the given checkpoint_file matches the training loop configuration, a checksum is calculated from the hashed model and optimizer configuration.

To Do:

Consider logic how the pipeline should interact with the training loop checkpoint functionality
Write tests

src/pykeen/training/training_loop.py

cthoyt · 2020-11-01T14:35:02Z

I think pipeline integration is a good idea. If you provide a checkpoint file path and it doesn't exist, just start checkpointing like normal. If you provide a checkpoint path and it does exist, load from there and continue training.

Still to do:

How do we test this?
- Idea: do some crazy stuff with adding timeouts to functions (c.f. https://stackoverflow.com/questions/492519/timeout-on-a-function-call)
What about early stopping?

codecov · 2020-11-01T18:44:25Z

Codecov Report

Merging #123 (dc247da) into master (7405a61) will decrease coverage by 0.92%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #123      +/-   ##
==========================================
- Coverage   68.10%   67.17%   -0.93%     
==========================================
  Files          94       91       -3     
  Lines        5950     5783     -167     
  Branches      751      741      -10     
==========================================
- Hits         4052     3885     -167     
- Misses       1678     1691      +13     
+ Partials      220      207      -13

Impacted Files	Coverage Δ
src/pykeen/datasets/base.py	`47.08% <0.00%> (-12.22%)`	⬇️
src/pykeen/models/base.py	`80.43% <0.00%> (-5.07%)`	⬇️
src/pykeen/models/multimodal/complex_literal.py	`28.33% <0.00%> (-1.83%)`	⬇️
src/pykeen/models/unimodal/rgcn.py	`78.64% <0.00%> (-1.82%)`	⬇️
src/pykeen/utils.py	`71.60% <0.00%> (-1.65%)`	⬇️
src/pykeen/models/multimodal/distmult_literal.py	`40.00% <0.00%> (-1.51%)`	⬇️
src/pykeen/triples/triples_factory.py	`78.68% <0.00%> (-1.45%)`	⬇️
src/pykeen/datasets/__init__.py	`46.29% <0.00%> (-0.98%)`	⬇️
src/pykeen/models/unimodal/structured_embedding.py	`80.55% <0.00%> (-0.27%)`	⬇️
src/pykeen/models/unimodal/simple.py	`89.36% <0.00%> (-0.23%)`	⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f944a75...605b76e. Read the comment docs.

lvermue · 2020-11-02T10:01:42Z

@cthoyt Here's my update

I think pipeline integration is a good idea. If you provide a checkpoint file path and it doesn't exist, just start checkpointing like normal. If you provide a checkpoint path and it does exist, load from there and continue training.

Still to do:

How do we test this?

Idea: do some crazy stuff with adding timeouts to functions (c.f. https://stackoverflow.com/questions/492519/timeout-on-a-function-call)

The problem with timeout will be that it isn't deterministic, since it depends on the test server's speed. There's an easier way to do that. I'll provide tests once we have agreed on how the functions should look like 😄

What about early stopping?

I have just implemented full early stopping support.

mberr

Generally looks good to me. Most of the comments are no merge-stoppers 🙂

src/pykeen/stoppers/early_stopping.py

src/pykeen/training/training_loop.py

lvermue · 2020-11-02T11:53:55Z

12f5d9a now adds support for the pipeline.

src/pykeen/training/training_loop.py

src/pykeen/pipeline.py

src/pykeen/training/training_loop.py

src/pykeen/pipeline.py

src/pykeen/training/training_loop.py

cthoyt · 2020-12-02T00:29:12Z

tests/test_pipeline.py

+        """Test whether the resumed pipeline creates the same results as the one shot pipeline."""
+        checkpoint_directory = self.temporary_directory.name
+
+        # TODO check that there aren't any checkpoint files already existing in the right place


can we add some more tests here that illustate what's going on between each step? It's pretty high level and won't be helpful to someone without them having to fully grok the code

Wouldn't that be the tests in test_training.py?

For the same reason as given in #123 (comment) , these tests only exist to see whether the pipeline handles those training loop checkpoints properly. The main functionality fully resides in the training loop.

okay we don't have to make this part overcomplicated. Maybe a code comment saying to check the other test then?

cthoyt · 2020-12-02T00:30:13Z

I wish the greatest bot of all time were here. Where are you @PyKEEN-bot??

cthoyt · 2020-12-02T10:32:23Z

docs/source/tutorial/checkpoints.rst

+To show how checkpoints are used with PyKEEN let's look at a simple example of how a model is setup.
+For fixing possible errors and safety fallbacks please also look at :ref:`word_of_caution`.
+
+.. code-block:: python


wouldn't these tutorials be more useful for users if they started by being centered on the pipeline and then at the end gave some insight into the underlying implementation?

The reason it is kept right now is due to the fact that the checkpoint functionality is only a true training loop functionality, because even though the pipeline supports using training loop checkpoints, it is not a true pipeline checkpoint.

That's true, but I don't think the beginning of a tutorial section of the documentation benefits from being pedagogical. The technical part can be in the reference, or at the end of the tutorial to help users who understand how to use the simple parts and want to understand how it works. I think one place this worked really well was the First Steps tutorial, which ended with the Beyond the Pipeline section.

Could you elaborate on the difference you mean between a training loop checkpoint vs a pipeline checkpoint?

tests/test_pipeline.py

…n/pykeen into allow_training_checkpoints

tests/test_training.py

lvermue · 2020-12-05T20:09:28Z

@PyKEEN-bot What are your latest findings about our code quality?

lvermue · 2020-12-05T20:32:56Z

@PyKEEN-bot test this, please!

docs/source/tutorial/checkpoints.rst

Trigger CI

cthoyt

Ready to merge once CI passes. Thanks @lvermue for writing an excellent tutorial!

lvermue · 2020-12-07T18:33:53Z

@PyKEEN-bot Run unit tests one more time, please!

Add checkpoint saving and loading functionality to training loop

dc55230

lvermue requested review from cthoyt, mali-git and mberr November 1, 2020 09:42

lvermue added 2 commits November 1, 2020 10:51

Flake8 and parameters defaults

9452a01

Flake8

21ab40e

cthoyt reviewed Nov 1, 2020

View reviewed changes

src/pykeen/training/training_loop.py Show resolved Hide resolved

cthoyt reviewed Nov 1, 2020

View reviewed changes

src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved

Refactor internal completed epoch tracking

c05b38b

lvermue added 2 commits November 2, 2020 08:14

Correct epoch number handling in training loop

3a0f420

Add automatic loading of early stoppers from training loop checkpoints

326cf57

lvermue mentioned this pull request Nov 2, 2020

Update documentation #124

Merged

mberr reviewed Nov 2, 2020

View reviewed changes

src/pykeen/stoppers/early_stopping.py Outdated Show resolved Hide resolved

src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved

src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved

src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved

Add training loop checkpoint support for pipelines

12f5d9a

Fix indentation

0af8dbe

cthoyt reviewed Nov 2, 2020

View reviewed changes

src/pykeen/training/training_loop.py Show resolved Hide resolved

Change training loop checksum creation from adler32 to md5

aaa7eb2

cthoyt reviewed Nov 2, 2020

View reviewed changes

src/pykeen/pipeline.py Outdated Show resolved Hide resolved

cthoyt reviewed Nov 2, 2020

View reviewed changes

src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved

lvermue added 3 commits November 2, 2020 13:54

Fix missing random_seed failure, if no training config file is provided

7ec0d44

Flake8, refactor code and add CheckpointMismatchError

7b84f70

Fix flake8

21a982c

cthoyt reviewed Nov 2, 2020

View reviewed changes

src/pykeen/pipeline.py Outdated Show resolved Hide resolved

cthoyt and others added 3 commits November 2, 2020 16:06

Update exception

11e7c4e

Add outline for checkpoint tutorial

5f11c15

Correct random state recovery

fc9a5a9

cthoyt reviewed Dec 2, 2020

View reviewed changes

src/pykeen/training/training_loop.py Outdated Show resolved Hide resolved

cthoyt reviewed Dec 2, 2020

View reviewed changes

PyKEEN-bot and others added 2 commits December 2, 2020 00:30

Trigger CI

52656ec

Add datetime formatting

a745853

cthoyt reviewed Dec 2, 2020

View reviewed changes

tests/test_pipeline.py Show resolved Hide resolved

lvermue and others added 5 commits December 2, 2020 14:56

Add docs for checkpoint_on_failure_file_path

12e3fed

Merge branch 'master' into allow_training_checkpoints

86749bf

Update constants

37069be

Change temp dir creation and teardown during unit tests

ffd8e55

Merge branch 'allow_training_checkpoints' of https://github.com/pykee…

d285704

…n/pykeen into allow_training_checkpoints

cthoyt reviewed Dec 5, 2020

View reviewed changes

tests/test_training.py Outdated Show resolved Hide resolved

Update the checkpoint tutorial

b5fb84f

PyKEEN-bot and others added 2 commits December 5, 2020 20:09

Trigger CI

92ed682

Fix temp dir name handling

33f0776

Trigger CI

cc36bde

cthoyt reviewed Dec 7, 2020

View reviewed changes

docs/source/tutorial/checkpoints.rst Outdated Show resolved Hide resolved

Small fixes in docs

ed6b98e

Trigger CI

cthoyt approved these changes Dec 7, 2020

View reviewed changes

Merge branch 'master' into allow_training_checkpoints

6231d23

Trigger CI

605b76e

lvermue merged commit 4954438 into master Dec 7, 2020

lvermue deleted the allow_training_checkpoints branch December 7, 2020 18:52

mberr mentioned this pull request Dec 10, 2020

Enable checkpoint saving in pipeline() #60

Closed

cthoyt mentioned this pull request Dec 14, 2020

Is it possible to resume an Optuna study via the HPO pipline? #214

Closed

cthoyt added the 🛑 Checkpoints issues related to checkpoints and resuming training label Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checkpoint saving and loading functionality to training loop #123

Add checkpoint saving and loading functionality to training loop #123

lvermue commented Nov 1, 2020 •

edited

cthoyt commented Nov 1, 2020 •

edited

codecov bot commented Nov 1, 2020 •

edited

lvermue commented Nov 2, 2020

mberr left a comment

lvermue commented Nov 2, 2020

cthoyt Dec 2, 2020

lvermue Dec 2, 2020

lvermue Dec 2, 2020

cthoyt Dec 2, 2020

cthoyt commented Dec 2, 2020

cthoyt Dec 2, 2020

lvermue Dec 2, 2020

cthoyt Dec 2, 2020

lvermue commented Dec 5, 2020

lvermue commented Dec 5, 2020

cthoyt left a comment

lvermue commented Dec 7, 2020

Add checkpoint saving and loading functionality to training loop #123

Add checkpoint saving and loading functionality to training loop #123

Conversation

lvermue commented Nov 1, 2020 • edited

cthoyt commented Nov 1, 2020 • edited

codecov bot commented Nov 1, 2020 • edited

Codecov Report

lvermue commented Nov 2, 2020

mberr left a comment

Choose a reason for hiding this comment

lvermue commented Nov 2, 2020

cthoyt Dec 2, 2020

Choose a reason for hiding this comment

lvermue Dec 2, 2020

Choose a reason for hiding this comment

lvermue Dec 2, 2020

Choose a reason for hiding this comment

cthoyt Dec 2, 2020

Choose a reason for hiding this comment

cthoyt commented Dec 2, 2020

cthoyt Dec 2, 2020

Choose a reason for hiding this comment

lvermue Dec 2, 2020

Choose a reason for hiding this comment

cthoyt Dec 2, 2020

Choose a reason for hiding this comment

lvermue commented Dec 5, 2020

lvermue commented Dec 5, 2020

cthoyt left a comment

Choose a reason for hiding this comment

lvermue commented Dec 7, 2020

lvermue commented Nov 1, 2020 •

edited

cthoyt commented Nov 1, 2020 •

edited

codecov bot commented Nov 1, 2020 •

edited