-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpoint saving and loading functionality to training loop #123
Merged
Merged
Changes from 44 commits
Commits
Show all changes
74 commits
Select commit
Hold shift + click to select a range
dc55230
Add checkpoint saving and loading functionality to training loop
lvermue 9452a01
Flake8 and parameters defaults
lvermue 21ab40e
Flake8
lvermue c05b38b
Refactor internal completed epoch tracking
lvermue 3a0f420
Correct epoch number handling in training loop
lvermue 326cf57
Add automatic loading of early stoppers from training loop checkpoints
lvermue 12f5d9a
Add training loop checkpoint support for pipelines
lvermue 0af8dbe
Fix indentation
lvermue aaa7eb2
Change training loop checksum creation from adler32 to md5
lvermue 7ec0d44
Fix missing random_seed failure, if no training config file is provided
lvermue 7b84f70
Flake8, refactor code and add CheckpointMismatchError
lvermue 21a982c
Fix flake8
lvermue 11e7c4e
Update exception
cthoyt 5f11c15
Add outline for checkpoint tutorial
cthoyt fc9a5a9
Correct random state recovery
lvermue 2bf5b98
Merge branch 'allow_training_checkpoints' of https://github.com/pykee…
lvermue 256beaa
Remove pipeline checkpoint helper file help function
lvermue 1e170f3
Remove torch save helper file
lvermue 3ba5288
Fix function argument for torch save
lvermue cfd0598
Add units tests for training loop checkpoint
lvermue 8d2740f
Merge branch 'master' into allow_training_checkpoints
cthoyt 07c3d42
Add unit tests for checkpoints
lvermue 37b777b
Fix flake8
lvermue 8af806f
Add failure fallback checkpoints to training loop
lvermue bc20d35
Correct checkpoint root dir handling
lvermue 17cdfcc
Add implicit random seed handling from checkpoints in pipeline
lvermue b036b9d
Code cleanup
cthoyt 5d2e47b
More refactoring
cthoyt 583d494
Add CPU/GPU random state differentiation
lvermue 604bc2f
Merge branch 'allow_training_checkpoints' of https://github.com/pykee…
lvermue 26e0b39
Workaround for CUDA rng state
cthoyt a564fef
Refactor tests
cthoyt 67776c8
Remove unnecessary stuff
cthoyt 578fd1e
Unnest logic
cthoyt 09d03e3
Improve typing and safety
cthoyt 9ca40dc
Fix testing
lvermue 5d64d6c
Fix pipeline checkpoint unit tests
lvermue bf82e79
Merge branch 'master' into allow_training_checkpoints
lvermue 661eed3
Fix usage of forbidden characters for Windows in filepaths
lvermue 675bd59
Refactor loading of states for the training loop and stoppers
lvermue f618a90
Fix flake8
lvermue 14e14b7
Fix handling of stopper state dictionaries
lvermue cd64e7e
Merge branch 'master' into allow_training_checkpoints
lvermue d9abffd
Fix flake8
lvermue f226b0a
Merge branch 'master' into allow_training_checkpoints
lvermue 240140b
Fix unit tests
lvermue 3cf1f69
Refactor pipeline unit tests
lvermue 29780d0
Refactor training loop unit tests
lvermue bf6190b
Fix flake8
lvermue 04b31c9
Merge branch 'master' into allow_training_checkpoints
lvermue 1cabf5f
Add saving of checkpoints after successful training
lvermue 1fa0f18
Add usage of temporary directories for unit tests
lvermue 46bb380
Add checkpoint documentation and correct failure checkpoint handling
lvermue 2b32fbb
Trigger CI
PyKEEN-bot 566e9ab
Add missing variable default value
cthoyt 778ce48
Get rid of tqdms
cthoyt 317f639
Pass flake8
cthoyt 563e5a4
Use class teardown for handling temporary directory
cthoyt 7fba750
Update docs
cthoyt 65cefa7
Update argument names and type hints
cthoyt 52656ec
Trigger CI
PyKEEN-bot a745853
Add datetime formatting
lvermue 12e3fed
Add docs for checkpoint_on_failure_file_path
lvermue 86749bf
Merge branch 'master' into allow_training_checkpoints
cthoyt 37069be
Update constants
cthoyt ffd8e55
Change temp dir creation and teardown during unit tests
lvermue d285704
Merge branch 'allow_training_checkpoints' of https://github.com/pykee…
lvermue b5fb84f
Update the checkpoint tutorial
lvermue 92ed682
Trigger CI
PyKEEN-bot 33f0776
Fix temp dir name handling
lvermue cc36bde
Trigger CI
PyKEEN-bot ed6b98e
Small fixes in docs
cthoyt 6231d23
Merge branch 'master' into allow_training_checkpoints
lvermue 605b76e
Trigger CI
PyKEEN-bot File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Using Checkpoints | ||
================= | ||
Why does someone want to use checkpoints? | ||
|
||
Give an example of a run that will obviously crash | ||
|
||
How to recover when you were smart enough to keep checkpoints? | ||
|
||
Where is this applicable? pipeline / hpo pipeline? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,9 +3,15 @@ | |
"""Constants for PyKEEN.""" | ||
|
||
import os | ||
import pathlib | ||
|
||
__all__ = [ | ||
'PYKEEN_HOME', | ||
'PYKEEN_DEFAULT_CHECKPOINT_DIR', | ||
] | ||
|
||
PYKEEN_HOME = os.environ.get('PYKEEN_HOME') or os.path.join(os.path.expanduser('~'), '.pykeen') | ||
PYKEEN_DEFAULT_CHECKPOINT = "PyKEEN_just_saved_my_day.pt" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. perfect name |
||
|
||
PYKEEN_DEFAULT_CHECKPOINT_DIR = pathlib.Path(PYKEEN_HOME).joinpath("checkpoints") | ||
PYKEEN_DEFAULT_CHECKPOINT_DIR.mkdir(exist_ok=True, parents=True) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lvermue write up this tutorial please