-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpoint saving and loading functionality to training loop #123
Changes from 67 commits
dc55230
9452a01
21ab40e
c05b38b
3a0f420
326cf57
12f5d9a
0af8dbe
aaa7eb2
7ec0d44
7b84f70
21a982c
11e7c4e
5f11c15
fc9a5a9
2bf5b98
256beaa
1e170f3
3ba5288
cfd0598
8d2740f
07c3d42
37b777b
8af806f
bc20d35
17cdfcc
b036b9d
5d2e47b
583d494
604bc2f
26e0b39
a564fef
67776c8
578fd1e
09d03e3
9ca40dc
5d64d6c
bf82e79
661eed3
675bd59
f618a90
14e14b7
cd64e7e
d9abffd
f226b0a
240140b
3cf1f69
29780d0
bf6190b
04b31c9
1cabf5f
1fa0f18
46bb380
2b32fbb
566e9ab
778ce48
317f639
563e5a4
7fba750
65cefa7
52656ec
a745853
12e3fed
86749bf
37069be
ffd8e55
d285704
b5fb84f
92ed682
33f0776
cc36bde
ed6b98e
6231d23
605b76e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
Using Checkpoints | ||
================= | ||
Training may take days to weeks in extreme cases when using models with many parameters or big datasets. This introduces | ||
a large array of possible errors, e.g. session timeouts, server restarts etc., which would lead to a complete loss of | ||
all progress made so far. To avoid this the :class:`pykeen.training.TrainingLoop` supports built-in check-points that | ||
allow a straight-forward saving of the current training loop state and resumption of a saved | ||
state from saved checkpoints. | ||
|
||
How to do it | ||
------------ | ||
To show how checkpoints are used with PyKEEN let's look at a simple example of how a model is setup. | ||
For fixing possible errors and safety fallbacks please also look at :ref:`word_of_caution`. | ||
|
||
.. code-block:: python | ||
|
||
from pykeen.models import TransE | ||
from pykeen.training import SLCWATrainingLoop | ||
from pykeen.triples import TriplesFactory | ||
from torch.optim import Adam | ||
|
||
triples_factory = Nations().training | ||
model = TransE( | ||
triples_factory=triples_factory, | ||
random_seed=123, | ||
) | ||
|
||
optimizer = Adam(params=model.get_grad_params()) | ||
training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer) | ||
|
||
At this point we have a model, dataset and optimizer all setup in a training loop and are ready to train the model with | ||
the ``training_loop``'s method :func:`pykeen.training.TrainingLoop.train`. To enable checkpoints all you have to do is | ||
setting the function argument ``checkpoint_file`` to the name you would like it to have. | ||
Optionally, you can set the path to where you want the checkpoints to be saved by setting the ``checkpoint_directory`` | ||
argument with a string or a :class:`pathlib.Path` object containing your desired root path. If you didn't set the | ||
``checkpoint_directory`` argument, your checkpoints will be saved in the ``PYKEEN_HOME`` directory that is defined in | ||
:mod:`pykeen.constants`, which is a subdirectory in your home directory, e.g. ``~/.pykeen/checkpoints``. | ||
Furthermore, you can set the checkpoint frequency, i.e. how often checkpoints should be saved given in minutes, by | ||
setting the argument ``checkpoint_frequency`` with an integer. The default frequency is 30 minutes and setting it to | ||
``0`` will cause the training loop to save a checkpoint after each epoch. | ||
|
||
Here is an example: | ||
|
||
.. code-block:: python | ||
|
||
losses = training_loop.train( | ||
num_epochs=1000, | ||
checkpoint_name='my_checkpoint.pt', | ||
checkpoint_frequency=5, | ||
) | ||
|
||
With this code we have started the training loop with the above defined KGEM. The training loop will save a checkpoint | ||
in the ``my_checkpoint.pt`` file, which will be saved in the ``~/.pykeen/checkpoints/`` directory, since we haven't | ||
set the argument ``checkpoint_directory``. | ||
The checkpoint file will be saved after 5 minutes since starting the training loop or the last time a checkpoint was | ||
saved and the epoch finishes, i.e. when one epoch takes 10 minutes the checkpoint will be saved after 10 minutes. | ||
In addition, checkpoints are always saved when the early stopper stops the training loop or the last epoch was finished. | ||
|
||
Let's assume you were anticipative, saved checkpoints and your training loop crashed after 200 epochs. | ||
Now you would like to resume from the last checkpoint. All you have to do is to rerun the **exact same code** as above | ||
and PyKEEN will smoothly start from the given checkpoint. Since PyKEEN stores all random states as well as the | ||
states of the model, optimizer and early stopper, the results will be exactly the same compared to running the | ||
training loop uninterruptedly. Of course, PyKEEN will also continue saving new checkpoints even when | ||
resuming from a previous checkpoint. | ||
|
||
On top of resuming interrupted training loops you can also resume training loops that finished successfully. | ||
E.g. the above training loop finished successfully after 1000 epochs, but you would like to | ||
train the same model from that state for 2000 epochs. All you have have to do is to change the argument | ||
``num_epochs`` in the above code to: | ||
|
||
.. code-block:: python | ||
|
||
losses = training_loop.train( | ||
num_epochs=2000, | ||
checkpoint_name='my_checkpoint.pt', | ||
checkpoint_frequency=5, | ||
) | ||
|
||
and now the training loop will resume from the state at 1000 epochs and continue to train until 2000 epochs. | ||
|
||
Another nice feature is that the checkpoints functionality integrates with the pipeline. This means that you can simply | ||
define a pipeline like this: | ||
|
||
.. code-block:: python | ||
|
||
from pykeen.pipeline import pipeline | ||
pipeline_result = pipeline( | ||
dataset='Nations', | ||
model='TransE', | ||
optimizer='Adam', | ||
training_kwargs=dict(num_epochs=1000, checkpoint_name='my_checkpoint.pt', checkpoint_frequency=5), | ||
) | ||
|
||
Again, assuming that e.g. this pipeline crashes after 200 epochs, you can simply execute **the same code** and the | ||
pipeline will load the last state from the checkpoint file and continue training as if nothing happened. | ||
|
||
.. todo:: Tutorial on recovery from hpo_pipeline. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe for a later PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Basically the hpo_pipeline supports saving checkpoints through the pipeline, but that is specific to the training loop itself. Supporting to resume a cancelled hpo_pipeline would be an entirely different story. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this |
||
|
||
Checkpoints on Failure | ||
---------------------- | ||
In cases where you only would like to save checkpoints whenever the training loop might fail, you can use the argument | ||
``checkpoint_on_failure=True``, like: | ||
|
||
.. code-block:: python | ||
|
||
losses = training_loop.train( | ||
num_epochs=2000, | ||
checkpoint_on_failure=True, | ||
) | ||
|
||
This option differs from ordinary checkpoints, since ordinary checkpoints are only saved | ||
after a successful epoch. When saving checkpoints due to failure of the training loop there is no guarantee that all | ||
random states can be recovered correctly, which might cause problems with regards to the reproducibility of that | ||
specific training loop. Therefore, these checkpoints are saved with a distinct checkpoint name, which will be | ||
``PyKEEN_just_saved_my_day_{datetime}.pt`` in the given ``checkpoint_directory``, even when you also opted to use | ||
ordinary checkpoints as defined above, e.g. with this code: | ||
|
||
.. code-block:: python | ||
|
||
losses = training_loop.train( | ||
num_epochs=2000, | ||
checkpoint_name='my_checkpoint.pt', | ||
checkpoint_frequency=5, | ||
checkpoint_on_failure=True, | ||
) | ||
|
||
Note: Use this argument with caution, since every failed training loop will create a distinct checkpoint file. | ||
|
||
.. _word_of_caution: | ||
|
||
Word of Caution and Possible Errors | ||
----------------------------------- | ||
When using checkpoints and trying out several configurations, which in return result in multiple different checkpoints, | ||
the inherent risk of overwriting checkpoints arises. This would naturally happen when you change the configuration of | ||
the KGEM, but don't change the ``checkpoint_name`` argument. | ||
To prevent this from happening, PyKEEN makes a hash-sum comparison of the configurations of the checkpoint and | ||
the one of the current configuration at hand. When these don't match, PyKEEN won't accept the checkpoint and raise | ||
an error. | ||
|
||
In case you want to overwrite the previous checkpoint file with a new configuration, you have to delete it explicitly. | ||
The reason for this behavior is three-fold: | ||
|
||
1. This allows a very easy and user friendly way of resuming an interrupted training loop by simply re-running | ||
the exact same code. | ||
2. By explicitly requiring to name the checkpoint files the user controls the naming of the files and thus makes | ||
it easier to keep an overview. | ||
3. Creating new checkpoint files for each run will lead most users to inadvertently spam their file systems with | ||
unused checkpoints that with ease can add up to hundred of GBs when running many experiments. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,8 @@ | |
#: A subdirectory of the PyKEEN data folder for checkpoints, defaults to ``~/.data/pykeen/checkpoints`` | ||
PYKEEN_CHECKPOINTS: Path = PYKEEN_MODULE.get('checkpoints') | ||
|
||
PYKEEN_DEFAULT_CHECKPOINT = "PyKEEN_just_saved_my_day.pt" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. perfect name |
||
|
||
DEFAULT_DROPOUT_HPO_RANGE = dict(type=float, low=0.0, high=0.5, q=0.1) | ||
#: We define the embedding dimensions as a multiple of 16 because it is computational beneficial (on a GPU) | ||
#: see: https://docs.nvidia.com/deeplearning/performance/index.html#optimizing-performance | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't these tutorials be more useful for users if they started by being centered on the
pipeline
and then at the end gave some insight into the underlying implementation?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason it is kept right now is due to the fact that the checkpoint functionality is only a true training loop functionality, because even though the pipeline supports using training loop checkpoints, it is not a true pipeline checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but I don't think the beginning of a tutorial section of the documentation benefits from being pedagogical. The technical part can be in the reference, or at the end of the tutorial to help users who understand how to use the simple parts and want to understand how it works. I think one place this worked really well was the First Steps tutorial, which ended with the Beyond the Pipeline section.
Could you elaborate on the difference you mean between a training loop checkpoint vs a pipeline checkpoint?