-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs/train] Add Result guide #38359
Merged
Merged
Changes from 7 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
2efaa4b
[docs/train] Result guide
4734cd9
Fix refs
ce727a1
another
8eecd62
headline
c7c67c1
refs
e1518aa
Update text
d422b63
refs
f960371
Apply suggestions from code review
krfricke 8af083e
Merge remote-tracking branch 'upstream/master' into doc/train/result
75a127f
Result.from_path
8cfcb17
Merge remote-tracking branch 'upstream/master' into doc/train/result
33ea92b
ray.train.result
77f1a78
air result
9698a06
update
8b172c5
Update doc/source/train/key-concepts.rst
krfricke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
Inspecting Training Results | ||
=========================== | ||
|
||
The return value of your :meth:`Trainer.fit() <ray.train.trainer.BaseTrainer.fit>` | ||
call is a :class:`~ray.air.result.Result` object. | ||
krfricke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The :class:`~ray.air.result.Result` object contains, among others: | ||
krfricke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- The last reported metrics (e.g. the loss) | ||
- The last reported checkpoint (to load the model) | ||
- Error messages, if any errors occurred | ||
|
||
Viewing metrics | ||
--------------- | ||
You can retrieve metrics reported to Ray Train from the :class:`~ray.air.result.Result` | ||
object. | ||
|
||
Common metrics include the training or validation loss, or prediction accuracies. | ||
|
||
The metrics retrieved from the :class:`~ray.air.result.Result` object | ||
correspond to those you passed to :func:`train.report <ray.train.report>` | ||
as an argument :ref:`in your training function <train-monitoring-and-logging>`. | ||
|
||
|
||
Last reported metrics | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Use :attr:`Result.metrics <ray.air.Result.metrics>` to retrieve the | ||
latest reported metrics. | ||
|
||
.. literalinclude:: ../doc_code/key_concepts.py | ||
:language: python | ||
:start-after: __result_metrics_start__ | ||
:end-before: __result_metrics_end__ | ||
|
||
Dataframe of all reported metrics | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Use :attr:`Result.metrics_dataframe <ray.air.Result.metrics_dataframe>` to retrieve | ||
krfricke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
a pandas DataFrame of all reported metrics. | ||
|
||
.. literalinclude:: ../doc_code/key_concepts.py | ||
:language: python | ||
:start-after: __result_dataframe_start__ | ||
:end-before: __result_dataframe_end__ | ||
|
||
|
||
Retrieving checkpoints | ||
---------------------- | ||
You can retrieve checkpoints reported to Ray Train from the :class:`~ray.air.result.Result` | ||
object. | ||
|
||
:ref:`Checkpoints <train-checkpointing>` contain all the information that is needed | ||
to restore the training state. This usually includes the trained model. | ||
|
||
You can use checkpoints for common downstream tasks such as | ||
:ref:`offline batch inference with Ray Data <batch_inference_ray_train>`, | ||
or :doc:`online model serving with Ray Serve </serve/index>`. | ||
|
||
The checkpoints retrieved from the :class:`~ray.air.result.Result` object | ||
correspond to those you passed to :func:`train.report <ray.train.report>` | ||
as an argument :ref:`in your training function <train-monitoring-and-logging>`. | ||
|
||
Last saved checkpoint | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
Use :attr:`Result.checkpoint <ray.air.Result.checkpoint>` to retrieve the | ||
last checkpoint. | ||
|
||
.. literalinclude:: ../doc_code/key_concepts.py | ||
:language: python | ||
:start-after: __result_checkpoint_start__ | ||
:end-before: __result_checkpoint_end__ | ||
|
||
|
||
Other checkpoints | ||
~~~~~~~~~~~~~~~~~ | ||
Sometimes you want to access an earlier checkpoint. For instance, if your loss increased | ||
after more training due to overfitting, you may want to retrieve the checkpoint with | ||
the lowest loss. | ||
|
||
You can retrieve a list of all available checkpoints and their metrics with | ||
:attr:`Result.best_checkpoints <ray.air.Result.best_checkpoints>` | ||
|
||
.. literalinclude:: ../doc_code/key_concepts.py | ||
:language: python | ||
:start-after: __result_best_checkpoint_start__ | ||
:end-before: __result_best_checkpoint_end__ | ||
|
||
Storage location | ||
---------------- | ||
krfricke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
If you need to retrieve the results later, you can get the storage location | ||
with :attr:`Result.path <ray.air.Result.path>`. | ||
|
||
This path will correspond to the :ref:`storage_path <train-log-dir>` you configured | ||
in the :class:`~ray.air.RunConfig`. | ||
krfricke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
.. literalinclude:: ../doc_code/key_concepts.py | ||
:language: python | ||
:start-after: __result_path_start__ | ||
:end-before: __result_path_end__ | ||
|
||
|
||
Errors | ||
------ | ||
krfricke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
If an error occurred during training, | ||
:attr:`Result.error <ray.air.Result.error>` will be set and contain the exception | ||
that was raised. | ||
|
||
.. literalinclude:: ../doc_code/key_concepts.py | ||
:language: python | ||
:start-after: __result_error_start__ | ||
:end-before: __result_error_end__ | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, one high level suggestion for extending this is to add some more color to what the user should do with these attributes.
As a user, what can/should I do with the metrics, checkpoints, etc.? We can guide them towards common steps, such as visualizing metrics with TensorBoard, or using the checkpoint for prediction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a bit - keeping it concise for now to not clutter the page, but happy to add more references.