Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs/train] Add Result guide #38359

Merged
merged 15 commits into from
Aug 17, 2023
3 changes: 3 additions & 0 deletions doc/source/data/batch_inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,9 @@ Suppose your cluster has 4 nodes, each with 16 CPUs. To limit to at most
predictions.show(limit=1)


.. _batch_inference_ray_train:


Using models from Ray Train
---------------------------

Expand Down
49 changes: 43 additions & 6 deletions doc/source/train/doc_code/key_concepts.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ def train_fn(config):
for i in range(state["step"], 10):
state["step"] += 1
train.report(
metrics={"step": state["step"]}, checkpoint=Checkpoint.from_dict(state)
metrics={"step": state["step"], "loss": (100 - i) / 100},
checkpoint=Checkpoint.from_dict(state),
)


Expand Down Expand Up @@ -160,12 +161,48 @@ def train_fn(config):
# __checkpoint_config_ckpt_freq_end__


# __results_start__
# __result_metrics_start__
result = trainer.fit()

# Print metrics
print("Observed metrics:", result.metrics)
# __result_metrics_end__

checkpoint_data = result.checkpoint.to_dict()
print("Checkpoint data:", checkpoint_data["step"])
# __results_end__

# __result_dataframe_start__
df = result.metrics_dataframe
print("Minimum loss", min(df["loss"]))
# __result_dataframe_end__


# __result_checkpoint_start__
print("Last checkpoint:", result.checkpoint)

with result.checkpoint.as_directory() as tmpdir:
# Load model from directory
...
# __result_checkpoint_end__

# __result_best_checkpoint_start__
# Print available checkpoints
for checkpoint, metrics in result.best_checkpoints:
print("Loss", metrics["loss"], "checkpoint", checkpoint)

# Get checkpoint with minimal loss
best_checkpoint = min(result.best_checkpoints, key=lambda bc: bc[1]["loss"])[0]

with best_checkpoint.as_directory() as tmpdir:
# Load model from directory
...
# __result_best_checkpoint_end__

# __result_path_start__
print("Results location", result.path)
# __result_path_end__


# __result_error_start__
if result.error:
assert isinstance(result.error, Exception)

print("Got exception:", result.error)
# __result_error_end__
1 change: 1 addition & 0 deletions doc/source/train/user-guides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,6 @@ Ray Train User Guides
user-guides/monitoring-logging
user-guides/checkpoints
user-guides/experiment-tracking
user-guides/results
user-guides/fault-tolerance
user-guides/advanced
114 changes: 114 additions & 0 deletions doc/source/train/user-guides/results.rst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, one high level suggestion for extending this is to add some more color to what the user should do with these attributes.

As a user, what can/should I do with the metrics, checkpoints, etc.? We can guide them towards common steps, such as visualizing metrics with TensorBoard, or using the checkpoint for prediction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit - keeping it concise for now to not clutter the page, but happy to add more references.

Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
Inspecting Training Results
===========================

The return value of your :meth:`Trainer.fit() <ray.train.trainer.BaseTrainer.fit>`
call is a :class:`~ray.air.result.Result` object.
krfricke marked this conversation as resolved.
Show resolved Hide resolved

The :class:`~ray.air.result.Result` object contains, among others:
krfricke marked this conversation as resolved.
Show resolved Hide resolved

- The last reported metrics (e.g. the loss)
- The last reported checkpoint (to load the model)
- Error messages, if any errors occurred

Viewing metrics
---------------
You can retrieve metrics reported to Ray Train from the :class:`~ray.air.result.Result`
object.

Common metrics include the training or validation loss, or prediction accuracies.

The metrics retrieved from the :class:`~ray.air.result.Result` object
correspond to those you passed to :func:`train.report <ray.train.report>`
as an argument :ref:`in your training function <train-monitoring-and-logging>`.


Last reported metrics
~~~~~~~~~~~~~~~~~~~~~

Use :attr:`Result.metrics <ray.air.Result.metrics>` to retrieve the
latest reported metrics.

.. literalinclude:: ../doc_code/key_concepts.py
:language: python
:start-after: __result_metrics_start__
:end-before: __result_metrics_end__

Dataframe of all reported metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use :attr:`Result.metrics_dataframe <ray.air.Result.metrics_dataframe>` to retrieve
krfricke marked this conversation as resolved.
Show resolved Hide resolved
a pandas DataFrame of all reported metrics.

.. literalinclude:: ../doc_code/key_concepts.py
:language: python
:start-after: __result_dataframe_start__
:end-before: __result_dataframe_end__


Retrieving checkpoints
----------------------
You can retrieve checkpoints reported to Ray Train from the :class:`~ray.air.result.Result`
object.

:ref:`Checkpoints <train-checkpointing>` contain all the information that is needed
to restore the training state. This usually includes the trained model.

You can use checkpoints for common downstream tasks such as
:ref:`offline batch inference with Ray Data <batch_inference_ray_train>`,
or :doc:`online model serving with Ray Serve </serve/index>`.

The checkpoints retrieved from the :class:`~ray.air.result.Result` object
correspond to those you passed to :func:`train.report <ray.train.report>`
as an argument :ref:`in your training function <train-monitoring-and-logging>`.

Last saved checkpoint
~~~~~~~~~~~~~~~~~~~~~
Use :attr:`Result.checkpoint <ray.air.Result.checkpoint>` to retrieve the
last checkpoint.

.. literalinclude:: ../doc_code/key_concepts.py
:language: python
:start-after: __result_checkpoint_start__
:end-before: __result_checkpoint_end__


Other checkpoints
~~~~~~~~~~~~~~~~~
Sometimes you want to access an earlier checkpoint. For instance, if your loss increased
after more training due to overfitting, you may want to retrieve the checkpoint with
the lowest loss.

You can retrieve a list of all available checkpoints and their metrics with
:attr:`Result.best_checkpoints <ray.air.Result.best_checkpoints>`

.. literalinclude:: ../doc_code/key_concepts.py
:language: python
:start-after: __result_best_checkpoint_start__
:end-before: __result_best_checkpoint_end__

Storage location
----------------
krfricke marked this conversation as resolved.
Show resolved Hide resolved
If you need to retrieve the results later, you can get the storage location
with :attr:`Result.path <ray.air.Result.path>`.

This path will correspond to the :ref:`storage_path <train-log-dir>` you configured
in the :class:`~ray.air.RunConfig`.
krfricke marked this conversation as resolved.
Show resolved Hide resolved


.. literalinclude:: ../doc_code/key_concepts.py
:language: python
:start-after: __result_path_start__
:end-before: __result_path_end__


Errors
------
krfricke marked this conversation as resolved.
Show resolved Hide resolved
If an error occurred during training,
:attr:`Result.error <ray.air.Result.error>` will be set and contain the exception
that was raised.

.. literalinclude:: ../doc_code/key_concepts.py
:language: python
:start-after: __result_error_start__
:end-before: __result_error_end__

Loading