Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train/docs] Restructure Ray Train docs with framework-specific guides #37892

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/source/_static/js/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,33 @@ document.addEventListener("DOMContentLoaded", function() {
let navItem = navItems[i];
const stringList = [
"User Guides", "Examples",
// Ray Core
"Ray Core", "Ray Core API",
"Ray Clusters", "Deploying on Kubernetes", "Deploying on VMs",
"Applications Guide", "Ray Cluster Management API",
// Ray AIR
"Ray AIR API",
// Ray Data
"Ray Data", "Ray Data API", "Integrations",
krfricke marked this conversation as resolved.
Show resolved Hide resolved
// Ray Train
"Ray Train", "Ray Train API",
"Distributed PyTorch", "Advanced Topics", "More Frameworks",
"Ray Train Internals",
// Ray Tune
"Ray Tune", "Ray Tune Examples", "Ray Tune API",
// Ray Serve
"Ray Serve", "Ray Serve API",
"Production Guide", "Advanced Guides",
"Deploy Many Models",
// Ray RLlib
"Ray RLlib", "Ray RLlib API",
// More libraries
"More Libraries", "Ray Workflows (Alpha)",
// Monitoring/debugging
"Monitoring and Debugging",
// References
"References", "Use Cases",
// Developer guides
"Developer Guides", "Getting Involved / Contributing",
];

Expand Down
46 changes: 26 additions & 20 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,29 +59,35 @@ parts:
- file: train/train
title: Ray Train
sections:
- file: train/getting-started
title: "Getting Started"
- file: train/key-concepts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this file. Can this info be moved to the Distributed PyTorch landing page or PyTorch Quickstart? Or can we prune a lot of it because things like Predictors and Checkpoints are being deemphasized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate to keep this on the top level - this is the only conceptual glue that ties the Ray Train library together. We can remove Predictors when they are deprecated, Checkpoints will still continue exist in some form

title: "Key Concepts"
- file: train/user-guides
title: "User Guides"
- file: train/distributed-pytorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure where this is configured, but would be good to have the drop-down caret show in the ToC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like that, too, and have no idea why it doesn't show up. I'll try a few more things...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look and think it's defined here:

// Reintroduce dropdown icons on the sidebar. This is a hack, as we can't
// programmatically figure out which nav items have children anymore.
document.addEventListener("DOMContentLoaded", function() {
let navItems = document.querySelectorAll(".bd-sidenav li");
for (let i = 0; i < navItems.length; i++) {
let navItem = navItems[i];
const stringList = [
"User Guides", "Examples",
"Ray Core", "Ray Core API",
"Ray Clusters", "Deploying on Kubernetes", "Deploying on VMs",
"Applications Guide", "Ray Cluster Management API",
"Ray AIR API",
"Ray Data", "Ray Data API", "Integrations",
"Ray Train", "Ray Train API",
"Ray Tune", "Ray Tune Examples", "Ray Tune API",
"Ray Serve", "Ray Serve API",
"Production Guide", "Advanced Guides",
"Deploy Many Models",
"Ray RLlib", "Ray RLlib API",
"More Libraries", "Ray Workflows (Alpha)",
"Monitoring and Debugging",
"References", "Use Cases",
"Developer Guides", "Getting Involved / Contributing",
];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll update the list

sections:
- file: train/distributed-pytorch/converting-existing-training-loop
- file: train/distributed-pytorch/data-loading-preprocessing
- file: train/distributed-pytorch/using-gpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we group the pages needed for production, together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean - what do you mean with "needed for production"? I don't think we should introduce more groups here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mental model, there are two distinct jobs to be done that first time users can have:

  1. Run Ray locally (proof of concept)
  2. Run Ray on a cluster (production)
    It may be helpful to delineate what guides/steps are needed to achieve 1, and the next level of complexity would be to achieve 2.

- file: train/distributed-pytorch/persistent-storage
title: Configuring Persistent Storage
- file: train/distributed-pytorch/monitoring-logging
- file: train/distributed-pytorch/checkpoints
- file: train/distributed-pytorch/experiment-tracking
- file: train/distributed-pytorch/fault-tolerance
- file: train/distributed-pytorch/advanced
sections:
- file: train/distributed-pytorch/reproducibility
- file: train/distributed-pytorch/automatic-mixed-precision
- file: train/distributed-pytorch/hyperparameter-optimization
title: Hyperparameter optimization
- file: train/more-frameworks
sections:
- file: train/distributed-tensorflow-keras
- file: train/distributed-xgboost-lightgbm
- file: train/horovod
- file: train/internals/index
sections:
- file: train/config_guide
title: "Configuring Ray Train"
- file: train/dl_guide
title: "Deep Learning Guide"
- file: train/hf_trainers
title: "Hugging Face Trainers"
- file: train/gbdt
title: "XGBoost/LightGBM guide"
- file: train/architecture
title: "Ray Train Architecture"
- file: train/train-with-tune
title: "Using Ray Train with Ray Tune"
- file: train/check-ingest
title: "Configuring Training Datasets"
- file: train/predictors
- file: train/benchmarks
- file: train/internals/architecture
- file: train/internals/benchmarks
- file: train/internals/environment-variables
- file: train/examples
title: "Examples"
sections:
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/batch_inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

checkpoint = result.checkpoint

**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no link anymore to framework specific Checkpoints? If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Also wondering this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decouple the removal from this PR. We'll update the content when we fully quarantined the framework-specific checkpoint classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decouple the removal from this PR.

I agree, but this PR makes the removal right? It's no longer documented with this PR merged.


In this case, we use the :class:`XGBoostCheckpoint <ray.train.xgboost.XGBoostCheckpoint>` to load the model.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/iterating-over-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ into disjoint shards.

If you're using :ref:`Ray Train <train-docs>`, you don't need to split the dataset.
Ray Train automatically splits your dataset for you. To learn more, see
:ref:`Configuring training datasets <air-ingest>`.
:ref:`Configuring training datasets <data-ingest-torch>`.

.. testcode::

Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/preprocessors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Ray AIR provides several common preprocessors out of the box and interfaces to d
Overview
--------

The most common way of using a preprocessor is by passing it as an argument to the constructor of a Ray Train :ref:`Trainer <train-getting-started>` in conjunction with a :ref:`Ray Data dataset <data>`.
The most common way of using a preprocessor is by passing it as an argument to the constructor of a Ray Train :ref:`Trainer <train-docs>` in conjunction with a :ref:`Ray Data dataset <data>`.
For example, the following code trains a model with a preprocessor that normalizes the data.

.. literalinclude:: doc_code/preprocessors.py
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/working-with-pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Ray Data integrates with :ref:`Ray Train <train-docs>` for easy data ingest for

...

For more details, see the :ref:`Ray Train user guide <train-datasets>`.
For more details, see the :ref:`Ray Train user guide <data-ingest-torch>`.

.. _transform_pytorch:

Expand Down
4 changes: 0 additions & 4 deletions doc/source/ray-air/api/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,6 @@ Ray AIR Configurations

.. TODO(ml-team): Add a general AIR configuration guide that covers all of these configs.

.. seealso::

See :ref:`this Ray Train configuration user guide <train-config>` for more details.

.. currentmodule:: ray

.. autosummary::
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-air/api/dataset-ingest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Ray Data Ingest into AIR Trainers

.. seealso::

See this :ref:`AIR Data ingest guide <air-ingest>` for usage examples.
See this :ref:`AIR Data ingest guide <data-ingest-torch>` for usage examples.

.. currentmodule:: ray

Expand Down
5 changes: 0 additions & 5 deletions doc/source/ray-air/api/predictor.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,6 @@
Predictor
=========

.. seealso::

See this :ref:`user guide on performing model inference <air-predictors>` in
AIR for usage examples.

.. currentmodule:: ray.train

Predictor Interface
Expand Down
8 changes: 2 additions & 6 deletions doc/source/ray-air/computer-vision.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ Training vision models
:end-before: __torch_trainer_stop__
:dedent:

For more in-depth examples, see :ref:`Using Trainers <train-getting-started>`.
For more in-depth examples, see :ref:`the Ray Train documentation <train-docs>`.

.. tab-item:: TensorFlow

Expand All @@ -202,7 +202,7 @@ Training vision models
:end-before: __tensorflow_trainer_stop__
:dedent:

For more information, check out :ref:`the Ray Train documentation <train-getting-started>`.
For more information, check out :ref:`the Ray Train documentation <train-docs>`.

Creating checkpoints
--------------------
Expand Down Expand Up @@ -259,8 +259,6 @@ image datasets.
:end-before: __torch_batch_predictor_stop__
:dedent:

For more in-depth examples, read :ref:`Using Predictors for Inference <air-predictors>`.

.. tab-item:: TensorFlow

To create a :class:`~ray.train.batch_predictor.BatchPredictor`, call
Expand All @@ -272,8 +270,6 @@ image datasets.
:end-before: __tensorflow_batch_predictor_stop__
:dedent:

For more information, read :ref:`Using Predictors for Inference <air-predictors>`.

Serving vision models
---------------------

Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-air/examples/batch_forecasting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1167,7 +1167,7 @@
"- We will restore a Prophet or ARIMA model directly from checkpoint, and demonstrate it can be used for prediction.\n",
"\n",
"```{tip}\n",
"[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n",
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
"Ray Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer the "AIR" removals - "Ray Predictors" make it sound like a first class concept, which it's not. The AIR removal hasn't been fully done, yet - IMO this should be part of that. here we just remove the references

"```\n"
]
},
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-air/examples/batch_tuning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -984,7 +984,7 @@
"metadata": {},
"source": [
"```{tip}\n",
"[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n",
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
krfricke marked this conversation as resolved.
Show resolved Hide resolved
"```\n",
"\n",
"Finally, we will restore the best and worst models from checkpoint and make predictions. \n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
"source": [
"First, we load and preprocess the MNIST dataset.\n",
"\n",
"Assumption for this tutorial: your existing code is using the `tf.data.Dataset` native to Tensorflow. This tutorial continues to use `tf.data.Dataset` to allow you to make as few code changes as possible. **Everything in this tutorial is also possible if you choose to use Ray Data, and you will also get the benefits of efficient preprocessing and multi-worker batch prediction.** See [here](train-datasets) for resources to get started with Ray Data."
"Assumption for this tutorial: your existing code is using the `tf.data.Dataset` native to Tensorflow. This tutorial continues to use `tf.data.Dataset` to allow you to make as few code changes as possible. **Everything in this tutorial is also possible if you choose to use Ray Data, and you will also get the benefits of efficient preprocessing and multi-worker batch prediction.** See [here](data-ingest-torch) for resources to get started with Ray Data."
]
},
{
Expand Down Expand Up @@ -519,9 +519,7 @@
"\n",
"A few notes on the configs set below:\n",
"- `train_loop_config` sets the hyperparameters passed into the training loop as the `config` parameter\n",
"- `scaling_config` configures **how many parallel workers to use**, the **resources required per worker**, and whether we want to **enable GPU training** or not.\n",
"\n",
"See this [configuration guide](train-config) for more details on how to configure the trainer."
"- `scaling_config` configures **how many parallel workers to use**, the **resources required per worker**, and whether we want to **enable GPU training** or not."
]
},
{
Expand Down Expand Up @@ -617,8 +615,6 @@
"\n",
"In our [other examples](ref-ray-examples) you can learn how to do more things with Ray, such as **serving your model with Ray Serve** or **tune your hyperparameters with Ray Tune**. You can also learn how to perform {ref}`offline batch inference <batch_inference_home>` with Ray Data.\n",
"\n",
"See [this table](train-framework-catalog) for a full catalog of frameworks that AIR supports out of the box.\n",
"\n",
"We hope this tutorial gave you a good starting point to leverage Ray AIR. If you have any questions, suggestions, or run into any problems pelase reach out on [Discuss](https://discuss.ray.io/), [GitHub](https://github.com/ray-project/ray) or the [Ray Slack](https://forms.gle/9TSdDYUgxYs8SA9e8)!"
krfricke marked this conversation as resolved.
Show resolved Hide resolved
]
}
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-air/examples/gptj_batch_prediction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.train.Checkpoint>`, which we don't for this example. See {ref}`air-predictors` for more information and usage examples."
"You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.train.Checkpoint>`, which we don't for this example. See {class}`ray.train.predictor.Predictor` for more information and usage examples."
]
}
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because AIR does not implement an out of the box Predictor for Diffusers. We could implement it ourselves, but Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.air.checkpoint.Checkpoint>`, and those are not necessary for this example. See {ref}`air-predictors` for more information and usage examples."
"You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because AIR does not implement an out of the box Predictor for Diffusers. We could implement it ourselves, but Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.air.checkpoint.Checkpoint>`, and those are not necessary for this example. See {class}`ray.train.predictor.Predictor` for more information and usage examples."
]
}
],
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-overview/use-cases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ Learn more about the Tune library with the following talks and user guides.
Distributed Training
--------------------

The :ref:`Ray Train <train-userguides>` library integrates many distributed training frameworks under a simple Trainer API,
The :ref:`Ray Train <train-docs>` library integrates many distributed training frameworks under a simple Trainer API,
providing distributed orchestration and management capabilities out of the box.

In contrast to training many models, model parallelism partitions a large model across many machines for training. Ray Train has built-in abstractions for distributing shards of models and running training in parallel.
Expand Down
6 changes: 3 additions & 3 deletions doc/source/ray-references/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ documentation, sorted alphabetically.
to compute and apply one gradient update to the model weights.

Batch predictor
A :ref:`Ray AIR Batch Predictor<air-predictors>` builds on the Predictor class
A :class:`Ray AIR Batch Predictor<ray.train.predictor.Predictor>` builds on the Predictor class
krfricke marked this conversation as resolved.
Show resolved Hide resolved
to parallelize inference on a large dataset. A Batch predictor shards the
dataset to allow multiple workers to do inference on a smaller number of data
points and then aggregating all the worker predictions at the end.
Expand Down Expand Up @@ -413,7 +413,7 @@ documentation, sorted alphabetically.
.. TODO: Policy evaluation

Predictor
:ref:`An interface for performing inference<air-predictors>` (prediction)
:class:`An interface for performing inference<ray.train.predictor.Predictor>` (prediction)
on input data with a trained model.

Preprocessor
Expand Down Expand Up @@ -603,7 +603,7 @@ documentation, sorted alphabetically.
(e.g., for sharing computed gradients).

Trainer configuration
:ref:`A Trainer can be configured in various ways<train-config>`. Some
A Trainer can be configured in various ways. Some
configurations are shared across all trainers, like the RunConfig, which
configures things like the experiment storage, and ScalingConfig, which
configures the number of training workers as well as resources needed per
Expand Down
Loading
Loading