Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train/docs] Restructure Ray Train docs with framework-specific guides #37892

Merged

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jul 28, 2023

Why are these changes needed?

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

https://anyscale-ray--37892.com.readthedocs.build/en/37892/

Related issue number

Replaces #37808

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 5 commits July 27, 2023 10:44
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Comment on lines +118 to +119
Data loading and preprocessing
------------------------------
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's stuff missing here - will fill this out once we converged on structure

@angelinalg
Copy link
Contributor

Thanks for this work, @krfricke! It's very exciting. My one request is to put Horovod, Xgboost, and Tensorflow under a level of hierarchy called More frameworks, to reflect that most Ray users are using Torch-based frameworks. @matthewdeng @richardliaw, thoughts?

@richardliaw
Copy link
Contributor

I would put Horovod into "More frameworks", whereas many organizations and enterprises are still using xgboost and tensorflow for production ML, I would keep it top level.

@matthewdeng
Copy link
Contributor

matthewdeng commented Jul 28, 2023

Okay. Would it make sense to combine the Horovod guide directly as part of the TensorFlow guide (and have tabs for Horovod specific code)?

@krfricke
Copy link
Contributor Author

Okay. Would it make sense to combine the Horovod guide directly as part of the TensorFlow guide (and have tabs for Horovod specific code)?

Horovod can also be used with PyTorch though.

Kai Fricke added 3 commits July 31, 2023 08:07
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@@ -0,0 +1,244 @@
.. _train-checkpointing:

Saving and loading checkpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we use Title Case for now?

I talked to @angelinalg and we do want to move to sentence case, but it makes more sense to be consistent across the rest of the Ray documentation for now.

Copy link
Member

@woshiyyya woshiyyya Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try to add contents for lightning, I feel this title is a bit ambiguous.

At the beginning, I though there would be contents discussing Ray Train's specific tools for checkpoint saving and loading. But actually, this section is discussing:

  • how to report a checkpoint to ray train
  • how to resume training with the latest reported ckpt

The actual saving and loading logic is handled by torch/lightning itself. Does "report and retrieve your checkpoint" or something similar make sense?

Copy link
Member

@woshiyyya woshiyyya Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after an offline discussion with @matt, we agree to keep "Saving and Loading Checkpoints" as the title, since we want to hide the implementation details of ckpt syncing from the user, and let them just use ray.train.* api to "save" and "load" checkpoints.

Kai Fricke added 8 commits August 2, 2023 09:40
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke changed the title [wip] [train/docs] Ray Train docs restructure [train/docs] Restructure Ray Train docs Aug 2, 2023
@krfricke krfricke changed the title [train/docs] Restructure Ray Train docs [train/docs] Restructure Ray Train docs with framework-specific guides Aug 2, 2023
doc/source/train/distributed-pytorch.rst Outdated Show resolved Hide resolved
doc/source/train/distributed-tensorflow-keras.rst Outdated Show resolved Hide resolved
doc/source/train/distributed-xgboost-lightgbm.rst Outdated Show resolved Hide resolved
doc/source/train/horovod.rst Outdated Show resolved Hide resolved
@angelinalg
Copy link
Contributor

@krfricke Do you know what happened to this guide? https://docs.ray.io/en/latest/ray-air/examples/convert_existing_pytorch_code_to_ray_air.html
@GokuMohandas has indicated that this is a very high quality doc.

@matthewdeng
Copy link
Contributor

@krfricke Do you know what happened to this guide? https://docs.ray.io/en/latest/ray-air/examples/convert_existing_pytorch_code_to_ray_air.html @GokuMohandas has indicated that this is a very high quality doc.

I think it got moved to the Example Gallery? (Though it does seem like a User Guide) cc @richardliaw

@@ -1168,7 +1168,7 @@
"- We will restore a Prophet or ARIMA model directly from checkpoint, and demonstrate it can be used for prediction.\n",
"\n",
"```{tip}\n",
"[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n",
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
"Ray Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer the "AIR" removals - "Ray Predictors" make it sound like a first class concept, which it's not. The AIR removal hasn't been fully done, yet - IMO this should be part of that. here we just remove the references

doc/source/ray-references/glossary.rst Show resolved Hide resolved
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure looks good to me, we can follow up on content in separate PRs. Thanks for all the iterations!

doc/source/_static/js/custom.js Show resolved Hide resolved
doc/source/_toc.yml Outdated Show resolved Hide resolved
krfricke and others added 3 commits August 3, 2023 08:59
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
…ture-alternative

# Conflicts:
#	doc/source/ray-air/examples/gptj_batch_prediction.ipynb
#	doc/source/train/dl_guide.rst
Signed-off-by: Kai Fricke <kai@anyscale.com>
@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

checkpoint = result.checkpoint

**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no link anymore to framework specific Checkpoints? If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Also wondering this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decouple the removal from this PR. We'll update the content when we fully quarantined the framework-specific checkpoint classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decouple the removal from this PR.

I agree, but this PR makes the removal right? It's no longer documented with this PR merged.

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the framework-specific checkpoint thing.

@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

checkpoint = result.checkpoint

**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Also wondering this.

@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

checkpoint = result.checkpoint

**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?

@krfricke krfricke merged commit 0e4842b into ray-project:master Aug 3, 2023
81 of 84 checks passed
@krfricke krfricke deleted the doc/train-restructure-alternative branch August 3, 2023 17:54
@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 3, 2023

Don't forget to set redirects in readthedocs for the things that have been moved in this PR cc @angelinalg @matthewdeng

NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
ray-project#37892)

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#37892)

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#37892)

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
ray-project#37892)

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
ray-project#37892)

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants