[train/docs] Restructure Ray Train docs with framework-specific guides #37892

krfricke · 2023-07-28T10:23:14Z

Why are these changes needed?

This PR restructures the Ray Train docs to better mimic typical user journeys.

Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.

This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.

Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).

https://anyscale-ray--37892.com.readthedocs.build/en/37892/

Related issue number

Replaces #37808

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2023-07-28T14:32:45Z

doc/source/train/distributed-tensorflow-keras.rst

+Data loading and preprocessing
+------------------------------


There's stuff missing here - will fill this out once we converged on structure

angelinalg · 2023-07-28T15:06:37Z

Thanks for this work, @krfricke! It's very exciting. My one request is to put Horovod, Xgboost, and Tensorflow under a level of hierarchy called More frameworks, to reflect that most Ray users are using Torch-based frameworks. @matthewdeng @richardliaw, thoughts?

richardliaw · 2023-07-28T18:28:33Z

I would put Horovod into "More frameworks", whereas many organizations and enterprises are still using xgboost and tensorflow for production ML, I would keep it top level.

matthewdeng · 2023-07-28T19:06:06Z

Okay. Would it make sense to combine the Horovod guide directly as part of the TensorFlow guide (and have tabs for Horovod specific code)?

krfricke · 2023-07-31T06:02:37Z

Okay. Would it make sense to combine the Horovod guide directly as part of the TensorFlow guide (and have tabs for Horovod specific code)?

Horovod can also be used with PyTorch though.

Signed-off-by: Kai Fricke <kai@anyscale.com>

matthewdeng · 2023-08-01T17:00:03Z

doc/source/train/distributed-pytorch/checkpoints.rst

@@ -0,0 +1,244 @@
+.. _train-checkpointing:
+
+Saving and loading checkpoints


nit: Can we use Title Case for now?

I talked to @angelinalg and we do want to move to sentence case, but it makes more sense to be consistent across the rest of the Ray documentation for now.

When I try to add contents for lightning, I feel this title is a bit ambiguous.

At the beginning, I though there would be contents discussing Ray Train's specific tools for checkpoint saving and loading. But actually, this section is discussing:

how to report a checkpoint to ray train

how to resume training with the latest reported ckpt

The actual saving and loading logic is handled by torch/lightning itself. Does "report and retrieve your checkpoint" or something similar make sense?

Ok, after an offline discussion with @matt, we agree to keep "Saving and Loading Checkpoints" as the title, since we want to hide the implementation details of ckpt syncing from the user, and let them just use ray.train.* api to "save" and "load" checkpoints.

…ture-alternative

Signed-off-by: Kai Fricke <kai@anyscale.com>

doc/source/train/distributed-pytorch.rst

doc/source/train/distributed-tensorflow-keras.rst

doc/source/train/distributed-xgboost-lightgbm.rst

doc/source/train/horovod.rst

angelinalg · 2023-08-02T23:25:11Z

@krfricke Do you know what happened to this guide? https://docs.ray.io/en/latest/ray-air/examples/convert_existing_pytorch_code_to_ray_air.html
@GokuMohandas has indicated that this is a very high quality doc.

matthewdeng · 2023-08-02T23:27:39Z

@krfricke Do you know what happened to this guide? https://docs.ray.io/en/latest/ray-air/examples/convert_existing_pytorch_code_to_ray_air.html @GokuMohandas has indicated that this is a very high quality doc.

I think it got moved to the Example Gallery? (Though it does seem like a User Guide) cc @richardliaw

angelinalg · 2023-08-02T23:30:58Z

doc/source/ray-air/examples/batch_forecasting.ipynb

@@ -1168,7 +1168,7 @@
    "- We will restore a Prophet or ARIMA model directly from checkpoint, and demonstrate it can be used for prediction.\n",
    "\n",
    "```{tip}\n",
-    "[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n",
+    "Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",


Suggested change

"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",

"Ray Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",

Let's defer the "AIR" removals - "Ray Predictors" make it sound like a first class concept, which it's not. The AIR removal hasn't been fully done, yet - IMO this should be part of that. here we just remove the references

doc/source/ray-air/examples/batch_tuning.ipynb

doc/source/ray-air/examples/convert_existing_tf_code_to_ray_air.ipynb

doc/source/ray-references/glossary.rst

doc/source/train/distributed-pytorch/monitoring-logging.rst

doc/source/tune/examples/tune_analyze_results.ipynb

matthewdeng

Overall structure looks good to me, we can follow up on content in separate PRs. Thanks for all the iterations!

doc/source/_static/js/custom.js

doc/source/_toc.yml

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>

…ture-alternative # Conflicts: # doc/source/ray-air/examples/gptj_batch_prediction.ipynb # doc/source/train/dl_guide.rst

Signed-off-by: Kai Fricke <kai@anyscale.com>

amogkam · 2023-08-03T17:19:39Z

doc/source/data/batch_inference.rst

@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

    checkpoint = result.checkpoint

-**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
+**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.


Is there no link anymore to framework specific Checkpoints? If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Also wondering this.

Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?

Let's decouple the removal from this PR. We'll update the content when we fully quarantined the framework-specific checkpoint classes.

Let's decouple the removal from this PR.

I agree, but this PR makes the removal right? It's no longer documented with this PR merged.

bveeramani

LGTM other than the framework-specific checkpoint thing.

bveeramani · 2023-08-03T17:32:58Z

doc/source/data/batch_inference.rst

@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

    checkpoint = result.checkpoint

-**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
+**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.


If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?

Also wondering this.

bveeramani · 2023-08-03T17:34:58Z

doc/source/data/batch_inference.rst

@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use

    checkpoint = result.checkpoint

-**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
+**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.


Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?

pcmoritz · 2023-08-03T22:52:20Z

Don't forget to set redirects in readthedocs for the things that have been moved in this PR cc @angelinalg @matthewdeng

ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Victor <vctr.y.m@example.com>

Kai Fricke added 5 commits July 27, 2023 10:44

start restructuring

7a79003

Signed-off-by: Kai Fricke <kai@anyscale.com>

restructure

de5582a

Signed-off-by: Kai Fricke <kai@anyscale.com>

cont

7d922c9

Signed-off-by: Kai Fricke <kai@anyscale.com>

structure

b34fe4e

Signed-off-by: Kai Fricke <kai@anyscale.com>

Better intro to PyTorch

13b4293

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke commented Jul 28, 2023

View reviewed changes

krfricke assigned matthewdeng, richardliaw and angelinalg Jul 28, 2023

Kai Fricke added 3 commits July 31, 2023 08:07

Merge branch 'master' into doc/train-restructure-alternative

5926a7c

Tensorflow guide

e8b78cd

Signed-off-by: Kai Fricke <kai@anyscale.com>

gpus + resources

6f1b4a8

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke marked this pull request as ready for review July 31, 2023 15:26

krfricke requested review from richardliaw, gjoliver, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, a team, ericl, scv119, c21, scottjlee and bveeramani as code owners July 31, 2023 15:26

matthewdeng reviewed Aug 1, 2023

View reviewed changes

Kai Fricke added 8 commits August 2, 2023 09:40

Merge remote-tracking branch 'upstream/master' into doc/train-restruc…

5207568

…ture-alternative

renames, dropdowns

803eb09

Signed-off-by: Kai Fricke <kai@anyscale.com>

experiment tracking

177afaf

Signed-off-by: Kai Fricke <kai@anyscale.com>

update configuration page

ad26fff

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix example

7dfa8fe

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix label

01f5e8f

Signed-off-by: Kai Fricke <kai@anyscale.com>

discussion

09237a8

Signed-off-by: Kai Fricke <kai@anyscale.com>

remove reference

6327fb2

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke changed the title ~~[wip] [train/docs] Ray Train docs restructure~~ [train/docs] Restructure Ray Train docs Aug 2, 2023

krfricke changed the title ~~[train/docs] Restructure Ray Train docs~~ [train/docs] Restructure Ray Train docs with framework-specific guides Aug 2, 2023

angelinalg approved these changes Aug 2, 2023

View reviewed changes

matthewdeng approved these changes Aug 3, 2023

View reviewed changes

doc/source/_static/js/custom.js Show resolved Hide resolved

doc/source/_toc.yml Outdated Show resolved Hide resolved

krfricke and others added 3 commits August 3, 2023 08:59

Apply suggestions from code review

74638bf

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>

Merge remote-tracking branch 'upstream/master' into doc/train-restruc…

030da48

…ture-alternative # Conflicts: # doc/source/ray-air/examples/gptj_batch_prediction.ipynb # doc/source/train/dl_guide.rst

session -> train

1219b6f

Signed-off-by: Kai Fricke <kai@anyscale.com>

amogkam reviewed Aug 3, 2023

View reviewed changes

bveeramani approved these changes Aug 3, 2023

View reviewed changes

krfricke merged commit 0e4842b into ray-project:master Aug 3, 2023
81 of 84 checks passed

krfricke deleted the doc/train-restructure-alternative branch August 3, 2023 17:54

angelinalg mentioned this pull request Aug 16, 2023

update names of guides in side nav #38519

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train/docs] Restructure Ray Train docs with framework-specific guides #37892

[train/docs] Restructure Ray Train docs with framework-specific guides #37892

krfricke commented Jul 28, 2023 •

edited

Loading

krfricke Jul 28, 2023

angelinalg commented Jul 28, 2023

richardliaw commented Jul 28, 2023

matthewdeng commented Jul 28, 2023 •

edited

Loading

krfricke commented Jul 31, 2023

matthewdeng Aug 1, 2023

woshiyyya Aug 2, 2023 •

edited

Loading

woshiyyya Aug 2, 2023 •

edited

Loading

angelinalg commented Aug 2, 2023

matthewdeng commented Aug 2, 2023

angelinalg Aug 2, 2023

krfricke Aug 3, 2023

matthewdeng left a comment

amogkam Aug 3, 2023

bveeramani Aug 3, 2023

bveeramani Aug 3, 2023

krfricke Aug 3, 2023

amogkam Aug 3, 2023

bveeramani left a comment

bveeramani Aug 3, 2023

bveeramani Aug 3, 2023

pcmoritz commented Aug 3, 2023 •

edited

Loading

		Data loading and preprocessing
		------------------------------

		@@ -0,0 +1,244 @@
		.. _train-checkpointing:

		Saving and loading checkpoints

	"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
	"Ray Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",

[train/docs] Restructure Ray Train docs with framework-specific guides #37892

[train/docs] Restructure Ray Train docs with framework-specific guides #37892

Conversation

krfricke commented Jul 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

angelinalg commented Jul 28, 2023

richardliaw commented Jul 28, 2023

matthewdeng commented Jul 28, 2023 • edited Loading

krfricke commented Jul 31, 2023

Choose a reason for hiding this comment

woshiyyya Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

woshiyyya Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

angelinalg commented Aug 2, 2023

matthewdeng commented Aug 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz commented Aug 3, 2023 • edited Loading

krfricke commented Jul 28, 2023 •

edited

Loading

matthewdeng commented Jul 28, 2023 •

edited

Loading

woshiyyya Aug 2, 2023 •

edited

Loading

woshiyyya Aug 2, 2023 •

edited

Loading

pcmoritz commented Aug 3, 2023 •

edited

Loading