-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train/docs] Restructure Ray Train docs with framework-specific guides #37892
[train/docs] Restructure Ray Train docs with framework-specific guides #37892
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Data loading and preprocessing | ||
------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's stuff missing here - will fill this out once we converged on structure
Thanks for this work, @krfricke! It's very exciting. My one request is to put Horovod, Xgboost, and Tensorflow under a level of hierarchy called |
I would put Horovod into "More frameworks", whereas many organizations and enterprises are still using xgboost and tensorflow for production ML, I would keep it top level. |
Okay. Would it make sense to combine the Horovod guide directly as part of the TensorFlow guide (and have tabs for Horovod specific code)? |
Horovod can also be used with PyTorch though. |
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@@ -0,0 +1,244 @@ | |||
.. _train-checkpointing: | |||
|
|||
Saving and loading checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we use Title Case for now?
I talked to @angelinalg and we do want to move to sentence case, but it makes more sense to be consistent across the rest of the Ray documentation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try to add contents for lightning, I feel this title is a bit ambiguous.
At the beginning, I though there would be contents discussing Ray Train's specific tools for checkpoint saving and loading. But actually, this section is discussing:
- how to report a checkpoint to ray train
- how to resume training with the latest reported ckpt
The actual saving and loading logic is handled by torch/lightning itself. Does "report and retrieve your checkpoint" or something similar make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after an offline discussion with @matt, we agree to keep "Saving and Loading Checkpoints" as the title, since we want to hide the implementation details of ckpt syncing from the user, and let them just use ray.train.*
api to "save" and "load" checkpoints.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke Do you know what happened to this guide? https://docs.ray.io/en/latest/ray-air/examples/convert_existing_pytorch_code_to_ray_air.html |
I think it got moved to the Example Gallery? (Though it does seem like a User Guide) cc @richardliaw |
@@ -1168,7 +1168,7 @@ | |||
"- We will restore a Prophet or ARIMA model directly from checkpoint, and demonstrate it can be used for prediction.\n", | |||
"\n", | |||
"```{tip}\n", | |||
"[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n", | |||
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n", | |
"Ray Predictors make batch inference easy since they have internal logic to parallelize the inference.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's defer the "AIR" removals - "Ray Predictors" make it sound like a first class concept, which it's not. The AIR removal hasn't been fully done, yet - IMO this should be part of that. here we just remove the references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall structure looks good to me, we can follow up on content in separate PRs. Thanks for all the iterations!
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
…ture-alternative # Conflicts: # doc/source/ray-air/examples/gptj_batch_prediction.ipynb # doc/source/train/dl_guide.rst
Signed-off-by: Kai Fricke <kai@anyscale.com>
@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use | |||
|
|||
checkpoint = result.checkpoint | |||
|
|||
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`. | |||
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no link anymore to framework specific Checkpoints? If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?
Also wondering this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's decouple the removal from this PR. We'll update the content when we fully quarantined the framework-specific checkpoint classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's decouple the removal from this PR.
I agree, but this PR makes the removal right? It's no longer documented with this PR merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the framework-specific checkpoint thing.
@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use | |||
|
|||
checkpoint = result.checkpoint | |||
|
|||
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`. | |||
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are deprecated, what's the recommended way to do prediction with models trained with Ray Train?
Also wondering this.
@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use | |||
|
|||
checkpoint = result.checkpoint | |||
|
|||
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`. | |||
**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @ericl's Slack comment, it sounds like references to framework-specific checkpoints should be removed from the docs. Could we replace "use one of the framework-specific Checkpoint classes." with the current recommendation?
Don't forget to set redirects in readthedocs for the things that have been moved in this PR cc @angelinalg @matthewdeng |
ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR restructures the Ray Train docs to better mimic typical user journeys.
Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models.
This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used.
Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before).
https://anyscale-ray--37892.com.readthedocs.build/en/37892/
Related issue number
Replaces #37808
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.