New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending docs build failure (I can't view the built docs, might be worth doing a final check once build works)
doc/source/data/overview.rst
Outdated
|
||
.. dropdown:: Out of the box scaling | ||
Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes. | |
Ray Data is built on Ray, so it easily scales on a heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
doc/source/data/overview.rst
Outdated
|
||
Ray Data is built on Ray, so it easily scales to many machines. Code that works on one machine also runs on a large cluster without any changes. | ||
Ray Data can easily scale to hundreds of nodes to process hundreds of TB data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray Data can easily scale to hundreds of nodes to process hundreds of TB data. | |
Ray Data can easily scale to hundreds of nodes to process hundreds of TB of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
doc/source/data/overview.rst
Outdated
|
||
With Ray Data, you can express your inference job directly in Python instead of | ||
YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience. | ||
With Ray Data, you can express batch inference and ML training job directly under same Dataset API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Ray Data, you can express batch inference and ML training job directly under same Dataset API. | |
With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -16,7 +16,7 @@ Use individual libraries for ML workloads. Click on the dropdowns for your workl | |||
`````{dropdown} <img src="images/ray_svg_logo.svg" alt="ray" width="50px"> Data: Scalable Datasets for ML | |||
:animate: fade-in-slide-down | |||
|
|||
Scale offline inference and training ingest with [Ray Data](data_key_concepts) -- | |||
Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the actual title text should remain the same, but update the reference:
Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) -- | |
Scale offline inference and training ingest with [Ray Data](data_quickstart) -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
doc/source/data/overview.rst
Outdated
Why choose Ray Data? | ||
-------------------- | ||
|
||
.. dropdown:: Faster and cheaper for modern deep learning applications | ||
|
||
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming :ref:`Dataset <dataset_concept>` primitive, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently. | ||
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like "Through its powerful streaming execution" is redundant since we say "Ray Data streams" in the next clause
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently. | |
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
doc/source/data/overview.rst
Outdated
.. dropdown:: Python first | ||
.. dropdown:: Unified API and backend for batch inference and ML training | ||
|
||
With Ray Data, you can express your inference job directly in Python instead of | ||
YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience. | ||
With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is also in the draft for the blog post, but IMO "Python first" is more compelling than "Unified API and backend for batch inference and ML training."
Don't need to block on this since we can always revise later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's revise this later.
doc/source/data/overview.rst
Outdated
Through the :ref:`Ray cluster launcher <cluster-index>`, you can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including CSV, Parquet, and raw images. | ||
You can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including Parquet, images, JSON, text, CSV, etc. | ||
|
||
.. dropdown:: Out of the box scaling on heterogeneous cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. dropdown:: Out of the box scaling on heterogeneous cluster | |
.. dropdown:: Out-of-the-box scaling on heterogeneous clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, think spotted one tiny typo.
@@ -71,7 +71,7 @@ For example, passing in a large pandas DataFrame or an unserializable model obje | |||
Instead, use strings or other identifiers as your values, and initialize/load the objects inside your Trainable directly depending on those. | |||
|
|||
```{note} | |||
[Datasets](data_key_concepts) can be used as values in the search space directly. | |||
[Dataset](data_quickstart) can be used as values in the search space directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it should be [Datasets](data_quickstart).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
…ts} (ray-project#44008) This PR is to update Ray Data documentation for landing, overview and key concepts page, as discussed offline. Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Why are these changes needed?
This PR is to update Ray Data documentation for landing, overview and key concepts page, as discussed offline.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.