Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

Merged
merged 5 commits into from Mar 18, 2024

Conversation

c21
Copy link
Contributor

@c21 c21 commented Mar 14, 2024

Why are these changes needed?

This PR is to update Ray Data documentation for landing, overview and key concepts page, as discussed offline.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending docs build failure (I can't view the built docs, might be worth doing a final check once build works)


.. dropdown:: Out of the box scaling
Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.
Ray Data is built on Ray, so it easily scales on a heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


Ray Data is built on Ray, so it easily scales to many machines. Code that works on one machine also runs on a large cluster without any changes.
Ray Data can easily scale to hundreds of nodes to process hundreds of TB data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray Data can easily scale to hundreds of nodes to process hundreds of TB data.
Ray Data can easily scale to hundreds of nodes to process hundreds of TB of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


With Ray Data, you can express your inference job directly in Python instead of
YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience.
With Ray Data, you can express batch inference and ML training job directly under same Dataset API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With Ray Data, you can express batch inference and ML training job directly under same Dataset API.
With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -16,7 +16,7 @@ Use individual libraries for ML workloads. Click on the dropdowns for your workl
`````{dropdown} <img src="images/ray_svg_logo.svg" alt="ray" width="50px"> Data: Scalable Datasets for ML
:animate: fade-in-slide-down

Scale offline inference and training ingest with [Ray Data](data_key_concepts) --
Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the actual title text should remain the same, but update the reference:

Suggested change
Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) --
Scale offline inference and training ingest with [Ray Data](data_quickstart) --

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Why choose Ray Data?
--------------------

.. dropdown:: Faster and cheaper for modern deep learning applications

Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming :ref:`Dataset <dataset_concept>` primitive, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like "Through its powerful streaming execution" is redundant since we say "Ray Data streams" in the next clause

Suggested change
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 50 to 45
.. dropdown:: Python first
.. dropdown:: Unified API and backend for batch inference and ML training

With Ray Data, you can express your inference job directly in Python instead of
YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience.
With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is also in the draft for the blog post, but IMO "Python first" is more compelling than "Unified API and backend for batch inference and ML training."

Don't need to block on this since we can always revise later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's revise this later.

Through the :ref:`Ray cluster launcher <cluster-index>`, you can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including CSV, Parquet, and raw images.
You can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including Parquet, images, JSON, text, CSV, etc.

.. dropdown:: Out of the box scaling on heterogeneous cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. dropdown:: Out of the box scaling on heterogeneous cluster
.. dropdown:: Out-of-the-box scaling on heterogeneous clusters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, think spotted one tiny typo.

@@ -71,7 +71,7 @@ For example, passing in a large pandas DataFrame or an unserializable model obje
Instead, use strings or other identifiers as your values, and initialize/load the objects inside your Trainable directly depending on those.

```{note}
[Datasets](data_key_concepts) can be used as values in the search space directly.
[Dataset](data_quickstart) can be used as values in the search space directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it should be [Datasets](data_quickstart).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

c21 added 4 commits March 18, 2024 10:24
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
@c21 c21 merged commit 187a5c0 into ray-project:master Mar 18, 2024
5 checks passed
@c21 c21 deleted the doc-group-1 branch March 18, 2024 19:01
c21 added a commit to c21/ray that referenced this pull request Mar 18, 2024
…ts} (ray-project#44008)

This PR is to update Ray Data documentation for landing, overview and key concepts page, as discussed offline.

Signed-off-by: Cheng Su <scnju13@gmail.com>
c21 added a commit to c21/ray that referenced this pull request Mar 18, 2024
Signed-off-by: Cheng Su <scnju13@gmail.com>
khluu pushed a commit that referenced this pull request Mar 18, 2024
Signed-off-by: Cheng Su <scnju13@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants