[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

c21 · 2024-03-14T18:52:48Z

Why are these changes needed?

This PR is to update Ray Data documentation for landing, overview and key concepts page, as discussed offline.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scottjlee

LGTM, pending docs build failure (I can't view the built docs, might be worth doing a final check once build works)

scottjlee · 2024-03-14T20:39:21Z

doc/source/data/overview.rst


-.. dropdown:: Out of the box scaling
+    Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.


Suggested change

Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.

Ray Data is built on Ray, so it easily scales on a heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.

scottjlee · 2024-03-14T20:39:56Z

doc/source/data/overview.rst


-    Ray Data is built on Ray, so it easily scales to many machines. Code that works on one machine also runs on a large cluster without any changes.
+    Ray Data can easily scale to hundreds of nodes to process hundreds of TB data.


Suggested change

Ray Data can easily scale to hundreds of nodes to process hundreds of TB data.

Ray Data can easily scale to hundreds of nodes to process hundreds of TB of data.

scottjlee · 2024-03-14T20:40:42Z

doc/source/data/overview.rst


-    With Ray Data, you can express your inference job directly in Python instead of
-    YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience.
+    With Ray Data, you can express batch inference and ML training job directly under same Dataset API.


Suggested change

With Ray Data, you can express batch inference and ML training job directly under same Dataset API.

With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API.

scottjlee · 2024-03-14T20:47:01Z

doc/source/ray-overview/getting-started.md

@@ -16,7 +16,7 @@ Use individual libraries for ML workloads. Click on the dropdowns for your workl
 `````{dropdown} <img src="images/ray_svg_logo.svg" alt="ray" width="50px"> Data: Scalable Datasets for ML
 :animate: fade-in-slide-down

-Scale offline inference and training ingest with [Ray Data](data_key_concepts) --
+Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) --


i think the actual title text should remain the same, but update the reference:

Suggested change

Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) --

Scale offline inference and training ingest with [Ray Data](data_quickstart) --

bveeramani · 2024-03-14T23:16:42Z

doc/source/data/overview.rst

 Why choose Ray Data?
 --------------------

 .. dropdown:: Faster and cheaper for modern deep learning applications

-    Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming :ref:`Dataset <dataset_concept>` primitive, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.
+    Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.


Feel like "Through its powerful streaming execution" is redundant since we say "Ray Data streams" in the next clause

Suggested change

Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.

Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.

bveeramani · 2024-03-14T23:25:09Z

doc/source/data/overview.rst

-.. dropdown:: Python first
+.. dropdown:: Unified API and backend for batch inference and ML training

-    With Ray Data, you can express your inference job directly in Python instead of
-    YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience.
+    With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API.


I know this is also in the draft for the blog post, but IMO "Python first" is more compelling than "Unified API and backend for batch inference and ML training."

Don't need to block on this since we can always revise later.

Yes let's revise this later.

bveeramani · 2024-03-14T23:25:28Z

doc/source/data/overview.rst

-    Through the :ref:`Ray cluster launcher <cluster-index>`, you can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including CSV, Parquet, and raw images.
+    You can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including Parquet, images, JSON, text, CSV, etc.
+
+.. dropdown:: Out of the box scaling on heterogeneous cluster


Suggested change

.. dropdown:: Out of the box scaling on heterogeneous cluster

.. dropdown:: Out-of-the-box scaling on heterogeneous clusters

omatthew98

Looks good, think spotted one tiny typo.

omatthew98 · 2024-03-18T17:12:05Z

doc/source/tune/tutorials/tune_get_data_in_and_out.md

@@ -71,7 +71,7 @@ For example, passing in a large pandas DataFrame or an unserializable model obje
 Instead, use strings or other identifiers as your values, and initialize/load the objects inside your Trainable directly depending on those.

 ```{note}
-[Datasets](data_key_concepts) can be used as values in the search space directly.
+[Dataset](data_quickstart) can be used as values in the search space directly.


Think it should be [Datasets](data_quickstart).

Signed-off-by: Cheng Su <scnju13@gmail.com>

…ts} (ray-project#44008) This PR is to update Ray Data documentation for landing, overview and key concepts page, as discussed offline. Signed-off-by: Cheng Su <scnju13@gmail.com>

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 requested review from matthewdeng, justinvyu, woshiyyya, a team, ericl, scv119, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners March 14, 2024 18:52

scottjlee approved these changes Mar 14, 2024

View reviewed changes

bveeramani approved these changes Mar 14, 2024

View reviewed changes

omatthew98 approved these changes Mar 18, 2024

View reviewed changes

omatthew98 reviewed Mar 18, 2024

View reviewed changes

c21 added 4 commits March 18, 2024 10:24

Update Ray Data documentation for {landing,overview,key concepts}

dd1936e

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix lint

8ab4ed5

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comment

5a28970

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comments

2655a53

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the doc-group-1 branch from 3a23fcb to 2655a53 Compare March 18, 2024 17:25

Remove image

bcebcb2

Signed-off-by: Cheng Su <scnju13@gmail.com>

matthewdeng approved these changes Mar 18, 2024

View reviewed changes

c21 merged commit 187a5c0 into ray-project:master Mar 18, 2024
5 checks passed

c21 deleted the doc-group-1 branch March 18, 2024 19:01

c21 mentioned this pull request Mar 18, 2024

[2.10][Data] Update Ray Data documentation for {landing,overview,key concepts} #44095

Merged

8 tasks

c21 added a commit to c21/ray that referenced this pull request Mar 18, 2024

Cherry pick ray-project#44008 and ray-project#44022

b16d626

Signed-off-by: Cheng Su <scnju13@gmail.com>

khluu pushed a commit that referenced this pull request Mar 18, 2024

Cherry pick #44008 and #44022 (#44095)

27b68d1

Signed-off-by: Cheng Su <scnju13@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

c21 commented Mar 14, 2024

scottjlee left a comment

scottjlee Mar 14, 2024

c21 Mar 14, 2024

scottjlee Mar 14, 2024

c21 Mar 14, 2024

scottjlee Mar 14, 2024

c21 Mar 14, 2024

scottjlee Mar 14, 2024

c21 Mar 14, 2024

bveeramani Mar 14, 2024

c21 Mar 18, 2024

bveeramani Mar 14, 2024

c21 Mar 18, 2024

bveeramani Mar 14, 2024

c21 Mar 18, 2024

omatthew98 left a comment

omatthew98 Mar 18, 2024

c21 Mar 18, 2024


		.. dropdown:: Out of the box scaling
		Ray Data is built on Ray, so it easily scales on heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.


		Ray Data is built on Ray, so it easily scales to many machines. Code that works on one machine also runs on a large cluster without any changes.
		Ray Data can easily scale to hundreds of nodes to process hundreds of TB data.

	With Ray Data, you can express batch inference and ML training job directly under same Dataset API.
	With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API.

	Scale offline inference and training ingest with [Ray Data Quickstart](data_quickstart) --
	Scale offline inference and training ingest with [Ray Data](data_quickstart) --

	Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming execution, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.
	Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.

	.. dropdown:: Out of the box scaling on heterogeneous cluster
	.. dropdown:: Out-of-the-box scaling on heterogeneous clusters

[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

Conversation

c21 commented Mar 14, 2024

Why are these changes needed?

Related issue number

Checks

scottjlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment