[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093

omatthew98 · 2024-03-18T18:42:33Z

Why are these changes needed?

This PR is to update Ray Data documentation for Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages as discussed offline.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

omatthew98 · 2024-03-18T21:18:49Z

Docs Pages from PR:

scottjlee

LGTM, small nits

scottjlee · 2024-03-18T21:21:27Z

doc/source/data/loading-data.rst

+            :class:`~ray.data.from_huggingface` only supports parallel reads in certain
+            instances, namely for untransformed public 🤗 Datasets. For those datasets,
+            `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
+            will be used to perform a distributed read, otherwise a single node read will be used.


Suggested change

will be used to perform a distributed read, otherwise a single node read will be used.

will be used to perform a distributed read; otherwise, a single node read will be used.

scottjlee · 2024-03-18T21:22:20Z

doc/source/data/loading-data.rst

+            `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
+            will be used to perform a distributed read, otherwise a single node read will be used.
+            This shouldn't be an issue with in-memory 🤗 Datasets, but may fail with
+            large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``


can we include links to HF docs for DatasetDict and IteraableDatasetDict?

scottjlee · 2024-03-18T21:22:37Z

doc/source/data/loading-data.rst

+            `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
+            will be used to perform a distributed read, otherwise a single node read will be used.
+            This shouldn't be an issue with in-memory 🤗 Datasets, but may fail with
+            large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``


Suggested change

large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``

large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` and ``IteraableDatasetDict``

scottjlee · 2024-03-18T21:23:59Z

doc/source/data/loading-data.rst

@@ -603,6 +611,31 @@ Ray Data interoperates with HuggingFace and TensorFlow datasets.

            [{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]

+    .. tab-item:: PyTorch Dataset
+
+        To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`.


Suggested change

To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`.

To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`.

scottjlee · 2024-03-18T21:31:02Z

doc/source/data/loading-data.rst

-datasource and pass it to :func:`~ray.data.read_datasource`.
+datasource and pass it to :func:`~ray.data.read_datasource`. To write results, you might
+also need to subclass :class:`ray.data.Datasink`. Then, create an instance of your custom
+datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide


Suggested change

datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide

datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see

scottjlee · 2024-03-18T21:44:41Z

doc/source/data/transforming-data.rst

@@ -145,6 +176,34 @@ To configure the batch type, specify ``batch_format`` in
                .map_batches(drop_nas, batch_format="pandas")
            )

+The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches


Suggested change

The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches

The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As batches

scottjlee · 2024-03-18T21:45:37Z

doc/source/data/transforming-data.rst

+can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type
+``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In


Suggested change

can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type

``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In

can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), the function should be of type

``Callable[DataBatch, DataBatch]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In

scottjlee · 2024-03-18T21:46:27Z

doc/source/data/transforming-data.rst

+other words your function should input and output a batch of data which can be represented as a
+pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need


Suggested change

other words your function should input and output a batch of data which can be represented as a

pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need

other words, your function should take as input and output a batch of data which can be represented as a

pandas DataFrame or a dictionary with string keys and NumPy ndarrays values. Your function does not need

scottjlee · 2024-03-18T21:47:04Z

doc/source/data/transforming-data.rst

+to return a batch in the same format as it is input, so you could input a pandas dataframe and output a
+dictionary of NumPy ndarrays. For example your function might look like:


Suggested change

to return a batch in the same format as it is input, so you could input a pandas dataframe and output a

dictionary of NumPy ndarrays. For example your function might look like:

to return a batch in the same format as its input, so you could input a pandas DataFrame and output a

dictionary of NumPy ndarrays. For example, your function might look like:

scottjlee · 2024-03-18T21:48:25Z

doc/source/data/transforming-data.rst

+The user defined function can also return an iterator that yields batches, so the function can also
+be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.
+In this case your function would look like:


Suggested change

The user defined function can also return an iterator that yields batches, so the function can also

be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.

In this case your function would look like:

The user defined function can also be a Python generator that yields batches, so the function can also

be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.

In this case, your function would look like:

Signed-off-by: Matthew Owen <mowen@anyscale.com>

angelinalg

Just some nits. Excuse any mangling in the suggestions when I tried to change passive voice to active voice. Please correct as needed. Very nice job overall. Consider using Vale to catch some of these copy edits I made. (go/vale)

doc/source/data/loading-data.rst

doc/source/data/transforming-data.rst

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

doc/source/data/loading-data.rst

raulchen · 2024-03-20T18:48:31Z

doc/source/data/saving-data.rst

@@ -83,7 +83,7 @@ the appropriate scheme. URI can point to buckets or folders.
            filesystem = gcsfs.GCSFileSystem(project="my-google-project")
            ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem)

-    .. tab-item:: ABL
+    .. tab-item:: ABS

        To save data to Azure Blob Storage, install the
        `Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_


same as read, also add a tip on how to tune configs for write failure retries.

Discussed offline, will add the tip on configs later.

doc/source/data/transforming-data.rst

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Signed-off-by: Matthew Owen <mowen@anyscale.com>

can-anyscale · 2024-03-21T00:31:00Z

This breaks data doc test, I'm putting up a revert to double check (https://buildkite.com/ray-project/postmerge/builds/3645)

#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <mowen@anyscale.com>

…r, and Saving Data pages (ray-project#44093) --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <mowen@anyscale.com>

…r, and Saving Data pages (#44093) (#44221) Docs only cherry pick for release. Note: this cherry-pick includes four commits which are all related to changing the Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages. They are rolled together to reduce cherry-picking overhead and are all part of the logical update to these pages. The PRs included in this cherry-pick: Main overhaul of listed pages, Two fixes to doc tests that were broken by the above (fix 1, fix 2). Additional small change to explain how to use credentials that was added after initial merge of main overhaul --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

…r, and Saving Data pages (ray-project#44093) --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners March 18, 2024 18:42

omatthew98 changed the title ~~[Data] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages~~ [Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages Mar 18, 2024

c21 assigned c21 and scottjlee Mar 18, 2024

scottjlee approved these changes Mar 18, 2024

View reviewed changes

omatthew98 added 7 commits March 19, 2024 10:04

updates to loading data, reorder inspect page

9abc3bc

Signed-off-by: Matthew Owen <mowen@anyscale.com>

updates to transforming data

4c43898

Signed-off-by: Matthew Owen <mowen@anyscale.com>

updating inspecting and saving data pages

af2b0b6

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding hf read info

50676f3

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding space to fix doc build

db692d8

Signed-off-by: Matthew Owen <mowen@anyscale.com>

respond to pr feedback

5c3cc1b

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fix doc-style issues

e55ca12

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the mowen-update-existing-docs branch from bb0802a to e55ca12 Compare March 19, 2024 17:04

Merge branch 'master' into mowen-update-existing-docs

a1622e9

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 assigned angelinalg Mar 20, 2024

angelinalg reviewed Mar 20, 2024

View reviewed changes

angelinalg approved these changes Mar 20, 2024

View reviewed changes

omatthew98 and others added 5 commits March 20, 2024 11:57

Update doc/source/data/loading-data.rst

0d64e88

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/loading-data.rst

83f944b

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/loading-data.rst

d6d35d9

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/loading-data.rst

baf015c

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/loading-data.rst

fb9e1bd

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

omatthew98 and others added 3 commits March 20, 2024 11:59

Update doc/source/data/loading-data.rst

4491cd3

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/transforming-data.rst

cad6636

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/transforming-data.rst

d395c5d

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

raulchen reviewed Mar 20, 2024

View reviewed changes

omatthew98 and others added 5 commits March 20, 2024 12:00

Update doc/source/data/transforming-data.rst

5a332de

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/loading-data.rst

c0d02e2

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

Update doc/source/data/transforming-data.rst

cfbb80c

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>

replace emojis for better indexing

cff7381

Signed-off-by: Matthew Owen <mowen@anyscale.com>

address more pr comments

94ac4b4

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the mowen-update-existing-docs branch from 3fb69d0 to 94ac4b4 Compare March 20, 2024 20:21

raulchen approved these changes Mar 20, 2024

View reviewed changes

raulchen merged commit a4a9e97 into ray-project:master Mar 20, 2024
5 checks passed

can-anyscale mentioned this pull request Mar 21, 2024

Revert "[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages" #44199

Closed

omatthew98 mentioned this pull request Mar 21, 2024

[Data] [Docs] Adding in missing imports to code in doc #44203

Merged

8 tasks

omatthew98 mentioned this pull request Mar 21, 2024

[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages (#44093) #44221

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093

[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093

omatthew98 commented Mar 18, 2024

omatthew98 commented Mar 18, 2024

scottjlee left a comment

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

scottjlee Mar 18, 2024

angelinalg left a comment

raulchen Mar 20, 2024

omatthew98 Mar 20, 2024

can-anyscale commented Mar 21, 2024

	will be used to perform a distributed read, otherwise a single node read will be used.
	will be used to perform a distributed read; otherwise, a single node read will be used.

	large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``
	large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` and ``IteraableDatasetDict``

	To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`.
	To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`.

	datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide
	datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see

	The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches
	The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As batches

		can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type
		``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In

		other words your function should input and output a batch of data which can be represented as a
		pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need

		to return a batch in the same format as it is input, so you could input a pandas dataframe and output a
		dictionary of NumPy ndarrays. For example your function might look like:

[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093

[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093

Conversation

omatthew98 commented Mar 18, 2024

Why are these changes needed?

Related issue number

Checks

omatthew98 commented Mar 18, 2024

scottjlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angelinalg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

can-anyscale commented Mar 21, 2024