Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093

Merged
merged 21 commits into from Mar 20, 2024

Conversation

omatthew98
Copy link
Contributor

Why are these changes needed?

This PR is to update Ray Data documentation for Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages as discussed offline.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@omatthew98 omatthew98 changed the title [Data] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages [Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages Mar 18, 2024
@c21 c21 assigned c21 and scottjlee Mar 18, 2024
@omatthew98
Copy link
Contributor Author

Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, small nits

:class:`~ray.data.from_huggingface` only supports parallel reads in certain
instances, namely for untransformed public 🤗 Datasets. For those datasets,
`hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
will be used to perform a distributed read, otherwise a single node read will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
will be used to perform a distributed read, otherwise a single node read will be used.
will be used to perform a distributed read; otherwise, a single node read will be used.

`hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
will be used to perform a distributed read, otherwise a single node read will be used.
This shouldn't be an issue with in-memory 🤗 Datasets, but may fail with
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include links to HF docs for DatasetDict and IteraableDatasetDict?

`hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
will be used to perform a distributed read, otherwise a single node read will be used.
This shouldn't be an issue with in-memory 🤗 Datasets, but may fail with
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict``
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` and ``IteraableDatasetDict``

@@ -603,6 +611,31 @@ Ray Data interoperates with HuggingFace and TensorFlow datasets.

[{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]

.. tab-item:: PyTorch Dataset

To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`.
To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`.

datasource and pass it to :func:`~ray.data.read_datasource`.
datasource and pass it to :func:`~ray.data.read_datasource`. To write results, you might
also need to subclass :class:`ray.data.Datasink`. Then, create an instance of your custom
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see

@@ -145,6 +176,34 @@ To configure the batch type, specify ``batch_format`` in
.map_batches(drop_nas, batch_format="pandas")
)

The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches
The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As batches

Comment on lines 180 to 181
can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type
``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type
``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In
can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), the function should be of type
``Callable[DataBatch, DataBatch]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In

Comment on lines 182 to 183
other words your function should input and output a batch of data which can be represented as a
pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
other words your function should input and output a batch of data which can be represented as a
pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need
other words, your function should take as input and output a batch of data which can be represented as a
pandas DataFrame or a dictionary with string keys and NumPy ndarrays values. Your function does not need

Comment on lines 184 to 185
to return a batch in the same format as it is input, so you could input a pandas dataframe and output a
dictionary of NumPy ndarrays. For example your function might look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to return a batch in the same format as it is input, so you could input a pandas dataframe and output a
dictionary of NumPy ndarrays. For example your function might look like:
to return a batch in the same format as its input, so you could input a pandas DataFrame and output a
dictionary of NumPy ndarrays. For example, your function might look like:

Comment on lines 196 to 198
The user defined function can also return an iterator that yields batches, so the function can also
be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.
In this case your function would look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The user defined function can also return an iterator that yields batches, so the function can also
be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.
In this case your function would look like:
The user defined function can also be a Python generator that yields batches, so the function can also
be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``.
In this case, your function would look like:

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits. Excuse any mangling in the suggestions when I tried to change passive voice to active voice. Please correct as needed. Very nice job overall. Consider using Vale to catch some of these copy edits I made. (go/vale)

doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
omatthew98 and others added 5 commits March 20, 2024 11:57
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
omatthew98 and others added 3 commits March 20, 2024 11:59
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
doc/source/data/loading-data.rst Show resolved Hide resolved
@@ -83,7 +83,7 @@ the appropriate scheme. URI can point to buckets or folders.
filesystem = gcsfs.GCSFileSystem(project="my-google-project")
ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem)

.. tab-item:: ABL
.. tab-item:: ABS

To save data to Azure Blob Storage, install the
`Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as read, also add a tip on how to tune configs for write failure retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, will add the tip on configs later.

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
omatthew98 and others added 5 commits March 20, 2024 12:00
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
@raulchen raulchen merged commit a4a9e97 into ray-project:master Mar 20, 2024
5 checks passed
@can-anyscale
Copy link
Collaborator

This breaks data doc test, I'm putting up a revert to double check (https://buildkite.com/ray-project/postmerge/builds/3645)

can-anyscale pushed a commit that referenced this pull request Mar 21, 2024
#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that.

Signed-off-by: Matthew Owen <mowen@anyscale.com>
omatthew98 added a commit to omatthew98/ray that referenced this pull request Mar 21, 2024
…r, and Saving Data pages (ray-project#44093)

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
omatthew98 added a commit to omatthew98/ray that referenced this pull request Mar 21, 2024
ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that.

Signed-off-by: Matthew Owen <mowen@anyscale.com>
khluu pushed a commit that referenced this pull request Mar 21, 2024
…r, and Saving Data pages (#44093) (#44221)

Docs only cherry pick for release.

Note: this cherry-pick includes four commits which are all related to changing the Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages. They are rolled together to reduce cherry-picking overhead and are all part of the logical update to these pages. The PRs included in this cherry-pick:

Main overhaul of listed pages,
Two fixes to doc tests that were broken by the above (fix 1, fix 2).
Additional small change to explain how to use credentials that was added after initial merge of main overhaul
---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
…r, and Saving Data pages (ray-project#44093)

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that.

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants