[Data] [Docs] Consolidate shuffling-related information into `Shuffling Data` page #44098

scottjlee · 2024-03-18T20:27:54Z

Why are these changes needed?

Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page.

New docs page: https://anyscale-ray--44098.com.readthedocs.build/en/44098/data/shuffling-data.html

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-03-18T22:28:35Z

doc/source/data/transforming-data.rst

@@ -162,7 +161,7 @@ program might run out of memory. If you encounter an out-of-memory error, decrea
 .. _stateful_transforms:

 Stateful Transforms
-==============================
+===================


unrelated to rest of PR, but fix the title underline.

omatthew98

Few comments, I think mostly addressing existing docs you copied over. Lgtm otherwise.

omatthew98 · 2024-03-18T23:54:55Z

doc/source/data/shuffling-data.rst

+Shuffle the ordering of files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To randomly shuffle the ordering of input files before reading, call a function like


Maybe change to something like "call a read function that supports shuffling e.g. call read images...". Seems a little unclear what "function like read_images` actually means?

omatthew98 · 2024-03-18T23:57:25Z

doc/source/data/shuffling-data.rst

+Local shuffle when iterating over batches
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To locally shuffle a subset of rows, call a function like :meth:`~ray.data.Dataset.iter_batches`


Again maybe be more descriptive than function like. For this case, if there are few enough options (3ish?), maybe just list them.

omatthew98 · 2024-03-19T00:01:06Z

doc/source/data/shuffling-data.rst

+To locally shuffle a subset of rows, call a function like :meth:`~ray.data.Dataset.iter_batches`
+and specify `local_shuffle_buffer_size`. This shuffles the rows up to a provided buffer
+size during iteration. See more details in
+:ref:`Iterating over batches with shuffling <iterating-over-batches-with-shuffling>`.


Should we move that information to this page and then have a small reference on that page to this broader shuffle page?

i felt that it was important to keep this information in the iteration page as well, since it can be a pretty core part of iter_batch-like methods for ML training. and there's greater detail about each iter_batch method for torch/tf, which seems out of place to put in this shuffle page. but if others feel the same, we can move it here

omatthew98 · 2024-03-19T00:08:17Z

doc/source/data/shuffling-data.rst

+
+.. _optimizing_shuffles:
+
+Advanced: Optimizing shuffles


Might just be a formatting thing, but should this be a subheading or a top level heading? Actually seems like all of the subsections are subsections of "Types of shuffling", is that intentional?

fixed, made this a new subheading outside of Types of shuffling, and titles below are subtitled under the new Advanced: Optimizing shuffles section.

omatthew98 · 2024-03-19T00:10:11Z

doc/source/data/shuffling-data.rst

+Advanced: Optimizing shuffles
+-----------------------------
+
+Shuffle operations are *all-to-all* operations where the entire Dataset must be materialized in memory before execution can proceed.


This section seems to not be super relevant to shuffling? Is the idea that these optimization might also apply to other all-to-all operations? The "these" in the below line is also unclear. I would have thought it was talking about shuffle operations but think it is talking about all to all operations?

Maybe move some of this to the "Enabling push-based shuffle" below which seems related?

+1 seemed slightly out of place to me. Wonder if we should just remove this section?

removed this section (but kept the note), since the content is also discussed under the Enabling push-based shuffle subsection.

omatthew98 · 2024-03-19T00:13:53Z

doc/source/data/shuffling-data.rst

+randomness of the training data. Based on a
+`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all


Suggested change

randomness of the training data. Based on a

`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all

randomness of the training data. Based on a

`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ , all

omatthew98 · 2024-03-19T00:18:32Z

doc/source/data/shuffling-data.rst

+
+Some Dataset operations require a *shuffle* operation, meaning that data is shuffled from all of the input partitions to all of the output partitions.
+These operations include :meth:`Dataset.random_shuffle <ray.data.Dataset.random_shuffle>`,
+:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`.


It's not super intuitive why these operations require a shuffle, if possible maybe add a quick sentence explaining?

omatthew98 · 2024-03-19T00:18:48Z

doc/source/data/shuffling-data.rst

+Some Dataset operations require a *shuffle* operation, meaning that data is shuffled from all of the input partitions to all of the output partitions.
+These operations include :meth:`Dataset.random_shuffle <ray.data.Dataset.random_shuffle>`,
+:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`.
+Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.


Suggested change

Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.

Shuffling can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.

omatthew98 · 2024-03-19T00:21:55Z

doc/source/data/shuffling-data.rst

+:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`.
+Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.
+
+Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.


Think this is outdated verbage?

Suggested change

Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.

Ray Data provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.

bveeramani · 2024-03-19T00:49:37Z

doc/source/data/shuffling-data.rst

+    If you observe reduced throughput when using ``local_shuffle_buffer_size``;
+    one way to diagnose this is to check the total time spent in batch creation by
+    examining the ``ds.stats()`` output (``In batch formatting``, under
+    ``Batch iteration time breakdown``).


Suggested change

If you observe reduced throughput when using ``local_shuffle_buffer_size``;

one way to diagnose this is to check the total time spent in batch creation by

examining the ``ds.stats()`` output (``In batch formatting``, under

``Batch iteration time breakdown``).

If you observe reduced throughput when using ``local_shuffle_buffer_size``,

check the total time spent in batch creation by

examining the ``ds.stats()`` output (``In batch formatting``, under

``Batch iteration time breakdown``).

bveeramani · 2024-03-19T00:49:54Z

doc/source/data/shuffling-data.rst

+    time spent in other steps, one way to improve performance is to decrease
+    ``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.


Suggested change

time spent in other steps, one way to improve performance is to decrease

``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.

time spent in other steps, decrease

``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.

bveeramani · 2024-03-19T00:51:36Z

doc/source/data/shuffling-data.rst

+Advanced: Optimizing shuffles
+-----------------------------
+
+Shuffle operations are *all-to-all* operations where the entire Dataset must be materialized in memory before execution can proceed.


+1 seemed slightly out of place to me. Wonder if we should just remove this section?

bveeramani · 2024-03-19T00:55:27Z

doc/source/data/shuffling-data.rst

+        shuffle="files",
+    )
+
+.. tip::


This didn't really feel like a tip to me. IMO, it might be better to make this regular text here and in the other sections. In general, I think we should try to use admonitions sparingly.

moved out of tip and into regular text

bveeramani · 2024-03-19T00:57:47Z

doc/source/data/shuffling-data.rst

+    ``Batch iteration time breakdown``).
+
+    If this time is significantly larger than the


Seems like these sentences should be part of the same paragraph?

Suggested change

``Batch iteration time breakdown``).

If this time is significantly larger than the

``Batch iteration time breakdown``). If this time is significantly larger than the

Signed-off-by: Scott Lee <sjl@anyscale.com>

…ng Data` page (ray-project#44098) Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ng Data` page (#44098) (#44171) Cherry-pick #44098. Docs-only change. Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ng Data` page (ray-project#44098) Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page. Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added 4 commits March 18, 2024 13:26

consolidate shuffle docs into one page

36d0e9d

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

8ae140e

Signed-off-by: Scott Lee <sjl@anyscale.com>

add shuffle page to index

997ca25

Signed-off-by: Scott Lee <sjl@anyscale.com>

fix link

ea37b19

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee commented Mar 18, 2024

View reviewed changes

scottjlee marked this pull request as ready for review March 18, 2024 23:14

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners March 18, 2024 23:14

scottjlee assigned c21, omatthew98 and bveeramani Mar 18, 2024

omatthew98 approved these changes Mar 19, 2024

View reviewed changes

bveeramani approved these changes Mar 19, 2024

View reviewed changes

scottjlee added 2 commits March 19, 2024 13:07

comments

101f48d

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0318-shuffledoc

26c5617

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from bveeramani March 20, 2024 00:23

bveeramani merged commit 02a235d into ray-project:master Mar 20, 2024
5 checks passed

scottjlee mentioned this pull request Mar 20, 2024

[Data] [Docs] Cherry-pick #44098 #44171

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] [Docs] Consolidate shuffling-related information into `Shuffling Data` page #44098

[Data] [Docs] Consolidate shuffling-related information into `Shuffling Data` page #44098

scottjlee commented Mar 18, 2024 •

edited

scottjlee Mar 18, 2024

omatthew98 left a comment

omatthew98 Mar 18, 2024

omatthew98 Mar 18, 2024

omatthew98 Mar 19, 2024

scottjlee Mar 19, 2024

omatthew98 Mar 19, 2024

scottjlee Mar 19, 2024

omatthew98 Mar 19, 2024

omatthew98 Mar 19, 2024

bveeramani Mar 19, 2024

scottjlee Mar 19, 2024

omatthew98 Mar 19, 2024

omatthew98 Mar 19, 2024

omatthew98 Mar 19, 2024

omatthew98 Mar 19, 2024

bveeramani Mar 19, 2024

bveeramani Mar 19, 2024

bveeramani Mar 19, 2024

bveeramani Mar 19, 2024

scottjlee Mar 19, 2024

bveeramani Mar 19, 2024

		randomness of the training data. Based on a
		`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all

	Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.
	Shuffling can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.

	Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.
	Ray Data provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.

		time spent in other steps, one way to improve performance is to decrease
		``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.

		``Batch iteration time breakdown``).

		If this time is significantly larger than the

[Data] [Docs] Consolidate shuffling-related information into Shuffling Data page #44098

[Data] [Docs] Consolidate shuffling-related information into Shuffling Data page #44098

Conversation

scottjlee commented Mar 18, 2024 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] [Docs] Consolidate shuffling-related information into `Shuffling Data` page #44098

[Data] [Docs] Consolidate shuffling-related information into `Shuffling Data` page #44098

scottjlee commented Mar 18, 2024 •

edited