[Datasets] Implement push-based shuffle #23758

stephanie-wang · 2022-04-06T22:58:51Z

Why are these changes needed?

The simple shuffle currently implemented in Datasets does not reliably scale past 1000+ partitions due to metadata and I/O overhead.

This PR adds an experimental shuffle implementation for a "push-based shuffle", as described in this paper draft. This algorithm should see better performance at larger data scales. The algorithm works by merging intermediate map outputs at the reducer side while other map tasks are executing. Then, a final reduce task merges these merged outputs.

Currently, the PR exposes this option through the DatasetContext. It can also be set through a hidden OS environment variable (RAY_DATASET_PUSH_BASED_SHUFFLE). Once we have more comprehensive benchmarks, we can better document this option and allow the algorithm to be chosen at run time.

Related issue number

Closes #23758.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

stephanie-wang · 2022-04-19T19:04:22Z

An initial run of sort on 1TB, 1k partitions produces 1h on 20 nodes vs 1h20min with simple shuffle. Looking into the performance a bit, there is definitely room for improvement on this PR, specifically around improving pipelining during the map-merge stage. But I think we should merge this version in first just to get the skeleton in there.

jjyao · 2022-04-21T15:57:03Z

python/ray/data/impl/push_based_shuffle.py

+            merge_tasks_assigned += 1
+            leftover_cpu_map[node_id] -= 1
+        merge_task_args = [
+            {"scheduling_strategy": NodeAffinitySchedulingStrategy(node_id, soft=True)}


Should we also use NodeAffinitySchedulingStrategy to colocate merge_factor number of map tasks with the merge task?

I think it is actually better to let Ray schedule the map tasks in case there are other factors like data locality being used to schedule those. I'll make a note of this in the code!

python/ray/data/impl/push_based_shuffle.py

ericl · 2022-04-21T22:26:52Z

python/ray/data/impl/push_based_shuffle.py

+        # reducer.
+        merge_task_args = self._compute_merge_task_options(
+            num_merge_tasks_per_round, merge_factor, cpu_map
+        )


Could we unit test the above topology configuration logic? Concretely, we could split it out into helper method that returns a struct/dict, and we can unit test the topology configured for various cluster scenarios. This would also help with splitting the code up for readability.

python/ray/data/impl/push_based_shuffle.py

ericl · 2022-04-21T22:30:41Z

python/ray/data/impl/push_based_shuffle.py

+            "map": shuffle_map_metadata,
+            "merge": shuffle_merge_metadata,
+            "reduce": new_metadata,
+        }


Add some test to test_stats.py?

scv119

This looks great! Some high level code structure suggestions.

scv119 · 2022-04-22T00:59:42Z

python/ray/data/context.py

@@ -28,6 +29,11 @@
 # Whether to furthermore fuse prior map tasks with shuffle stages.
 DEFAULT_OPTIMIZE_FUSE_SHUFFLE_STAGES = True

+# Whether to use push-based shuffle by default.
+DEFAULT_USE_PUSH_BASED_SHUFFLE = bool(


do we need to set this env for all nodes?

Ah right, good point... It only needs to be set on whoever calls the Dataset shuffle call; after that I think the context will get propagated through the usual mechanism. Do you think that's an issue?

cc @clarkzinzow for best practice/convention here

@stephanie-wang @scv119 I think that this is fine, since we already rely on the user being able to mutate the global context in the driver before doing Dataset operations and expect those default overrides to propagate to all tasks, so this is no different IMO.

python/ray/data/impl/push_based_shuffle.py

scv119 · 2022-04-22T01:31:25Z

python/ray/data/impl/push_based_shuffle.py

+        # reducer.
+        merge_task_args = self._compute_merge_task_options(
+            num_merge_tasks_per_round, merge_factor, cpu_map
+        )


python/ray/data/impl/push_based_shuffle.py

ericl · 2022-04-22T23:36:58Z

I think some tests are failing here. I reviewed this at a high level and am ok with merging this to unblock further testing, though we might want to revisit the code structure at a later point (could be after we're happy with the performance).

scv119 · 2022-04-22T21:30:32Z

python/ray/data/context.py

@@ -28,6 +29,11 @@
 # Whether to furthermore fuse prior map tasks with shuffle stages.
 DEFAULT_OPTIMIZE_FUSE_SHUFFLE_STAGES = True

+# Whether to use push-based shuffle by default.
+DEFAULT_USE_PUSH_BASED_SHUFFLE = bool(


cc @clarkzinzow for best practice/convention here

clarkzinzow

LGTM, same thoughts as @ericl: I have some ideas of how to improve the code structure here, but this looks good enough to merge, so let's unblock next steps and revisit after perf testing.

clarkzinzow · 2022-04-25T15:16:10Z

python/ray/data/context.py

@@ -28,6 +29,11 @@
 # Whether to furthermore fuse prior map tasks with shuffle stages.
 DEFAULT_OPTIMIZE_FUSE_SHUFFLE_STAGES = True

+# Whether to use push-based shuffle by default.
+DEFAULT_USE_PUSH_BASED_SHUFFLE = bool(


@stephanie-wang @scv119 I think that this is fine, since we already rely on the user being able to mutate the global context in the driver before doing Dataset operations and expect those default overrides to propagate to all tasks, so this is no different IMO.

franklsf95

Awesome seeing this work going into production! I left a few questions inline.

One thing we should keep in mind is data skew. The current pipelining schedule will not work very well if the input partitions are too imbalanced.

franklsf95 · 2022-04-27T18:23:59Z

python/ray/data/impl/push_based_shuffle.py

+    ) -> Tuple[BlockList, Dict[str, List[BlockMetadata]]]:
+        logger.info("Using experimental push-based shuffle.")
+        # TODO(swang): For jobs whose reduce work is heavier than the map work,
+        # we should support fractional merge factors.


I don't understand what fractional merge factors mean. Like, how would a merge task merge 1.5x map outputs? And why is this related with reduce/map workload ratio?

It just means you have more merge tasks than map tasks. So if you have map:merge = 1/2 and 10 map tasks for every round, each merge task will output 1/2 of a partition's worth of data. Presumably you'd also want to have many more reduce tasks than map tasks in this scenario.

But yeah, there's no evidence yet that we'd want to support this case. I haven't seen a case yet where the reduce-side computation is heavier than the map.

This reverts commit c1054a0.

The simple shuffle currently implemented in Datasets does not reliably scale past 1000+ partitions due to metadata and I/O overhead. This PR adds an experimental shuffle implementation for a "push-based shuffle", as described in this paper draft. This algorithm should see better performance at larger data scales. The algorithm works by merging intermediate map outputs at the reducer side while other map tasks are executing. Then, a final reduce task merges these merged outputs. Currently, the PR exposes this option through the DatasetContext. It can also be set through a hidden OS environment variable (RAY_DATASET_PUSH_BASED_SHUFFLE). Once we have more comprehensive benchmarks, we can better document this option and allow the algorithm to be chosen at run time. Redo for #23758 to fix CI.

stephanie-wang and others added 26 commits March 30, 2022 12:58

simple shuffle op

d3f31f6

Factor out shuffle into ShuffleOp

e85ecbe

SortOp

0d1f0b7

Groupby Op

c9666a2

GroupbyOp

7e8dd1b

x

d2d3316

doc, fix random_shuffle

bc64e5b

Update python/ray/data/impl/shuffle.py

0ec45a2

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

47351d4

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

970cbaa

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

lint

645531c

random

131e84f

lint

b11dcd2

fix

51a0045

Push-based shuffle

5514d33

Fix bug for total # CPUs much larger than number of input blocks

74d749a

Merge remote-tracking branch 'upstream/master' into shuffle-op

60bddab

x

988c496

TODO

2539f1a

Dummy merge task args

c7754dc

Merge remote-tracking branch 'upstream/master' into shuffle-op

f87a10c

NodeAffinityStrategy

9b4f289

Merge remote-tracking branch 'upstream/master' into shuffle-op

5876cff

update

2dc85bb

lint

10d87b6

env flag and tests

ffffd3f

stephanie-wang changed the title ~~[WIP][Datasets] Implement push-based shuffle~~ [Datasets] Implement push-based shuffle Apr 14, 2022

ShufflePartitionOp

9eeefca

stephanie-wang marked this pull request as ready for review April 14, 2022 20:15

stephanie-wang requested a review from ericl as a code owner April 14, 2022 20:15

stephanie-wang added 2 commits April 19, 2022 11:58

Merge remote-tracking branch 'upstream/master' into shuffle-op

722ac43

todo

7f25484

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 19, 2022

Clear sample objectrefs

59b350a

jjyao reviewed Apr 21, 2022

View reviewed changes

ericl reviewed Apr 21, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2022

scv119 reviewed Apr 22, 2022

View reviewed changes

stephanie-wang added 3 commits April 22, 2022 10:12

generic pipelined exec

f68db39

refactor

53b67f2

refactor

5127c7d

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 22, 2022

ericl approved these changes Apr 22, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 22, 2022

scv119 approved these changes Apr 24, 2022

View reviewed changes

fix

6e5d7e6

clarkzinzow approved these changes Apr 25, 2022

View reviewed changes

stephanie-wang added 2 commits April 26, 2022 10:39

reset contextg

81513e2

lint

ececdcd

franklsf95 reviewed Apr 27, 2022

View reviewed changes

stephanie-wang merged commit c1054a0 into ray-project:master Apr 27, 2022

stephanie-wang deleted the shuffle-op branch April 27, 2022 18:59

stephanie-wang mentioned this pull request Apr 27, 2022

[Datasets][Feature] Implement push-based shuffle in Datasets #23594

Closed

2 tasks

jjyao added a commit to jjyao/ray that referenced this pull request Apr 28, 2022

Revert "[Datasets] Implement push-based shuffle (ray-project#23758)"

8650d63

This reverts commit c1054a0.

stephanie-wang mentioned this pull request Apr 28, 2022

[Datasets] Implement push-based shuffle #24281

Merged

jjyao added a commit that referenced this pull request Apr 28, 2022

Revert "[Datasets] Implement push-based shuffle (#23758)" (#24279)

abba263

This reverts commit c1054a0.

pabloem mentioned this pull request Jun 15, 2022

Start designing shuffling algorithm ray-project/ray_beam_runner#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Implement push-based shuffle #23758

[Datasets] Implement push-based shuffle #23758

stephanie-wang commented Apr 6, 2022 •

edited

stephanie-wang commented Apr 19, 2022

jjyao Apr 21, 2022

stephanie-wang Apr 21, 2022

ericl Apr 21, 2022

scv119 Apr 22, 2022

ericl Apr 21, 2022

scv119 left a comment

scv119 Apr 22, 2022

stephanie-wang Apr 22, 2022

scv119 Apr 22, 2022

clarkzinzow Apr 25, 2022 •

edited

scv119 Apr 22, 2022

ericl commented Apr 22, 2022

scv119 Apr 22, 2022

clarkzinzow left a comment

clarkzinzow Apr 25, 2022 •

edited

franklsf95 left a comment

franklsf95 Apr 27, 2022

stephanie-wang Apr 27, 2022

[Datasets] Implement push-based shuffle #23758

[Datasets] Implement push-based shuffle #23758

Conversation

stephanie-wang commented Apr 6, 2022 • edited

Why are these changes needed?

Related issue number

stephanie-wang commented Apr 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow Apr 25, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Apr 22, 2022

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Apr 25, 2022 • edited

Choose a reason for hiding this comment

franklsf95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Apr 6, 2022 •

edited

clarkzinzow Apr 25, 2022 •

edited

clarkzinzow Apr 25, 2022 •

edited