[Datasets] Implement push-based shuffle #24281

stephanie-wang · 2022-04-28T00:41:56Z

The simple shuffle currently implemented in Datasets does not reliably scale past 1000+ partitions due to metadata and I/O overhead.

This PR adds an experimental shuffle implementation for a "push-based shuffle", as described in this paper draft. This algorithm should see better performance at larger data scales. The algorithm works by merging intermediate map outputs at the reducer side while other map tasks are executing. Then, a final reduce task merges these merged outputs.

Currently, the PR exposes this option through the DatasetContext. It can also be set through a hidden OS environment variable (RAY_DATASET_PUSH_BASED_SHUFFLE). Once we have more comprehensive benchmarks, we can better document this option and allow the algorithm to be chosen at run time.

Redo for #23758 to fix CI.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

stephanie-wang and others added 30 commits March 30, 2022 12:58

simple shuffle op

d3f31f6

Factor out shuffle into ShuffleOp

e85ecbe

SortOp

0d1f0b7

Groupby Op

c9666a2

GroupbyOp

7e8dd1b

x

d2d3316

doc, fix random_shuffle

bc64e5b

Update python/ray/data/impl/shuffle.py

0ec45a2

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

47351d4

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

970cbaa

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

lint

645531c

random

131e84f

lint

b11dcd2

fix

51a0045

Push-based shuffle

5514d33

Fix bug for total # CPUs much larger than number of input blocks

74d749a

Merge remote-tracking branch 'upstream/master' into shuffle-op

60bddab

x

988c496

TODO

2539f1a

Dummy merge task args

c7754dc

Merge remote-tracking branch 'upstream/master' into shuffle-op

f87a10c

NodeAffinityStrategy

9b4f289

Merge remote-tracking branch 'upstream/master' into shuffle-op

5876cff

update

2dc85bb

lint

10d87b6

env flag and tests

ffffd3f

ShufflePartitionOp

9eeefca

Merge remote-tracking branch 'upstream/master' into shuffle-op

fccf9fd

Merge remote-tracking branch 'upstream/master' into shuffle-op

291bb2b

Update python/ray/data/impl/shuffle.py

2af2642

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

stephanie-wang and others added 18 commits April 19, 2022 10:12

Update python/ray/data/impl/shuffle.py

4769d6e

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

5bf6596

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

0dba544

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

0510bb3

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update python/ray/data/impl/shuffle.py

7a84e8b

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

file cleanup, set option in dataset context

d47cc5f

parametrize tests

f66abdc

update

cda0761

Merge remote-tracking branch 'upstream/master' into shuffle-op

722ac43

todo

7f25484

Clear sample objectrefs

59b350a

generic pipelined exec

f68db39

refactor

53b67f2

refactor

5127c7d

fix

6e5d7e6

reset contextg

81513e2

lint

ececdcd

Merge remote-tracking branch 'upstream/master' into shuffle-op

4576978

stephanie-wang requested review from ericl, scv119, clarkzinzow and jjyao as code owners April 28, 2022 00:41

fix

65813d6

clarkzinzow approved these changes Apr 28, 2022

View reviewed changes

ericl approved these changes Apr 28, 2022

View reviewed changes

scv119 approved these changes Apr 28, 2022

View reviewed changes

stephanie-wang merged commit a5a11f6 into ray-project:master Apr 28, 2022

stephanie-wang deleted the shuffle-op branch April 28, 2022 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Implement push-based shuffle #24281

[Datasets] Implement push-based shuffle #24281

stephanie-wang commented Apr 28, 2022 •

edited

[Datasets] Implement push-based shuffle #24281

[Datasets] Implement push-based shuffle #24281

Conversation

stephanie-wang commented Apr 28, 2022 • edited

stephanie-wang commented Apr 28, 2022 •

edited