[Train] Split all Ray Datasets by default #38694

woshiyyya · 2023-08-21T22:23:04Z

Why are these changes needed?

Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a DataConfig. e.g. DataConfig(datasets_to_split=["train", "eval"]).

We now change the default behavior to shard all datasets by default for the following considerations:

Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets.
Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them.
Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal.

API

Shard all datasets(default):

TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    # data_config=DataConfig(datasets_to_split="all")
)

Shard a subset of datasets

TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    data_config=DataConfig(datasets_to_split=["a", "b"])
)

Related issue number

Closes #37668

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng

Thanks for the quick implementation! Logic looks good to me. @c21 could you shepherd this through?

python/ray/train/_internal/data_config.py

python/ray/air/tests/test_new_dataset_config.py

doc/source/train/user-guides/data-loading-preprocessing.rst

python/ray/train/_internal/data_config.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

…by_default

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

c21

Thanks @woshiyyya!

c21 · 2023-08-22T09:16:09Z

doc/source/train/user-guides/data-loading-preprocessing.rst


-For all other datasets, Ray Train passes the entire dataset to each worker.
+If want to customize which datasets are split, pass in a :class:`DataConfig <ray.train.DataConfig>` to the Trainer constructor. 


If want to customize which datasets are split, pass in a :class:DataConfig <ray.train.DataConfig> to the Trainer constructor

->

If you want to customize which dataset to split, pass in a :class:DataConfig(datasets_to_split=...) <ray.train.DataConfig> to the Trainer constructor. ?

python/ray/train/_internal/data_config.py

doc/source/train/user-guides/data-loading-preprocessing.rst

c21 · 2023-08-22T09:30:14Z

btw do we need a migration guide for 2.7? @ericl

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

python/ray/train/_internal/data_config.py

c21

LGTM

c21 · 2023-08-22T19:22:57Z

@woshiyyya - could you help check if all CI test failures are related? Let's probably retrigger the CI again, as many CI tests are failing.

woshiyyya · 2023-08-22T19:46:50Z

@c21 yeah I am debugging these CI tests. I'll ping you after I fixed them!

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a `DataConfig`. e.g. `DataConfig(datasets_to_split=["train", "eval"])`. We now change the default behavior to shard all datasets by default for the following considerations: - Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets. - Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them. - Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal. ### API - Shard all datasets(default): ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, # data_config=DataConfig(datasets_to_split="all") ) ``` - Shard a subset of datasets ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, data_config=DataConfig(datasets_to_split=["a", "b"]) ) ``` Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a `DataConfig`. e.g. `DataConfig(datasets_to_split=["train", "eval"])`. We now change the default behavior to shard all datasets by default for the following considerations: - Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets. - Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them. - Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal. ### API - Shard all datasets(default): ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, # data_config=DataConfig(datasets_to_split="all") ) ``` - Shard a subset of datasets ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, data_config=DataConfig(datasets_to_split=["a", "b"]) ) ``` Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

woshiyyya added 4 commits August 21, 2023 15:20

init

e8ffcc0

add ut

9d67e9e

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

wip

ac2ed2c

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update user gudies and docstrings

cd87f16

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya marked this pull request as ready for review August 22, 2023 00:03

woshiyyya requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners August 22, 2023 00:03

woshiyyya requested a review from c21 August 22, 2023 00:06

woshiyyya assigned matthewdeng and c21 Aug 22, 2023

matthewdeng approved these changes Aug 22, 2023

View reviewed changes

woshiyyya and others added 3 commits August 22, 2023 00:55

Apply suggestions from code review

e52f654

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/split_data_…

c16298b

…by_default

fix

f232134

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

c21 reviewed Aug 22, 2023

View reviewed changes

python/ray/train/_internal/data_config.py Show resolved Hide resolved

doc/source/train/user-guides/data-loading-preprocessing.rst Show resolved Hide resolved

woshiyyya added 3 commits August 22, 2023 10:16

update

7c76605

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

wip

18920cb

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

address comments

3554f74

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

ericl reviewed Aug 22, 2023

View reviewed changes

python/ray/train/_internal/data_config.py Show resolved Hide resolved

c21 approved these changes Aug 22, 2023

View reviewed changes

fix CI

a5fc04f

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya mentioned this pull request Aug 22, 2023

[Doc] Update all Ray Train Examples that needs Metrics Aggregation #38760

Closed

woshiyyya added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 22, 2023

c21 merged commit 23ed18f into ray-project:master Aug 23, 2023
71 of 73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Split all Ray Datasets by default #38694

[Train] Split all Ray Datasets by default #38694

woshiyyya commented Aug 21, 2023 •

edited

Loading

matthewdeng left a comment

c21 left a comment

c21 Aug 22, 2023

c21 commented Aug 22, 2023

c21 left a comment

c21 commented Aug 22, 2023

woshiyyya commented Aug 22, 2023 •

edited

Loading


		For all other datasets, Ray Train passes the entire dataset to each worker.
		If want to customize which datasets are split, pass in a :class:`DataConfig <ray.train.DataConfig>` to the Trainer constructor.

[Train] Split all Ray Datasets by default #38694

[Train] Split all Ray Datasets by default #38694

Conversation

woshiyyya commented Aug 21, 2023 • edited Loading

Why are these changes needed?

API

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

c21 Aug 22, 2023

Choose a reason for hiding this comment

c21 commented Aug 22, 2023

c21 left a comment

Choose a reason for hiding this comment

c21 commented Aug 22, 2023

woshiyyya commented Aug 22, 2023 • edited Loading

woshiyyya commented Aug 21, 2023 •

edited

Loading

woshiyyya commented Aug 22, 2023 •

edited

Loading