Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Split all Ray Datasets by default #38694

Merged
merged 11 commits into from
Aug 23, 2023

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Aug 21, 2023

Why are these changes needed?

Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a DataConfig. e.g. DataConfig(datasets_to_split=["train", "eval"]).

We now change the default behavior to shard all datasets by default for the following considerations:

  • Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets.
  • Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them.
  • Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal.

API

  • Shard all datasets(default):
TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    # data_config=DataConfig(datasets_to_split="all")
)
  • Shard a subset of datasets
TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    data_config=DataConfig(datasets_to_split=["a", "b"])
)

Related issue number

Closes #37668

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick implementation! Logic looks good to me. @c21 could you shepherd this through?

python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved
python/ray/air/tests/test_new_dataset_config.py Outdated Show resolved Hide resolved
python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved
woshiyyya and others added 3 commits August 22, 2023 00:55
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @woshiyyya!


For all other datasets, Ray Train passes the entire dataset to each worker.
If want to customize which datasets are split, pass in a :class:`DataConfig <ray.train.DataConfig>` to the Trainer constructor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If want to customize which datasets are split, pass in a :class:DataConfig <ray.train.DataConfig> to the Trainer constructor

->

If you want to customize which dataset to split, pass in a :class:DataConfig(datasets_to_split=...) <ray.train.DataConfig> to the Trainer constructor. ?

python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved
python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved
python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved
@c21
Copy link
Contributor

c21 commented Aug 22, 2023

btw do we need a migration guide for 2.7? @ericl

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@c21
Copy link
Contributor

c21 commented Aug 22, 2023

@woshiyyya - could you help check if all CI test failures are related? Let's probably retrigger the CI again, as many CI tests are failing.

@woshiyyya
Copy link
Member Author

woshiyyya commented Aug 22, 2023

@c21 yeah I am debugging these CI tests. I'll ping you after I fixed them!

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 22, 2023
@c21 c21 merged commit 23ed18f into ray-project:master Aug 23, 2023
71 of 73 checks passed
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a `DataConfig`. e.g. `DataConfig(datasets_to_split=["train", "eval"])`.

We now change the default behavior to shard all datasets by default for the following considerations:

- Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets.
- Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them.
- Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal.

### API
- Shard all datasets(default):
```python
TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    # data_config=DataConfig(datasets_to_split="all")
)
```

- Shard a subset of datasets
```python
TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    data_config=DataConfig(datasets_to_split=["a", "b"])
)
```

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a `DataConfig`. e.g. `DataConfig(datasets_to_split=["train", "eval"])`.

We now change the default behavior to shard all datasets by default for the following considerations:

- Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets.
- Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them.
- Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal.

### API
- Shard all datasets(default):
```python
TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    # data_config=DataConfig(datasets_to_split="all")
)
```

- Shard a subset of datasets
```python
TorchTrainer(
    datasets={"a": ds_1, "b": ds_2, "c": ds_3},
    data_config=DataConfig(datasets_to_split=["a", "b"])
)
```

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] Shard all Input Datasets in Ray Train by default
4 participants