-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Split all Ray Datasets by default #38694
[Train] Split all Ray Datasets by default #38694
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick implementation! Logic looks good to me. @c21 could you shepherd this through?
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @woshiyyya!
|
||
For all other datasets, Ray Train passes the entire dataset to each worker. | ||
If want to customize which datasets are split, pass in a :class:`DataConfig <ray.train.DataConfig>` to the Trainer constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If want to customize which datasets are split, pass in a :class:DataConfig <ray.train.DataConfig>
to the Trainer constructor
->
If you want to customize which dataset to split, pass in a :class:DataConfig(datasets_to_split=...) <ray.train.DataConfig>
to the Trainer constructor. ?
btw do we need a migration guide for 2.7? @ericl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@woshiyyya - could you help check if all CI test failures are related? Let's probably retrigger the CI again, as many CI tests are failing. |
@c21 yeah I am debugging these CI tests. I'll ping you after I fixed them! |
Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a `DataConfig`. e.g. `DataConfig(datasets_to_split=["train", "eval"])`. We now change the default behavior to shard all datasets by default for the following considerations: - Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets. - Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them. - Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal. ### API - Shard all datasets(default): ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, # data_config=DataConfig(datasets_to_split="all") ) ``` - Shard a subset of datasets ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, data_config=DataConfig(datasets_to_split=["a", "b"]) ) ``` Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a `DataConfig`. e.g. `DataConfig(datasets_to_split=["train", "eval"])`. We now change the default behavior to shard all datasets by default for the following considerations: - Efficiency: We want people to leverage Ray Data as best as possible. The best way to optimize training time is to leverage the fact that Ray Data can effectively shard all the datasets across workers. Training frameworks (e.g. Lightning) provide ways to aggregate results across workers, and we should be recommending users to shard their validation datasets. - Consistency: It is conceptually easier for users to understand a single default behavior applied to all Datasets and to be provided options to configure them. - Explicitness: The behavior for the magic “train” key is not very explicit, and users will not understand this until they really read through the documentation. Relying on untyped keywords is non-ideal. ### API - Shard all datasets(default): ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, # data_config=DataConfig(datasets_to_split="all") ) ``` - Shard a subset of datasets ```python TorchTrainer( datasets={"a": ds_1, "b": ds_2, "c": ds_3}, data_config=DataConfig(datasets_to_split=["a", "b"]) ) ``` Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
Previously we only shard the "train" Ray Dataset by default. If users want to shard other datasets, they need to explicitly specify it with a
DataConfig
. e.g.DataConfig(datasets_to_split=["train", "eval"])
.We now change the default behavior to shard all datasets by default for the following considerations:
API
Related issue number
Closes #37668
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.