New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Train] Update docstring and user guides for `train_loop_config` #43691

Merged

matthewdeng merged 9 commits into ray-project:master from woshiyyya:update_train_loop_config

Mar 8, 2024

Member

woshiyyya commented Mar 4, 2024 •

edited

Why are these changes needed?

It's been a common issue that Ray Train users try to pass large data/model object through train_loop_config, which introduce large serialization overhead, and may incur some deserialization issues (e.g. deserialize cuda tensor on cpu actor (TrainTrainable)).

This PR adds comments in the user guide and docstring to warn users against similar attempts.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

woshiyyya added 2 commits

March 4, 2024 15:30


          update docstring for train_loop_config

df46a30

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>


          updating

1f30c72

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya marked this pull request as ready for review

March 4, 2024 23:33

woshiyyya requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, justinvyu and a team as code owners

March 4, 2024 23:33

woshiyyya assigned matthewdeng

woshiyyya commented

View reviewed changes

doc/source/train/getting-started-pytorch-lightning.rst

@@ @@ -23,7 +23,7 @@ For reference, the final code is as follows: @@
                   from ray.train.torch import TorchTrainer
                   from ray.train import ScalingConfig
-                  def train_func(config):

Member Author

woshiyyya Mar 6, 2024 •

edited

Not showing config argument in the first place, since we didn't specify train_loop_config in TorchTrainer in this code snippet. Users will be confused about where to put the train_func arguments.

justinvyu approved these changes

View reviewed changes

Contributor

justinvyu left a comment

Thanks. I agree, the config should not be promoted since it's mostly unnecessary for Train.

doc/source/train/getting-started-pytorch-lightning.rst Outdated Show resolved Hide resolved


          replace note with warning

f8644ed

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng reviewed

View reviewed changes

doc/source/train/getting-started-pytorch-lightning.rst Show resolved Hide resolved

doc/source/train/getting-started-pytorch-lightning.rst Outdated

Comment on lines 193 to 199

+              You can specify the input argument for `train_func` via the Trainer's `train_loop_config` parameter.
+              .. warning::
+                  Avoid passing large data objects through `train_loop_config` to reduce the
+                  serialization and deserialization overhead. Instead, it's preferred to
+                  initialize large objects (e.g. datasets, models) directly in `train_func`.

Contributor

matthewdeng Mar 6, 2024

Add a code snippet to show how to populate these? I think we want to show that it's a dictionary.

def train_func(config):
    config[...]

config = {...}
trainer = TorchTrainer(train_func, train_loop_config=config, ...)

In the warning we can also show an example as well.

Member Author

woshiyyya Mar 7, 2024

Added two examples to

highlight the config format
show the good and bad practices of setting train_loop_config.

doc/source/train/getting-started-pytorch-lightning.rst Outdated

		@@ -190,6 +190,13 @@ Begin by wrapping your code in a :ref:`training function <train-overview-trainin

		Each distributed training worker executes this function.

		You can specify the input argument for `train_func` via the Trainer's `train_loop_config` parameter.

Contributor

matthewdeng Mar 6, 2024

Optionally, we can extract this section out to a separate file and include it, similar to what's being done here.

In the future we may just have a full separate user guide for this.

Member Author

woshiyyya Mar 7, 2024

Good idea. I've extracted the common paragraph into a separate doc.


          remove all config arguments in train_func()

6aa01bd

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested review from ericl, scv119, c21, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners

March 7, 2024 00:49

woshiyyya and others added 3 commits

March 6, 2024 17:15


          take out the train_func configuration into a separate doc

5c6e520

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>


          Update torch-configure-train_func.rst

6e55db7

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>


          Merge branch 'master' into update_train_loop_config

6ce05f3

matthewdeng approved these changes

View reviewed changes

doc/source/train/common/torch-configure-train_func.rst Outdated Show resolved Hide resolved

woshiyyya added 2 commits

March 7, 2024 12:32


          update

0a5cfb5

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>


          Merge remote-tracking branch 'origin/update_train_loop_config' into u…

22d1177

…pdate_train_loop_config

c21 approved these changes

View reviewed changes

Contributor

c21 left a comment

LGTM from data side.

matthewdeng merged commit 0edd366 into ray-project:master

9 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

justinvyu justinvyu approved these changes

matthewdeng matthewdeng approved these changes

c21 c21 approved these changes

richardliaw Awaiting requested review from richardliaw

krfricke Awaiting requested review from krfricke

xwjiang2010 Awaiting requested review from xwjiang2010

amogkam Awaiting requested review from amogkam amogkam is a code owner

Yard1 Awaiting requested review from Yard1

maxpumperla Awaiting requested review from maxpumperla

ericl Awaiting requested review from ericl ericl is a code owner

scv119 Awaiting requested review from scv119 scv119 is a code owner

scottjlee Awaiting requested review from scottjlee scottjlee is a code owner

bveeramani Awaiting requested review from bveeramani bveeramani is a code owner

raulchen Awaiting requested review from raulchen raulchen is a code owner

stephanie-wang Awaiting requested review from stephanie-wang stephanie-wang is a code owner

omatthew98 Awaiting requested review from omatthew98 omatthew98 is a code owner