[Train] Add backend-specific context manager for `train_func`. #43209

woshiyyya · 2024-02-15T22:36:53Z

Why are these changes needed?

This PR provides a way to inject backend-specific context manager for train_func. It's an developer API(not for users), which enabled us to inject specific setup and teardown logics for the training function.

Use case 1: PyTorch sets default cuda device

Set torch default device to the current device allocated to this worker.

Previously, ray train did not automatically set torch.cuda.current_device, but only set it when the user calls train.torch.prepare_model in the training function. If the user does not call prepare_model, the default cuda device for all workers will be "cuda:0", which is not ideal and may cause problems (Moving all tensors to device 0).

def train_func():
    model.to("cuda") # -> it will move models on all ranks to device 0
    ...

We add this behavior internally rather than let the users to call it, because we want it to have the same behavior when scaling training from 1 GPU to multiple GPUs without changing the user code.

~~### Use case 2: XGBoost CommunicatorContext~~

~~Previous discussions: #42767 (comment)~~

~~To make XGBoost training distributed, users have to call the training function under a context manager.~~

~~### Use case 3: LightGBM set Env Vars~~

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu

Thanks!

I do like this and am in favor of adding it to the BackendConfig developer API, but I am also worried that we are doing some magical stuff behind the scenes that is not explicit to the user. We should be strict in terms of what we put in this default context manager -- otherwise we'll end up with a bunch of implicit behavior that users aren't aware of.

Some alternatives to consider and discuss pros/cons before merging this PR:

Have these decorators as utilities that users should call explicitly.
Don't have any default setup/teardown and show users how to achieve certain things like setting default cuda device in documentation.

python/ray/train/_internal/utils.py

python/ray/train/data_parallel_trainer.py

justinvyu · 2024-02-16T21:44:09Z

python/ray/train/torch/config.py

@@ -16,6 +16,19 @@
 logger = logging.getLogger(__name__)


+class TorchConfigContextManager:


Can we actually swap to the function style context so that it's easier to reuse existing contexts?

@contextlib.contextmanager def torch_context_manager(): # some other setup with torch.device(ray.train.torch.get_device()): yield # some other teardown

def xgboost_context_manager(): # some other setup with CommunicatorContext(): yield # some other teardown

I think it's fine since you can return either function-based or class-based context manager.

def train_func_context(self): def func_based_ctx_mgr(): ... yield ... return func_based_ctx_mgr

alternatively, to reuse an existing context manager, we can subclass it as below:

class InnerContextManager: def __enter__(self): print("Entering InnerContextManager") return self def __exit__(self, exc_type, exc_val, exc_tb): print("Exiting InnerContextManager") return False class OuterContextManager(InnerContextManager): def __enter__(self): print("Entering OuterContextManager") super().__enter__() return self def __exit__(self, exc_type, exc_val, exc_tb): super().__exit__(exc_type, exc_val, exc_tb) print("Exiting OuterContextManager") return False

python/ray/train/backend.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu

Possible to add a quick sentence to the PR description about the decision to call this for the user rather than expose it as a utility that the user can call themselves?

justinvyu · 2024-02-20T21:55:55Z

python/ray/train/torch/config.py

+                torch.cuda.set_device(device)
+
+    def __exit__(self, type, value, traceback):
+        # Propagate exceptions if any


nit: I think only return True is needed if you want to suppress exceptions https://docs.python.org/3/reference/datamodel.html#object.__exit__

PR description updated!

Oh actually we are not suppressing the exceptions since it's captured in the outer layer here: https://github.com/ray-project/ray/pull/43209/files#diff-8b259b33153d078b025da24134ff3b897aa1227d287d2ad38a1d2f11afb7d213R154

woshiyyya · 2024-02-21T22:10:11Z

I updated the PR description.

The BackendConfig is a developer API so it'd be safe and won't be exposed to the users.

woshiyyya · 2024-02-21T22:21:01Z

Currently we have multiple ways to do the initialization before calling the users' train_func.

Using this backend-specific context manager
Use the predefined training loop (e.g. LightGBM, XGBoost)
Backend.on_start + Backend.on_training_start

In the future design, we need a more unified way to do the initialization. One possible way is to store all states in a global context, wrap the initialization logic with context manager around train_func. This ensures that initialization logics executes in the same thread as train_func.

add prologue

3224561

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya changed the title ~~[Train] Enable calling backend0specific prologue before executing train_func.~~ [Train] Enable calling backend-specific prologue before executing train_func. Feb 15, 2024

woshiyyya added 9 commits February 15, 2024 14:39

add base class noop method

3ec5d21

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

add base class noop method

dfc9865

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix test

f3d8c29

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update construct_train_func

f862b56

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix lint

22266ce

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix tests

a1b79ba

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

rename to setup

478a628

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update

e1b976d

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix

ce0ce84

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya changed the title ~~[Train] Enable calling backend-specific prologue before executing train_func.~~ [Train] Enable calling backend-specific setup function before executing train_func. Feb 16, 2024

switch to context manager

ba1b2bf

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya changed the title ~~[Train] Enable calling backend-specific setup function before executing train_func.~~ [Train] Add backend-specific context manager for train_func. Feb 16, 2024

fix test

e782e4d

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya marked this pull request as ready for review February 16, 2024 19:37

woshiyyya requested review from matthewdeng and justinvyu February 16, 2024 19:37

woshiyyya assigned justinvyu and matthewdeng Feb 16, 2024

fix contextmgr not callable issue

5248c26

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu reviewed Feb 16, 2024

View reviewed changes

woshiyyya and others added 4 commits February 16, 2024 14:36

Apply suggestions from code review

ab693b3

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

remove parenthesis around @Property

c14ece2

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update

528711e

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix

1d0085e

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu approved these changes Feb 20, 2024

View reviewed changes

matthewdeng merged commit 852e9f0 into ray-project:master Feb 21, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Add backend-specific context manager for `train_func`. #43209

[Train] Add backend-specific context manager for `train_func`. #43209

woshiyyya commented Feb 15, 2024 •

edited

justinvyu left a comment

justinvyu Feb 16, 2024

woshiyyya Feb 16, 2024 •

edited

justinvyu left a comment

justinvyu Feb 20, 2024

woshiyyya Feb 21, 2024

woshiyyya commented Feb 21, 2024

woshiyyya commented Feb 21, 2024 •

edited

		@@ -16,6 +16,19 @@
		logger = logging.getLogger(__name__)


		class TorchConfigContextManager:

[Train] Add backend-specific context manager for train_func. #43209

[Train] Add backend-specific context manager for train_func. #43209

Conversation

woshiyyya commented Feb 15, 2024 • edited

Why are these changes needed?

Use case 1: PyTorch sets default cuda device

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Feb 16, 2024

Choose a reason for hiding this comment

woshiyyya Feb 16, 2024 • edited

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Feb 20, 2024

Choose a reason for hiding this comment

woshiyyya Feb 21, 2024

Choose a reason for hiding this comment

woshiyyya commented Feb 21, 2024

woshiyyya commented Feb 21, 2024 • edited

[Train] Add backend-specific context manager for `train_func`. #43209

[Train] Add backend-specific context manager for `train_func`. #43209

woshiyyya commented Feb 15, 2024 •

edited

woshiyyya Feb 16, 2024 •

edited

woshiyyya commented Feb 21, 2024 •

edited