[train] Simplify `ray.train.xgboost/lightgbm` (2/n): Re-implement `XGBoostTrainer` as a lightweight `DataParallelTrainer` #42767

justinvyu · 2024-01-27T02:00:03Z

TL;DR

This PR re-implements XGBoostTrainer as a DataParallelTrainer that does not use xgboost_ray under the hood, in an effort to unify the trainer implementations and remove that external dependency.

Motivation

High maintenance burden that requires a release process every time incompatibilities come up between Ray and the latest xgboost_ray/lightgbm_ray version.
- For example, see the last few months of commit history.
- Here’s what happens: a change in Ray (e.g. a Ray Tune deprecation/moved package) causes the latest release of xgboost_ray to no longer work → ray.train.xgboost.XGBoostTrainer breaks → Ray Train team needs to patch a fix in this separate package, make a release, then update the pinned package version in CI.
- This process is all manual, so this entire process including getting CI to pass takes at least 2 hours.
Reducing code complexity: Directly using xgboost_ray introduces significant code complexity.
XGBoostTrainer and LightGBMTrainer are data parallel trainers, but go through a completely different code path as DataParallelTrainer implementations.
- After this migration, all Ray Train entrypoints will be using the DataParallelTrainer execution logic.
Usability: The current xgboost and lightgbm trainers are hard to use due to being a pass-through API shell on top of the xgboost.train API that people are familiar with.
- Let’s use this opportunity to cut down these bulky external packages to a simple, lightweight integration in Ray Train, where users need to change minimal code to distribute their workload.
- This is the same motivation as the TorchTrainer unification effort.
Minor point: Removing duplicate logic
- xgboost_ray and lightgbm_ray are designed to be run independently, so they implement their execution loop with resource scheduling logic and error handling. There is a huge overlap in the external libraries and Tune, and it’s very difficult to navigate between the 2 codebases as a maintainer.
See more reasons here.

PR Summary

Introduce simplified ray.train.xgboost.v2.XGBoostTrainer and ray.train.lightgbm.v2.LightGBMTrainer that do not depend on xgboost_ray and lightgbm_ray.
- These will have a different API compared to the existing trainers due to being subclasses of DataParallelTrainer. Users are able to pass in their own training function.
Re-implement the existing ray.train.xgboost.XGBoostTrainer and ray.train.lightgbm.LightGBMTrainer on top of the v2 counterparts.
- We are not going to force an API migration immediately -- let's wait for this new implementation to stabilize for 1 release before making that decision.
Remove xgboost_ray and lightgbm_ray dependencies from Ray Train starting immediately from 2.10. (Will do in a follow-up PR.)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya · 2024-01-29T21:02:21Z

Very neat solution! I have some general questions:

Would we support or remove support for elastic training? (not sure how many users are using it, but my intuition is we are removing this feature)
Shall we provide a simple API for users to get a xgboost.DMatrix. I noticed that in xgboost_ray, there will be no further processing logics in the train_func after passing the RayDMatrix into the train().
What would be the checkpointing code looks like? Will that be a Xgboost's native callback, which calls ray.train.report on some hooks?

justinvyu · 2024-01-29T23:22:31Z

@woshiyyya

Would we support or remove support for elastic training? (not sure how many users are using it, but my intuition is we are removing this feature)

This PR would remove the elastic training part. I'm not actually sure if the old XGBoostTrainer even allows usage of the elastic training feature. xgboost_ray re-implements the execution loop, which waits for distributed worker futures and launching workers when one fails.
However, this would get caught by Ray Train, the entire placement group would get removed and restarted. OR, this would hang if the user includes a ray.train.report in our provided callback.
We can revisit xgb elastic training in the future along with other data parallel trainers.

Shall we provide a simple API for users to get a xgboost.DMatrix. I noticed that in xgboost_ray, there will be no further processing logics in the train_func after passing the RayDMatrix into the train().

I think it's not needed to add another API on top of DMatrix. RayDMatrix is an abstraction used to shard the dataset across multiple Ray workers, but we don't really need that since Ray Data can do that already, so I don't want to keep RayDMatrix around. Plus, Ray Data is the only method that we recommend for XGBoostTrainer in the first place.
Here's what regular xgboost.DMatrix usage looks like.
Here's what it looks like with Ray Train + Ray Data. It matches up with the native way of using xgboost.

           train_ds = ray.train.get_dataset_shard("train")
            train_df = train_ds.to_pandas()
            X, y = train_df.drop("y", axis=1), train_df["y"]
            dtrain = xgboost.DMatrix(X, label=y)

What would be the checkpointing code looks like? Will that be a Xgboost's native callback, which calls ray.train.report on some hooks?

Two options:

TuneReportCheckpointCallback which is already our recommendation, and users can implement their own if needed similar to lightning/transformers.
Call ray.train.report manually by iteratively training more models:

bst_model = None
num_boost_rounds_per_iter =
for i in range(num_iters):
    bst_model = xgboost.train(
        ..., xgb_model=bst_model,  # start from bst_model
        num_boost_rounds=num_boost_rounds_per_iter
    )
    ray.train.report(..., checkpoint=...)

justinvyu · 2024-01-29T23:38:40Z

Here's a summary of the enhancements achieved by this proposal, once we fully migrate to the v2.XGBoostTrainer.

Feature	Status	Notes
Elastic training	↔️	Elastic training implementation is no longer attached to `XGBoostTrainer`, but this is not technically a regression, since it didn't work before anyways.
Checkpointing	↗️	Still the same as before, except checkpoint loading / post-processing logic is even more flexible now. Before, everything had to be through the xgboost callback.
Ray Data Integration	↗️	Previously, the integration was mostly done on the `RayDMatrix` level instead of the existing `DataParallelTrainer` logic. Now, the integration is unified across more trainers. This enables the streaming data implementation to be used with xgboost's experimental iterator-based data loading feature.
Future xgboost features	↗️	There are many features that are easily accessible since the user has control of the training loop, including a federated distributed learning backend, iterator-based DMatrix loading, multi-output classification, and whatever else xgboost adds in the future.
Usability	↗️	The current `XGBoostTrainer` is 2 unnecessary layers on top of the native `xgboost.train` API that people are familiar with. You have to pass configs through `XGBoostTrainer(params, label_column, train_kwargs)`, which are then passed to `xgboost_ray.train`, which also has a bunch of arguments passed through with `kwargs` to `xgboost.train`. This is really hard to use and can't utilize editor auto-complete. It's easier to let the user call `xgboost.train` directly.

Let me know if there are any "regressions" that I'm missing with this change.

python/ray/train/xgboost/config.py

matthewdeng · 2024-02-02T02:50:21Z

python/ray/train/xgboost/config.py

+            # Set up the rabit tracker on the Train driver.
+            num_workers = len(worker_group)
+            self.rabit_args = {"DMLC_NUM_WORKER": num_workers}
+            train_driver_ip = ray.util.get_node_ip_address()


Should this be this IP or the rank 0 worker?

The rabit process should be on the "driver", which in this case is the Trainer. All ranks connect to the driver rabit process.

Take a look at how this dask distributed xgboost test sets it up: https://github.com/dmlc/xgboost/blob/662854c7d75ef1ec543ee0db73098227de5be59c/tests/test_distributed/test_with_dask/test_with_dask.py#L1619-L1654

matthewdeng · 2024-02-02T02:51:56Z

python/ray/train/xgboost/xgboost_trainer.py

+        from xgboost.collective import CommunicatorContext
+
+        with CommunicatorContext():


Is this needed?

Yup, this is the thing that actually connects the worker to the collective group (kind of like torch.init_process_group).

Usually, you need to add a bunch of args in here, but the environment variables that I set above take care of that. We could consider making this a ray train utility, but I feel like keeping the native usage is pretty simple.

I was trying to do it for the user here, but that didn't end up working, since the context needs to be directly wrapping the user code it seems.

Oh I see. @woshiyyya brought up a similar thing for Torch where we might want to set the torch device, which also needs to modify the user code since it has to be run in the same thread.

Maybe we can have some sort of decorator abstraction that surrounds the user's train function?

Yes. I was trying to set torch cuda device by default, but it only works when we call it inside the training function. I am thinking we can have a function decorator, so that we can inject some environment setup in an elegant way.

e.g.

@ray.train.context(framework="xgboost") def train_func(): ...

(Need a better naming for this decorator..)

Seems that it's more crucial for the new XGBoostTrainer API. We could consider having this decorator so users don't have to think about it.

Interesting, we may be able to get rid of the Trainers and mirror the Ray Core API more. 😆

matthewdeng

So clean!

python/ray/train/xgboost/config.py

woshiyyya · 2024-02-02T22:17:43Z

python/ray/train/xgboost/xgboost_trainer.py

+        from xgboost.collective import CommunicatorContext
+
+        with CommunicatorContext():


Seems that it's more crucial for the new XGBoostTrainer API. We could consider having this decorator so users don't have to think about it.

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lify_xgb Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lify_xgb Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya · 2024-02-15T21:16:56Z

python/ray/train/_internal/data_config.py

@@ -20,6 +20,7 @@ def __init__(
        self,
        datasets_to_split: Union[Literal["all"], List[str]] = "all",
        execution_options: Optional[ExecutionOptions] = None,
+        convert_to_data_iterator: bool = True,


Nit: This seems unclear for the people don't know about what data_iterator refers to. Consider rename it to streaming_execution=True or materialize=False?

Update: we're gonna have the user call DataIterator.materialize instead.

python/ray/train/xgboost/xgboost_trainer.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng · 2024-02-21T18:47:58Z

python/ray/train/xgboost/config.py

+            # Ranks are assigned in increasing order of the worker's task id.
+            # This task id will be sorted by increasing world rank.
+            os.environ["DMLC_TASK_ID"] = (
+                f"[xgboost.ray-rank={ray.train.get_context().get_world_rank()}]:"
+                f"{ray.get_runtime_context().get_actor_id()}"
+            )


Is there a strict interface that this needs to follow?

Nope, this format was inspired by xgboost.dask: https://github.com/dmlc/xgboost/blob/7cc256e246e68a6c641ecb57a138e3c8a721c55e/python-package/xgboost/dask/__init__.py#L237

The strings will just be sorted by world rank: https://github.com/ray-project/ray/pull/42767/files#diff-e40514dc5fac1d96905235c5090fbeeb747889a95a3dd86228f8737348911236R55-R60

Hmm so both here and in Dask it'll end up sorting by the rank strings rather than integers. I guess that's fine if Dask is doing it, but maybe we can update the documentation to match? (Or we can prepend some zeros to the world rank).

Oh yeah, string comparison will mess up if you have more than 10 workers. I do actually want the ranks to all match up, so prepending a few 0s makes sense.

python/ray/train/xgboost/v2.py

python/ray/train/xgboost/xgboost_trainer.py

matthewdeng · 2024-02-21T19:06:53Z

python/ray/train/xgboost/xgboost_trainer.py

+        # TODO(justinvyu): [Deprecated] Remove in 2.11
+        if dmatrix_params != _DEPRECATED_VALUE:
+            raise DeprecationWarning(
+                "`dmatrix_params` is deprecated, since XGBoostTrainer no longer "
+                "depends on the `xgboost_ray.RayDMatrix` utility."
+            )


Is there any alternative needed here for functional parity?

I think the closest thing would be passing these params into the xgboost.DMatrix(...).

This would be a new feature though, since the original usage was to pass extra params as xgboost_ray.RayDMatrix constructor args, but none of those apply anymore.

Let's just keep it as deprecated?

python/ray/train/xgboost/xgboost_trainer.py

python/ray/train/tests/test_xgboost_trainer.py

python/ray/train/xgboost/xgboost_trainer.py

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

python/ray/train/xgboost/config.py

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng

via Minions on GIPHY

matthewdeng · 2024-02-23T01:24:26Z

python/ray/train/xgboost/xgboost_trainer.py

+        eval_X, eval_y = eval_df.drop(label_column, axis=1), eval_df[label_column]
+        evals.append((xgboost.DMatrix(eval_X, label=eval_y), eval_name))
+
+    with CommunicatorContext():


Will we (eventually) move this into the train_func_context?

Yes, I'll add that in a followup!

justinvyu added 2 commits January 26, 2024 17:57

add prototype

3edaa3c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

d8ecfa8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from matthewdeng and woshiyyya January 27, 2024 02:00

matthewdeng reviewed Feb 2, 2024

View reviewed changes

python/ray/train/xgboost/config.py Outdated Show resolved Hide resolved

python/ray/train/xgboost/config.py Outdated Show resolved Hide resolved

woshiyyya reviewed Feb 2, 2024

View reviewed changes

justinvyu mentioned this pull request Feb 13, 2024

[train] Simplify ray.train.xgboost/lightgbm (1/n): Align frequency-based and checkpoint_at_end checkpoint formats #42111

Merged

10 tasks

justinvyu added 18 commits February 13, 2024 13:48

Merge branch 'master' of https://github.com/ray-project/ray into simp…

ed75792

…lify_xgb

locality_hints is for 'distributed data loading'

6aae2db

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

start implementing old xgboost trainer with simple xgb trainer

bb00eef

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into simp…

d2410d8

…lify_xgb Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix ds.split(locality_hints) usage

a38a399

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

working old api with new impl

a1a8864

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

deprecate dmatrix_params

a4dca53

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add get_model method back

e0b76a1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use the model from the checkpoint correctly

94eba80

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix checkpoint_at_end default logic

91e77ec

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unneeded tests and add back dataset key validation

3815a04

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

19a0137

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into simp…

7a0976d

…lify_xgb

fix lint 2

eaeefb9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix resource_changing test

1914fa4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

scaling config num_workers now required for XGBoostTrainer

44b032d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into simp…

9611f88

…lify_xgb Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move simple xgboost trainer -> v2 module

d5742f1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

align worker ranks

f6e866b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu changed the title ~~[WIP][train] Simplify XGBoostTrainer as a lightweight DataParallelTrainer~~ [train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer Feb 15, 2024

revert data config changes

ac2184c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya reviewed Feb 15, 2024

View reviewed changes

remove data config special flag in new xgboost trainer

cb716d8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu mentioned this pull request Feb 15, 2024

[data] Add DataIterator.materialize #43210

Merged

8 tasks

woshiyyya mentioned this pull request Feb 16, 2024

[Train] Add backend-specific context manager for train_func. #43209

Merged

8 tasks

justinvyu added 3 commits February 16, 2024 15:08

Merge branch 'master' of https://github.com/ray-project/ray into simp…

48db11c

…lify_xgb

use DataIterator.materialize in v2 xgb trainer

3ef9099

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into simp…

4ab8fa5

…lify_xgb

justinvyu mentioned this pull request Feb 17, 2024

[train] Simplify ray.train.xgboost/lightgbm (3/n): Re-implement LightGBMTrainer as a lightweight DataParallelTrainer #43244

Merged

8 tasks

justinvyu added 2 commits February 20, 2024 12:21

Merge branch 'master' of https://github.com/ray-project/ray into simp…

4ddbf93

…lify_xgb

don't use private api for dataset keys

64fa06b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng reviewed Feb 21, 2024

View reviewed changes

python/ray/train/xgboost/xgboost_trainer.py Outdated Show resolved Hide resolved

justinvyu added 5 commits February 21, 2024 10:01

Merge branch 'master' of https://github.com/ray-project/ray into simp…

3a71176

…lify_xgb

fix docstring

b528375

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unnecessary eval dataset computation

6afbe1e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add in data_config to xgboost trainer

e3a2395

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

pad task id with some 0s

cbd164f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya reviewed Feb 21, 2024

View reviewed changes

python/ray/train/xgboost/config.py Show resolved Hide resolved

justinvyu added 4 commits February 21, 2024 12:43

Merge branch 'master' of https://github.com/ray-project/ray into simp…

77f9790

…lify_xgb

remove docstring

8e1f860

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into simp…

3017acf

…lify_xgb

move all logic to on_training_start hook

6ba03f7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng approved these changes Feb 23, 2024

View reviewed changes

justinvyu merged commit 62dbcb2 into ray-project:master Feb 23, 2024
8 of 9 checks passed

justinvyu deleted the simplify_xgb branch February 23, 2024 18:49

justinvyu mentioned this pull request Feb 27, 2024

[train] Simplify ray.train.xgboost/lightgbm (5/n): Remove xgboost_ray and lightgbm_ray dependencies (for release tests) #43425

Merged

8 tasks

justinvyu mentioned this pull request Mar 25, 2024

[train] Enforce xgboost>=1.7 for XGBoostTrainer usage #44269

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Simplify `ray.train.xgboost/lightgbm` (2/n): Re-implement `XGBoostTrainer` as a lightweight `DataParallelTrainer` #42767

[train] Simplify `ray.train.xgboost/lightgbm` (2/n): Re-implement `XGBoostTrainer` as a lightweight `DataParallelTrainer` #42767

justinvyu commented Jan 27, 2024 •

edited

woshiyyya commented Jan 29, 2024 •

edited

justinvyu commented Jan 29, 2024 •

edited

justinvyu commented Jan 29, 2024 •

edited

matthewdeng Feb 2, 2024

justinvyu Feb 2, 2024

matthewdeng Feb 2, 2024

justinvyu Feb 2, 2024 •

edited

matthewdeng Feb 2, 2024

woshiyyya Feb 2, 2024 •

edited

woshiyyya Feb 2, 2024

justinvyu Feb 3, 2024

matthewdeng left a comment

woshiyyya Feb 2, 2024

woshiyyya Feb 15, 2024

justinvyu Feb 15, 2024

matthewdeng Feb 21, 2024

justinvyu Feb 21, 2024

matthewdeng Feb 21, 2024

justinvyu Feb 21, 2024

matthewdeng Feb 21, 2024

justinvyu Feb 21, 2024

matthewdeng left a comment

matthewdeng Feb 23, 2024

justinvyu Feb 23, 2024

		from xgboost.collective import CommunicatorContext

		with CommunicatorContext():

[train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer #42767

[train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer #42767

Conversation

justinvyu commented Jan 27, 2024 • edited

TL;DR

Motivation

PR Summary

Related issue number

Checks

woshiyyya commented Jan 29, 2024 • edited

justinvyu commented Jan 29, 2024 • edited

justinvyu commented Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinvyu Feb 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya Feb 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[train] Simplify `ray.train.xgboost/lightgbm` (2/n): Re-implement `XGBoostTrainer` as a lightweight `DataParallelTrainer` #42767

[train] Simplify `ray.train.xgboost/lightgbm` (2/n): Re-implement `XGBoostTrainer` as a lightweight `DataParallelTrainer` #42767

justinvyu commented Jan 27, 2024 •

edited

woshiyyya commented Jan 29, 2024 •

edited

justinvyu commented Jan 29, 2024 •

edited

justinvyu commented Jan 29, 2024 •

edited

justinvyu Feb 2, 2024 •

edited

woshiyyya Feb 2, 2024 •

edited