Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer #42767

Merged
merged 45 commits into from Feb 23, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jan 27, 2024

TL;DR

This PR re-implements XGBoostTrainer as a DataParallelTrainer that does not use xgboost_ray under the hood, in an effort to unify the trainer implementations and remove that external dependency.

Motivation

  1. High maintenance burden that requires a release process every time incompatibilities come up between Ray and the latest xgboost_ray/lightgbm_ray version.
    • For example, see the last few months of commit history.
    • Here’s what happens: a change in Ray (e.g. a Ray Tune deprecation/moved package) causes the latest release of xgboost_ray to no longer work → ray.train.xgboost.XGBoostTrainer breaks → Ray Train team needs to patch a fix in this separate package, make a release, then update the pinned package version in CI.
    • This process is all manual, so this entire process including getting CI to pass takes at least 2 hours.
  2. Reducing code complexity: Directly using xgboost_ray introduces significant code complexity.
    XGBoostTrainer and LightGBMTrainer are data parallel trainers, but go through a completely different code path as DataParallelTrainer implementations.
    • After this migration, all Ray Train entrypoints will be using the DataParallelTrainer execution logic.
  3. Usability: The current xgboost and lightgbm trainers are hard to use due to being a pass-through API shell on top of the xgboost.train API that people are familiar with.
    • Let’s use this opportunity to cut down these bulky external packages to a simple, lightweight integration in Ray Train, where users need to change minimal code to distribute their workload.
    • This is the same motivation as the TorchTrainer unification effort.
  4. Minor point: Removing duplicate logic
    • xgboost_ray and lightgbm_ray are designed to be run independently, so they implement their execution loop with resource scheduling logic and error handling. There is a huge overlap in the external libraries and Tune, and it’s very difficult to navigate between the 2 codebases as a maintainer.
  5. See more reasons here.

PR Summary

  1. Introduce simplified ray.train.xgboost.v2.XGBoostTrainer and ray.train.lightgbm.v2.LightGBMTrainer that do not depend on xgboost_ray and lightgbm_ray.
    • These will have a different API compared to the existing trainers due to being subclasses of DataParallelTrainer. Users are able to pass in their own training function.
  2. Re-implement the existing ray.train.xgboost.XGBoostTrainer and ray.train.lightgbm.LightGBMTrainer on top of the v2 counterparts.
    • We are not going to force an API migration immediately -- let's wait for this new implementation to stabilize for 1 release before making that decision.
  3. Remove xgboost_ray and lightgbm_ray dependencies from Ray Train starting immediately from 2.10. (Will do in a follow-up PR.)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@woshiyyya
Copy link
Member

woshiyyya commented Jan 29, 2024

Very neat solution! I have some general questions:

  • Would we support or remove support for elastic training? (not sure how many users are using it, but my intuition is we are removing this feature)
  • Shall we provide a simple API for users to get a xgboost.DMatrix. I noticed that in xgboost_ray, there will be no further processing logics in the train_func after passing the RayDMatrix into the train().
  • What would be the checkpointing code looks like? Will that be a Xgboost's native callback, which calls ray.train.report on some hooks?

@justinvyu
Copy link
Contributor Author

justinvyu commented Jan 29, 2024

@woshiyyya

Would we support or remove support for elastic training? (not sure how many users are using it, but my intuition is we are removing this feature)

  • This PR would remove the elastic training part. I'm not actually sure if the old XGBoostTrainer even allows usage of the elastic training feature. xgboost_ray re-implements the execution loop, which waits for distributed worker futures and launching workers when one fails.
  • However, this would get caught by Ray Train, the entire placement group would get removed and restarted. OR, this would hang if the user includes a ray.train.report in our provided callback.
  • We can revisit xgb elastic training in the future along with other data parallel trainers.

Shall we provide a simple API for users to get a xgboost.DMatrix. I noticed that in xgboost_ray, there will be no further processing logics in the train_func after passing the RayDMatrix into the train().

  • I think it's not needed to add another API on top of DMatrix. RayDMatrix is an abstraction used to shard the dataset across multiple Ray workers, but we don't really need that since Ray Data can do that already, so I don't want to keep RayDMatrix around. Plus, Ray Data is the only method that we recommend for XGBoostTrainer in the first place.
  • Here's what regular xgboost.DMatrix usage looks like.
  • Here's what it looks like with Ray Train + Ray Data. It matches up with the native way of using xgboost.
           train_ds = ray.train.get_dataset_shard("train")
            train_df = train_ds.to_pandas()
            X, y = train_df.drop("y", axis=1), train_df["y"]
            dtrain = xgboost.DMatrix(X, label=y)

What would be the checkpointing code looks like? Will that be a Xgboost's native callback, which calls ray.train.report on some hooks?

Two options:

  1. TuneReportCheckpointCallback which is already our recommendation, and users can implement their own if needed similar to lightning/transformers.
  2. Call ray.train.report manually by iteratively training more models:
bst_model = None
num_boost_rounds_per_iter =
for i in range(num_iters):
    bst_model = xgboost.train(
        ..., xgb_model=bst_model,  # start from bst_model
        num_boost_rounds=num_boost_rounds_per_iter
    )
    ray.train.report(..., checkpoint=...)

@justinvyu
Copy link
Contributor Author

justinvyu commented Jan 29, 2024

Here's a summary of the enhancements achieved by this proposal, once we fully migrate to the v2.XGBoostTrainer.

Feature Status Notes
Elastic training ↔️ Elastic training implementation is no longer attached to XGBoostTrainer, but this is not technically a regression, since it didn't work before anyways.
Checkpointing ↗️ Still the same as before, except checkpoint loading / post-processing logic is even more flexible now. Before, everything had to be through the xgboost callback.
Ray Data Integration ↗️ Previously, the integration was mostly done on the RayDMatrix level instead of the existing DataParallelTrainer logic. Now, the integration is unified across more trainers. This enables the streaming data implementation to be used with xgboost's experimental iterator-based data loading feature.
Future xgboost features ↗️ There are many features that are easily accessible since the user has control of the training loop, including a federated distributed learning backend, iterator-based DMatrix loading, multi-output classification, and whatever else xgboost adds in the future.
Usability ↗️ The current XGBoostTrainer is 2 unnecessary layers on top of the native xgboost.train API that people are familiar with. You have to pass configs through XGBoostTrainer(params, label_column, **train_kwargs), which are then passed to xgboost_ray.train, which also has a bunch of arguments passed through with **kwargs to xgboost.train. This is really hard to use and can't utilize editor auto-complete. It's easier to let the user call xgboost.train directly.

Let me know if there are any "regressions" that I'm missing with this change.

python/ray/train/xgboost/config.py Outdated Show resolved Hide resolved
# Set up the rabit tracker on the Train driver.
num_workers = len(worker_group)
self.rabit_args = {"DMLC_NUM_WORKER": num_workers}
train_driver_ip = ray.util.get_node_ip_address()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be this IP or the rank 0 worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rabit process should be on the "driver", which in this case is the Trainer. All ranks connect to the driver rabit process.

Take a look at how this dask distributed xgboost test sets it up: https://github.com/dmlc/xgboost/blob/662854c7d75ef1ec543ee0db73098227de5be59c/tests/test_distributed/test_with_dask/test_with_dask.py#L1619-L1654

Comment on lines 182 to 184
from xgboost.collective import CommunicatorContext

with CommunicatorContext():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

@justinvyu justinvyu Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is the thing that actually connects the worker to the collective group (kind of like torch.init_process_group).

Usually, you need to add a bunch of args in here, but the environment variables that I set above take care of that. We could consider making this a ray train utility, but I feel like keeping the native usage is pretty simple.

I was trying to do it for the user here, but that didn't end up working, since the context needs to be directly wrapping the user code it seems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. @woshiyyya brought up a similar thing for Torch where we might want to set the torch device, which also needs to modify the user code since it has to be run in the same thread.

Maybe we can have some sort of decorator abstraction that surrounds the user's train function?

Copy link
Member

@woshiyyya woshiyyya Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was trying to set torch cuda device by default, but it only works when we call it inside the training function. I am thinking we can have a function decorator, so that we can inject some environment setup in an elegant way.

e.g.

@ray.train.context(framework="xgboost")
def train_func():
    ...

(Need a better naming for this decorator..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that it's more crucial for the new XGBoostTrainer API. We could consider having this decorator so users don't have to think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, we may be able to get rid of the Trainers and mirror the Ray Core API more. 😆

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So clean!

python/ray/train/xgboost/config.py Outdated Show resolved Hide resolved
python/ray/train/xgboost/config.py Outdated Show resolved Hide resolved
python/ray/train/xgboost/config.py Outdated Show resolved Hide resolved
Comment on lines 182 to 184
from xgboost.collective import CommunicatorContext

with CommunicatorContext():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that it's more crucial for the new XGBoostTrainer API. We could consider having this decorator so users don't have to think about it.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…lify_xgb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu changed the title [WIP][train] Simplify XGBoostTrainer as a lightweight DataParallelTrainer [train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer Feb 15, 2024
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -20,6 +20,7 @@ def __init__(
self,
datasets_to_split: Union[Literal["all"], List[str]] = "all",
execution_options: Optional[ExecutionOptions] = None,
convert_to_data_iterator: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This seems unclear for the people don't know about what data_iterator refers to. Consider rename it to streaming_execution=True or materialize=False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: we're gonna have the user call DataIterator.materialize instead.

python/ray/train/xgboost/xgboost_trainer.py Outdated Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment on lines 82 to 87
# Ranks are assigned in increasing order of the worker's task id.
# This task id will be sorted by increasing world rank.
os.environ["DMLC_TASK_ID"] = (
f"[xgboost.ray-rank={ray.train.get_context().get_world_rank()}]:"
f"{ray.get_runtime_context().get_actor_id()}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strict interface that this needs to follow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so both here and in Dask it'll end up sorting by the rank strings rather than integers. I guess that's fine if Dask is doing it, but maybe we can update the documentation to match? (Or we can prepend some zeros to the world rank).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, string comparison will mess up if you have more than 10 workers. I do actually want the ranks to all match up, so prepending a few 0s makes sense.

python/ray/train/xgboost/v2.py Outdated Show resolved Hide resolved
python/ray/train/xgboost/xgboost_trainer.py Show resolved Hide resolved
Comment on lines 156 to 161
# TODO(justinvyu): [Deprecated] Remove in 2.11
if dmatrix_params != _DEPRECATED_VALUE:
raise DeprecationWarning(
"`dmatrix_params` is deprecated, since XGBoostTrainer no longer "
"depends on the `xgboost_ray.RayDMatrix` utility."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any alternative needed here for functional parity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the closest thing would be passing these params into the xgboost.DMatrix(...).

This would be a new feature though, since the original usage was to pass extra params as xgboost_ray.RayDMatrix constructor args, but none of those apply anymore.

Let's just keep it as deprecated?

python/ray/train/xgboost/xgboost_trainer.py Show resolved Hide resolved
python/ray/train/tests/test_xgboost_trainer.py Outdated Show resolved Hide resolved
python/ray/train/xgboost/xgboost_trainer.py Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


via Minions on GIPHY

eval_X, eval_y = eval_df.drop(label_column, axis=1), eval_df[label_column]
evals.append((xgboost.DMatrix(eval_X, label=eval_y), eval_name))

with CommunicatorContext():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we (eventually) move this into the train_func_context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add that in a followup!

@justinvyu justinvyu merged commit 62dbcb2 into ray-project:master Feb 23, 2024
8 of 9 checks passed
@justinvyu justinvyu deleted the simplify_xgb branch February 23, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants