[data] Add `DataIterator.materialize` #43210

justinvyu · 2024-02-15T23:21:52Z

Why are these changes needed?

This PR introduces a DataIterator.materialize API that fully executes/consumes a data iterator and returns it as a MaterializedDataset for the user to continue processing it.

The reason to add this API is to support model training in Ray Train that requires the full dataset up front. For example, xgboost needs to consider the full dataset to fit decision trees and expects that full dataset to be .

The get_dataset_shard API which bridges Ray Data and Ray Train calls streaming_split on the dataset, where the number of splits is the number of training workers. This works well for SGD training schemes (typical for Torch, Tensorflow users), since the typical training procedure is to estimate the gradient on a small batch of data at a time. Fitting decision trees requires searching for the best split over the entire dataset, where the batch by batch dataloading is not suitable.

With this change, the following workflow is now possible:

def train_fn_per_worker(config):
    # 1. Get the dataset shard for the worker and convert to a `xgboost.DMatrix`
    train_ds_iter, eval_ds_iter = (
        ray.train.get_dataset_shard("train"),
        ray.train.get_dataset_shard("validation"),
    )

    train_ds, eval_ds = train_ds_iter.materialize(), eval_ds_iter.materialize()  # <-- new API usage

    train_df, eval_df = train_ds.to_pandas(), eval_ds.to_pandas()
    train_X, train_y = train_df.drop("y", axis=1), train_df["y"]
    eval_X, eval_y = eval_df.drop("y", axis=1), eval_df["y"]
    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(eval_X, label=eval_y)

    # 2. Do distributed data-parallel training.
    with CommunicatorContext():
        bst = xgboost.train(..., dtrain=dtrain)

# Launch distributed training job with Ray Train
train_ds = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
eval_ds = ray.data.from_items([{"x": x, "y": x + 1} for x in range(16)])
ray.train.xgboost.XGBoostTrainer(train_fn_per_worker, datasets={"train": train_ds, "validation": eval_ds})

XGBoost training with a data iterator

Note that there actually is support for xgboost training with data iterators, but it is experimental and possibly less performant: https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html#data-iterator

Related PR

This PR is a pre-requisite for #42767

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

scottjlee · 2024-02-15T23:29:53Z

python/ray/data/iterator.py

-                            feature_column_dtypes[key]
-                            if isinstance(feature_column_dtypes, dict)
-                            else feature_column_dtypes,
+                            (


is this just a lint change?

My auto-linting caught this for some reason.

c21

LGTM, thanks @justinvyu!

Can we also add an end-to-end test for XGBoostTrainer? It can wait for followup PR if you plan to do it later.

c21 · 2024-02-15T23:36:16Z

python/ray/data/iterator.py

+
+        block_iter, stats, owned_by_consumer = self._to_block_iterator()
+
+        block_refs_and_metadata = list(block_iter)


Can we add a comment that this would trigger the execution and materialize all blocks of iterator? Want to make it obvious for people to read in the future.

c21 · 2024-02-16T02:19:16Z

Oh forget one more thing, please add the new API to documentation - https://github.com/ray-project/ray/blob/master/doc/source/data/api/data_iterator.rst . Thanks.

justinvyu · 2024-02-16T18:45:30Z

@c21 Yes, I plan on adding the e2e test in the follow-up XGBoostTrainer PR that uses this API.

…_iter_to_dataset

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 3 commits February 15, 2024 14:56

add impl

3a779b6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add test

c4ffa24

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

docstring

3feb7aa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned c21 and scottjlee Feb 15, 2024

justinvyu requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners February 15, 2024 23:21

justinvyu mentioned this pull request Feb 15, 2024

[train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer #42767

Merged

8 tasks

scottjlee approved these changes Feb 15, 2024

View reviewed changes

c21 approved these changes Feb 15, 2024

View reviewed changes

justinvyu added 3 commits February 16, 2024 10:47

Merge branch 'master' of https://github.com/ray-project/ray into data…

1928390

…_iter_to_dataset

update docstring

c565d65

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add to api ref

c776d85

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

c21 merged commit e221c6e into ray-project:master Feb 16, 2024
9 checks passed

justinvyu deleted the data_iter_to_dataset branch February 16, 2024 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add `DataIterator.materialize` #43210

[data] Add `DataIterator.materialize` #43210

justinvyu commented Feb 15, 2024 •

edited

Loading

scottjlee Feb 15, 2024

justinvyu Feb 16, 2024

c21 left a comment

c21 Feb 15, 2024

c21 commented Feb 16, 2024

justinvyu commented Feb 16, 2024


		block_iter, stats, owned_by_consumer = self._to_block_iterator()

		block_refs_and_metadata = list(block_iter)

[data] Add DataIterator.materialize #43210

[data] Add DataIterator.materialize #43210

Conversation

justinvyu commented Feb 15, 2024 • edited Loading

Why are these changes needed?

XGBoost training with a data iterator

Related PR

Related issue number

Checks

scottjlee Feb 15, 2024

Choose a reason for hiding this comment

justinvyu Feb 16, 2024

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

c21 Feb 15, 2024

Choose a reason for hiding this comment

c21 commented Feb 16, 2024

justinvyu commented Feb 16, 2024

[data] Add `DataIterator.materialize` #43210

[data] Add `DataIterator.materialize` #43210

justinvyu commented Feb 15, 2024 •

edited

Loading