[train] Implement DatasetManager by TimothySeah · Pull Request #63309 · ray-project/ray

TimothySeah · 2026-05-12T19:02:39Z

Summary

Here is a timeline that explains the history of the DatasetManager

Before [Data] Remove _base_dataset from StreamSplitDataIterator #61607, StreamSplitDataIterator held a reference to the Dataset, which could be large as explained in this PR description.
[train] Make ray.train.get_dataset_shard lazily configure the dataset sharding #55230 changed how we send StreamSplitDataIterators to workers. Basically, we went from presharding the dataset before creating the worker group to configuring the dataset on the fly when calling ray.train.get_dataset_shard. Unfortunately, this led to a performance regression, so we had to revert it - see the revert PR ([train] Revert "Make ray.train.get_dataset_shard lazily configure the dataset sharding (#55230)" #55760) for a deeper explanation of what happened, but essentially sending over StreamSplitDataIterators was slow because they contained Dataset references which could get big. It's not clear why sending big StreamSplitDataIterators was slower in the get_dataset_shard case than the presharding case.
[Data] Remove _base_dataset from StreamSplitDataIterator #61607 removed Dataset references from StreamSplitDataIterators
This PR brings DatasetManager back now that StreamSplitDataIterators are small.

Testing

The revert PR's description (#55760) included a nice repro script which I modified slightly:

import time

import ray
import ray.data
from ray.data import DataContext
from ray.data.datasource.partitioning import Partitioning
from ray.train.v2._internal.data_integration.dataset_manager import DatasetManager
from ray.train.v2._internal.data_integration.interfaces import DatasetShardMetadata


train_dir = "s3://anyscale-imagenet/ILSVRC/Data/CLS-LOC/train"
train_partitioning = Partitioning(
    "dir", base_dir=train_dir, field_names=["class"]
)
train_ds = ray.data.read_images(
    train_dir,
    mode="RGB",
    include_paths=False,
    partitioning=train_partitioning,
)

num_workers = 16

datasets = {"train": train_ds, "val": train_ds}
dataset_manager = ray.remote(DatasetManager).remote(
    datasets=datasets,
    data_config=ray.train.DataConfig(),
    data_context=DataContext.get_current(),
    world_size=num_workers,
    worker_node_ids=None,
)


def get_size_bytes(obj):
    import ray.cloudpickle as ray_pickle
    return len(ray_pickle.dumps(obj))


@ray.remote(num_cpus=1)
def consumer(dm, rank):
    dataset_info = DatasetShardMetadata("train", rank)

    start = time.perf_counter()
    ds_shard = ray.get(dm.get_dataset_shard.remote(dataset_info))
    end = time.perf_counter()

    size_mb = get_size_bytes(ds_shard) / (1024 * 1024)
    print(f"[{rank=}] TIME TO GET THE DATASET SHARD (SIZE={size_mb}MB):", end - start)


start = time.perf_counter()
ray.get([consumer.remote(dataset_manager, rank) for rank in range(num_workers)])
end = time.perf_counter()
print("elapsed:", end - start)

Now we are sending ~1kb DataIterators over in ~5s (the 5s is due to waiting for streaming_split, not the transfer time), whereas the revert PR (#55760 (comment)) showed that we were sending ~780mb DataIterators over in up to 174s! In other words, this is no longer an issue, so it's safe to merge DatasetManager for real this time.

(ray4) tseah@tseah-LV3607J62K ray % RAY_DEDUP_LOGS=0 python driver.py
2026-05-12 16:34:59,444	INFO worker.py:2018 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(DatasetManager pid=69108) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
elapsed: 6.097738708020188
(consumer pid=69103) [rank=3] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.295730707992334
(consumer pid=69106) [rank=0] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.3177806670137215
(consumer pid=69110) [rank=4] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.318447458994342
(consumer pid=69110) [rank=14] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 0.015125167003134266
(consumer pid=69104) [rank=1] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.295733000006294
(consumer pid=69111) [rank=2] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.337776000000304
(consumer pid=69102) [rank=5] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.340320540999528
(consumer pid=69101) [rank=7] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.306097583001247
(consumer pid=69105) [rank=6] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.326983959006611
(consumer pid=69107) [rank=8] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.3252247920027
(consumer pid=69109) [rank=9] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 5.325624417018844
(consumer pid=69109) [rank=15] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 0.014673249999759719
(consumer pid=69773) [rank=10] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 4.506475125002908
(consumer pid=69774) [rank=12] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 4.497576666995883
(consumer pid=69771) [rank=11] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 4.499152167001739
(consumer pid=69772) [rank=13] TIME TO GET THE DATASET SHARD (SIZE=0.0016326904296875MB): 1.9341769159946125

Testing 2

If the dataset(s) are not sharded the DataIteratorImpl will contain the dataset object. I ran the training_ingest_benchmark-task=image_classification.skip_training.jpeg release tests on this PR (https://buildkite.com/ray-project/release/builds/93074) and a master branch PR (#63388, https://buildkite.com/ray-project/release/builds/93075) with dataset sharding temporarily disabled. Note that these jobs are only failing at the "get stats" step after training is finished; we can just focus on the training time. The results show that this PR actually achieves better e2e runtime. Honestly, I'm not really sure why this is the case - maybe pulling from the datasetmanager actor is faster than pushing from the traincontroller actor since the latter is responsible for many other activities as well?

With DatasetManager Without DatasetManager

2m13s 5m27s

.preserve_order 4m48s 36m37s

.local_fs 32s 33s

local_fs.preserve_order 36s 35s

local_fs_multi_gpus 48s 53s

local_fs_multi_gpus.preserve_order 1m21s 2m28s

Note

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a DatasetManager actor to centralize and synchronize the management of Ray Dataset shards across training workers. It refactors RayDatasetShardProvider and DatasetsCallback to delegate dataset configuration and executor shutdown to this new manager. Feedback from the review highlights several improvement opportunities: ensuring safe access to coordinator actors to prevent AttributeError, removing redundant dataset resolution logic, converting blocking ray.get calls to asynchronous operations within the manager to avoid event loop stalls, and replacing assertions with conditional checks for safer lifecycle management during shutdown.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

JasonLi1909

Nice! One edge to test for: if the dataset(s) are not sharded the DataIteratorImpl will contain the dataset object

TimothySeah · 2026-05-14T22:34:47Z

TODO: try training_ingest_benchmark-task=image_classification.skip_training.jpeg without splitting datasets with and without this PR. Reason: DataIterator (non split case) still has a dataset so we need to verify that this doesn't slow that case down.

justinvyu

Thanks!

Co-authored-by: Justin Yu <justin.v.yu@gmail.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

…chmark Temporarily pass datasets_to_split=[] in RayDataLoaderFactory so the training_ingest_benchmark-task=image_classification.skip_training.jpeg integration test runs without splitting the dataset across workers. Intended to be reverted after measuring the behavioral difference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gest benchmark" This reverts commit 77f0e4e.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit a1ae586. Configure here.}

[train] Open source DatasetManager

761744e

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

TimothySeah and others added 3 commits May 12, 2026 17:52

Apply suggestion from @gemini-code-assist[bot]

3eeb254

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

Apply suggestion from @gemini-code-assist[bot]

b5b86f1

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

Apply suggestion from @gemini-code-assist[bot]

ba62ca5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

JasonLi1909 reviewed May 14, 2026

View reviewed changes

justinvyu approved these changes May 14, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/data_integration/dataset_manager.py Outdated

TimothySeah and others added 2 commits May 15, 2026 17:03

Update python/ray/train/v2/_internal/data_integration/dataset_manager.py

c786c46

Co-authored-by: Justin Yu <justin.v.yu@gmail.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah mentioned this pull request May 19, 2026

do not submit - just provide comparison baseline #63388

Closed

Revert "[test] DO NOT MERGE: disable dataset splitting in training in…

a1ae586

…gest benchmark" This reverts commit 77f0e4e.

TimothySeah marked this pull request as ready for review May 19, 2026 22:52

TimothySeah requested a review from a team as a code owner May 19, 2026 22:52

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/data_integration/dataset_manager.py

TimothySeah changed the title ~~[train] Open source DatasetManager~~ [train] Implement DatasetManager May 20, 2026

ray-gardener Bot added train Ray Train Related Issue data Ray Data-related issues labels May 20, 2026

justinvyu enabled auto-merge (squash) May 20, 2026 22:00

github-actions Bot added the go add ONLY when ready to merge, run all tests label May 20, 2026

justinvyu merged commit 844c38d into ray-project:master May 20, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Implement DatasetManager#63309

[train] Implement DatasetManager#63309
justinvyu merged 7 commits into
ray-project:masterfrom
TimothySeah:tseah/oss-dataset-manager

TimothySeah commented May 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JasonLi1909 left a comment

Uh oh!

TimothySeah commented May 14, 2026

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	With DatasetManager	Without DatasetManager
	2m13s	5m27s
.preserve_order	4m48s	36m37s
.local_fs	32s	33s
local_fs.preserve_order	36s	35s
local_fs_multi_gpus	48s	53s
local_fs_multi_gpus.preserve_order	1m21s	2m28s

Conversation

TimothySeah commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Testing 2

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JasonLi1909 left a comment

Choose a reason for hiding this comment

Uh oh!

TimothySeah commented May 14, 2026

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TimothySeah commented May 12, 2026 •

edited

Loading