[train] Implement DatasetManager#63309
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a DatasetManager actor to centralize and synchronize the management of Ray Dataset shards across training workers. It refactors RayDatasetShardProvider and DatasetsCallback to delegate dataset configuration and executor shutdown to this new manager. Feedback from the review highlights several improvement opportunities: ensuring safe access to coordinator actors to prevent AttributeError, removing redundant dataset resolution logic, converting blocking ray.get calls to asynchronous operations within the manager to avoid event loop stalls, and replacing assertions with conditional checks for safer lifecycle management during shutdown.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
JasonLi1909
left a comment
There was a problem hiding this comment.
Nice! One edge to test for: if the dataset(s) are not sharded the DataIteratorImpl will contain the dataset object
|
TODO: try |
Co-authored-by: Justin Yu <justin.v.yu@gmail.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
…chmark Temporarily pass datasets_to_split=[] in RayDataLoaderFactory so the training_ingest_benchmark-task=image_classification.skip_training.jpeg integration test runs without splitting the dataset across workers. Intended to be reverted after measuring the behavioral difference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gest benchmark" This reverts commit 77f0e4e.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit a1ae586. Configure here.

Summary
Here is a timeline that explains the history of the
DatasetManagerStreamSplitDataIteratorheld a reference to theDataset, which could be large as explained in this PR description.ray.train.get_dataset_shardlazily configure the dataset sharding #55230 changed how we sendStreamSplitDataIteratorsto workers. Basically, we went from presharding the dataset before creating the worker group to configuring the dataset on the fly when callingray.train.get_dataset_shard. Unfortunately, this led to a performance regression, so we had to revert it - see the revert PR ([train] Revert "Make ray.train.get_dataset_shard lazily configure the dataset sharding (#55230)" #55760) for a deeper explanation of what happened, but essentially sending overStreamSplitDataIterators was slow because they containedDatasetreferences which could get big. It's not clear why sending bigStreamSplitDataIteratorswas slower in theget_dataset_shardcase than the presharding case.Datasetreferences fromStreamSplitDataIteratorsDatasetManagerback now thatStreamSplitDataIteratorsare small.Testing
The revert PR's description (#55760) included a nice repro script which I modified slightly:
Now we are sending ~1kb
DataIterators over in ~5s (the 5s is due to waiting forstreaming_split, not the transfer time), whereas the revert PR (#55760 (comment)) showed that we were sending ~780mbDataIterators over in up to 174s! In other words, this is no longer an issue, so it's safe to mergeDatasetManagerfor real this time.Testing 2
If the dataset(s) are not sharded the DataIteratorImpl will contain the dataset object. I ran the
training_ingest_benchmark-task=image_classification.skip_training.jpegrelease tests on this PR (https://buildkite.com/ray-project/release/builds/93074) and a master branch PR (#63388, https://buildkite.com/ray-project/release/builds/93075) with dataset sharding temporarily disabled. Note that these jobs are only failing at the "get stats" step after training is finished; we can just focus on the training time. The results show that this PR actually achieves better e2e runtime. Honestly, I'm not really sure why this is the case - maybe pulling from the datasetmanager actor is faster than pushing from the traincontroller actor since the latter is responsible for many other activities as well?Note