-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Allow split by column value in Dataset #45634
Comments
Is writing a partitioned datasets the primary use case? if so, this may be related to #42228 and #42288 If I understand your example correctly, writing a dataset into groups (or partitions) will be like
Afaik, the ray data write_* api are blocking in the current design. In #42288, I proposed a solution by align the blocks with keys. In that case,
will write |
Well, I would also like the ability to treat each group as a separate Dataset, for example applying map_batches to each, but yes, ultimately the use case I'm targeting right now is writing. |
To add to this, I am looking for the similar feature for general data processing, not necessarily for model training, but to stream the data to each node by group (single or multiple keys). Assuming I have 1 node for each group, I can just do something like If you don't have enough nodes (n=3) for for each group, maybe send (g1,g2) to shard 1, and (g3) to shard 2, and (g3,g4) to shard 3, somewhat split according to the number of rows. |
Is there any follow up on this feature request? Super helpful when dealing with time series datasets. |
Description
Allow ray.data.Dataset to be grouped by and then split into separate Datasets by a column value. In particular, ray.data.Dataset should have a
split_by_key
function that splits the Dataset into a dict or list of separate Datasets based on a particular column value. This is basically the groupby of Pandas.Use case
Currently, the use case is trying to take a Ray Dataset and split it into shards by some column value to write to separate files using a Ray DataSink. Currently, this is not possible, because the groupby operation only returns a GroupedData, from which you have to use
map_groups
. The current solution is to write some custom file writing logic insidemap_groups
and callmaterialize()
on the resulting Dataset, which is not a good API and prevents other use cases, like passing different Datasets to different workers, for example.The text was updated successfully, but these errors were encountered: