[Data] Allow split by column value in Dataset #45634

terraflops1048576 · 2024-05-30T04:02:07Z

Description

Allow ray.data.Dataset to be grouped by and then split into separate Datasets by a column value. In particular, ray.data.Dataset should have a split_by_key function that splits the Dataset into a dict or list of separate Datasets based on a particular column value. This is basically the groupby of Pandas.

Use case

Currently, the use case is trying to take a Ray Dataset and split it into shards by some column value to write to separate files using a Ray DataSink. Currently, this is not possible, because the groupby operation only returns a GroupedData, from which you have to use map_groups. The current solution is to write some custom file writing logic inside map_groups and call materialize() on the resulting Dataset, which is not a good API and prevents other use cases, like passing different Datasets to different workers, for example.

The text was updated successfully, but these errors were encountered:

wingkitlee0 · 2024-05-31T20:05:18Z

Is writing a partitioned datasets the primary use case? if so, this may be related to #42228 and #42288

If I understand your example correctly, writing a dataset into groups (or partitions) will be like

for ds_by_group in ds.split_by_key("group_key"):
  ds_by_group.write_parquet("target")

Afaik, the ray data write_* api are blocking in the current design.

In #42288, I proposed a solution by align the blocks with keys. In that case,

ds.repartition_by_key("group_key").write_parquet("target")

will write Ngroup files.

terraflops1048576 · 2024-06-04T05:09:50Z

Well, I would also like the ability to treat each group as a separate Dataset, for example applying map_batches to each, but yes, ultimately the use case I'm targeting right now is writing.

pinduzera · 2024-06-10T14:04:43Z

To add to this, I am looking for the similar feature for general data processing, not necessarily for model training, but to stream the data to each node by group (single or multiple keys).
So, imagine I have g1, g2, g3, g4, g5 (all bank data of a given user), I want to be able to process each user independently (or even send some groups together efficiently). And send those groups to each node in one (or both) of the following options:

Assuming I have 1 node for each group, I can just do something like shards = ds.streaming_split(key="KeyColumn", n=5) and each node would receive an iterable with a single group.

If you don't have enough nodes (n=3) for for each group, maybe send (g1,g2) to shard 1, and (g3) to shard 2, and (g3,g4) to shard 3, somewhat split according to the number of rows.

NumberChiffre · 2024-07-14T19:35:06Z

Is there any follow up on this feature request? Super helpful when dealing with time series datasets.

terraflops1048576 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 30, 2024

terraflops1048576 changed the title ~~[Data]~~ [Data] Allow split by column value in Dataset May 30, 2024

anyscalesam added the data Ray Data-related issues label Jun 3, 2024

anyscalesam added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Allow split by column value in Dataset #45634

[Data] Allow split by column value in Dataset #45634

terraflops1048576 commented May 30, 2024

wingkitlee0 commented May 31, 2024

terraflops1048576 commented Jun 4, 2024

pinduzera commented Jun 10, 2024 •

edited

Loading

NumberChiffre commented Jul 14, 2024

[Data] Allow split by column value in Dataset #45634

[Data] Allow split by column value in Dataset #45634

Comments

terraflops1048576 commented May 30, 2024

Description

Use case

wingkitlee0 commented May 31, 2024

terraflops1048576 commented Jun 4, 2024

pinduzera commented Jun 10, 2024 • edited Loading

NumberChiffre commented Jul 14, 2024

pinduzera commented Jun 10, 2024 •

edited

Loading