Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Allow split by column value in Dataset #45634

Open
terraflops1048576 opened this issue May 30, 2024 · 4 comments
Open

[Data] Allow split by column value in Dataset #45634

terraflops1048576 opened this issue May 30, 2024 · 4 comments
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@terraflops1048576
Copy link
Contributor

Description

Allow ray.data.Dataset to be grouped by and then split into separate Datasets by a column value. In particular, ray.data.Dataset should have a split_by_key function that splits the Dataset into a dict or list of separate Datasets based on a particular column value. This is basically the groupby of Pandas.

Use case

Currently, the use case is trying to take a Ray Dataset and split it into shards by some column value to write to separate files using a Ray DataSink. Currently, this is not possible, because the groupby operation only returns a GroupedData, from which you have to use map_groups. The current solution is to write some custom file writing logic inside map_groups and call materialize() on the resulting Dataset, which is not a good API and prevents other use cases, like passing different Datasets to different workers, for example.

@terraflops1048576 terraflops1048576 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 30, 2024
@terraflops1048576 terraflops1048576 changed the title [Data] [Data] Allow split by column value in Dataset May 30, 2024
@wingkitlee0
Copy link
Contributor

Is writing a partitioned datasets the primary use case? if so, this may be related to #42228 and #42288

If I understand your example correctly, writing a dataset into groups (or partitions) will be like

for ds_by_group in ds.split_by_key("group_key"):
  ds_by_group.write_parquet("target")

Afaik, the ray data write_* api are blocking in the current design.

In #42288, I proposed a solution by align the blocks with keys. In that case,

ds.repartition_by_key("group_key").write_parquet("target")

will write Ngroup files.

@anyscalesam anyscalesam added the data Ray Data-related issues label Jun 3, 2024
@terraflops1048576
Copy link
Contributor Author

Well, I would also like the ability to treat each group as a separate Dataset, for example applying map_batches to each, but yes, ultimately the use case I'm targeting right now is writing.

@pinduzera
Copy link

pinduzera commented Jun 10, 2024

To add to this, I am looking for the similar feature for general data processing, not necessarily for model training, but to stream the data to each node by group (single or multiple keys).
So, imagine I have g1, g2, g3, g4, g5 (all bank data of a given user), I want to be able to process each user independently (or even send some groups together efficiently). And send those groups to each node in one (or both) of the following options:

Assuming I have 1 node for each group, I can just do something like shards = ds.streaming_split(key="KeyColumn", n=5) and each node would receive an iterable with a single group.

If you don't have enough nodes (n=3) for for each group, maybe send (g1,g2) to shard 1, and (g3) to shard 2, and (g3,g4) to shard 3, somewhat split according to the number of rows.

@anyscalesam anyscalesam added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 12, 2024
@NumberChiffre
Copy link

Is there any follow up on this feature request? Super helpful when dealing with time series datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

5 participants