Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] [streaming] Support a streaming_repartition() operator #36724

Open
ericl opened this issue Jun 22, 2023 · 2 comments
Open

[data] [streaming] Support a streaming_repartition() operator #36724

ericl opened this issue Jun 22, 2023 · 2 comments
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@ericl
Copy link
Contributor

ericl commented Jun 22, 2023

In several use cases, it is useful to change the block size of datasets in a streaming way. The current repartition() operator is an all-to-all operator and is incompatible with streaming.

We could implement a general purpose streaming_repartition() operator that supports repartitioning in a few streaming-compatible ways:

  • Splitting/coalescing blocks into a certain number of rows
  • Splitting/coalescing blocks into a certain in-memory byte size
  • Splitting/coalescing blocks into K pieces

This could be implemented as a new PhysicalOperator that implements the online repartitioning. This could also replace the current SplitBlocks mechanism from #36352

@ericl ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jun 22, 2023
@luxunxiansheng
Copy link

Suppose I have 20 big size files and I implement a specfic datasource for it. I would like to load the datset by read_datasource with a parallesim ,say , 200. Now I see the splitblocks function to split each block to many smaller blocks. My question is , how does the splitblocks work? It will split a single big file in each row into many many binary parts and then to coalesce them somewhere in the downstream?

@ericl
Copy link
Contributor Author

ericl commented Oct 5, 2023

SplitBlocks works within the read task to split the read output into multiple smaller pieces. These will remain as smaller individual blocks for the remainder of the computation unless the dataset is explicitly repartitioned.

Ray Data will automatically insert SplitBlocks to ensure the desired/autodetected parallelism is met after a read.

@anyscalesam anyscalesam added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants