Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema #22834

Merged
merged 16 commits into from
Mar 8, 2022

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Mar 5, 2022

Why are these changes needed?

There is a bug in combining the results from map_batches: if we create two dataset out of the same data, but with different num of partitions, we may get different results when run the same map_batches() on them. That is, num of partitions is affecting the map_batches() results, which should not.

Related issue number

Closes #22673

Checks

  • [ Y] I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ Y] Unit tests
    • Release tests
    • This PR is not tested :(

@jianoaix
Copy link
Contributor Author

jianoaix commented Mar 5, 2022

I realize I should have created a new branch for a different issue. The previous commit logs make it a bit messy.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@jjyao
Copy link
Collaborator

jjyao commented Mar 7, 2022

Could you explain the root cause of the issue? Schema of empty block is wrong? Should we always set schema of empty block to None?

@jianoaix
Copy link
Contributor Author

jianoaix commented Mar 7, 2022

Could you explain the root cause of the issue? Schema of empty block is wrong? Should we always set schema of empty block to None?

The schema for empty block is default to pyarrow (#22673 (comment)).

@jjyao
Copy link
Collaborator

jjyao commented Mar 7, 2022

If we don't want to trust the schema of empty block even if it's set, should we just set to None in the first place? Otherwise we will see inconsistent schemas for blocks in the same dataset?

@jianoaix
Copy link
Contributor Author

jianoaix commented Mar 7, 2022

The block is of type Union[List[T], "pyarrow.Table", "pandas.DataFrame", bytes], so even if it's empty, it already has a type and hence an implied the schema type.
Ideally if we know the output block type, we might pass it around and then use BlockBuilder.for_block_type(block_type).build() to create an empty one, which will have the right type and schema, even if it's empty. I was initially suggesting this approach (in the comment linked above), but it looks we might not know the output type, so here we use non-empty block's schema.

@ericl
Copy link
Contributor

ericl commented Mar 7, 2022

Seems tests are failing.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 7, 2022
@jianoaix
Copy link
Contributor Author

jianoaix commented Mar 7, 2022

Seems tests are failing.

Yeah. As I was working on fixing it, I realized that it is not feasible to rely on num_rows in block metadata to determine whether a block is empty. IIUC, we do not necessarily always know num_rows because we employ lazy execution/lazy data loading.

@jianoaix jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 8, 2022
@ericl ericl merged commit c2908de into ray-project:master Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dataset][Bug] map_batches() handles the combining of results from empty and non-empty batches incorrectly
4 participants