For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema #22834

jianoaix · 2022-03-05T00:46:02Z

Why are these changes needed?

There is a bug in combining the results from map_batches: if we create two dataset out of the same data, but with different num of partitions, we may get different results when run the same map_batches() on them. That is, num of partitions is affecting the map_batches() results, which should not.

Related issue number

Closes #22673

Checks

[ Y] I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ Y] Unit tests
- Release tests
- This PR is not tested :(

…on-empty blocks determine the schema

jianoaix · 2022-03-05T00:48:44Z

I realize I should have created a new branch for a different issue. The previous commit logs make it a bit messy.

clarkzinzow

Nice!

jjyao · 2022-03-07T18:24:40Z

Could you explain the root cause of the issue? Schema of empty block is wrong? Should we always set schema of empty block to None?

jianoaix · 2022-03-07T18:32:58Z

Could you explain the root cause of the issue? Schema of empty block is wrong? Should we always set schema of empty block to None?

The schema for empty block is default to pyarrow (#22673 (comment)).

jjyao · 2022-03-07T18:58:03Z

If we don't want to trust the schema of empty block even if it's set, should we just set to None in the first place? Otherwise we will see inconsistent schemas for blocks in the same dataset?

jianoaix · 2022-03-07T19:30:33Z

The block is of type Union[List[T], "pyarrow.Table", "pandas.DataFrame", bytes], so even if it's empty, it already has a type and hence an implied the schema type.
Ideally if we know the output block type, we might pass it around and then use BlockBuilder.for_block_type(block_type).build() to create an empty one, which will have the right type and schema, even if it's empty. I was initially suggesting this approach (in the comment linked above), but it looks we might not know the output type, so here we use non-empty block's schema.

ericl · 2022-03-07T22:44:29Z

Seems tests are failing.

jianoaix · 2022-03-07T23:04:16Z

Seems tests are failing.

Yeah. As I was working on fixing it, I realized that it is not feasible to rely on num_rows in block metadata to determine whether a block is empty. IIUC, we do not necessarily always know num_rows because we employ lazy execution/lazy data loading.

jianoaix added 13 commits February 28, 2022 14:33

Support map_groups in dataset

138fabe

address comments

f85c66c

add test for sort

0c041a6

Merge branch 'master' of https://github.com/ray-project/ray into groupby

a96f056

fix comments

22be926

Merge branch 'master' of https://github.com/ray-project/ray into groupby

7e2db5b

fix

c49234f

address comments

9d8cc62

fix

6dc7c0d

fix lint

b4e3381

lint

228e65d

Merge branch 'master' of https://github.com/ray-project/ray into groupby

0a1e5cf

For a dataset comprised of both empty and non-empty blocks, let the n…

ebc5948

…on-empty blocks determine the schema

jianoaix requested review from ericl, scv119 and clarkzinzow as code owners March 5, 2022 00:46

jianoaix assigned ericl Mar 5, 2022

jianoaix requested a review from jjyao as a code owner March 5, 2022 00:46

jianoaix assigned clarkzinzow Mar 5, 2022

clarkzinzow approved these changes Mar 7, 2022

View reviewed changes

fix test

b730a3a

jjyao approved these changes Mar 7, 2022

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into groupby

220abf5

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 7, 2022

fix test

3cde9c1

jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 8, 2022

ericl merged commit c2908de into ray-project:master Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema #22834

For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema #22834

jianoaix commented Mar 5, 2022

jianoaix commented Mar 5, 2022

clarkzinzow left a comment

jjyao commented Mar 7, 2022

jianoaix commented Mar 7, 2022

jjyao commented Mar 7, 2022

jianoaix commented Mar 7, 2022 •

edited

Loading

ericl commented Mar 7, 2022

jianoaix commented Mar 7, 2022

For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema #22834

For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema #22834

Conversation

jianoaix commented Mar 5, 2022

Why are these changes needed?

Related issue number

Checks

jianoaix commented Mar 5, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

jjyao commented Mar 7, 2022

jianoaix commented Mar 7, 2022

jjyao commented Mar 7, 2022

jianoaix commented Mar 7, 2022 • edited Loading

ericl commented Mar 7, 2022

jianoaix commented Mar 7, 2022

jianoaix commented Mar 7, 2022 •

edited

Loading