Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Fix Parquet partition filter bug #40947

Merged

Conversation

wingkitlee0
Copy link
Contributor

@wingkitlee0 wingkitlee0 commented Nov 4, 2023

Why are these changes needed?

This PR fixes some edge cases described in #40945 and #40946

Related issue number

Closes #40945 and #40946

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wingkitlee0
Copy link
Contributor Author

Probably need advice on where to add unit test. test_consumption.py?

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! Could you add tests totest_parquet?

python/ray/data/datasource/parquet_datasource.py Outdated Show resolved Hide resolved
@wingkitlee0 wingkitlee0 force-pushed the read-parquet-fix-small-bugs branch 2 times, most recently from 22c9572 to 5b3f422 Compare November 8, 2023 04:12
@bveeramani bveeramani self-assigned this Nov 14, 2023
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's merge once premerge passes

@bveeramani
Copy link
Member

@wingkitlee0 looks premerge is failing because test_parquet_read_partitioned_with_partition_filter doesn't work with Arrow 6

@wingkitlee0
Copy link
Contributor Author

@wingkitlee0 looks premerge is failing because test_parquet_read_partitioned_with_partition_filter doesn't work with Arrow 6

@bveeramani I will take a deeper look later because pyarrow dataset api looks different in version 6.x

@wingkitlee0
Copy link
Contributor Author

wingkitlee0 commented Nov 16, 2023

@bveeramani

To recap, one of the bugs is about the failure to detect partitioning when using one file or a list of one file (after filtering). In recent pyarrow versions, this is not an issue. Hence, this PR is to fix Ray.
However, in the older pyarrow 6, it actually had the same bug: https://github.com/apache/arrow/blob/release-6.0.1/python/pyarrow/parquet.py#L1686
So the unit test for partition_filter is expected to fail and Ray is consistent with pyarrow regarding this bug.

Maybe this test should simply be skipped for pyarrow 6

@bveeramani bveeramani changed the title [data] fix two small bugs [Data] Fix Parquet partition filter bug Nov 18, 2023
@bveeramani
Copy link
Member

@wingkitlee0 okay, sounds good. Could we pytest.skip the test when the PyArrow version is 6, and add a note explaining why we're skipping?

Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Kit Lee <wklee4993@gmail.com>
@wingkitlee0
Copy link
Contributor Author

@wingkitlee0 okay, sounds good. Could we pytest.skip the test when the PyArrow version is 6, and add a note explaining why we're skipping?

@bveeramani Done. Thanks

@bveeramani bveeramani merged commit c51acd4 into ray-project:master Nov 21, 2023
13 of 16 checks passed
@wingkitlee0 wingkitlee0 deleted the read-parquet-fix-small-bugs branch November 21, 2023 23:57
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
This PR fixes some edge cases described in ray-project#40945 and ray-project#40946

Signed-off-by: Kit Lee <wklee4993@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] ArrowInvalid error when selecting only a subset of partition columns in read_parquet or similar
2 participants