Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

Closed
revans2 opened this issue Aug 30, 2022 · 4 comments · Fixed by #11627
Closed
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 30, 2022

Describe the bug
You are going to love this one. To be clear this is not really a bug in CUDF. I think this is a bug in Spark, and I will file an issue there for it. This is asking if we can work around this issue in CUDF because it is going to be a while before any fix goes into Spark and it is going to be even longer before any customers are able to upgrade to a version of Spark with the fix in it.

When talking about signed numbers in the parquet format specification it says...

INT(8, true), INT(16, true), and INT(32, true) must annotate an int32 primitive type and INT(64, true) must annotate an int64 primitive type. INT(32, true) and INT(64, true) are implied by the int32 and int64 primitive types if no other annotation is present and should be considered optional.

The CUDF code adds the INT(32, true) and INT(64, true) tags to columns that it writes out of those types. You can see this using the parquet-cli tool to dump the footer for a file.

Schema:
message schema {
  required int64 a (INTEGER(64,true));
  required int32 b (INTEGER(32,true));
}

The Spark writer does not include those extra metadata tags.

Schema:
message spark_schema {
  required int64 a;
  required int32 b;
}

Both should be fine according to the spec, but when Spark tries to setup filters for a predicate push down. Like a > 500 and b < 5 when it tries to translate the spark filter into a parquet filter, if it sees the metadata on the integer it does not match and ends up not filtering anything. This results in a lot of extra data being read when we could have skipped over entire row groups. Could we please stop inserting in the metadata in the footer for columns of type INT32 and INT64? All of the other types appear to be doing what Spark wants.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 30, 2022
@github-actions github-actions bot added this to Needs prioritizing in Bug Squashing Aug 30, 2022
@revans2
Copy link
Contributor Author

revans2 commented Aug 30, 2022

I filed https://issues.apache.org/jira/browse/SPARK-40280 in Spark for this issue.

@revans2
Copy link
Contributor Author

revans2 commented Aug 30, 2022

Just FYI. I tested a file written by pandas/pyarrow and it looks like Spark.

Schema:
message schema {
  optional int64 a;
  optional int32 b;
}

@etseidl
Copy link
Contributor

etseidl commented Aug 30, 2022

I'm guessing this is due to changes I made in #11302. I needed the converted type to figure out which field from the stats union to use when deciding column ordering for the column index. Looking back, I changed the logic some to always assume signed integers unless the converted type is unsigned, so col_schema.converted_type can probably now be left as UNKNOWN for INT32 and INT64.

Testing now.

@hyperbolic2346
Copy link
Contributor

@etseidl That comment was an emotional rollercoaster. Thanks for clarification there. I was remembering that PR as I read the comment, but had forgotten it!

Bug Squashing automation moved this from Needs prioritizing to Closed Aug 31, 2022
rapids-bot bot pushed a commit that referenced this issue Aug 31, 2022
…11627)

As brought up in #11626, converted type being written on INT32 and INT64 columns is not out of spec, but abnormal for parquet writers. This change brings cudf's writer in line with pandas and spark by not including converted type information on these types.

closes #11626

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jim Brennan (https://github.com/jbrennan333)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11627
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants