[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

revans2 · 2022-08-30T19:46:30Z

Describe the bug
You are going to love this one. To be clear this is not really a bug in CUDF. I think this is a bug in Spark, and I will file an issue there for it. This is asking if we can work around this issue in CUDF because it is going to be a while before any fix goes into Spark and it is going to be even longer before any customers are able to upgrade to a version of Spark with the fix in it.

When talking about signed numbers in the parquet format specification it says...

INT(8, true), INT(16, true), and INT(32, true) must annotate an int32 primitive type and INT(64, true) must annotate an int64 primitive type. INT(32, true) and INT(64, true) are implied by the int32 and int64 primitive types if no other annotation is present and should be considered optional.

The CUDF code adds the INT(32, true) and INT(64, true) tags to columns that it writes out of those types. You can see this using the parquet-cli tool to dump the footer for a file.

Schema:
message schema {
  required int64 a (INTEGER(64,true));
  required int32 b (INTEGER(32,true));
}

The Spark writer does not include those extra metadata tags.

Schema:
message spark_schema {
  required int64 a;
  required int32 b;
}

Both should be fine according to the spec, but when Spark tries to setup filters for a predicate push down. Like a > 500 and b < 5 when it tries to translate the spark filter into a parquet filter, if it sees the metadata on the integer it does not match and ends up not filtering anything. This results in a lot of extra data being read when we could have skipped over entire row groups. Could we please stop inserting in the metadata in the footer for columns of type INT32 and INT64? All of the other types appear to be doing what Spark wants.

The text was updated successfully, but these errors were encountered:

revans2 · 2022-08-30T20:04:35Z

I filed https://issues.apache.org/jira/browse/SPARK-40280 in Spark for this issue.

revans2 · 2022-08-30T21:23:45Z

Just FYI. I tested a file written by pandas/pyarrow and it looks like Spark.

Schema:
message schema {
  optional int64 a;
  optional int32 b;
}

etseidl · 2022-08-30T22:49:34Z

I'm guessing this is due to changes I made in #11302. I needed the converted type to figure out which field from the stats union to use when deciding column ordering for the column index. Looking back, I changed the logic some to always assume signed integers unless the converted type is unsigned, so col_schema.converted_type can probably now be left as UNKNOWN for INT32 and INT64.

Testing now.

hyperbolic2346 · 2022-08-30T23:08:58Z

@etseidl That comment was an emotional rollercoaster. Thanks for clarification there. I was remembering that PR as I read the comment, but had forgotten it!

…11627) As brought up in #11626, converted type being written on INT32 and INT64 columns is not out of spec, but abnormal for parquet writers. This change brings cudf's writer in line with pandas and spark by not including converted type information on these types. closes #11626 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Jim Brennan (https://github.com/jbrennan333) - Yunsong Wang (https://github.com/PointKernel) URL: #11627

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 30, 2022

github-actions bot added this to Needs prioritizing in Bug Squashing Aug 30, 2022

hyperbolic2346 mentioned this issue Aug 30, 2022

Removed converted type for INT32 and INT64 since they do not convert #11627

Merged

3 tasks

rapids-bot bot closed this as completed in #11627 Aug 31, 2022

Bug Squashing automation moved this from Needs prioritizing to Closed Aug 31, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

revans2 commented Aug 30, 2022

revans2 commented Aug 30, 2022

revans2 commented Aug 30, 2022

etseidl commented Aug 30, 2022 •

edited

Loading

hyperbolic2346 commented Aug 30, 2022

[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

Comments

revans2 commented Aug 30, 2022

revans2 commented Aug 30, 2022

revans2 commented Aug 30, 2022

etseidl commented Aug 30, 2022 • edited Loading

hyperbolic2346 commented Aug 30, 2022

etseidl commented Aug 30, 2022 •

edited

Loading