Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

Merged
merged 15 commits into from
Dec 6, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Nov 9, 2023

Description

The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see this arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64).

The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested review from a team as code owners November 9, 2023 22:12
@etseidl etseidl requested review from shwina, galipremsagar, harrism and bdice and removed request for a team November 9, 2023 22:12
Copy link

copy-pr-bot bot commented Nov 9, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Nov 9, 2023
@GregoryKimball
Copy link
Contributor

Thank you @etseidl for contributing this. (Part of #13501, to target 24.02 release)

@vuule vuule self-requested a review November 15, 2023 19:01
@vuule vuule self-assigned this Nov 17, 2023
@vuule vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 17, 2023
@vuule
Copy link
Contributor

vuule commented Nov 17, 2023

/ok to test

@harrism harrism removed their request for review November 21, 2023 07:13
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, if a bit odd :D
Can you please add a note in the description about the 2s complement math going on with this change?

@PointKernel PointKernel self-requested a review December 5, 2023 18:24
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/src/io/parquet/delta_enc.cuh Show resolved Hide resolved
@PointKernel
Copy link
Member

/ok to test

@ttnghia ttnghia self-requested a review December 5, 2023 23:05
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this!

@vuule
Copy link
Contributor

vuule commented Dec 6, 2023

@ttnghia should we wait on your review before merging? I see that you self-requested a review.

@ttnghia
Copy link
Contributor

ttnghia commented Dec 6, 2023

I'm fine to merge it now. If it is still here then I can take a look later tonight.

@etseidl
Copy link
Contributor Author

etseidl commented Dec 6, 2023

Don't we need a python review?

@vuule
Copy link
Contributor

vuule commented Dec 6, 2023

Don't we need a python review?

Well, NOW we do! :D

Thanks for the reminder (I don't see github requiring the python review). I'll ping folks tomorrow :)

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python approval. 🐍 ✅

@vuule
Copy link
Contributor

vuule commented Dec 6, 2023

/merge

@rapids-bot rapids-bot bot merged commit d97b3e0 into rapidsai:branch-24.02 Dec 6, 2023
70 checks passed
@etseidl etseidl deleted the delta_bitwidth branch December 6, 2023 17:09
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
…hysical type being encoded (rapidsai#14392)

The current implementation of the DELTA_BINARY_PACKED encoder can in certain circumstances use more bits for encoding than are present in the data. Specifically, for INT32 data, if the range of values is large enough, the encoder will use up to 33 bits. While not a violation of the Parquet specification, this does cause problems with certain parquet readers (see [this](apache/arrow#20374) arrow-cpp issue, for instance). libcudf and parquet-mr have no issue with reading data encoded with 33 bits, but in the interest of a) greater interoperability and b) smaller files, this PR changes the DELTA_BINARY_PACKED encoder to use no more bits than are present in the physical type being encoded (32 for INT32, 64 for INT64).

The actual change made here is to perform all the delta computations using 32-bit integers for INT32 data, and 64-bit integers for INT64 data. The prior encoder used 64-bit integers exclusively. To deal with possible overflow, all math is done on unsigned integers to get defined wrapping behavior (overflow with signed integers is UB), with results cast back to signed afterwards. This is in line with the approach taken by parquet-mr, arrow-cpp, and arrow-rs.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#14392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

7 participants