Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed page data truncation in parquet writer under certain conditions. #15474

Merged

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Apr 5, 2024

Fixes #15473

The issue is that in some cases, for example where we have all nulls, we can fail to update the size of the page output buffer, resulting in a missing byte expected by some readers. Specifically, we poke the value of dict_bits into the output buffer here:

dst[0] = dict_bits;

But, if we have no leaf values (for example, because everything in the page is null) s->cur never gets updated here, because we never enter the containing loop.

if (t == 0) { s->cur = s->rle_out; }

The fix is to just always update s->cur after this if-else block

if (dict_bits >= 0 && physical_type != BOOLEAN) {

Note that this was already handled by our reader. But some third party readers (Trino) are expecting that data to be there and crash if it's not.

Checklist

  • I am familiar with the Contributing Guidelines.
  • [] New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@nvdbaranec nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Apr 5, 2024
@nvdbaranec nvdbaranec requested a review from a team as a code owner April 5, 2024 19:43
@nvdbaranec
Copy link
Contributor Author

@vuule Need expert eyes here. I'm not an expert on the writer.

@nvdbaranec
Copy link
Contributor Author

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

@nvdbaranec nvdbaranec changed the title Fixed an issue where in some cases we would end up not writing out th… Fixed page data truncation in parquet writer under certain conditions. Apr 5, 2024
@nvdbaranec nvdbaranec requested a review from vuule April 5, 2024 20:32
@ttnghia
Copy link
Contributor

ttnghia commented Apr 5, 2024

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

How about pandas? Can pandas handle such situation like our reader?

@nvdbaranec
Copy link
Contributor Author

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

How about pandas? Can pandas handle such situation like our reader?

I'll give it a try. I'm guessing the answer is yes though, since we've been generating data like this for a long time now and it would almost certainly have shown up in the python tests.

@nvdbaranec
Copy link
Contributor Author

Spark integration tests and cudf parquet+compute-sanitizer tests passed.

@nvdbaranec nvdbaranec added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Apr 5, 2024
@nvdbaranec
Copy link
Contributor Author

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

How about pandas? Can pandas handle such situation like our reader?

I'll give it a try. I'm guessing the answer is yes though, since we've been generating data like this for a long time now and it would almost certainly have shown up in the python tests.

Yeah, Pandas handles the problem files just fine.

@etseidl
Copy link
Contributor

etseidl commented Apr 6, 2024

Seems like a bug in the reader that's failing. The page header will have the correct size if it leaves off the 0 for dict bits, so really the reader is reading beyond the end of the page. As usual, the parquet spec is silent on whether that single byte needs to be there if there's no data present.

One additional question here is: how might we add a test for this.

A test could write a null column, then pull the header for the first data page and check that the uncompressed_page_size is 1. Check for use of read_page_header() in parquet_writer_test.cpp for examples.

Edit: uncompressed_page_size will also have to account for 4 bytes of definition level length indicator plus the number of bytes to encode num_values zeros, plus the one byte for dict_bits. FWIW arrow writes the 0, so I guess we should too.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @etseidl here if Arrow writes the 0 in this case, we can do it too without impacting any other functionality.

@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Apr 9, 2024
@nvdbaranec
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 3b48f8b into rapidsai:branch-24.06 Apr 9, 2024
69 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Trino cannot read map column written by GPU
4 participants