Fixed page data truncation in parquet writer under certain conditions. #15474

nvdbaranec · 2024-04-05T19:43:13Z

The issue is that in some cases, for example where we have all nulls, we can fail to update the size of the page output buffer, resulting in a missing byte expected by some readers. Specifically, we poke the value of dict_bits into the output buffer here:

cudf/cpp/src/io/parquet/page_enc.cu

Line 1892 in 6319ab7

dst[0] = dict_bits;

But, if we have no leaf values (for example, because everything in the page is null) s->cur never gets updated here, because we never enter the containing loop.

cudf/cpp/src/io/parquet/page_enc.cu

Line 1948 in 6319ab7

if (t == 0) { s->cur = s->rle_out; }

The fix is to just always update s->cur after this if-else block

cudf/cpp/src/io/parquet/page_enc.cu

Line 1891 in 6319ab7

if (dict_bits >= 0 && physical_type != BOOLEAN) {

Note that this was already handled by our reader. But some third party readers (Trino) are expecting that data to be there and crash if it's not.

Checklist

I am familiar with the Contributing Guidelines.
[] New or existing tests cover these changes.
The documentation is up to date with these changes.

…e dict_bits field in parquet page headers.

nvdbaranec · 2024-04-05T19:44:13Z

@vuule Need expert eyes here. I'm not an expert on the writer.

nvdbaranec · 2024-04-05T19:46:48Z

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

ttnghia · 2024-04-05T21:58:59Z

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

How about pandas? Can pandas handle such situation like our reader?

nvdbaranec · 2024-04-05T22:03:24Z

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

How about pandas? Can pandas handle such situation like our reader?

I'll give it a try. I'm guessing the answer is yes though, since we've been generating data like this for a long time now and it would almost certainly have shown up in the python tests.

nvdbaranec · 2024-04-05T22:39:30Z

Spark integration tests and cudf parquet+compute-sanitizer tests passed.

nvdbaranec · 2024-04-05T23:03:24Z

One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case.

How about pandas? Can pandas handle such situation like our reader?

I'll give it a try. I'm guessing the answer is yes though, since we've been generating data like this for a long time now and it would almost certainly have shown up in the python tests.

Yeah, Pandas handles the problem files just fine.

etseidl · 2024-04-06T00:18:37Z

Seems like a bug in the reader that's failing. The page header will have the correct size if it leaves off the 0 for dict bits, so really the reader is reading beyond the end of the page. As usual, the parquet spec is silent on whether that single byte needs to be there if there's no data present.

One additional question here is: how might we add a test for this.

A test could write a null column, then pull the header for the first data page and check that the uncompressed_page_size is ~~1~~. Check for use of read_page_header() in parquet_writer_test.cpp for examples.

Edit: uncompressed_page_size will also have to account for 4 bytes of definition level length indicator plus the number of bytes to encode num_values zeros, plus the one byte for dict_bits. FWIW arrow writes the 0, so I guess we should too.

mhaseeb123

Agreed with @etseidl here if Arrow writes the 0 in this case, we can do it too without impacting any other functionality.

nvdbaranec · 2024-04-09T18:08:30Z

/merge

Fixed an issue where in some cases we would end up not writing out th…

959f1db

…e dict_bits field in parquet page headers.

nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Apr 5, 2024

nvdbaranec requested a review from a team as a code owner April 5, 2024 19:43

nvdbaranec requested review from bdice and PointKernel April 5, 2024 19:43

nvdbaranec mentioned this pull request Apr 5, 2024

[BUG] Trino cannot read map column written by GPU #15473

Closed

nvdbaranec changed the title ~~Fixed an issue where in some cases we would end up not writing out th…~~ Fixed page data truncation in parquet writer under certain conditions. Apr 5, 2024

nvdbaranec requested a review from vuule April 5, 2024 20:32

ttnghia approved these changes Apr 5, 2024

View reviewed changes

nvdbaranec added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Apr 5, 2024

GregoryKimball requested a review from mhaseeb123 April 8, 2024 21:02

mhaseeb123 approved these changes Apr 8, 2024

View reviewed changes

nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Apr 9, 2024

rapids-bot bot merged commit 3b48f8b into rapidsai:branch-24.06 Apr 9, 2024
69 checks passed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed page data truncation in parquet writer under certain conditions. #15474

Fixed page data truncation in parquet writer under certain conditions. #15474

nvdbaranec commented Apr 5, 2024 •

edited

Loading

nvdbaranec commented Apr 5, 2024

nvdbaranec commented Apr 5, 2024

ttnghia commented Apr 5, 2024 •

edited

Loading

nvdbaranec commented Apr 5, 2024

nvdbaranec commented Apr 5, 2024

nvdbaranec commented Apr 5, 2024

etseidl commented Apr 6, 2024 •

edited

Loading

mhaseeb123 left a comment •

edited

Loading

nvdbaranec commented Apr 9, 2024

Fixed page data truncation in parquet writer under certain conditions. #15474

Fixed page data truncation in parquet writer under certain conditions. #15474

Conversation

nvdbaranec commented Apr 5, 2024 • edited Loading

Checklist

nvdbaranec commented Apr 5, 2024

nvdbaranec commented Apr 5, 2024

ttnghia commented Apr 5, 2024 • edited Loading

nvdbaranec commented Apr 5, 2024

nvdbaranec commented Apr 5, 2024

nvdbaranec commented Apr 5, 2024

etseidl commented Apr 6, 2024 • edited Loading

mhaseeb123 left a comment • edited Loading

Choose a reason for hiding this comment

nvdbaranec commented Apr 9, 2024

nvdbaranec commented Apr 5, 2024 •

edited

Loading

ttnghia commented Apr 5, 2024 •

edited

Loading

etseidl commented Apr 6, 2024 •

edited

Loading

mhaseeb123 left a comment •

edited

Loading