Parquet sub-rowgroup reading. #14360

nvdbaranec · 2023-11-03T22:08:25Z

Implementation of sub-rowgroup reading of Parquet files. This PR implements an additional layer on top of the existing chunking system. Currently, the reader takes two parameters: input_pass_read_limit which specifies a limit on temporary memory usage when reading and decompressing file data; and output_pass_read_limit which specifies a limit on how large an output chunk (a table) can be.

Currently when the user specifies a limit via input_pass_read_limit, the reader will perform multiple passes over the file at row-group granularity. That is, it will control how many row groups it will read at once to conform to the specified limit.

However, there are cases where this is not sufficient. So this PR changes things so that we now have subpasses below the top level passes. It works as follows:

We read a set of input chunks based on the input_pass_read_limit but we do not decompress them immediately. This constitutes a pass.
Within each pass of compressed data, we progressively decompress batches of pages as subpasses.
Within each subpass we apply the output limit to produce chunks.

So the overall structure of the reader is: (read) pass -> (decompress) subpass -> (decode) chunk

Major sections of code changes:

Previously the incoming page data in the file was unsorted. To handle this we later on produced a page_index that could be applied to the array to get them in schema-sorted order. This was getting very unwieldy so I just sort the pages up front now and the page_index array has gone away.
There are now two sets of pages to be aware of in the code. Within each pass_intermediate_data there is the set of all pages within the current set of loaded row groups. And then within the subpass_intermediate_data struct there is a separate array of pages representing the current batch of decompressed data we are processing. To keep the confusion down I changed a good amount of code to always reference it's array though it's associated struct. Ie, pass.pages or subpass.pages. In addition, I removed the page_info from ColumnChunkDesc to help prevent the kernels from getting confused. ColumnChunkDesc now only has a dict_page field which is constant across all subpasses.
The primary entry point for the chunking mechanism is in handle_chunking. Here we iterate through passes, subpasses and output chunks. Successive subpasses are computed and preprocessed through here.
The volume of diffs you'll see in reader_impl_chunking.cu is a little deceptive. A lot of this is just functions (or pieces of functions) that have been moved over from either reader_impl_preprocess.cu or reader_impl_helpers.cpp. The most relevant actual changes are in: handle_chunking, compute_input_passes, compute_next_subpass, and compute_chunks_for_subpass.

Note on tests: I renamed parquet_chunked_reader_tests.cpp to parquet_chunked_reader_test.cu as I needed to use thrust. The only actual changes in the file are the addition of the ParquetChunkedReaderInputLimitConstrainedTest and ParquetChunkedReaderInputLimitTest test suites at the bottom.

… been changed to ::parquet::detail, ::parquet::gpu has been renamed to ::parquet::detail, and several detail-style files which were just using ::parquet have been moved into parquet::detail.

…. More work remains though.

GregoryKimball · 2023-11-06T22:01:11Z

Hello @etseidl we think that this work #14360 will have some conflicts with your decoder addition in #14101. Our plan is to complete the work on #14101 first, and then resolve conflicts to this PR. In the meantime, would you please take a look at @nvdbaranec 's work here?

cpp/src/io/parquet/reader_impl.cpp

etseidl · 2023-11-07T01:02:02Z

we think that this work #14360 will have some conflicts with your decoder addition in #14101

FWIW, I tried a merge of #14101 into this branch, and the conflicts were pretty minor, with the biggest change being changes to the ComputePageStringSizes signature.

cpp/src/io/parquet/reader_impl_preprocess.cu

etseidl · 2023-11-07T18:57:18Z

It's a lot to digest, but looks great so far. I have a few questions. First, will this help with skip_rows? I'm thinking of the predicate case where an index gives you a range of rows to read from the middle of a rowgroup. Can this work be modified to (or does it already handle) process just the pages needed to satisfy the predicate along with any needed dictionary pages?

The second is if the size statistics from #14000 are available, would you still use this mechanism but feed in the stats, or would it be better to have an entirely different path for stats driven chunked reading?

cpp/src/io/parquet/page_decode.cuh

cpp/src/io/parquet/page_hdr.cu

cpp/src/io/parquet/page_string_decode.cu

…uncompressed data. Add a couple of simple testts.

cpp/src/io/parquet/reader_impl_chunking.cu

vuule

still not done, but getting close

cpp/src/io/parquet/reader_impl_chunking.hpp

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/reader_impl.cpp

cpp/src/io/parquet/reader_impl_chunking.cu

cpp/src/io/parquet/reader_impl_preprocess.cu

vuule

the last couple of nits.
thank you for leaving extensive comments, this would be unapproachable otherwise.

cpp/src/io/parquet/reader_impl_chunking.cu

cpp/src/io/parquet/reader_impl_preprocess.cu

…feedback changes.

nvdbaranec · 2024-01-25T00:40:08Z

/merge

nvdbaranec added 9 commits October 6, 2023 10:59

Cleanup of namespaces in parquet. The ::detail::parquet namespace has…

8907a90

… been changed to ::parquet::detail, ::parquet::gpu has been renamed to ::parquet::detail, and several detail-style files which were just using ::parquet have been moved into parquet::detail.

Remove reader_impl_chunking.cu, which was accidentally included.

cb74b7e

Centralize all pass/chunk related code into reader_impl_chunking.cu

227e1f0

Merge branch 'branch-23.12' into parquet_chunked_reader_cleanup

b7f51a5

Formatting.

f1378e5

Remove unnecessary comment block.

79ae066

Change include file ordering

85b1e83

Merge branch 'branch-23.12' into parquet_chunked_reader_cleanup

3d4fa81

First pass at sub rwogroup reading. Basic tests show proof of concept…

08ce770

…. More work remains though.

nvdbaranec requested review from a team as code owners November 3, 2023 22:08

nvdbaranec requested review from robertmaynard and karthikeyann November 3, 2023 22:08

nvdbaranec marked this pull request as draft November 3, 2023 22:08

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Nov 3, 2023

nvdbaranec added 2 commits November 6, 2023 11:35

Merge branch 'branch-23.12' into parquet_sub_rowgroup_chunks

ef373d6

Formatting.

4d2326d

github-actions bot removed the CMake CMake build issue label Nov 6, 2023

GregoryKimball assigned nvdbaranec Nov 6, 2023

GregoryKimball requested a review from ttnghia November 6, 2023 21:59

etseidl reviewed Nov 7, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl.cpp Show resolved Hide resolved

etseidl reviewed Nov 7, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

ttnghia reviewed Nov 7, 2023

View reviewed changes

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved

ttnghia reviewed Nov 7, 2023

View reviewed changes

cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved

ttnghia reviewed Nov 7, 2023

View reviewed changes

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

Setup dictionary pages properly at the pass level. Fixed issues with …

8d4e3f9

…uncompressed data. Add a couple of simple testts.

ttnghia reviewed Jan 23, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Show resolved Hide resolved

ttnghia reviewed Jan 23, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 23, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 23, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 23, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

nvdbaranec added 2 commits January 23, 2024 17:42

More PR review feedback.

7ff6459

Formatting.

8ab4f80

ttnghia reviewed Jan 24, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 24, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

vuule reviewed Jan 24, 2024

View reviewed changes

nvdbaranec added 3 commits January 24, 2024 10:14

Wave of PR review feedback.

3ed0351

More PR feedback. Whole lot of consting going on.

6a5b17f

Formatting.

0083639

nvdbaranec requested review from ttnghia and vuule January 24, 2024 16:55

Merge branch 'branch-24.02' into parquet_sub_rowgroup_chunks

ef25b72

ttnghia reviewed Jan 24, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 24, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved

vuule reviewed Jan 24, 2024

View reviewed changes

nvdbaranec added 2 commits January 24, 2024 16:43

Added a missing CUDF_KERNEL tag to gpuDecodePageHeaders. Misc review …

b0002d7

…feedback changes.

Formatting.

a48c8e8

nvdbaranec requested review from ttnghia and vuule January 24, 2024 22:47

ttnghia approved these changes Jan 24, 2024

View reviewed changes

Merge branch 'branch-24.02' into parquet_sub_rowgroup_chunks

d280d18

vuule approved these changes Jan 24, 2024

View reviewed changes

rapids-bot bot merged commit 5b1eef3 into rapidsai:branch-24.02 Jan 25, 2024
68 checks passed

davidwendt mentioned this pull request Jan 25, 2024

[BUG] Memcheck error found in PARQUET_TEST ParquetChunkedReaderInputLimitTest.Mixed #14883

Closed

GregoryKimball mentioned this pull request Feb 14, 2024

[FEA] Update chunked parquet reader benchmarks to include pass_read_limit #15057

Open

3 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet sub-rowgroup reading. #14360

Parquet sub-rowgroup reading. #14360

nvdbaranec commented Nov 3, 2023 •

edited

Loading

GregoryKimball commented Nov 6, 2023

etseidl commented Nov 7, 2023

etseidl commented Nov 7, 2023

vuule left a comment

vuule left a comment

nvdbaranec commented Jan 25, 2024

Parquet sub-rowgroup reading. #14360

Parquet sub-rowgroup reading. #14360

Conversation

nvdbaranec commented Nov 3, 2023 • edited Loading

GregoryKimball commented Nov 6, 2023

etseidl commented Nov 7, 2023

etseidl commented Nov 7, 2023

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

nvdbaranec commented Jan 25, 2024

nvdbaranec commented Nov 3, 2023 •

edited

Loading