Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet sub-rowgroup reading. #14360

Merged

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Nov 3, 2023

closes #14270

Implementation of sub-rowgroup reading of Parquet files. This PR implements an additional layer on top of the existing chunking system. Currently, the reader takes two parameters: input_pass_read_limit which specifies a limit on temporary memory usage when reading and decompressing file data; and output_pass_read_limit which specifies a limit on how large an output chunk (a table) can be.

Currently when the user specifies a limit via input_pass_read_limit, the reader will perform multiple passes over the file at row-group granularity. That is, it will control how many row groups it will read at once to conform to the specified limit.

However, there are cases where this is not sufficient. So this PR changes things so that we now have subpasses below the top level passes. It works as follows:

  • We read a set of input chunks based on the input_pass_read_limit but we do not decompress them immediately. This constitutes a pass.
  • Within each pass of compressed data, we progressively decompress batches of pages as subpasses.
  • Within each subpass we apply the output limit to produce chunks.

So the overall structure of the reader is: (read) pass -> (decompress) subpass -> (decode) chunk

Major sections of code changes:

  • Previously the incoming page data in the file was unsorted. To handle this we later on produced a page_index that could be applied to the array to get them in schema-sorted order. This was getting very unwieldy so I just sort the pages up front now and the page_index array has gone away.

  • There are now two sets of pages to be aware of in the code. Within each pass_intermediate_data there is the set of all pages within the current set of loaded row groups. And then within the subpass_intermediate_data struct there is a separate array of pages representing the current batch of decompressed data we are processing. To keep the confusion down I changed a good amount of code to always reference it's array though it's associated struct. Ie, pass.pages or subpass.pages. In addition, I removed the page_info from ColumnChunkDesc to help prevent the kernels from getting confused. ColumnChunkDesc now only has a dict_page field which is constant across all subpasses.

  • The primary entry point for the chunking mechanism is in handle_chunking. Here we iterate through passes, subpasses and output chunks. Successive subpasses are computed and preprocessed through here.

  • The volume of diffs you'll see in reader_impl_chunking.cu is a little deceptive. A lot of this is just functions (or pieces of functions) that have been moved over from either reader_impl_preprocess.cu or reader_impl_helpers.cpp. The most relevant actual changes are in: handle_chunking, compute_input_passes, compute_next_subpass, and compute_chunks_for_subpass.

Note on tests: I renamed parquet_chunked_reader_tests.cpp to parquet_chunked_reader_test.cu as I needed to use thrust. The only actual changes in the file are the addition of the ParquetChunkedReaderInputLimitConstrainedTest and ParquetChunkedReaderInputLimitTest test suites at the bottom.

@nvdbaranec nvdbaranec requested review from a team as code owners November 3, 2023 22:08
@nvdbaranec nvdbaranec marked this pull request as draft November 3, 2023 22:08
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Nov 3, 2023
@github-actions github-actions bot removed the CMake CMake build issue label Nov 6, 2023
@GregoryKimball
Copy link
Contributor

Hello @etseidl we think that this work #14360 will have some conflicts with your decoder addition in #14101. Our plan is to complete the work on #14101 first, and then resolve conflicts to this PR. In the meantime, would you please take a look at @nvdbaranec 's work here?

@etseidl
Copy link
Contributor

etseidl commented Nov 7, 2023

we think that this work #14360 will have some conflicts with your decoder addition in #14101

FWIW, I tried a merge of #14101 into this branch, and the conflicts were pretty minor, with the biggest change being changes to the ComputePageStringSizes signature.

@etseidl
Copy link
Contributor

etseidl commented Nov 7, 2023

It's a lot to digest, but looks great so far. I have a few questions. First, will this help with skip_rows? I'm thinking of the predicate case where an index gives you a range of rows to read from the middle of a rowgroup. Can this work be modified to (or does it already handle) process just the pages needed to satisfy the predicate along with any needed dictionary pages?

The second is if the size statistics from #14000 are available, would you still use this mechanism but feed in the stats, or would it be better to have an entirely different path for stats driven chunked reading?

…uncompressed data. Add a couple of simple testts.
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not done, but getting close

cpp/src/io/parquet/reader_impl_chunking.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last couple of nits.
thank you for leaving extensive comments, this would be unapproachable otherwise.

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@nvdbaranec
Copy link
Contributor Author

/merge

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] Proposal for sub-rowgroup reading in the parquet reader.
6 participants