Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] AST filtering in parquet reader #13348

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented May 13, 2023

Description

The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering.

The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups.
New column_name_reference is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced:

  1. named_to_reference_converter - Converts column name references to column index references
  2. stats_expression_converter - Converts the above output table filtering AST to 'Statistics AST'.

Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc.

Depends on #13472

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 13, 2023
@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change labels May 13, 2023
@karthikeyann karthikeyann self-assigned this May 13, 2023
Comment on lines 211 to 217
CompactProtocolFieldWriter c(*this);
if (s.max.size() != 0) { c.field_binary(1, s.max); }
if (s.min.size() != 0) { c.field_binary(2, s.min); }
if (s.null_count != -1) { c.field_int(3, s.null_count); }
if (s.distinct_count != -1) { c.field_int(4, s.distinct_count); }
if (s.max_value.size() != 0) { c.field_binary(5, s.max_value); }
if (s.min_value.size() != 0) { c.field_binary(6, s.min_value); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see a writer change here - was there something incorrect about the way we've been writing row group statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The statistics are pre-encoded during page encoding itself and stored as binary blob. I changed the datatype from binary blob to Statistics struct. So, this change is encoding that struct again.

@karthikeyann karthikeyann changed the base branch from branch-23.06 to branch-23.08 May 26, 2023 05:01
@github-actions github-actions bot added ci CMake CMake build issue Java Affects Java cuDF API. Python Affects Python cuDF API. labels May 26, 2023
@karthikeyann karthikeyann changed the base branch from branch-23.08 to branch-23.06 May 26, 2023 05:17
@karthikeyann karthikeyann changed the base branch from branch-23.06 to branch-23.08 May 26, 2023 05:17
- Add device column from Column chunk statistics in metadata
- Add Filter AST to StatsAST transformer
- Apply StatsAST on chunk statistics column in device
- Filter the row groups, and give to base parquet reader
@davidwendt davidwendt removed their request for review July 14, 2023 11:47
@karthikeyann karthikeyann requested a review from a team as a code owner July 19, 2023 22:23
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 19, 2023
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further requests from me. I'm ok once Bradley signs off on this.

@bdice
Copy link
Contributor

bdice commented Jul 20, 2023

Thanks Mike — I’ll review again tomorrow.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments -- mostly minor changes but I definitely want to get stream conventions figured out before approving. I know that our I/O code might not be the best example of stream conventions today, but we need to get the entirety of libcudf moving in that direction. New APIs should accept and leverage streams correctly.

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
// Add empty columns if needed.
return finalize_output(out_metadata, out_columns);
// Add empty columns if needed. Filter output columns based on filter.
return finalize_output(out_metadata, out_columns, filter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much familiarity with the chunked reader. Do we need to raise an error, or something like that? Just trying to understand if additional work is needed here, or if this can be resolved.

cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from bdice July 21, 2023 21:21
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all the requested changes in my previous review were addressed. I think this is close enough to approve, but we need to address the conversation about compatibility with the chunked reader in https://github.com/rapidsai/cudf/pull/13348/files#r1271118219 before merging.

@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 26, 2023
@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit fa09cca into rapidsai:branch-23.08 Jul 26, 2023
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants