[FEA] AST filtering in parquet reader #13348

karthikeyann · 2023-05-13T18:38:04Z

Description

The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering.

The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups.
New column_name_reference is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced:

named_to_reference_converter - Converts column name references to column index references
stats_expression_converter - Converts the above output table filtering AST to 'Statistics AST'.

Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc.

Parse column chunk metadata statistics in parquet reader #13472
Convert column chunk min, max to cudf type column.
Add AST filter interface to parquet reader options
Convert AST to Statistics AST
Apply statistics AST on Stats values to get row_groups
Apply AST as filter on output columns.

Depends on #13472

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

GregoryKimball · 2023-05-15T18:03:56Z

cpp/src/io/parquet/compact_protocol_writer.cpp

+  CompactProtocolFieldWriter c(*this);
+  if (s.max.size() != 0) { c.field_binary(1, s.max); }
+  if (s.min.size() != 0) { c.field_binary(2, s.min); }
+  if (s.null_count != -1) { c.field_int(3, s.null_count); }
+  if (s.distinct_count != -1) { c.field_int(4, s.distinct_count); }
+  if (s.max_value.size() != 0) { c.field_binary(5, s.max_value); }
+  if (s.min_value.size() != 0) { c.field_binary(6, s.min_value); }


I'm surprised to see a writer change here - was there something incorrect about the way we've been writing row group statistics?

No. The statistics are pre-encoded during page encoding itself and stored as binary blob. I changed the datatype from binary blob to Statistics struct. So, this change is encoding that struct again.

…uet_predicate_row_group

- Add device column from Column chunk statistics in metadata - Add Filter AST to StatsAST transformer - Apply StatsAST on chunk statistics column in device - Filter the row groups, and give to base parquet reader

…uet_predicate_row_group

cpp/src/io/parquet/predicate_pushdown.cpp

…uet_predicate_row_group

hyperbolic2346

No further requests from me. I'm ok once Bradley signs off on this.

bdice · 2023-07-20T02:50:00Z

Thanks Mike — I’ll review again tomorrow.

bdice

Some comments -- mostly minor changes but I definitely want to get stream conventions figured out before approving. I know that our I/O code might not be the best example of stream conventions today, but we need to get the entirety of libcudf moving in that direction. New APIs should accept and leverage streams correctly.

cpp/src/io/parquet/reader_impl.cpp

bdice · 2023-07-20T19:43:12Z

cpp/src/io/parquet/reader_impl.cpp

-  // Add empty columns if needed.
-  return finalize_output(out_metadata, out_columns);
+  // Add empty columns if needed. Filter output columns based on filter.
+  return finalize_output(out_metadata, out_columns, filter);


I don't have much familiarity with the chunked reader. Do we need to raise an error, or something like that? Just trying to understand if additional work is needed here, or if this can be resolved.

cpp/src/io/parquet/reader_impl_helpers.hpp

cpp/tests/io/parquet_test.cpp

cpp/src/io/parquet/predicate_pushdown.cpp

…uet_predicate_row_group

bdice

It looks like all the requested changes in my previous review were addressed. I think this is close enough to approve, but we need to address the conversation about compatibility with the chunked reader in https://github.com/rapidsai/cudf/pull/13348/files#r1271118219 before merging.

karthikeyann · 2023-07-26T22:41:28Z

/merge

karthikeyann added 3 commits May 13, 2023 01:08

Read Statistics in parquet Reader, update writer to use Statistics

846ca2e

fix mistake in detail api signature

58d5472

add filter (AST) to parquet reader options

aeac4a9

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 13, 2023

karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change labels May 13, 2023

karthikeyann self-assigned this May 13, 2023

karthikeyann added 2 commits May 14, 2023 16:10

Apply AST as filter on output columns, unit test

0412e49

style check, doxygen check failure fixes

c2e6589

GregoryKimball reviewed May 15, 2023

View reviewed changes

karthikeyann added 2 commits May 26, 2023 10:28

add filter to required pq reader functions

dbcc3ce

Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…

26b124e

…uet_predicate_row_group

karthikeyann changed the base branch from branch-23.06 to branch-23.08 May 26, 2023 05:01

github-actions bot added ci CMake CMake build issue Java Affects Java cuDF API. Python Affects Python cuDF API. labels May 26, 2023

karthikeyann changed the base branch from branch-23.08 to branch-23.06 May 26, 2023 05:17

karthikeyann changed the base branch from branch-23.06 to branch-23.08 May 26, 2023 05:17

karthikeyann added 3 commits May 30, 2023 20:16

cleanup statistics_blob

0a0c8d1

add expression_transformer in AST

81c2026

add column chunk statistics based row group filtering

b1e9798

- Add device column from Column chunk statistics in metadata - Add Filter AST to StatsAST transformer - Apply StatsAST on chunk statistics column in device - Filter the row groups, and give to base parquet reader

davidwendt removed their request for review July 14, 2023 11:47

karthikeyann added 4 commits July 14, 2023 21:38

add anoymous namespace

5d0f592

Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…

ab58f61

…uet_predicate_row_group

style fix

f2e4cbc

Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…

b8a751b

…uet_predicate_row_group

bdice reviewed Jul 18, 2023

View reviewed changes

cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/predicate_pushdown.cpp Show resolved Hide resolved

karthikeyann added 4 commits July 20, 2023 03:47

add float nan test

b4c1e69

add Cython bindings

2de4e1d

style fix

3e270c7

Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…

2f04e5e

…uet_predicate_row_group

karthikeyann requested a review from a team as a code owner July 19, 2023 22:23

karthikeyann requested review from shwina, bdice and hyperbolic2346 July 19, 2023 22:23

github-actions bot added the Python Affects Python cuDF API. label Jul 19, 2023

hyperbolic2346 approved these changes Jul 20, 2023

View reviewed changes

bdice requested changes Jul 20, 2023

View reviewed changes

karthikeyann added 2 commits July 22, 2023 02:36

address review comments (bdice)

ef2da72

Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…

18eeb88

…uet_predicate_row_group

karthikeyann requested a review from bdice July 21, 2023 21:21

bdice approved these changes Jul 24, 2023

View reviewed changes

raydouglass approved these changes Jul 25, 2023

View reviewed changes

karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 26, 2023

Merge branch 'branch-23.08' into fea-parquet_predicate_row_group

b507b28

rapids-bot bot merged commit fa09cca into rapidsai:branch-23.08 Jul 26, 2023
54 checks passed

karthikeyann mentioned this pull request Aug 18, 2023

Row group selection support in Parquet chunked reader #13913

Closed

GregoryKimball mentioned this pull request Sep 10, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] AST filtering in parquet reader #13348

[FEA] AST filtering in parquet reader #13348

karthikeyann commented May 13, 2023 •

edited

Loading

GregoryKimball May 15, 2023

karthikeyann May 16, 2023

hyperbolic2346 left a comment

bdice commented Jul 20, 2023

bdice left a comment

bdice Jul 20, 2023

bdice left a comment

karthikeyann commented Jul 26, 2023

[FEA] AST filtering in parquet reader #13348

[FEA] AST filtering in parquet reader #13348

Conversation

karthikeyann commented May 13, 2023 • edited Loading

Description

Checklist

GregoryKimball May 15, 2023

Choose a reason for hiding this comment

karthikeyann May 16, 2023

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

bdice commented Jul 20, 2023

bdice left a comment

Choose a reason for hiding this comment

bdice Jul 20, 2023

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

karthikeyann commented Jul 26, 2023

karthikeyann commented May 13, 2023 •

edited

Loading