Introduce batched parquet column readers for flat types #14423

raunaqmorarka · 2022-10-01T19:20:42Z

Description

FlatDefinitionLevelDecoder implements an optimized decoder for primitive
columns where the definition level is either 0 or 1.
ApacheParquetValueDecoder provides implementations of batched decoders
which are wrappers over existing parquet-mr ValuesReader implementations.
More optimized implementations of ValueDecoder will be added to ValueDecoders
in subsequent chnages.
FilteredRowRangesIterator providers a way to iterate over rows selected by column
index in a batch of positions at a time.
FlatColumnReader makes use of the batched decoders and row-ranges iterator to
implement an optimized ColumnReader with dedicated code paths for nullable and
non-nullable types.
ColumnReaderFactory is updated to use the optimized implementations for
boolean, tinyint, short, int, long, float, double and short decimals.
ColumnReaderFactory falls back to existing implementations where optimized
implementations are not yet available or the flag
parquet.optimized-reader.enabled is disabled.

Non-technical explanation

Improve performance of reading Parquet files for boolean, tinyint, short, int, long, float, double and short decimal data types.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Iceberg, Delta, Hudi
* Improve performance of reading Parquet files for boolean, tinyint, short, int, long, float, double and short decimal data types. The catalog configuration property `parquet.optimized-reader.enabled` can be set to `false` to disable the optimized implementation. ({issue}`14423`)

raunaqmorarka · 2022-10-02T07:28:13Z

Flat column reader sf1000 partitioned.pdf
Flat column reader sf1000 unpartitioned.pdf
Partitioned : Both TPCH and TPCDS improve by roughly 10%
Unpartitioned : TPCH improves by 10% and TPCDS improves by 20%

skrzypo987

lgtm % It should not count since I co-authored this

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestParquetDecimalScaling.java

sopel39

There are test failures.
Would be great to attach benchmark report

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReader.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/FilteredRowRanges.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestParquetPageSkipping.java

...no-hive/src/test/java/io/trino/plugin/hive/TestParquetPageSkippingWithAcceleratedReader.java

raunaqmorarka · 2022-10-07T13:12:15Z

There are test failures.
Would be great to attach benchmark report

Test failure looks unrelated but I'm re-running to confirm.
Already shared benchmarks in #14423 (comment)

lib/trino-parquet/src/main/java/io/trino/parquet/reader/FilteredRowRanges.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderUtils.java

martint · 2022-10-11T00:40:59Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/BitPackingUtils.java

+    public static void unpack(byte[] values, int offset, long packedValue)
+    {
+        values[offset] = (byte) (packedValue & 1);
+        values[offset + 1] = (byte) ((packedValue >>> 1) & 1);


Are we sure this is better than a loop? What does the resulting assembly and perf look for this? How did we measure it? As in the comment above, the downside is more bytecode and fewer opportunities for inlining as a result.

Those unrollings have show significant improvements in jmh benchmarks and some visible (not huge obviously) gains in macrobenchmarks.

lib/trino-parquet/src/main/java/io/trino/parquet/reader/FilteredRowRanges.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/SimpleSliceInputStream.java

lib/trino-parquet/src/test/java/io/trino/parquet/reader/BenchmarkReadUleb128Int.java

martint · 2022-10-11T00:48:56Z

lib/trino-parquet/src/test/java/io/trino/parquet/reader/BenchmarkReadUleb128Int.java

+            throws Exception
+    {
+        benchmark(BenchmarkReadUleb128Int.class)
+                .withOptions(optionsBuilder -> optionsBuilder.jvmArgsAppend("-Xmx4g", "-Xms4g"))


Why is this needed?

My guess is that this line has been mindlessly copied from one benchmark to another for centuries.

Yes, this is carried over from existing benchmark classes. I was assuming that using a fixed heap size somehow leads to more consistent results, but I don't know if that's actually the case.

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ValueDecoder.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/DefinitionLevelDecoder.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ApacheParquetValueDecoder.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java

Introduce a standalone ColumnReaderFactory class along with ColumnReader interface Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>

NullsDecoder implements an optimized definition levels decoder for primitive columns where the definition level is either 0 or 1. ApacheParquetValueDecoders provides implementations of batched decoders which are wrappers over existing parquet-mr ValuesReader implementations. More optimized implementations of ValueDecoder will be added to ValueDecoders in subsequent changes. FilteredRowRangesIterator providers a way to iterate over rows selected by column index in a batch of positions at a time. FlatColumnReader makes use of the batched decoders and row-ranges iterator to implement an optimized ColumnReader with dedicated code paths for nullable and non-nullable types. ColumnReaderFactory is updated to use the optimized implementations for boolean, tinyint, short, int, long, float, double and short decimals. ColumnReaderFactory falls back to existing implementations where optimized implementations are not yet available or the flag parquet.optimized-reader.enabled is disabled. Co-authored-by: Raunaq Morarka <raunaqmorarka@gmail.com>

When a large enough number of rows are skipped due to `seek` operation, it is possible to skip decompressing and decoding parquet pages entirely.

Parquet schema may specify a column definition as OPTIONAL even though there are no nulls in the actual data. Row-group column statistics can be used to identify such cases and switch to faster non-nullable read paths in FlatColumnReader.

colebow · 2022-11-23T10:19:02Z

@raunaqmorarka I think we also need to add docs for this to the Hudi connector page. Not urgent, just let me know if you've got that covered or if you'd prefer me to follow up with that.

raunaqmorarka · 2022-11-23T18:34:07Z

@raunaqmorarka I think we also need to add docs for this to the Hudi connector page. Not urgent, just let me know if you've got that covered or if you'd prefer me to follow up with that.

Right, would be great if you could take care of that.

cla-bot bot added the cla-signed label Oct 1, 2022

github-actions bot added docs tests:hive labels Oct 1, 2022

raunaqmorarka requested review from sopel39 and skrzypo987 October 1, 2022 19:21

raunaqmorarka added the performance label Oct 2, 2022

raunaqmorarka force-pushed the pqr-flat branch 2 times, most recently from fc6a5db to b1458b7 Compare October 4, 2022 08:41

skrzypo987 approved these changes Oct 4, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java Outdated Show resolved Hide resolved

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestParquetDecimalScaling.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the pqr-flat branch from b1458b7 to 802cf5f Compare October 4, 2022 13:58

martint mentioned this pull request Oct 5, 2022

Project Hummingbird #14237

Open

19 tasks

raunaqmorarka force-pushed the pqr-flat branch 2 times, most recently from 59135ac to 868236e Compare October 6, 2022 05:02

sopel39 approved these changes Oct 7, 2022

View reviewed changes

raunaqmorarka force-pushed the pqr-flat branch from 868236e to 07827fb Compare October 10, 2022 07:56

martint reviewed Oct 11, 2022

View reviewed changes

raunaqmorarka force-pushed the pqr-flat branch 2 times, most recently from 16c3547 to eabde80 Compare October 11, 2022 13:28

raunaqmorarka requested a review from martint October 11, 2022 13:29

raunaqmorarka force-pushed the pqr-flat branch 5 times, most recently from 9361f0c to b4bb82b Compare October 17, 2022 13:14

raunaqmorarka mentioned this pull request Oct 17, 2022

Support more types in batched parquet reader #14667

Merged

raunaqmorarka force-pushed the pqr-flat branch 3 times, most recently from 149d9e8 to 8cb5c1c Compare October 20, 2022 10:27

raunaqmorarka requested a review from martint November 2, 2022 08:25

raunaqmorarka force-pushed the pqr-flat branch 2 times, most recently from 6cda3e5 to 1280c45 Compare November 2, 2022 15:42

raunaqmorarka force-pushed the pqr-flat branch from 1280c45 to ebff3e2 Compare November 18, 2022 12:51

Simplify getPageReader to hasPageReader in ParquetReader

7d9da2f

raunaqmorarka force-pushed the pqr-flat branch from ebff3e2 to f1a5f8a Compare November 18, 2022 13:29

raunaqmorarka requested review from sopel39 and skrzypo987 November 18, 2022 13:45

raunaqmorarka force-pushed the pqr-flat branch from f1a5f8a to 22c4144 Compare November 18, 2022 19:01

martint approved these changes Nov 18, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ValueDecoder.java Outdated Show resolved Hide resolved

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/DefinitionLevelDecoder.java Outdated Show resolved Hide resolved

martint reviewed Nov 18, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ApacheParquetValueDecoder.java Outdated Show resolved Hide resolved

martint reviewed Nov 19, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java Outdated Show resolved Hide resolved

raunaqmorarka and others added 3 commits November 19, 2022 11:44

Refactor parquet column reader factory

a93a798

Introduce a standalone ColumnReaderFactory class along with ColumnReader interface Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>

Simplify calculation of filtered row ranges in ParquetReader

6078e67

Move paddingBigInteger to ParquetTypeUtils

6fd46b2

raunaqmorarka force-pushed the pqr-flat branch 2 times, most recently from 9a393e8 to e0d0fd2 Compare November 19, 2022 20:23

Krzysztof Skrzypczynski and others added 5 commits November 20, 2022 20:55

Refactor TestParquetPageSkipping to run with optimized reader

02f1d06

Add optimized parquet reader to documentation

4baf263

Skip decoding parquet pages for long seek operations

0988d18

When a large enough number of rows are skipped due to `seek` operation, it is possible to skip decompressing and decoding parquet pages entirely.

raunaqmorarka force-pushed the pqr-flat branch from e0d0fd2 to 3b803f1 Compare November 20, 2022 15:25

raunaqmorarka merged commit e629139 into trinodb:master Nov 20, 2022

raunaqmorarka deleted the pqr-flat branch November 20, 2022 17:22

github-actions bot added this to the 404 milestone Nov 20, 2022

raunaqmorarka mentioned this pull request Nov 20, 2022

Release notes for 405 #15058

Closed

colebow mentioned this pull request Nov 21, 2022

Add Trino 405 release notes #15139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce batched parquet column readers for flat types #14423

Introduce batched parquet column readers for flat types #14423

raunaqmorarka commented Oct 1, 2022 •

edited

Loading

raunaqmorarka commented Oct 2, 2022

skrzypo987 left a comment

sopel39 left a comment

raunaqmorarka commented Oct 7, 2022

martint Oct 11, 2022

skrzypo987 Oct 11, 2022

martint Oct 11, 2022

skrzypo987 Oct 11, 2022

raunaqmorarka Oct 11, 2022

colebow commented Nov 23, 2022

raunaqmorarka commented Nov 23, 2022

Introduce batched parquet column readers for flat types #14423

Introduce batched parquet column readers for flat types #14423

Conversation

raunaqmorarka commented Oct 1, 2022 • edited Loading

Description

Non-technical explanation

Release notes

raunaqmorarka commented Oct 2, 2022

skrzypo987 left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

raunaqmorarka commented Oct 7, 2022

martint Oct 11, 2022

Choose a reason for hiding this comment

skrzypo987 Oct 11, 2022

Choose a reason for hiding this comment

martint Oct 11, 2022

Choose a reason for hiding this comment

skrzypo987 Oct 11, 2022

Choose a reason for hiding this comment

raunaqmorarka Oct 11, 2022

Choose a reason for hiding this comment

colebow commented Nov 23, 2022

raunaqmorarka commented Nov 23, 2022

raunaqmorarka commented Oct 1, 2022 •

edited

Loading