-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read parquet column chunks in small steps #15374
Read parquet column chunks in small steps #15374
Conversation
fb2380e
to
8c1f314
Compare
8c1f314
to
5972765
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PrimitiveColumnReader.java
Show resolved
Hide resolved
5972765
to
24af6be
Compare
1b732cf
to
973e7e0
Compare
I added tests for |
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments.
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
973e7e0
to
21de362
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most comments addressed
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
21de362
to
90dbd07
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
90dbd07
to
20e0a82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most comments addressed (javadoc for ChunkedInputStream still pending)
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Show resolved
Hide resolved
b90c5b1
to
8ae0089
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments addressed. Test of memory accounting are pending
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Show resolved
Hide resolved
8ae0089
to
d28cd54
Compare
memory accounting test added in |
TestParquetDataSource tests only a generic AbstractParquetDataSource logic so it should be in trino-parquet.
Before this change, parquet column chunks were read in one go, copying everything into one big Slice. This had two issues. One, for limit queries, we potentially don't need to read entire column chunk to finish the query as first page may satisfy the limit. Second, for files with big row group size the allocated Slice can exceed the jvm limits for native byte array, and even if it doesn't, it makes memory usage not efficient due to how humungous allocations are implemented in the jvm.
d28cd54
to
7be2201
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
We can allow for a big DiskRange to be passed to the ParquetDataSource.planRead, since it's going to split the ranges into small chunks anyway.
5914833
to
b581c52
Compare
Comments were addressed. Thanks @lukasz-stec, @raunaqmorarka
Benchmark results for tpch/tpcds parquet sf1k partitioned with default |
Description
Currently,
ParquetReader
will read the entire row group at once. This means that:limit N
queries will take longer time than needed because the reader might read hundreds of megabytes before returning a few rows.Reader might allocate and read hundreds of megabytes at once. This is problematic because:
The issue can especially be seen when Parquet row group size is large.
This PR fixes this by reading parquet column chunks in small (8MB by default) pieces.
Additional context and related issues
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( X) Release notes are required, with the following suggested text: