Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read parquet column chunks in small steps #15374

Merged

Conversation

lukasz-stec
Copy link
Member

@lukasz-stec lukasz-stec commented Dec 12, 2022

Description

Currently, ParquetReader will read the entire row group at once. This means that:

  • limit N queries will take longer time than needed because the reader might read hundreds of megabytes before returning a few rows.

  • Reader might allocate and read hundreds of megabytes at once. This is problematic because:

    • memory accounting for the reader happens after data is already read and arrays are created. This can lead to OOM and general system instability (e.g. GC) because such allocations will not be accounted for
    • peak memory usage is higher than needed (e.g. GBs vs MBs)

The issue can especially be seen when Parquet row group size is large.

This PR fixes this by reading parquet column chunks in small (8MB by default) pieces.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( X) Release notes are required, with the following suggested text:

# Hive, Delta, Hudi, Iceberg
* Avoid large memory allocations in parquet reader by limiting the maximum size of reads from file. This improves stability and reduces peak memory requirements. The catalog configuration property `parquet.max-buffer-size` can be used to change the maximum size of reads performed by the parquet reader from the default value of 8MB. ({issue}`15374`)

@cla-bot cla-bot bot added the cla-signed label Dec 12, 2022
@lukasz-stec lukasz-stec force-pushed the ls/054-parquet-small-buffer branch 5 times, most recently from fb2380e to 8c1f314 Compare December 13, 2022 09:50
@lukasz-stec lukasz-stec force-pushed the ls/054-parquet-small-buffer branch 2 times, most recently from 1b732cf to 973e7e0 Compare December 15, 2022 08:51
@lukasz-stec lukasz-stec marked this pull request as ready for review December 15, 2022 08:52
@lukasz-stec
Copy link
Member Author

I added tests for PageReader and ChunkedInputStream. This is ready for review.

Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments.

Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most comments addressed

Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most comments addressed (javadoc for ChunkedInputStream still pending)

Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments addressed. Test of memory accounting are pending

@lukasz-stec
Copy link
Member Author

memory accounting test added in io.trino.parquet.reader.TestParquetDataSource#testMemoryAccounting

TestParquetDataSource tests only a generic AbstractParquetDataSource
logic so it should be in trino-parquet.
Before this change, parquet column chunks were
read in one go, copying everything into one big Slice.
This had two issues.
One, for limit queries, we potentially don't need to
read entire column chunk to finish the query as first
page may satisfy the limit.
Second, for files with big row group size the allocated
Slice can exceed the jvm limits for native byte array,
and even if it doesn't, it makes memory usage not efficient
due to how humungous allocations are implemented in the jvm.
We can allow for a big DiskRange to be
passed to the ParquetDataSource.planRead,
since it's going to split the ranges into
small chunks anyway.
@martint martint dismissed their stale review December 21, 2022 18:53

Comments were addressed. Thanks @lukasz-stec, @raunaqmorarka

@lukasz-stec
Copy link
Member Author

Benchmark results for tpch/tpcds parquet sf1k partitioned with default parquet.max-buffer-size=8MB and parquet.max-buffer-size=128MB.

image

parquet-small-chunks-128MB-buffer-oss-sf1k-part-ext.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

ArithmeticException: integer overflow when querying large Parquet files with large row group (2GB)
4 participants