Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ChunkedSliceOutput to chunk and buffer writes in parquet #18564

Merged
merged 3 commits into from Aug 10, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Aug 7, 2023

Description

Avoids separate calls to output stream for writing page headers and page data.
Avoids holding on to extra memory for each buffered page due to over-allocation by compressors.
Adds memory usage accounting for buffered pages.

Additional context and related issues

Should help with the memory usage accounting problems described in #18557

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Iceberg, Delta
* Improve memory usage accounting of parquet writer. ({issue}`18564`)

Allows simplifying some of the code working with pages sizes
to use ints instead of longs
@findepi
Copy link
Member

findepi commented Aug 7, 2023

Avoids separate calls to output stream for writing page headers and page data.

not sure i understand the goal here.
are we saving on http requests? i think the output stream is buffering internally?

can you expand a bit the commit rationale so that it's more obvious what's the objective?

@raunaqmorarka
Copy link
Member Author

Avoids separate calls to output stream for writing page headers and page data.

not sure i understand the goal here. are we saving on http requests? i think the output stream is buffering internally?

can you expand a bit the commit rationale so that it's more obvious what's the objective?

Yes, there is varying levels of buffering in different output stream implementations. It's not clear to me that every implementation is doing that or that we should rely on that always being the case. In any case, that's not the main benefit of this change.

The main objective here is

Avoids holding on to extra memory for each buffered page due to over-allocation by compressors.
Adds memory usage accounting for buffered pages.

Also, ORC writer uses ChunkedSliceOutput in OrcOutputBuffer#writeChunkToOutputStream in a similar way.

@raunaqmorarka
Copy link
Member Author

raunaqmorarka commented Aug 7, 2023

Insert benchmarks.pdf

There is a big increase in the peak memory usage due to the improved accounting and small decrease in perf due to the extra byte array copy required to get rid of over-allocation in compression buffer.
Screenshot 2023-08-10 at 12 01 12 PM

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@@ -297,6 +297,7 @@ public long getBufferedBytes()
public long getRetainedBytes()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it called after a flush? (is there something like a flush here)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no "flush" here, there is a "close" followed by a getBuffer to extract the buffered pages.
This is called by connector page sink for every writer after writing each page.

@raunaqmorarka raunaqmorarka force-pushed the pqw-output branch 3 times, most recently from f1df74a to dc324d9 Compare August 9, 2023 20:03
Avoids separate calls to output stream for writing page headers
and page data
Avoids holding on to extra memory for each buffered page due to
over-allocation by compressors
@raunaqmorarka raunaqmorarka merged commit ddf82b3 into trinodb:master Aug 10, 2023
67 checks passed
@raunaqmorarka raunaqmorarka deleted the pqw-output branch August 10, 2023 12:50
@colebow colebow added this to the 423 milestone Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

5 participants