-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521
@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521
Conversation
Does it make sense to also modify https://github.com/snowflakedb/snowflake/pull/105446 to take into account the uncompressed size @sfc-gh-azagrebin @sfc-gh-lsembera ? |
indeed, we can log it there! I am on it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a simple test?
@@ -114,6 +114,7 @@ static <T> Blob constructBlobAndMetadata( | |||
// The paddedChunkLength is used because it is the actual data size used for | |||
// decompression and md5 calculation on server side. | |||
.setChunkLength(paddedChunkLength) | |||
.setUncompressedChunkLength((int) serializedChunk.chunkUncompressedSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This size will be different compared with our estimated uncompressed buffer size, right? Is that something you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what we estimate. We do not have the actual uncompressed size because it is tracked only internally in Parquet lib but I also do not think it matters much to gather stats that is the goal of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps out of scope for this pr - do you know if it is it a lot of additional work to gather the actual uncompressed chunk length? we have customers asking for throughput numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can we adjust the naming from 'chunkUncompressedSize' to 'chunkUncompressedSizeEstimate' to reduce this confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I changed the variable name and in the result of flush but not in server response.
The problem is that now server side can accept only this name for the size.
The current intention was to send the estimated size to investigate the compression ratio based on the estimation
because the way we use the parquet lib now we cannot know the actual size before we start the flush
hence we have only estimation for compressed prediction before we start the flush.
Once we decide to re-write the parquet generation on lower level and buffer directly in serialised parquet buffers, it will be possible to get the actual size during the buffering before flushing.
This is a good point about the actual size. I looked again into the code and I think it is possible
now to report the actual uncompressed size from the parquet lib after the flush and I added now to the
serialisation result. As I understand the expected difference should be very small and close to negligible.
Nonetheless, we could send both if we change the server size to accept both but for now
as I mentioned in the beginning, I would prefer to send the estimation.
The estimation should be good enough representative of the user data because it basically just sums up the actual user data bytes plus some small overhead of parquet format (data length and nullable bit-mask).
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thank you for looking into this. Just want to make sure we have a vague plan for reporting the actual size in the future - can we change chunk_length_uncompressed
to represent the actual size in the future? I'm worried about future confusion if the field changes across versions compared to the current naming.
Ie should we have:
chunk_length_uncompressed
andestimated_chunk_length_uncompressed
chunk_length_uncompressed
andactual_chunk_length_uncompressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change chunk_length_uncompressed to represent the actual size in the future?
This is not so easy because it is a part of client/server API.
Server already accepts chunk_length_uncompressed
not anything else.
We can just add actual_chunk_length_uncompressed
later and have a comment for chunk_length_uncompressed
. I agree that naming should have been better but otherwise there are 2 ways:
- deprecating the old name that will be a hassle because of old clients sill sending the old name
- or we would have to change first server side and block this PR until server side is released (2 versions). It would mean that we will not able to gather any stats for longer time and customers will use older post-GA SDK versions that do not send any stats.
Eventually, I think we should have only actual size stats if we decide to refactor parquet generator or we will always use estimation. We do not really need both. The estimation reflects the user data good enough. The actual size will just include a bit more or less parquet specific overhead that is not user data anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, if you ask about customer throughput, we should actually compute it a bit differently but call it completely differently :) Parquet encoding for page 2 (for now we use page 1) is like compression and it can significantly reduce the actual customer data size, even before compressing it. I assume when you say customer throughput
you mean the actual data bytes that customers insert and not how parquet eventually packs them. Hence, it would be a different stat to report, something like chunk_user_data_size
or something. This would better reflect the customer throughput.
@@ -152,6 +162,11 @@ Integer getChunkLength() { | |||
return chunkLength; | |||
} | |||
|
|||
@JsonProperty("chunk_length_uncompressed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will require a server side change first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server side change is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but yes, we have to wait
375ea7f
to
fde5fdb
Compare
fde5fdb
to
175c989
Compare
175c989
to
03eebf7
Compare
Codecov Report
@@ Coverage Diff @@
## master #521 +/- ##
==========================================
- Coverage 78.25% 78.15% -0.11%
==========================================
Files 76 76
Lines 4746 4751 +5
Branches 426 426
==========================================
- Hits 3714 3713 -1
- Misses 852 856 +4
- Partials 180 182 +2
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Please put a meaningful description instead of just linking the JIRA, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - left a small nit and a question about actual uncompressed chunk length
65d8cc1
to
d95d1ba
Compare
d95d1ba
to
597864f
Compare
SNOW-835618
Currently, we send only compressed chunk length in blob metadata.
We can also send uncompressed and log in ingestion events to analyse compression ratio stats.