@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521

sfc-gh-azagrebin · 2023-06-08T13:24:50Z

Currently, we send only compressed chunk length in blob metadata.
We can also send uncompressed and log in ingestion events to analyse compression ratio stats.

sfc-gh-kkloudas · 2023-06-08T13:30:37Z

Does it make sense to also modify https://github.com/snowflakedb/snowflake/pull/105446 to take into account the uncompressed size @sfc-gh-azagrebin @sfc-gh-lsembera ?

sfc-gh-azagrebin · 2023-06-08T13:56:33Z

@sfc-gh-kkloudas

Does it make sense to also modify server side alert to take into account the uncompressed size

indeed, we can log it there! I am on it

sfc-gh-tzhang

Could we add a simple test?

sfc-gh-tzhang · 2023-06-08T21:26:04Z

src/main/java/net/snowflake/ingest/streaming/internal/BlobBuilder.java

@@ -114,6 +114,7 @@ static <T> Blob constructBlobAndMetadata(
                // The paddedChunkLength is used because it is the actual data size used for
                // decompression and md5 calculation on server side.
                .setChunkLength(paddedChunkLength)
+                .setUncompressedChunkLength((int) serializedChunk.chunkUncompressedSize)


This size will be different compared with our estimated uncompressed buffer size, right? Is that something you want?

I think this is what we estimate. We do not have the actual uncompressed size because it is tracked only internally in Parquet lib but I also do not think it matters much to gather stats that is the goal of this change.

Perhaps out of scope for this pr - do you know if it is it a lot of additional work to gather the actual uncompressed chunk length? we have customers asking for throughput numbers

Also can we adjust the naming from 'chunkUncompressedSize' to 'chunkUncompressedSizeEstimate' to reduce this confusion?

True, I changed the variable name and in the result of flush but not in server response.
The problem is that now server side can accept only this name for the size.

The current intention was to send the estimated size to investigate the compression ratio based on the estimation
because the way we use the parquet lib now we cannot know the actual size before we start the flush
hence we have only estimation for compressed prediction before we start the flush.
Once we decide to re-write the parquet generation on lower level and buffer directly in serialised parquet buffers, it will be possible to get the actual size during the buffering before flushing.

This is a good point about the actual size. I looked again into the code and I think it is possible
now to report the actual uncompressed size from the parquet lib after the flush and I added now to the
serialisation result. As I understand the expected difference should be very small and close to negligible.
Nonetheless, we could send both if we change the server size to accept both but for now
as I mentioned in the beginning, I would prefer to send the estimation.
The estimation should be good enough representative of the user data because it basically just sums up the actual user data bytes plus some small overhead of parquet format (data length and nullable bit-mask).

Let me know what you think.

Makes sense, thank you for looking into this. Just want to make sure we have a vague plan for reporting the actual size in the future - can we change chunk_length_uncompressed to represent the actual size in the future? I'm worried about future confusion if the field changes across versions compared to the current naming.

Ie should we have:

chunk_length_uncompressed and estimated_chunk_length_uncompressed

chunk_length_uncompressed and actual_chunk_length_uncompressed

can we change chunk_length_uncompressed to represent the actual size in the future?

This is not so easy because it is a part of client/server API.
Server already accepts chunk_length_uncompressed not anything else.
We can just add actual_chunk_length_uncompressed later and have a comment for chunk_length_uncompressed. I agree that naming should have been better but otherwise there are 2 ways:

deprecating the old name that will be a hassle because of old clients sill sending the old name

or we would have to change first server side and block this PR until server side is released (2 versions). It would mean that we will not able to gather any stats for longer time and customers will use older post-GA SDK versions that do not send any stats.

Eventually, I think we should have only actual size stats if we decide to refactor parquet generator or we will always use estimation. We do not really need both. The estimation reflects the user data good enough. The actual size will just include a bit more or less parquet specific overhead that is not user data anyways.

Moreover, if you ask about customer throughput, we should actually compute it a bit differently but call it completely differently :) Parquet encoding for page 2 (for now we use page 1) is like compression and it can significantly reduce the actual customer data size, even before compressing it. I assume when you say customer throughput you mean the actual data bytes that customers insert and not how parquet eventually packs them. Hence, it would be a different stat to report, something like chunk_user_data_size or something. This would better reflect the customer throughput.

sfc-gh-tzhang · 2023-06-08T21:26:18Z

src/main/java/net/snowflake/ingest/streaming/internal/ChunkMetadata.java

@@ -152,6 +162,11 @@ Integer getChunkLength() {
    return chunkLength;
  }

+  @JsonProperty("chunk_length_uncompressed")


I believe this will require a server side change first

server side change is merged

but yes, we have to wait

codecov · 2023-07-20T15:17:02Z

Codecov Report

Merging #521 (65d8cc1) into master (79b4ea2) will decrease coverage by 0.11%.
The diff coverage is 100.00%.

❗ Current head 65d8cc1 differs from pull request most recent head 597864f. Consider uploading reports for the commit 597864f to get more accurate results

@@            Coverage Diff             @@
##           master     #521      +/-   ##
==========================================
- Coverage   78.25%   78.15%   -0.11%     
==========================================
  Files          76       76              
  Lines        4746     4751       +5     
  Branches      426      426              
==========================================
- Hits         3714     3713       -1     
- Misses        852      856       +4     
- Partials      180      182       +2

Impacted Files	Coverage Δ
...owflake/ingest/streaming/internal/BlobBuilder.java	`97.46% <100.00%> (+0.03%)`	⬆️
...flake/ingest/streaming/internal/ChunkMetadata.java	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sfc-gh-tzhang

lgtm

sfc-gh-tzhang · 2023-07-20T17:53:30Z

SNOW-835618

Please put a meaningful description instead of just linking the JIRA, thanks!

sfc-gh-rcheng

lgtm - left a small nit and a question about actual uncompressed chunk length

…from SDK to GS

sfc-gh-azagrebin requested review from a team and sfc-gh-tzhang as code owners June 8, 2023 13:24

sfc-gh-lsembera approved these changes Jun 8, 2023

View reviewed changes

sfc-gh-tzhang reviewed Jun 8, 2023

View reviewed changes

sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from 375ea7f to fde5fdb Compare June 9, 2023 10:26

sfc-gh-lsembera mentioned this pull request Jun 9, 2023

V2.0.0 release #522

Merged

sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from fde5fdb to 175c989 Compare June 12, 2023 15:13

sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from 175c989 to 03eebf7 Compare July 20, 2023 14:50

sfc-gh-tzhang approved these changes Jul 20, 2023

View reviewed changes

sfc-gh-rcheng approved these changes Jul 20, 2023

View reviewed changes

sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from 65d8cc1 to d95d1ba Compare July 21, 2023 10:20

@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length …

597864f

…from SDK to GS

sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from d95d1ba to 597864f Compare July 24, 2023 11:30

sfc-gh-azagrebin merged commit 5d9b291 into master Jul 24, 2023
13 checks passed

sfc-gh-azagrebin deleted the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch July 24, 2023 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521

@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521

sfc-gh-azagrebin commented Jun 8, 2023 •

edited

Loading

sfc-gh-kkloudas commented Jun 8, 2023

sfc-gh-azagrebin commented Jun 8, 2023 •

edited

Loading

sfc-gh-tzhang left a comment

sfc-gh-tzhang Jun 8, 2023

sfc-gh-azagrebin Jun 9, 2023 •

edited

Loading

sfc-gh-rcheng Jul 20, 2023

sfc-gh-rcheng Jul 20, 2023

sfc-gh-azagrebin Jul 21, 2023 •

edited

Loading

sfc-gh-rcheng Jul 21, 2023

sfc-gh-azagrebin Jul 24, 2023 •

edited

Loading

sfc-gh-azagrebin Jul 24, 2023 •

edited

Loading

sfc-gh-tzhang Jun 8, 2023

sfc-gh-azagrebin Jun 9, 2023

sfc-gh-azagrebin Jun 9, 2023

codecov bot commented Jul 20, 2023 •

edited

Loading

sfc-gh-tzhang left a comment

sfc-gh-tzhang commented Jul 20, 2023 •

edited by jira bot

Loading

sfc-gh-rcheng left a comment

@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521

@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521

Conversation

sfc-gh-azagrebin commented Jun 8, 2023 • edited Loading

sfc-gh-kkloudas commented Jun 8, 2023

sfc-gh-azagrebin commented Jun 8, 2023 • edited Loading

sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

sfc-gh-tzhang Jun 8, 2023

Choose a reason for hiding this comment

sfc-gh-azagrebin Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

sfc-gh-rcheng Jul 20, 2023

Choose a reason for hiding this comment

sfc-gh-rcheng Jul 20, 2023

Choose a reason for hiding this comment

sfc-gh-azagrebin Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

sfc-gh-rcheng Jul 21, 2023

Choose a reason for hiding this comment

sfc-gh-azagrebin Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

sfc-gh-azagrebin Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

sfc-gh-tzhang Jun 8, 2023

Choose a reason for hiding this comment

sfc-gh-azagrebin Jun 9, 2023

Choose a reason for hiding this comment

sfc-gh-azagrebin Jun 9, 2023

Choose a reason for hiding this comment

codecov bot commented Jul 20, 2023 • edited Loading

Codecov Report

sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

sfc-gh-tzhang commented Jul 20, 2023 • edited by jira bot Loading

sfc-gh-rcheng left a comment

Choose a reason for hiding this comment

sfc-gh-azagrebin commented Jun 8, 2023 •

edited

Loading

sfc-gh-azagrebin commented Jun 8, 2023 •

edited

Loading

sfc-gh-azagrebin Jun 9, 2023 •

edited

Loading

sfc-gh-azagrebin Jul 21, 2023 •

edited

Loading

sfc-gh-azagrebin Jul 24, 2023 •

edited

Loading

sfc-gh-azagrebin Jul 24, 2023 •

edited

Loading

codecov bot commented Jul 20, 2023 •

edited

Loading

sfc-gh-tzhang commented Jul 20, 2023 •

edited by jira bot

Loading