Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@snow SNOW-835618 Snowpipe Streaming: send uncompressed chunk length from SDK to GS #521

Merged

Conversation

sfc-gh-azagrebin
Copy link
Contributor

@sfc-gh-azagrebin sfc-gh-azagrebin commented Jun 8, 2023

SNOW-835618

Currently, we send only compressed chunk length in blob metadata.
We can also send uncompressed and log in ingestion events to analyse compression ratio stats.

@sfc-gh-kkloudas
Copy link
Contributor

Does it make sense to also modify https://github.com/snowflakedb/snowflake/pull/105446 to take into account the uncompressed size @sfc-gh-azagrebin @sfc-gh-lsembera ?

@sfc-gh-azagrebin
Copy link
Contributor Author

sfc-gh-azagrebin commented Jun 8, 2023

@sfc-gh-kkloudas

Does it make sense to also modify server side alert to take into account the uncompressed size

indeed, we can log it there! I am on it

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a simple test?

@@ -114,6 +114,7 @@ static <T> Blob constructBlobAndMetadata(
// The paddedChunkLength is used because it is the actual data size used for
// decompression and md5 calculation on server side.
.setChunkLength(paddedChunkLength)
.setUncompressedChunkLength((int) serializedChunk.chunkUncompressedSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This size will be different compared with our estimated uncompressed buffer size, right? Is that something you want?

Copy link
Contributor Author

@sfc-gh-azagrebin sfc-gh-azagrebin Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what we estimate. We do not have the actual uncompressed size because it is tracked only internally in Parquet lib but I also do not think it matters much to gather stats that is the goal of this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps out of scope for this pr - do you know if it is it a lot of additional work to gather the actual uncompressed chunk length? we have customers asking for throughput numbers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we adjust the naming from 'chunkUncompressedSize' to 'chunkUncompressedSizeEstimate' to reduce this confusion?

Copy link
Contributor Author

@sfc-gh-azagrebin sfc-gh-azagrebin Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I changed the variable name and in the result of flush but not in server response.
The problem is that now server side can accept only this name for the size.

The current intention was to send the estimated size to investigate the compression ratio based on the estimation
because the way we use the parquet lib now we cannot know the actual size before we start the flush
hence we have only estimation for compressed prediction before we start the flush.
Once we decide to re-write the parquet generation on lower level and buffer directly in serialised parquet buffers, it will be possible to get the actual size during the buffering before flushing.

This is a good point about the actual size. I looked again into the code and I think it is possible
now to report the actual uncompressed size from the parquet lib after the flush and I added now to the
serialisation result. As I understand the expected difference should be very small and close to negligible.
Nonetheless, we could send both if we change the server size to accept both but for now
as I mentioned in the beginning, I would prefer to send the estimation.
The estimation should be good enough representative of the user data because it basically just sums up the actual user data bytes plus some small overhead of parquet format (data length and nullable bit-mask).

Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you for looking into this. Just want to make sure we have a vague plan for reporting the actual size in the future - can we change chunk_length_uncompressed to represent the actual size in the future? I'm worried about future confusion if the field changes across versions compared to the current naming.

Ie should we have:

  • chunk_length_uncompressed and estimated_chunk_length_uncompressed
  • chunk_length_uncompressed and actual_chunk_length_uncompressed

Copy link
Contributor Author

@sfc-gh-azagrebin sfc-gh-azagrebin Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change chunk_length_uncompressed to represent the actual size in the future?

This is not so easy because it is a part of client/server API.
Server already accepts chunk_length_uncompressed not anything else.
We can just add actual_chunk_length_uncompressed later and have a comment for chunk_length_uncompressed. I agree that naming should have been better but otherwise there are 2 ways:

  • deprecating the old name that will be a hassle because of old clients sill sending the old name
  • or we would have to change first server side and block this PR until server side is released (2 versions). It would mean that we will not able to gather any stats for longer time and customers will use older post-GA SDK versions that do not send any stats.

Eventually, I think we should have only actual size stats if we decide to refactor parquet generator or we will always use estimation. We do not really need both. The estimation reflects the user data good enough. The actual size will just include a bit more or less parquet specific overhead that is not user data anyways.

Copy link
Contributor Author

@sfc-gh-azagrebin sfc-gh-azagrebin Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, if you ask about customer throughput, we should actually compute it a bit differently but call it completely differently :) Parquet encoding for page 2 (for now we use page 1) is like compression and it can significantly reduce the actual customer data size, even before compressing it. I assume when you say customer throughput you mean the actual data bytes that customers insert and not how parquet eventually packs them. Hence, it would be a different stat to report, something like chunk_user_data_size or something. This would better reflect the customer throughput.

@@ -152,6 +162,11 @@ Integer getChunkLength() {
return chunkLength;
}

@JsonProperty("chunk_length_uncompressed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will require a server side change first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server side change is merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but yes, we have to wait

@sfc-gh-azagrebin sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from 375ea7f to fde5fdb Compare June 9, 2023 10:26
@sfc-gh-lsembera sfc-gh-lsembera mentioned this pull request Jun 9, 2023
@sfc-gh-azagrebin sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from fde5fdb to 175c989 Compare June 12, 2023 15:13
@sfc-gh-azagrebin sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from 175c989 to 03eebf7 Compare July 20, 2023 14:50
@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #521 (65d8cc1) into master (79b4ea2) will decrease coverage by 0.11%.
The diff coverage is 100.00%.

❗ Current head 65d8cc1 differs from pull request most recent head 597864f. Consider uploading reports for the commit 597864f to get more accurate results

@@            Coverage Diff             @@
##           master     #521      +/-   ##
==========================================
- Coverage   78.25%   78.15%   -0.11%     
==========================================
  Files          76       76              
  Lines        4746     4751       +5     
  Branches      426      426              
==========================================
- Hits         3714     3713       -1     
- Misses        852      856       +4     
- Partials      180      182       +2     
Impacted Files Coverage Δ
...owflake/ingest/streaming/internal/BlobBuilder.java 97.46% <100.00%> (+0.03%) ⬆️
...flake/ingest/streaming/internal/ChunkMetadata.java 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sfc-gh-tzhang
Copy link
Contributor

sfc-gh-tzhang commented Jul 20, 2023

SNOW-835618

Please put a meaningful description instead of just linking the JIRA, thanks!

Copy link
Collaborator

@sfc-gh-rcheng sfc-gh-rcheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - left a small nit and a question about actual uncompressed chunk length

@sfc-gh-azagrebin sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from 65d8cc1 to d95d1ba Compare July 21, 2023 10:20
@sfc-gh-azagrebin sfc-gh-azagrebin force-pushed the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch from d95d1ba to 597864f Compare July 24, 2023 11:30
@sfc-gh-azagrebin sfc-gh-azagrebin merged commit 5d9b291 into master Jul 24, 2023
13 checks passed
@sfc-gh-azagrebin sfc-gh-azagrebin deleted the azagrebin-SNOW-835618-send-uncompressed-chunk-len branch July 24, 2023 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants