Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Zstandard compression for sinks #2302

Closed
occasionallydavid opened this issue Apr 12, 2020 · 14 comments
Closed

Feature request: Zstandard compression for sinks #2302

occasionallydavid opened this issue Apr 12, 2020 · 14 comments
Labels
domain: compression Anything related to compressing data within Vector domain: networking Anything related to Vector's networking domain: sinks Anything related to the Vector's sinks have: nice This feature is nice to have. It is low priority. needs: approval Needs review & approval before work can begin. needs: rfc Needs an RFC before work can begin.

Comments

@occasionallydavid
Copy link

Vector mostly only supports gzip compression in its sinks, which is to say, a compressor specified in 1990 based on already 20 year old methods, that performs string deduplication over a tiny 32KiB window. Deflate is neither quick to compress nor decompress, has tragic ratios for bulk data, and has been roundly obsoleted in every metric except mass adoption by numerous compressors over the past 30 years.

Of those modern compressors, LZMA and Zstandard have some level of adoption and are fit for general use, but for logs analysis in particular, Zstandard hits a massive sweet spot with state of the art compression ratios combined with best in class decompression speed.

It's possible to get 20x compression of logs with Zstandard and decompress those logs for analysis at almost 2GiB/sec with a single thread. This allows a 20 core machine (theoretically) to process 40 GiB/s of decompressed logs while saturating an underlying 2 GiB/s NVMe storage device (assuming no other work except decompression was being performed).

LZMA is competitive with Zstandard in ratio and overall performance, but Zstandard still enjoys a significant lead in terms of absolute decompression performance, which for me is a major deciding factor in long term logs storage.

This is a request to consider modern gzip alternatives, or if there is no time for that, perhaps consider only my suggestion to go the Zstandard route. ;)

Thanks

@Hoverbear
Copy link
Contributor

We definitely want to build on our compression feature in the near future! I think giving folks the option can be done similar to how we do encoding.

@Hoverbear Hoverbear added needs: approval Needs review & approval before work can begin. needs: rfc Needs an RFC before work can begin. domain: networking Anything related to Vector's networking labels Apr 13, 2020
@Hoverbear
Copy link
Contributor

For whoever wants to tackle this: I think adding an encoding.compression field might be the way to go?

@binarylogic binarylogic added the have: nice This feature is nice to have. It is low priority. label Apr 20, 2020
@binarylogic binarylogic assigned bruceg and unassigned bruceg Apr 20, 2020
@binarylogic
Copy link
Contributor

@bruceg before we begin work, we should identify sinks where this is compatible.

@bruceg
Copy link
Member

bruceg commented Apr 22, 2020

Sinks currently using gzip compression:

Sink Allowed Methods Status
aws_s3 any
clickhouse brotli, deflate, gzip (reference)
elasticsearch gzip (?)
gcp_cloud_storage any
http any
kafka gzip, lz4, snappy, zstd (reference) supported via librdkafka
splunk_hec gzip (?)

@binarylogic
Copy link
Contributor

So it looks like aws_s3, gcp_cloud_storage, http, and kafka are good sinks to target first.

@lukesteensen
Copy link
Member

Just to be clear, I don't think we have to implement anything for kafka beyond passing the configs down and enabling the relevant features on the crate.

@binarylogic binarylogic added domain: compression Anything related to compressing data within Vector domain: sinks Anything related to the Vector's sinks labels Aug 7, 2020
@hdhoang
Copy link
Contributor

hdhoang commented Dec 17, 2020

The file sink's new compression option should be tracked here as well. If the implementation is okay, I can send a PR hdhoang@ea578aa

Though, compression.level is still zlib-specific, where zstd has range 1-21 and default at 3. cf #3032 regarding other algorithms' levels.

@venkat-sneller
Copy link

n00b here - Does vector support zstd to the aws s3 sink out of the box already? I dont see it in the allowed encoding.compression options in the docs

@jszwedko
Copy link
Member

Hey @venkat-sneller ! That's correct, we only support gzip on the aws_s3 sink right now. That would be a good addition though.

@zamazan4ik
Copy link
Contributor

@hdhoang Sorry for the so late pinging. Did you try to send a Zstd-related PR? If no, could you please try to do it? Thanks in advance!

@hdhoang
Copy link
Contributor

hdhoang commented Aug 17, 2022

no worry! i'll try it again this month.

(currently we still do td-agent + exec zstd as a flush, fwiw)

@bruceg
Copy link
Member

bruceg commented Sep 30, 2022

Related: #14349
In particular, see this comment: #14349 (comment)
We want to merge the compression support in batch buffers with that in the file sink, providing a unified configuration and capabilities across sinks that support compression.

@gaby
Copy link

gaby commented Apr 7, 2023

Has there been any traction on this? I currently have multiple chained vector instances. One of them is running with the HTTP Server sink using zstd. The other is using HTTP Output but zstd is currently not supported.

@jszwedko
Copy link
Member

#17371 added support for zstd compression to a good number of sinks (including http and aws_s3). I'll close this out, but if there is a sink you'd still like zstd compression support for, please open an issue for that specific sink.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: compression Anything related to compressing data within Vector domain: networking Anything related to Vector's networking domain: sinks Anything related to the Vector's sinks have: nice This feature is nice to have. It is low priority. needs: approval Needs review & approval before work can begin. needs: rfc Needs an RFC before work can begin.
Projects
None yet
Development

No branches or pull requests

10 participants