-
-
Notifications
You must be signed in to change notification settings - Fork 99
Compress in-memory slices with Zstd #2268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Out of curiosity, trying to understand the entropy: can you zstd-compress the index (on the filesystem) as well and report on the sizes? |
667M to 229M when doing it per-file. |
So down to ~34% of the original size. |
I guess it would be an easy win if this could be made configurable. There seem to be people out there using a compressed file system with VAST, and for those scenarios the additional compression might become a performance degradation, as there would be two layers attempting to compress the data. But still, an impressive win for such a small patch 👍 |
Very nice! I would like to measure the export performance and RSS as well.
Probably not even necessary, at least btrfs opts out of compression very quickly if it doesn't manage to compress the first few kB itself. This should also be much more effective than compressing individual filesystem blocks.
That is promising, but we should probably do it a per-value-index basis instead of the whole file. |
Yes, I think so. Even with Parquet stores coming up this is still valuable to reduce the memory usage.
Will do and report back!
I think that's for a separate story, let's keep this focussed on the table slices only for now. |
Definitely. |
No noticeable difference for Zstd with default compression. Unpacking is probably slightly more expensive, but sending things over the wire is much cheaper. The higher compression level is especially bad when selecting subsets of table slices, because that requires re-compressing the results before sending them over with the current approach. But from this point of view alone I don't see why we'd not just always enable Zstd with default compression.
For the Zstd default during export:
For the uncompressed version:
|
In a repeated test with 8'557'668 ingested Suricata events enabling the compression had a negligible effect on import speed (under 0.5%), but reduced the size of the archive and also the in-memory size of the loaded slices by roughly the same amount. Here are the sizes of the database for three scenarios: Uncompressed, Zstd with the default compression level, and Zstd with the max compression level (50% slower on important, so irrelevant in the grand scheme of things). ``` ❯ du -csh vast.db.uncompressed/{index,archive} 667M vast.db.uncompressed/index 1.8G vast.db.uncompressed/archive 2.5G total ❯ du -csh vast.db.zstd.default/{index,archive} 668M vast.db.zstd.default/index 426M vast.db.zstd.default/archive 1.1G total ❯ du -csh vast.db.zstd.max/{index,archive} 675M vast.db.zstd.max/index 361M vast.db.zstd.max/archive 1.0G total ```
10da7ab
to
2396062
Compare
Now that we have Zstd-compression for table slices we can (1) increase the partition size 4-fold without running into size problems with the FlatBuffers builders, and (2) can adjust the default table slice size upwards which is a huge boost to overall efficiency.
I've extended this PR with the tuned defaults following my extensive measurements on our testbed server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make these values accessible to those that haven't burnt powers of two in their heads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been extensively tested and co-reviewed by @dispanser.
In a repeated test with 8'557'668 ingested Suricata events enabling the compression had a negligible effect on import speed (under 0.5%), but reduced the size of the archive and also the in-memory size of the loaded slices by roughly the same amount.
Here are the sizes of the database for three scenarios: Uncompressed, Zstd with the default compression level, and Zstd with the max compression level (50% slower on import, so irrelevant in the grand scheme of things).
📝 Checklist
🎯 Review Instructions
Discuss whether we want to enable this by default, or make this an option, or just not do it at all. After that comes testing and a changelog entry.