-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to alternative compression for crates.io crates #2526
Comments
Also as another note, I don't think that platform compatibility is relevant at all. The xz2 crate I've got going works on all our platforms and we're not relying on external tools so we've got the bases covered. |
The one place where you might see that sort of bandwidth is on EC2 machines. In my experience, their connection to S3 buckets is pretty unreal. I like the strategy of having both formats and letting the cargo client choose (with a reasonable default) given that the S3 cost is negligible. I'm not worried about our platforms, but for people trying to stand up rust+cargo on some wonky new embedded platform with a strange linux variant, every new package can be a nightmare --- I'm thinking of my experiences with these "custom" ARM linux builds for PandaBoard, etc. that I've had to play with for Servo. Though, admittedly, one could argue that you should always be cross-compiling to target them instead. |
Did you look at brotli as well ? I admit I don't know if it's usable like this, but it's been specifically done to be fast at decompressing and to achieve good ratio. |
Nah unfortunately I only tried out gzip + xz, I figure brotli would be pretty similar to gzip (not drastically smaller, but perhaps faster), but it'd be good to investigate just to make sure! |
I read a little more and brotli has a predefined dictionary that's suitable mostly for html and text files. So it might not be ideal for rust crates -- yet rust crates are also mostly text after all. zopfli could also be worth looking as it compresses into a gzip-compatible format. |
When I was writing swaddle, I took a bit of a look at this. |
Or we could support multiple connection formats, including none at all (eg for the EC2 case above). This is effectively what the Debian apt repositories do. It's why Ubuntu et al could migrate from gzip to bzip2 to lzma to xz over several major releases, without having to worry about repacking or whether the necessary more modern compression tooling was available to all users. That said, I'd recommend choosing 2 or 3:-
|
FYI, there's a short writeup comparing a few modern compression implementations here: Brotli seems to be the current leader for balancing compression ratio and decompression speed. |
Apparently Dropbox wrote a pure-Rust brotli implementation: |
zstd is another potential choice, supporting very high decompression speed at near-lzma compression ratio at the high levels. It supports custom dictionaries, but they need to be shared by the sender and the receiver, so it'd require a specific client (making it cargo-only), so maybe not ideal. As for pipelining download and decompression, most compression formats already allow that. |
Brotli's a great compression algorithm, but it works best on textual and HTML encoded data; it's not ideal for binaries. I think the debate really boils down to:-
|
On a related note, is there any chance that the crate download filenames could include the compression algorithm? Format sniffing is error-prone and potentially risky, and I would greatly prefer |
I've tested using bindgen-0.19.0.crate as an example: zopfli: 5369537 BTW: I wouldn't mind if support for the new format was tied to particular Rust version (i.e. when publishing I could choose to opt in to the new format). If I were to publish a package that uses language features from Rust 1.n, then I could as well require Cargo 1.n for decompressing it. |
I'd really, really, really rather not. Especially as there's currently no way for Cargo to tell that any given crate requires Rust 1.n. |
My reasoning is that the package won't work either way, so whether it fails to decompress or fails to build, it's still incompatible. A way for cargo to know required rust version could be added. |
@pornel: How did you get these numbers? The |
Oops, I've had dirty local directory containing massive |
For small files, brotli wins thanks to its dictionary. The
EDIT: This brings a good point: what is the average size of crates on crates.io? With the trend of having many smaller crates, compressing small files looks like a common case. |
I recently blogged about this stuff. Though a bit more complicated (especially on the server side), I strongly suspect that delta compression (e.g. BSDiff or Courgette) will give us even bigger wins than using brotli. |
Delta updates don't apply to the first download though, which is quite frequent. |
Because I was curious, I checked how different compressors/decompressors worked out on what's currently in my
Things like diff compression might make sense for saving space on the server size, but are not terribly useful on the user end of things, particularly for e.g. CI systems which are frequently set up to download everything from scratch every time. Conclusions:
Footnotes
|
I'm surprised about the decompression time for zopfli, shouldn't it be close to gzip ? |
I was also surprised. In my past experiences decompressing from zopfli has usually been if anything faster. It's possible the issue is with my benchmarking script, which for convenience used the same executable for compression and decompression; zopfli may just be bad a decompression and I'd have gotten different results with gunzip. |
related is dropbox re-implementation of brotli in rust, with its new name broccoli, and the broccoli protocol which they are using around it to have concatenable brotli streams:
@llogiq , @adam-azarchs when you write "diff compression" this should solve it? @alexcrichton what you think? |
Brotli is unrelated to diff compression. "Diff compression" refers to computing the diff between one version of something and another and only sending that difference (which could itself be compressed, of course). However there are several problems with using diff compression for this use case:
Particularly due to that last point, download of complete archives needs to be considered the "normal" case, and needs to be well-supported. Diff compression might be a fun optimization, but can't really have priority over optimizing the common case. |
Since a lot of this early discussion, we gained the |
I redid the above comparison with somewhat newer tools and a bit more care to insulate things from the rest of what was going on on the system. I omitted zopfli this time because it's a bit awkward to include (it doesn't work on streams, only files on disk, because it needs to make repeated passes).
These results were streaming the content of each crate file (one at a time) from an in-memory buffer2, on an Amazon EC2 instance (m5, Intel(R) Xeon(R) Platinum 8259CL CPU). Software versions:
Compression inputs were I didn't redo the If switching codecs is on the table, it looks to me like (edited to clarify that the statistics are computed compressing each crate's content separately, and to add a few more codecs / settings to the party) Footnotes
|
It sounds like you are compressing / decompressing your entire cache into one file. How representative is that compared to compressing / decompressing individual Is there much of a difference between implementations we'd use within cargo and the standalone programs tested? Something else for us to consider is how easy it is to index into the "tarball" if we decide to support rustc reading directly from compressed files. This isn't decided yet and so not a hard requirement but it is useful to consider. Again, I know very little about compression: someone brought up LZ4 at RustConf Unconf. Is that worth testing? |
No, I was not. That was compressing/decompressing one crate at a time. Otherwise I would expect
That I can't answer without knowing which implementation you'd use. If you used C / bindgen then I'd expect the performance to be the same (or maybe slightly better since you wouldn't have the process start/stop overhead) but if it's an alternative implementation then who knows?
For that, you'd probably be best off using
Not really. lz4 is very fast for both compression and decompression but doesn't get good compression ratios. It's designed for things like transparently compressing data in an advanced filesystem like zfs or btrfs; it's not really appropriate for longer-term archival. For completeness, however, here is a comparison on a subset of the crate list:
|
For the curious, I created this comparison using the little go "script" I'm attaching here (go, because it's quick to iterate with, is easy to inject parallelism, and doesn't require any dependencies beyond the standard library for something like this). Disclaimer: this was just something I whipped up quickly to generate these comparisons; I make no warrantee as to its robustness or maintainability. It's named |
Alternatively, we could checksum before compression as requested in #7212 and we can support as many different compression algorithms we want without adding a new field each time. |
That might risk decompression bombs (or malware using a decompressor vulnerability, because there is no way to ensure a valid download before decompression. |
Yes. We are going to need more than one hash. The index is going to need to provide a "the actual artifact you're about to download has the hash". Separately we will need the ability for lock files to check "this package is the same as last time we built, even if we downloaded it under different compression". |
I take it our existing zip bomb mitigations aren't sufficient? |
Security vulnerabilities in decompression codecs are not unheard of, so It's nice to be able to verify integrity of the downloaded artifact before decompressing it. It also makes it easier to implement things like content-addressable caches, since that way they don't actually need to be able to understand the compression format. |
@epage it would at least raise the security level by being able to verify the authenticity of the download before doing anything with it. Of course crackers could still compromise the server, replacing both binaries and checksums, but then we'll have bigger fish to fry anyway. |
Checksums are stored in |
The author of lzip went great length to make extreme robust integrity checking, which makes it more suitable for long term storage than xz. See xz inadequate. I have been using lzip for quite some time especially the Compression size and speed are on par with xz, but outperformed by zstd. My current usage pattern:
|
For crate sizes, see https://lib.rs/stats#crate-sizes — almost all crates are under 50KB. Majority is under 10KB. The really large ones typically contain binary executables, libraries, |
This is true. Looking at my own
So if your point was that perhaps rather than looking at compression algorithm there should the more emphasis on reducing unnecessarily included files, I'd mostly agree with you. But, only mostly, because of those 6 crates, one was As for what compression algorithm to use, I think the argument for A recent golang issue to add zstd support does a decent job of summarizing why it meets the fairly high bar for inclusion:
I'd probably also take the opportunity to rejigger the format a little bit to make it easier to extract Footnotes |
My conclusions from crate sizes are:
I think this favors Brotli as the compressor, which performs especially well on small text files (and still fine on binary files). Other arguments in favor of Brotli:
|
Theory is nice, but |
I don't see zstd being pareto-optimial on all three variables, so the conclusion depends on which criteria you pick. If prioritizing only file size, then For a certain balance of file size & decoding time, The file sizes involved here differ by multiple orders of magnitude, which makes proper analysis much harder. Datasets with such large differences are likely to show the Simpson's paradox (you may get different conclusions whether you compare compression on pairs of files, or per bucket of similar file sizes, or just add up sizes of all files first and then compare) A single total/average is likely affected strongly by outliers — de/compression time increases superlinearly with window size (which is forced to be small on small files), so a handful of largest files can dominate the timings. A single total also can't disprove that Brotli compresses small files better. Brotli could be better on all files except the largest few, and lose on the summed total because of them. |
I was curious what impact a custom dictionary would have on the zstd numbers. The main reason that brotli is able to compress so well is its dictionary. zstd makes it easy to train a custom dictionary on a collection of files. Then especially for small crates, zstd appears to achieve better compression:
Note: it is quite plausible that at least for mimalloc there is some overfitting happening here. The dictionary is about 100k so for the others that is less likely. As the inputs get larger, this gap narrows and eventually brotli wins out
In this case the dictionary version also gives worse results than zstd without a dictionary. That's weird because at least in gzip, not providing a dictionary is equal to providing a dictionary of all zeros. In any case, the dictionary provides another level of tuning to zstd, that appears to be beneficial wrt. size for the majority of crates. scripts to generate this output from a folder of .crate files are provided below: Hacky bash scriptsfor file in *.crate;
do
# Rename the file to .crate.gz
mv "$file" "$file.gz"
# Decompress the .gz file, keeping the original
gzip -d -c "$file.gz" > "${file%.crate}.crate.raw"
done train a dictionary on all zstd --train -o raw-crates.dict *.crate.raw this runs zstd and brotli on the .crate.raw files for file in $(find . -maxdepth 1 -name "*.crate.raw" -exec stat --format='%s %n' {} + | sort -n | cut -d' ' -f2-);
do
echo "==="
# Original file size
original_size=$(stat --format='%s' "$file")
zstd --ultra -22 "$file" -f
zstd --ultra -22 "$file" -f -D raw-crates.dict
brotli -Z "$file" -f
brotli_compressed_file="${file}.br"
brotli_size=$(stat --format='%s' "$brotli_compressed_file")
brotli_ratio=$(awk "BEGIN {printf \"%.2f\", $brotli_size / $original_size * 100}")
echo "$file : $brotli_ratio% ($original_size => $brotli_size bytes, $brotli_compressed_file)"
done |
For a fair comparison you shouldn't train the dictionary on the same set of crates you're using to measure compression ratio, though it would be fair for there to be some overlap, since presumably the dictionary would we trained on a sample of the most popular crates. A zstd dictionary is basically just initializing the stream state. It shouldn't make much difference at all for larger files; once you're past the context window, the initial state is irrelevant. |
There's a major downside to using a shared dictionary with I also think it's a bit misleading to say that "almost all crates are under 50KB. Majority is under 10KB." Yes, this is true. It's also true that a majority haven't updated in over a year, and any metric for "popularity" is dominated by a very tiny minority. I also think that there's a strong argument that the compression ratio for small crates doesn't matter in most practical senses. The only case where all crates should be considered equally is if you care about the storage costs for hosting the crates.io registry, in which case you should probably round everything up to a floor of 128 KB, which is the minimum billable size for an s3 object. I don't think anyone is arguing that this is what should be prioritized, especially given that, for a good long while after such an update was made, crates.io would likely need to be hosting crates in both the gzip format and whatever was settled on to replace it (either that or convert on the fly for older clients, but that costs compute and has to be done very carefully to ensure that checksums remain stable). If you really want to cut storage costs on crates.io then If you care about the space taken up by If you care about network bandwidth (either for developers or egress charges for crates.io's hosting) then you should weight by recent downloads (and if you want to be a pedant, also consider the fixed per-download overhead, which will dominate for the smallest crates). The top recently-downloaded crates on crates.io are
If you care about impact on build times, then you should again weight by recent downloads and consider the compressed size (assuming some download speed, and again probably assuming some fixed per-crate overhead) and decompression speed. |
Good point about the weighing. I've added tarball_size * downloads to the stats: https://lib.rs/stats#crate-bandwidth |
Nice! That's very interesting data! From there, we can see that for example the majority of crates have 1-10MB worth of downloads per month, so totaling O(100GB) for that bucket. I wouldn't be surprised if a lot of those are crates which basically no one but their author uses. Meanwhile there 482 crates which total at least that much each, including 8 which download > 10TB each (which is pretty impressive really). The list of crates in that bucket isn't terribly surprising given what I've noted above - Breaking it down by bandwidth used per bucket (very approximately, given I don't know the distribution of sizes within each bucket; but assuming they follow the same trend can probably assume the average is about 2x min),
Put another way, there's only around 25 crates that are responsible for around half of the total bandwidth on crates.io. Impressive! (though, not necessarily in a good way. I hope those crates at least have put some effort into trimming out anything unnecessary but I'd be willing to bet that at least some of them have unnecessary content) Of course, a lot of downloads are in CI (e.g. GitHub actions, if the user hasn't set up caching properly), where bandwidth is plentiful and latency isn't so critical. So if I had time to work up another representative benchmark (I don't, at least not this week), I'd probably be looking at
|
here's bandwidth from recent downloads & size of the latest release |
Just wanted to throw my chip in, for |
Why?
tl;dr; unless you have a 52MB/s internet connection, xz is probably faster, but my math should be checked!
The statistics here are operating over the entirety of crates.io at this time, including all crates and all versions ever published. Currently we use
flate2
which is backed by miniz using the best compression for tarballs corresponding to level 9. The "zlib" numbers here are generated from compilingflate2
against zlib instead of miniz. The xz numbers are generated with the xz2 crate also using compression level 9.First up, let's take a look at what we're just storing on S3. This is just the size of all published crates.
Next, let's multiply each version's size by how many times it's been downloaded. This in theory the number of bytes that have been transferred out of S3
Next up is how long it took (in nanoseconds) in total to decompress all crates on my local computer.
Ok, so the claims of xz are correct in that it's about 30% smaller than gzip, but the decompression time is much larger! If we assume that these numbers are true in the average, however, let's do some math to figure out how fast your bandwidth needs to be to break even.
First up we've got:
So if we assume that xz crates are on average 36.16% smaller and use the timings we found above for decompressing, we have:
Now that's 0.05258 bytes per nanosecond, which translates to 52,580,000 bytes per second which is 52.58 MB per second.
So... if my math is right, xz is faster for download + decompression unless you have a 52MB/s uplink to crates.io. I kinda doubt anyone really has that kind of bandwidth to our bucket, so that should mean that on average the smaller download size of xz-compressed crates will more than make up for the longer decompression times.
Note that I didn't bother getting statistics about compression times. xz is massively slower than gzip, but I don't think it matters much for most as
cargo publish
is super rare.How?
Ok, so I haven't actually thought much about this. I was basically curious recently and just wanted to make sure that these numbers and statistics didn't disappear. For backwards compatibility we need to continue to publish gzip crates for quite awhile (maybe forever). Our S3 bill can take the hit though, that's fine. This means that
cargo publish
will create both tarballs and upload them.The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.
Next, when downloading a crate Cargo could start saying "I'm xz compatible" at which point crates.io would redirect to the xz url (currently it redirects to a gz url) for any crate which has an xz-compressed tarball available (crates.io would grow a flag for this for all published versions).
I... think that would cover our bases? Certainly a lot of work so I probably won't be able to get around to this any time soon, but wanted to put this out there in case others were interested :)
The text was updated successfully, but these errors were encountered: