Migrate to alternative compression for crates.io crates #2526

alexcrichton · 2016-03-29T06:04:37Z

Why?

tl;dr; unless you have a 52MB/s internet connection, xz is probably faster, but my math should be checked!

The statistics here are operating over the entirety of crates.io at this time, including all crates and all versions ever published. Currently we use flate2 which is backed by miniz using the best compression for tarballs corresponding to level 9. The "zlib" numbers here are generated from compiling flate2 against zlib instead of miniz. The xz numbers are generated with the xz2 crate also using compression level 9.

First up, let's take a look at what we're just storing on S3. This is just the size of all published crates.

stat	val	% smaller
miniz	3776697673	0.0
zlib	3776147960	0.01
xz	2411082764	36.16

Next, let's multiply each version's size by how many times it's been downloaded. This in theory the number of bytes that have been transferred out of S3

stat	val	% smaller
miniz	3502228200434	0.0
zlib	3501544526571	0.02
xz	2373770137784	32.22

Next up is how long it took (in nanoseconds) in total to decompress all crates on my local computer.

stat	val	ns per byte	% slower
miniz	118891793644	31.484	0.0
zlib	118952441288	31.501	0.05
xz	144860343353	60.081	90.83

Ok, so the claims of xz are correct in that it's about 30% smaller than gzip, but the decompression time is much larger! If we assume that these numbers are true in the average, however, let's do some math to figure out how fast your bandwidth needs to be to break even.

First up we've got:

time = bytes / BW + bytes * time_per_byte

So if we assume that xz crates are on average 36.16% smaller and use the timings we found above for decompressing, we have:

bytes / BW + bytes * 31.484 = bytes * (1 - .3616) / BW + bytes * (1 - .3616) * 60.081
            1 / BW + 31.484 = (1 - .3616) / BW + (1 - .3616) * 60.081                
            1 + BW * 31.484 = (1 - .3616) + (1 - .3616) * 60.081 * BW                
            1 + BW * 31.484 = .6384 + 38.36 * BW                                     
                      .3616 = 6.876 * BW                                             
                    0.05258 = BW

Now that's 0.05258 bytes per nanosecond, which translates to 52,580,000 bytes per second which is 52.58 MB per second.

So... if my math is right, xz is faster for download + decompression unless you have a 52MB/s uplink to crates.io. I kinda doubt anyone really has that kind of bandwidth to our bucket, so that should mean that on average the smaller download size of xz-compressed crates will more than make up for the longer decompression times.

Note that I didn't bother getting statistics about compression times. xz is massively slower than gzip, but I don't think it matters much for most as cargo publish is super rare.

How?

Ok, so I haven't actually thought much about this. I was basically curious recently and just wanted to make sure that these numbers and statistics didn't disappear. For backwards compatibility we need to continue to publish gzip crates for quite awhile (maybe forever). Our S3 bill can take the hit though, that's fine. This means that cargo publish will create both tarballs and upload them.

The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.

Next, when downloading a crate Cargo could start saying "I'm xz compatible" at which point crates.io would redirect to the xz url (currently it redirects to a gz url) for any crate which has an xz-compressed tarball available (crates.io would grow a flag for this for all published versions).

I... think that would cover our bases? Certainly a lot of work so I probably won't be able to get around to this any time soon, but wanted to put this out there in case others were interested :)

The text was updated successfully, but these errors were encountered:

alexcrichton · 2016-03-29T06:07:29Z

Also as another note, I don't think that platform compatibility is relevant at all. The xz2 crate I've got going works on all our platforms and we're not relying on external tools so we've got the bases covered.

larsbergstrom · 2016-03-29T18:28:18Z

The one place where you might see that sort of bandwidth is on EC2 machines. In my experience, their connection to S3 buckets is pretty unreal.

I like the strategy of having both formats and letting the cargo client choose (with a reasonable default) given that the S3 cost is negligible. I'm not worried about our platforms, but for people trying to stand up rust+cargo on some wonky new embedded platform with a strange linux variant, every new package can be a nightmare --- I'm thinking of my experiences with these "custom" ARM linux builds for PandaBoard, etc. that I've had to play with for Servo. Though, admittedly, one could argue that you should always be cross-compiling to target them instead.

julienw · 2016-04-15T13:58:33Z

Did you look at brotli as well ? I admit I don't know if it's usable like this, but it's been specifically done to be fast at decompressing and to achieve good ratio.

alexcrichton · 2016-04-15T15:56:28Z

Nah unfortunately I only tried out gzip + xz, I figure brotli would be pretty similar to gzip (not drastically smaller, but perhaps faster), but it'd be good to investigate just to make sure!

julienw · 2016-04-15T16:39:24Z

I read a little more and brotli has a predefined dictionary that's suitable mostly for html and text files. So it might not be ideal for rust crates -- yet rust crates are also mostly text after all.

zopfli could also be worth looking as it compresses into a gzip-compatible format.

raphaelcohn · 2016-04-29T11:29:57Z

When I was writing swaddle, I took a bit of a look at this. zopfli is usually beaten, slightly, by pigz -11 -T (-I can't be used, as older versions still used on Ubuntu LTS releases don't have it). [Try pigz -11 -c -q -f -n -T -- "$FILE" >"$FILE".gz for example]. Sticking to gzip also has the slight advantage that the HTTP client can be used to decompress on the fly, rather than after downloading, if the right HTTP headers are used. Amazon's CDS supports this too, IIRC. That itself can often be a major win. Does crates.io do this? I don't know. However, even better, are the rzip derivatives. Particularly lrzip. It's available now on most major platforms (although CentOS requires the EPEL repo).

raphaelcohn · 2016-04-29T11:35:35Z

Or we could support multiple connection formats, including none at all (eg for the EC2 case above). This is effectively what the Debian apt repositories do. It's why Ubuntu et al could migrate from gzip to bzip2 to lzma to xz over several major releases, without having to worry about repacking or whether the necessary more modern compression tooling was available to all users.

That said, I'd recommend choosing 2 or 3:-

gzip for ubiquity and the ability to leverage HTTP compression headers; this is effectively then the same as 'none'
xz for significant compression wins supported by nearly all modern platforms today but not some edge cases;
lrzip or something more exotic, with slightly less platform support, for maximal compression and minimum bandwidth costs (both for crates.io and users).

luser · 2016-06-21T15:54:21Z

FYI, there's a short writeup comparing a few modern compression implementations here:
https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf

Brotli seems to be the current leader for balancing compression ratio and decompression speed.

luser · 2016-06-29T18:47:48Z

Apparently Dropbox wrote a pure-Rust brotli implementation:
https://github.com/dropbox/rust-brotli

gyscos · 2016-07-18T06:56:01Z

zstd is another potential choice, supporting very high decompression speed at near-lzma compression ratio at the high levels. It supports custom dictionaries, but they need to be shared by the sender and the receiver, so it'd require a specific client (making it cargo-only), so maybe not ideal.

As for pipelining download and decompression, most compression formats already allow that.

raphaelcohn · 2016-07-23T12:31:16Z

Brotli's a great compression algorithm, but it works best on textual and HTML encoded data; it's not ideal for binaries. I think the debate really boils down to:-

ubiquity and existing support from all OS and HTTP clients and proxies: gzip, using a rust pigz binding where available for maximal compression
more compression but no need to worry about client side support for all tier 1 and tier 2 rust platforms (eg xz)
something more exotic (zstd, lrzip, etc) but with maximal compression and potential bootstrapping issues

eternaleye · 2016-08-11T18:44:55Z

On a related note, is there any chance that the crate download filenames could include the compression algorithm? Format sniffing is error-prone and potentially risky, and I would greatly prefer foo-1.2.3.crate.xz as distinct from foo-1.2.3.crate (legacy gzip), and so forth for future formats.

kornelski · 2016-08-30T22:59:49Z

I've tested using bindgen-0.19.0.crate as an example:

zopfli: 5369537
xz -9: 3944776
xz default: 3951300
brotli default: 4025797

BTW: I wouldn't mind if support for the new format was tied to particular Rust version (i.e. when publishing I could choose to opt in to the new format). If I were to publish a package that uses language features from Rust 1.n, then I could as well require Cargo 1.n for decompressing it.

eternaleye · 2016-08-30T23:01:37Z

@pornel

I'd really, really, really rather not. Especially as there's currently no way for Cargo to tell that any given crate requires Rust 1.n.

kornelski · 2016-08-30T23:08:40Z

My reasoning is that the package won't work either way, so whether it fails to decompress or fails to build, it's still incompatible. A way for cargo to know required rust version could be added.

gyscos · 2016-08-30T23:09:56Z

@pornel: How did you get these numbers? The bindgen-0.19.0.crate uncompressed is 269312 bytes here, maybe I have the wrong file?

kornelski · 2016-08-30T23:10:52Z

Oops, I've had dirty local directory containing massive libclang.dylib.

gyscos · 2016-08-30T23:28:15Z

For small files, brotli wins thanks to its dictionary. The bindgen-0.19-crate file (as built by cargo package) is a good example:

Compression used	Size in bytes
No compression	269312
`gzip -9`	45969
`xz -1`	44020
`xz -9`	40132
`zstd --ultra -22`	41406
`bro`	38831

EDIT: This brings a good point: what is the average size of crates on crates.io? With the trend of having many smaller crates, compressing small files looks like a common case.

llogiq · 2016-08-31T00:04:10Z

I recently blogged about this stuff. Though a bit more complicated (especially on the server side), I strongly suspect that delta compression (e.g. BSDiff or Courgette) will give us even bigger wins than using brotli.

gyscos · 2016-08-31T05:39:35Z

Delta updates don't apply to the first download though, which is quite frequent.

adam-azarchs · 2021-10-10T08:51:54Z

Because I was curious, I checked how different compressors/decompressors worked out on what's currently in my $CARGO_HOME/registry/cache, which is probably a reasonably representative sample¹. The totals, across all of those crates:

Codec	version	Compressed size (MB)	Compression time (s)	Decompression time (s)
Uncompressed	N/A	967	N/A	N/A
As downloaded (`gzip -9`)	N/A	182	N/A	8
`zopfli` (default settings)	1.0.3	173	6937	483
`brotli -9`	1.0.9	146	287	7
`zstd --ultra -22`	1.5.0	139	2984	9
`xz -9e`	5.2.5	133	1448	22

Things like diff compression might make sense for saving space on the server size, but are not terribly useful on the user end of things, particularly for e.g. CI systems which are frequently set up to download everything from scratch every time.

Conclusions:

zopfli offers small improvements in compression ratio, but at a very significant cost in both compression and decompression times. It does have the (significant) advantage of backwards compatibility, however, which is really the only reason it exists.
brotli does significantly better than gzip (including zopfli) in terms of compression ratio². While it does not achieve the highest compression ratios, it is very significantly faster for compression and edges out the others in decompression speed as well.
zstd tends to be a very good trade-off between speed and compression ratio in a lot of situations. However, when you push its compression settings to the max like this, it still can't catch up to xz and becomes rather expensive to compress. It does, however, beat xz in decompression speed, at the cost of only a little size³.
xz achieves the highest compression ratio, while still being relatively reasonable in terms of the amount of time it takes to compress things. However, it is a little slower on the decompression end. For most reasonably expected internet connections, people would probably be better off with slightly larger zstd or brotli archives which decompress faster.

For each of the 1303 crate archives in that directory, I decompressed it, then re-compressed it with each of the codecs being discussed, and then timed decompression from there as well. The benchmark was performed on a laptop with an Intel Core i9-9980HK and 64GB of memory running Linux. ↩
This shouldn't be much of a surprise, since crate archives are mostly source code, and brotli was optimized for web content which is also mostly source code - in particular, javascript looks awfully similar to rust sources, as far as a compression codec is concerned anyway. ↩
One could probably eek a little more out of zstd by using its functionality for "training" a dictionary that can be pre-shared, borrowing a trick from brotli but trained on a dataset that is specific to your problem domain. The significant downside of doing that, of course, is that it means people wouldn't be able to inspect the crate archives using standard tools without finding the specified dictionary file somewhere. ↩

julienw · 2021-10-10T11:46:20Z

I'm surprised about the decompression time for zopfli, shouldn't it be close to gzip ?

adam-azarchs · 2021-10-10T17:53:48Z

I was also surprised. In my past experiences decompressing from zopfli has usually been if anything faster. It's possible the issue is with my benchmarking script, which for convenience used the same executable for compression and decompression; zopfli may just be bad a decompression and I'd have gotten different results with gunzip.

soloturn · 2023-03-03T19:35:11Z

related is dropbox re-implementation of brotli in rust, with its new name broccoli, and the broccoli protocol which they are using around it to have concatenable brotli streams:

@llogiq , @adam-azarchs when you write "diff compression" this should solve it? @alexcrichton what you think?

adam-azarchs · 2023-03-03T22:03:16Z

Brotli is unrelated to diff compression. "Diff compression" refers to computing the diff between one version of something and another and only sending that difference (which could itself be compressed, of course). However there are several problems with using diff compression for this use case:

If you skip a version, you need to download, decompress, and apply several diffs sequentially. Or the server needs to maintain diffs between a quadratically-growing number of different version combinations.
Checksum verification becomes more complicated. Either you need to maintain the checksum for the baseline and every diff, or you need to apply the diff before computing the checksum.
Decompressing a baseline archive and one or more diffs, and then applying those diffs, is guaranteed to be slower than simply decompressing an archive of the final content.
The whole thing adds significant complexity. That's not an argument against it in itself, but does mean that the benefits need to be compelling.
When I update a crate on my dev machine, there will be the earlier version to diff against, which could improve size for that crate, but there's also a lot of other crates which haven't been touched at all. However a CI system that is doing a clean build every time (e.g. GitHub actions) will need to download every crate fresh, with no baseline. Because it's downloading every crate, I strongly suspect that those kinds of things will constitute the vast majority of total downloads. In those cases, downloading a baseline + one or more diffs would actually increase the total amount downloaded in most cases.

Particularly due to that last point, download of complete archives needs to be considered the "normal" case, and needs to be well-supported. Diff compression might be a fun optimization, but can't really have priority over optimizing the common case.

epage · 2023-09-18T16:35:35Z

Since a lot of this early discussion, we gained the package.rust-version field and we can only upload the new format when the MSRV is high enough.

adam-azarchs · 2023-09-19T21:13:54Z

I redid the above comparison with somewhat newer tools and a bit more care to insulate things from the rest of what was going on on the system. I omitted zopfli this time because it's a bit awkward to include (it doesn't work on streams, only files on disk, because it needs to make repeated passes).

Codec	Compressed Size (MB)	Ratio	Decompres time (s)	Compress time (s)
Uncompressed	3252.5	1.00	N/A	N/A
lz4	802.4	0.25	6.5	14.9
zstd --fast	630.0	0.19	8.2	15.3
lz4 -9	609.2	0.19	6.4	96.2
gzip	525.8	0.16	18.9	88.6
gzip -9	517.0	0.16	18.8	249.4
zstd	493.4	0.15	8.9	20.1
brotli -9	398.2	0.12	10.4	316.2
zstd -19	380.3	0.12	9.7	1527.2
zstd -22	376.0	0.12	10.6	3599.5
xz	372.5	0.11	35.1	853.1
xz -9e	360.2	0.11	34.9	1785.2
brotli¹	357.2	0.11	11.3	7094.0

These results were streaming the content of each crate file (one at a time) from an in-memory buffer², on an Amazon EC2 instance (m5, Intel(R) Xeon(R) Platinum 8259CL CPU).

Software versions:

gzip 1.5
zstd 1.5.4
brotli 1.0.9
xz 5.2.7

Compression inputs were $CARGO_HOME/registry/cache/*/*.crate³, for a large and probably representative sample.

I didn't redo the zopfli benchmark but it looks likely that it would end up close to brotli -11 in terms of time to compress⁴, and I'd expect it's compressed size to come in a little bigger than zstd at default settings. The reason to consider zopfli as an option would be to preserve compatibility with existing clients, since its output is still gzip; it could be implemented with a relatively isolated change to cargo publish (there's even a pure-rust re-implementation for zopfli already, though I haven't done any performance comparisons for it). A little over 2 seconds per MB is slow as far as compression codecs go in general, but given that the vast majority of crates (in my $CARGO_HOME at least) come in under 500kb, it wouldn't have a meaningful impact on the amount of time it takes to publish a new crate version.

If switching codecs is on the table, it looks to me like brotli (at the default, highest compression level of 11) is probably the way to go. The decompression speed is close enough to zstd (and quite and improvement over gzip, let alone over xz), and it wins on size as well, likely because it's able to take advantage of its pre-initialized dictionary⁵.

(edited to clarify that the statistics are computed compressing each crate's content separately, and to add a few more codecs / settings to the party)

brotli's documented command line options only allow levels 0 through 9, however it actually goes to 11, which is the default. ↩
zstd's compression speed is improved quite substantially by passing --size-hint or --stream-size, especially at higher compression levels - that 3449 seconds for zstd --ultra -22 is reduced to 2069 seconds, for example. ↩
Because of the recent switch to sparse registry, there's some duplication between github.com-1ecc6299db9ec823 and index.crates.io-6f17d22bba15001f, which makes the statistics biased towards more recently-used crates, which I figured is probably a good thing. ↩
This makes sense given that brotli's source code refers to quality level 10 as ZOPFLIFICATION and 11 as HQ_ZOPFLIFICATION. ↩
Brotli's dictionary (RFC) was pre-trained on internet content at the time. That includes a lot of JavaScript, which is mostly English-ish words with a lot of '});, just like most rust source code. It would be interesting to see how that dictionary might be different if it was re-trained today, where a lot more of the downloaded content on the internet is JavaScript, and that JavaScript is more frequently minified, but it's unlikely that we'll see a "brotli 2" any time soon, since the existing one remains "good enough." ↩

epage · 2023-09-19T21:44:00Z

It sounds like you are compressing / decompressing your entire cache into one file. How representative is that compared to compressing / decompressing individual .crate files? I know some things scale poorly but I don't know enough about compression algorithms to know if it includes them. For example, while compression time is the lowest priority for me, it makes it harder to evaluate the user impact when the numbers are that large and unrepresentative of what will happen for the user.

Is there much of a difference between implementations we'd use within cargo and the standalone programs tested?

Something else for us to consider is how easy it is to index into the "tarball" if we decide to support rustc reading directly from compressed files. This isn't decided yet and so not a hard requirement but it is useful to consider.

Again, I know very little about compression: someone brought up LZ4 at RustConf Unconf. Is that worth testing?

adam-azarchs · 2023-09-20T01:27:36Z

It sounds like you are compressing / decompressing your entire cache into one file.

No, I was not. That was compressing/decompressing one crate at a time. Otherwise I would expect zstd to have significantly outperformed brotli. Also zopfli (including the implementation within brotli as far as I know) is not capable of compressing files larger than 2GB.

Is there much of a difference between implementations we'd use within cargo and the standalone programs tested?

That I can't answer without knowing which implementation you'd use. If you used C / bindgen then I'd expect the performance to be the same (or maybe slightly better since you wouldn't have the process start/stop overhead) but if it's an alternative implementation then who knows?

Something else for us to consider is how easy it is to index into the "tarball"

For that, you'd probably be best off using zip rather than tar for the archive format, since that allows indexing to individual files within the archive. Though, you could use something like stargz if you wanted to retain backwards compatibility. Either way, your compression ratios won't be as good because you'll lose state between files. However, I'd expect the impact to be lower for brotli due to the pre-initialized dictionary.

Again, I know very little about compression: someone brought up LZ4 at RustConf Unconf. Is that worth testing?

Not really. lz4 is very fast for both compression and decompression but doesn't get good compression ratios. It's designed for things like transparently compressing data in an advanced filesystem like zfs or btrfs; it's not really appropriate for longer-term archival. zstd is almost as fast for decompression.

For completeness, however, here is a comparison on a subset of the crate list:

Codec	Compressed Size (MB)	Ratio	Decompres time (s)	Compress time (s)
Uncompressed	2238.2	1.00	N/A	N/A
zstd --fast	447.1	0.20	6.2	11.4
lz4 -9	433.1	0.19	4.8	69.9
zstd	350.8	0.16	6.7	15.7

adam-azarchs · 2023-09-20T01:47:15Z

For the curious, I created this comparison using the little go "script" I'm attaching here (go, because it's quick to iterate with, is easy to inject parallelism, and doesn't require any dependencies beyond the standard library for something like this). Disclaimer: this was just something I whipped up quickly to generate these comparisons; I make no warrantee as to its robustness or maintainability. It's named .txt here due to limitations github places on the extensions for files attached to comments: comp_bench.go

epage · 2023-11-01T22:38:05Z

The index would grow another field to store the sha256 hash of the xz crate, and new publishes would fill that in.

Alternatively, we could checksum before compression as requested in #7212 and we can support as many different compression algorithms we want without adding a new field each time.

llogiq · 2023-11-07T09:05:50Z

That might risk decompression bombs (or malware using a decompressor vulnerability, because there is no way to ensure a valid download before decompression.

Eh2406 · 2023-11-07T14:56:08Z

Yes. We are going to need more than one hash. The index is going to need to provide a "the actual artifact you're about to download has the hash". Separately we will need the ability for lock files to check "this package is the same as last time we built, even if we downloaded it under different compression".

epage · 2023-11-07T14:58:16Z

I take it our existing zip bomb mitigations aren't sufficient?

adam-azarchs · 2023-11-07T17:39:05Z

Security vulnerabilities in decompression codecs are not unheard of, so It's nice to be able to verify integrity of the downloaded artifact before decompressing it. It also makes it easier to implement things like content-addressable caches, since that way they don't actually need to be able to understand the compression format.

llogiq · 2023-11-08T10:54:48Z

@epage it would at least raise the security level by being able to verify the authenticity of the download before doing anything with it. Of course crackers could still compromise the server, replacing both binaries and checksums, but then we'll have bigger fish to fry anyway.

adam-azarchs · 2023-11-08T21:26:06Z

Checksums are stored in Cargo.lock, so I'd think a compromise of that sort for the registry would be detected very quickly.

wbcat · 2024-06-20T09:09:07Z

The author of lzip went great length to make extreme robust integrity checking, which makes it more suitable for long term storage than xz. See xz inadequate.

I have been using lzip for quite some time especially the tarlz tool.

Compression size and speed are on par with xz, but outperformed by zstd.

My current usage pattern:

zstd if data is somewhat ephemeral
lzip if data has to be kept for a long time

kornelski · 2024-06-20T18:01:23Z

For crate sizes, see https://lib.rs/stats#crate-sizes — almost all crates are under 50KB. Majority is under 10KB.

The really large ones typically contain binary executables, libraries, ./target dir, or a lot of autogenerated code for bindings or parsers.

adam-azarchs · 2024-06-21T00:24:34Z

This is true. Looking at my own $CARGO_HOME cache as maybe a representative sampling, the 10 largest (which is really only 6 because there were multiple versions of some) are

crate	compressed size (MB)	decompressed (MB)	notes
windows	12	157	I think falls into the "a lot of autogenerated code" category (including some multi-megabyte .rs files). That's frankly never going to perform well.
image	8.8	14	seems mainly down to including its `tests` and `examples` directories. This is the kind of thing that gives https://github.com/the-lean-crate/criner a reason to exist.
plotters	8.2	42	nearly all of the size is in a `plotters-doc-data` that appears to be full of images that are referenced in markdown/html from docstrings and may not have even been intentionally checked out in the repo at the time the crate was packaged. Also includes e.g. `Cargo.toml.orig`...
lapack-sys	7.1	73	includes a lot of little files from lapack C code, including a few which I don't think ought to be there like for example both `Makefile` and `CMakeLists.txt` (please, just pick one...). But also 24M worth of directories named `testing` or `TESTING`.
openssl-src version 111.25.2	4.9	27	mostly in `openssl/crypto` because eliptic curve definitions are unavoidably big.
ring	4.9	15	1.9M of tests, but most of the size is in `third_party/NIST/SHAVS`, which is again unavoidable.

So if your point was that perhaps rather than looking at compression algorithm there should the more emphasis on reducing unnecessarily included files, I'd mostly agree with you. But, only mostly, because of those 6 crates, one was windows and another was openssl-src, both of which are fairly critical crates in the rust ecosystem, and neither of which have a strightforward route to being reduced in size. I think size isn't such a big concern but moving to a more modern compression algorithm such as zstd could improve performance considerably.

As for what compression algorithm to use, I think the argument for zstd is pretty strong (even ignoring the recent tarnish to xz's reputation security-wise): zstd -22 compresses just about as well as xz but is 3x faster to decompress. Compared to gzip, it's 72% of the size and decompresses in half the time.

A recent golang issue to add zstd support does a decent job of summarizing why it meets the fairly high bar for inclusion:

is faster to compress/decompress than GZIP,

has better compression ratios,

is well-specified¹,

is backed by a prominent company, and

is seeing increased adoption in use with archives and as a browser content encoding.

lzip is fine I guess but definitely not seeing the kind of backing and adoption that zstd is seeing, e.g. browser Content-Encoding or .deb archives in the repos for recentish major versions of Debian and Ubuntu. The robustness checks and recoverability are nice but I think the usefulness is limited in a context where the archives are stored in a major cloud service with redundancy across multiple geographic zones. Integrity checking is useful, of course; zstd does optionally include an integrity check in its blocks, using xxhash, but if it's corrupted, realistically you're just going to download a fresh copy.

I'd probably also take the opportunity to rejigger the format a little bit to make it easier to extract Cargo.toml and maybe the license file (if applicable) without decompressing the whole archive (possibly as simple as requiring those to be the first elements in the archive). Not sure there's much of a use case for being able to extract other arbitrary files from the archive.

which xz is not, as the lzip authors point out. lzip is well-specified, at least. ↩

kornelski · 2024-06-21T02:26:12Z

My conclusions from crate sizes are:

majority of crates are small and plain text.
the big outliers either contain lots of code (which is necessary), or binaries that maybe shouldn't have been included in the first place.

I think this favors Brotli as the compressor, which performs especially well on small text files (and still fine on binary files).

Other arguments in favor of Brotli:

it has been immortalized as the compression algorithm in HTTP and Woff2 (web font). It's widely deployed at scale (zstd in HTTP is new, and barely used).
there is a mature Rust implementation of Brotli, including both compression and decompression (zstd in Rust has only a slow decompressor, and no compressor).

adam-azarchs · 2024-06-21T06:46:55Z

Theory is nice, but brotli was included in my benchmarks above. The results speak for themselves.

kornelski · 2024-06-22T02:23:41Z

I don't see zstd being pareto-optimial on all three variables, so the conclusion depends on which criteria you pick.

If prioritizing only file size, then brotli -11 wins by a significant margin, while still being faster than gzip to decode, and encode time is less than 2x of zstd's maximum. Use of the most expensive compression setting isn't irrational, given that the majority of crates is below 50KB. Publishing is a one-time operation, while the data is downloaded many more times and stored indefinitely.

For a certain balance of file size & decoding time, zstd -19 has a slight advantage in this set, but this comes at a 5x increase in encoding time compared to brotli -9, and it's inconclusive whether a higher brotli level could catch up on file size with less than 5x increase in encoding time.

The file sizes involved here differ by multiple orders of magnitude, which makes proper analysis much harder. Datasets with such large differences are likely to show the Simpson's paradox (you may get different conclusions whether you compare compression on pairs of files, or per bucket of similar file sizes, or just add up sizes of all files first and then compare)

A single total/average is likely affected strongly by outliers — de/compression time increases superlinearly with window size (which is forced to be small on small files), so a handful of largest files can dominate the timings. A single total also can't disprove that Brotli compresses small files better. Brotli could be better on all files except the largest few, and lose on the summed total because of them.

folkertdev · 2024-06-22T13:19:15Z

I was curious what impact a custom dictionary would have on the zstd numbers. The main reason that brotli is able to compress so well is its dictionary. zstd makes it easy to train a custom dictionary on a collection of files.

Then especially for small crates, zstd appears to achieve better compression:

zstd           : 17.62%   ( 22528 =>   3969 bytes, ./mimalloc-0.1.42.crate.raw.zst) 
zstd with dict : 11.39%   ( 22528 =>   2567 bytes, ./mimalloc-0.1.42.crate.raw.zst) 
brotli         : 15.16%   ( 22528 =>   3416 bytes, ./mimalloc-0.1.42.crate.raw.br)
===
zstd           : 16.06%   (323072 =>  51874 bytes, base64-0.13.1.crate.raw.zst) 
zstd with dict : 14.20%   (323072 =>  45869 bytes, base64-0.13.1.crate.raw.zst) 
brotli         : 14.39%   (323072 => 46481 bytes, base64-0.13.1.crate.raw.br)
===
zstd           : 13.45%   (381952 =>  51376 bytes, borsh-1.5.0.crate.raw.zst) 
zstd with dict : 11.88%   (381952 =>  45369 bytes, borsh-1.5.0.crate.raw.zst) 
brotli         : 12.58%   (381952 => 48052 bytes, borsh-1.5.0.crate.raw.br)

Note: it is quite plausible that at least for mimalloc there is some overfitting happening here. The dictionary is about 100k so for the others that is less likely.

As the inputs get larger, this gap narrows and eventually brotli wins out

zstd           : 12.23%   (4192256 => 512558 bytes, tokio-1.34.0.crate.raw.zst) 
zstd with dict : 13.12%   (4192256 => 550107 bytes, tokio-1.34.0.crate.raw.zst) 
brotli         : 11.89%   (4192256 => 498616 bytes, tokio-1.34.0.crate.raw.br)

In this case the dictionary version also gives worse results than zstd without a dictionary. That's weird because at least in gzip, not providing a dictionary is equal to providing a dictionary of all zeros.

In any case, the dictionary provides another level of tuning to zstd, that appears to be beneficial wrt. size for the majority of crates.

scripts to generate this output from a folder of .crate files are provided below:

Hacky bash scripts

for file in *.crate; 
do 
    # Rename the file to .crate.gz
    mv "$file" "$file.gz"
    
    # Decompress the .gz file, keeping the original
    gzip -d -c "$file.gz" > "${file%.crate}.crate.raw"
done

train a dictionary on all .crate.raw files in the current directory

zstd --train -o raw-crates.dict *.crate.raw

this runs zstd and brotli on the .crate.raw files

for file in $(find . -maxdepth 1 -name "*.crate.raw" -exec stat --format='%s %n' {} + | sort -n | cut -d' ' -f2-);
do 
    echo "==="
    # Original file size
    original_size=$(stat --format='%s' "$file")

    zstd --ultra -22 "$file" -f
    zstd --ultra -22 "$file" -f -D raw-crates.dict 

    brotli -Z "$file" -f
    brotli_compressed_file="${file}.br"


    brotli_size=$(stat --format='%s' "$brotli_compressed_file")
    brotli_ratio=$(awk "BEGIN {printf \"%.2f\", $brotli_size / $original_size * 100}")

    echo "$file : $brotli_ratio%   ($original_size => $brotli_size bytes, $brotli_compressed_file)"
done

adam-azarchs · 2024-06-22T23:04:37Z

For a fair comparison you shouldn't train the dictionary on the same set of crates you're using to measure compression ratio, though it would be fair for there to be some overlap, since presumably the dictionary would we trained on a sample of the most popular crates.

A zstd dictionary is basically just initializing the stream state. It shouldn't make much difference at all for larger files; once you're past the context window, the initial state is irrelevant.

adam-azarchs · 2024-06-23T06:50:05Z

There's a major downside to using a shared dictionary with zstd, which is that crates are no longer convenient to inspect or manipulate with standard tools other than cargo itself. That little benchmark script is a case in point - you can just do gzip -c < foo.crate, but if you used a shared dictionary with zstd, you'd need to find that shared dictionary and pass it to the command line. Brotli's shared dictionary isn't going to be as well optimized for rust code, but it's baked into the specification for the format so there's no question about where to find it.

I also think it's a bit misleading to say that "almost all crates are under 50KB. Majority is under 10KB." Yes, this is true. It's also true that a majority haven't updated in over a year, and any metric for "popularity" is dominated by a very tiny minority. I also think that there's a strong argument that the compression ratio for small crates doesn't matter in most practical senses.

The only case where all crates should be considered equally is if you care about the storage costs for hosting the crates.io registry, in which case you should probably round everything up to a floor of 128 KB, which is the minimum billable size for an s3 object. I don't think anyone is arguing that this is what should be prioritized, especially given that, for a good long while after such an update was made, crates.io would likely need to be hosting crates in both the gzip format and whatever was settled on to replace it (either that or convert on the fly for older clients, but that costs compute and has to be done very carefully to ensure that checksums remain stable). If you really want to cut storage costs on crates.io then zopfli is probably the way to go, at least for the medium term, because it doesn't pose such backwards compatibility issues.

If you care about the space taken up by $CARGO_HOME/registry/*/cache in developer's machines, you should round everything up to to the next 4 KiB block boundary (some filesystems use larger blocks, but not many use smaller blocks), and you should weight by all-time downloads (the ordering of which is pretty similar to that for recent downloads).

If you care about network bandwidth (either for developers or egress charges for crates.io's hosting) then you should weight by recent downloads (and if you want to be a pedant, also consider the fixed per-download overhead, which will dominate for the smallest crates). The top recently-downloaded crates on crates.io are

crate	recent downloads (millions)	compressed size (KiB)
`syn`	75	259
`hashbrown`	57	138
`bitflags`	57	42
`base64`	49	80
`regex-syntax`	49	339
`proc-macro2`	47	48
`indexmap`	46	81
`quote`	44	28
`regex-automata`	44	603
`libc`	43	726
`serde`	41	76
`itertools`	39	143
`memchr`	39	94

If you care about impact on build times, then you should again weight by recent downloads and consider the compressed size (assuming some download speed, and again probably assuming some fixed per-crate overhead) and decompression speed.

kornelski · 2024-06-24T14:48:20Z

Good point about the weighing. I've added tarball_size * downloads to the stats: https://lib.rs/stats#crate-bandwidth

adam-azarchs · 2024-06-24T18:58:35Z

Good point about the weighing. I've added tarball_size * downloads to the stats: https://lib.rs/stats#crate-bandwidth

Nice! That's very interesting data! From there, we can see that for example the majority of crates have 1-10MB worth of downloads per month, so totaling O(100GB) for that bucket. I wouldn't be surprised if a lot of those are crates which basically no one but their author uses. Meanwhile there 482 crates which total at least that much each, including 8 which download > 10TB each (which is pretty impressive really). The list of crates in that bucket isn't terribly surprising given what I've noted above - libc is one of the most-downloaded crates, and is relatively large, and windows, openssl-src, and ring are all quite large and still popular.

Breaking it down by bandwidth used per bucket (very approximately, given I don't know the distribution of sizes within each bucket; but assuming they follow the same trend can probably assume the average is about 2x min),

bucket	crates	total (TB)
1-10MB	48635	0.1
10-100MB	25706	0.5
100MB-1GB	11062	2
1 - 10GB	4487	9
10-100GB	1413	30
100GB-1TB	406	80
1 - 5TB	53	80
5-10TB	15	100
>10TB	8	100

Put another way, there's only around 25 crates that are responsible for around half of the total bandwidth on crates.io. Impressive! (though, not necessarily in a good way. I hope those crates at least have put some effort into trimming out anything unnecessary but I'd be willing to bet that at least some of them have unnecessary content)

Of course, a lot of downloads are in CI (e.g. GitHub actions, if the user hasn't set up caching properly), where bandwidth is plentiful and latency isn't so critical. So if I had time to work up another representative benchmark (I don't, at least not this week), I'd probably be looking at

Crates with at least 5M "recent" downloads (~500)
Crates with at least 30M all-time downloads (~500, with obviously a lot of overlap with 1)
Crates with >100GB worth of monthly downloads (482, probably also with a lot of overlap with 1)

kornelski · 2024-06-25T12:19:41Z

here's bandwidth from recent downloads & size of the latest release

top-bandwidth.csv

zleyyij · 2024-06-27T18:04:20Z

Just wanted to throw my chip in, for zstd, the pure rust crate doesn't have an encoding implementation because the specification doesn't directly specify the algorithm, just provides different ways to encode chunks of data, so implementing it would require a lot of work (referring to the source code for facebook's library if you wanted to come anywhere close in terms of speed and compression ratios).

alexcrichton added the E-hard Experience: Hard label Mar 29, 2016

weihanglo mentioned this issue Mar 2, 2023

Improve contributor experience by labeling complexity, ambiguity, and consensus of an issue #11788

Closed

epage changed the title ~~Migrate to xz compression for crates.io crates~~ Migrate to alternative compression for crates.io crates Sep 18, 2023

epage added the S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. label Oct 23, 2023

epage mentioned this issue Nov 1, 2023

Create checksum from more reproducible artifact #7212

Open

epage mentioned this issue Nov 2, 2023

cargo metadata: expose checksum in packages #12818

Open

epage mentioned this issue Feb 26, 2024

provide a way to not download tests/, examples and other such common directories #13491

Open

Migrate to alternative compression for crates.io crates #2526

Migrate to alternative compression for crates.io crates #2526

Comments

alexcrichton commented Mar 29, 2016

Why?

How?

alexcrichton commented Mar 29, 2016

larsbergstrom commented Mar 29, 2016

julienw commented Apr 15, 2016

alexcrichton commented Apr 15, 2016

julienw commented Apr 15, 2016

raphaelcohn commented Apr 29, 2016

raphaelcohn commented Apr 29, 2016

luser commented Jun 21, 2016

luser commented Jun 29, 2016

gyscos commented Jul 18, 2016 • edited Loading

raphaelcohn commented Jul 23, 2016 • edited Loading

eternaleye commented Aug 11, 2016

kornelski commented Aug 30, 2016 • edited Loading

eternaleye commented Aug 30, 2016

kornelski commented Aug 30, 2016

gyscos commented Aug 30, 2016

kornelski commented Aug 30, 2016

gyscos commented Aug 30, 2016 • edited Loading

llogiq commented Aug 31, 2016

gyscos commented Aug 31, 2016 • edited Loading

adam-azarchs commented Oct 10, 2021

Footnotes

julienw commented Oct 10, 2021

adam-azarchs commented Oct 10, 2021

soloturn commented Mar 3, 2023 • edited Loading

adam-azarchs commented Mar 3, 2023

epage commented Sep 18, 2023

adam-azarchs commented Sep 19, 2023 • edited Loading

Footnotes

epage commented Sep 19, 2023

adam-azarchs commented Sep 20, 2023 • edited Loading

adam-azarchs commented Sep 20, 2023 • edited Loading

epage commented Nov 1, 2023

llogiq commented Nov 7, 2023

Eh2406 commented Nov 7, 2023

epage commented Nov 7, 2023

adam-azarchs commented Nov 7, 2023

llogiq commented Nov 8, 2023

adam-azarchs commented Nov 8, 2023

wbcat commented Jun 20, 2024

kornelski commented Jun 20, 2024 • edited Loading

adam-azarchs commented Jun 21, 2024

Footnotes

kornelski commented Jun 21, 2024

adam-azarchs commented Jun 21, 2024

kornelski commented Jun 22, 2024 • edited Loading

folkertdev commented Jun 22, 2024 • edited Loading

adam-azarchs commented Jun 22, 2024

adam-azarchs commented Jun 23, 2024

kornelski commented Jun 24, 2024

adam-azarchs commented Jun 24, 2024

kornelski commented Jun 25, 2024

zleyyij commented Jun 27, 2024

gyscos commented Jul 18, 2016 •

edited

Loading

raphaelcohn commented Jul 23, 2016 •

edited

Loading

kornelski commented Aug 30, 2016 •

edited

Loading

gyscos commented Aug 30, 2016 •

edited

Loading

gyscos commented Aug 31, 2016 •

edited

Loading

soloturn commented Mar 3, 2023 •

edited

Loading

adam-azarchs commented Sep 19, 2023 •

edited

Loading

adam-azarchs commented Sep 20, 2023 •

edited

Loading

adam-azarchs commented Sep 20, 2023 •

edited

Loading

kornelski commented Jun 20, 2024 •

edited

Loading

kornelski commented Jun 22, 2024 •

edited

Loading

folkertdev commented Jun 22, 2024 •

edited

Loading