Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTarballs we deliver should be compressed better #21724
Comments
This comment has been minimized.
This comment has been minimized.
|
Apparently our s3 distribution includes docs, which my I downloaded the tarball we ship and recompressed it for a fairer comparison:
|
nagisa
changed the title
Tarballs we deliver are compressed terribly
Tarballs we deliver should be compressed better
Jan 28, 2015
steveklabnik
added
the
A-infrastructure
label
Jan 29, 2015
This comment has been minimized.
This comment has been minimized.
|
The tarball (2015-03-03) has since increased to 138MiB in size. Recompressing that to |
This comment has been minimized.
This comment has been minimized.
|
Do we know of statistics of how many systems can decompress a |
This comment has been minimized.
This comment has been minimized.
|
As I and others discussed on IRC:
On linux systems some very common packages (with defaultish configure options) depend on liblzma5/xz (gdb and systemd come to mind as examples) so it is very likely it will be available on a standard linux system. |
lifthrasiir
referenced this issue
Mar 5, 2015
Merged
rustdoc: Move sidebar items into shared JavaScript. #23060
This comment has been minimized.
This comment has been minimized.
|
OS X does support xz by default in its 'tar' command (which is bsdtar - not sure exactly when support was introduced, but I think it was in 10.9) and Archive Utility (apparently newly in 10.10). This works via a library rather than the xz command line utility, which is not provided. |
This comment has been minimized.
This comment has been minimized.
|
Are you sure it is not just shelling out to the xz command? Does 'tar xJf EDIT: never mind, apparently bstdar just links and uses liblzma5 directly. This is very nice, I should consider dropping gnu tar and use bsdtar myself. Either way, this means that using xz will benefit most of the users of both linux and os x. 2015/03/06 6:52 "comex" notifications@github.com:
|
This comment has been minimized.
This comment has been minimized.
|
The 7-Zip Utility for Windows can decompress xz archives, according to Wikipedia. That's the only archiver I use on Windows. It's free and open source, but it is third-party. However, isn't the installer the preferred way of getting Rust on Windows? That's what I use. I don't know how the installer does decompression but xz support will probably have to be implemented for it. |
This comment has been minimized.
This comment has been minimized.
ssokolow
commented
Apr 3, 2015
|
In addition to 7-Zip (which invented the underlying LZMA2 compression format) XZ is also supported by all the other major tools not already mentioned:
Edit: I need to stop forgetting to double-check my memory of what download pages are offering before posting. I've trimmed out some irrelevant bits. |
This comment has been minimized.
This comment has been minimized.
|
Strong +1 to @alexcrichton 's suggestion that we simply provide both. It costs us relatively nothing to construct both artifacts; is there any serious cost to provide them both (e.g. are we worried about connection charges or storage space on our servers?) |
This comment has been minimized.
This comment has been minimized.
|
An update since the metadata reform. Now gzipped tarball is 107MB. xzipped is 78MB. Still an easy 30MB win. EDIT: docless xz: 75MB, docless gzip: 100MB. |
This comment has been minimized.
This comment has been minimized.
|
Is this still an issue today? |
This comment has been minimized.
This comment has been minimized.
|
Recompressing gzip from https://static.rust-lang.org/dist/rust-1.4.0-x86_64-unknown-linux-gnu.tar.gz to xz still goes from 97MB to 75MB, a win of 22MB. |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
FYI, nowadays even Busybox supports |
This comment has been minimized.
This comment has been minimized.
|
We now have switched to the full Rust solution for distribution, so we can easily switch from tar.gz to tar.xz in that case. |
alexcrichton
added
the
P-low
label
Aug 22, 2016
This comment has been minimized.
This comment has been minimized.
|
I recently had a chance to live on data-capped tether for 2 weeks and it hurt me very hard when the new stage0 compiler got in. It took me a considerable amount of time to download the new compiler and put a noticeable dent into my data allowance. Both of those would have been much more bearable with |
This comment has been minimized.
This comment has been minimized.
shlomif
commented
Sep 8, 2016
|
Please provide .xz downloads for the source tarballs. I am a packager for Mageia Linux and downloading the tar.gz and then uploading it our tarballs server over my slow ADSL upstream is time-consuming. I tried to compress the tar.gz tarball using
That's a 34% saving. |
This comment has been minimized.
This comment has been minimized.
|
I'd like to make this happen but it's quite complex to do. I think the basic way to do it is to recompress all the tarballs in one batch job at the same time during final manifest generation. It would be great to do it in a way that isn't conflated with other parts of the build infrastructure, so that it can be developed and tested independently of buildbot. Unfortunately the way the entire set of artifacts is put together is quite complex. I tried to write up a design that somebody else could implement but got pretty discouraged. But some requirements I think
I do want to redesign the entire release build process, and it might be easier to make this happen as part of a redesign. |
This comment has been minimized.
This comment has been minimized.
|
Compressing just the source tarball can probably be done relatively easily by modifying the build system with a |
This comment has been minimized.
This comment has been minimized.
|
If we're counting calories, stripping There are also a few spots of debuginfo, but that saves less. |
This comment has been minimized.
This comment has been minimized.
|
Ok, I think nowadays we're quite ready to be poised to do this! Specifically I believe the steps would look like:
|
This comment has been minimized.
This comment has been minimized.
|
I did some experiments with the compression by also tuning the order in which files are included in the archive and it looks like we might get further improvements. This is basically achieved by storing duplicate files one after the other, so that the stream compressor can encode them more efficiently.
I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :) |
This comment has been minimized.
This comment has been minimized.
|
@ranma42 holy cow I had no idea we could get such a drastic improvement by reordering files, that's awesome! FWIW the tarball creation itself is likely buried in rust-installer which may be difficult to modify but not impossible! Eventually I'd love to completely rewrite the rust-installer repo in Rust itself (e.g. |
This comment has been minimized.
This comment has been minimized.
|
I've tried to get better results than @ranma42 with both brotli and zstd at maximum compression settings (
For the rust-src-nightly.tar.gz they were behind as well, just not as far:
Also note that brotli takes far longer to compress than any other algo. In the second diagram, you can see that reverse sorting in fact has a tiny negative inpact for source code. I too would suggest to go with xz at level 9 with reverse ordering, as its a) far more widespread than zstd/brotli and b) possible better decompression speed for zstd/brotli is an unimportant advantage. |
This comment has been minimized.
This comment has been minimized.
|
I wonder if we can improve the sorting by either using a similarity hash and order by hash value, or even use a distance metric and Floyd-Warshall to find out the cheapest path through all files. Then again that's probably overdoing it. |
This comment has been minimized.
This comment has been minimized.
|
@llogiq the "reverse name sorting" trick is a cheap approximation of that, because it clusters files with the same extension. In the case of rust object files, it is effectively also sorting them by their hash, ensuring that identical libraries are adjacent in the list. If we want to squeeze the tarball further, I would suggest investigating the biggest files in the release:
|
This comment has been minimized.
This comment has been minimized.
|
Perhaps we should setup stripped binaries after all – as the savings are substantial. It may allow some people to use Rust who currently cannot afford it. |
This comment has been minimized.
This comment has been minimized.
|
The difference between fully stripped and not stripped when decompressed is 120MB. Difference when compressed (for sorted files) is 8MB. |
This comment has been minimized.
This comment has been minimized.
|
Being bold, we could also think of every single function as one "file", reorder those using similarity hashes (or floyd-warshall, although I guess the number would be too high for pure floyd-warshall), and provide a self extracting archive or something. That would solve the "cargo links everything statically" problem. |
This comment has been minimized.
This comment has been minimized.
|
Just in case I have tested other options with
[1] All dictionary compression scheme requires a certain amount of previously decoded data. In gzip this is not significant (~64K) but for costlier options of |
This comment has been minimized.
This comment has been minimized.
|
I followed the first steps suggested by @alexcrichton without encountering any significant issue. |
This comment has been minimized.
This comment has been minimized.
|
@ranma42 oh @brson and I discussed this a long time ago actually and we were both on board with just adding a new key to the manifest. Right now all artifacts have |
This comment has been minimized.
This comment has been minimized.
|
oh and similar to |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton a dash in the field name will prevent Given the proposed approach, I assume that there are no plans to add other formats in the future. Another option might be to add a |
This comment has been minimized.
This comment has been minimized.
|
Oh so the serde version of toml takes care of tha just fine (via serde attributes) and the old rustc-serialize version actually handled it as well (translating deserializing into a rust field named I think we're definitely open to new formats in the future, we'd just add more keys. We could support a generic container (like a list) for the formats but it didn't really seem to give much benefit over just listing everything manually. Downloaders will basically always look for an exact one and otherwise fall back to tarballs. |
This comment has been minimized.
This comment has been minimized.
|
I implemented the changes required to get the xz url and hash here, but I keep getting the |
This comment has been minimized.
This comment has been minimized.
|
Oh ideally we'd switch to serde, but I wouldn't really worry about it, it's not that important. Due to bootstrapping using serde in the compiler is difficult right now, unfortunately. |
This comment has been minimized.
This comment has been minimized.
|
Then I will leave the manifest fields as |
bors
added a commit
that referenced
this issue
Apr 30, 2017
bors
added a commit
that referenced
this issue
Apr 30, 2017
bors
added a commit
that referenced
this issue
May 3, 2017
frewsxcv
added a commit
to frewsxcv/rust
that referenced
this issue
May 3, 2017
bors
added a commit
that referenced
this issue
May 3, 2017
frewsxcv
added a commit
to frewsxcv/rust
that referenced
this issue
May 3, 2017
bors
added a commit
to rust-lang/rustup.rs
that referenced
this issue
May 23, 2017
bors
added a commit
to rust-lang/rustup.rs
that referenced
this issue
May 24, 2017
This comment has been minimized.
This comment has been minimized.
|
The next version of rustup should include rust-lang/rustup.rs#1100, hence it should use XZ by default (if available). |
This comment has been minimized.
This comment has been minimized.
|
And rustup has now shipped! |
nagisa commentedJan 28, 2015
Today’s rust-nightly-x86_64-unknown-linux-gnu.tar.gz is 125MiB in size. I did a
make dist-tar-binswhich outputthe sametarball, but only 88MiB in size.This is 70% of whatever we publish to s3.I took liberty to also test:
xz (the default level, -6) → 69MiB (55% original);xz -9 → 59MiB (47% original, but has high memory requirements to decompress)bz2 → 82MiB (65% original);lzma → 69MiB, but took longer than xz.I strongly propose to either migrate to a more modern compression algorithm (xz)
or at least investigating why gzip does such a bad job on the build bots.cc @brson