Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tarballs we deliver should be compressed better #21724

Closed
nagisa opened this issue Jan 28, 2015 · 40 comments
Closed

Tarballs we deliver should be compressed better #21724

nagisa opened this issue Jan 28, 2015 · 40 comments
Labels
P-low Low priority

Comments

@nagisa
Copy link
Member

nagisa commented Jan 28, 2015

Today’s rust-nightly-x86_64-unknown-linux-gnu.tar.gz is 125MiB in size. I did a make dist-tar-bins which output the same tarball, but only 88MiB in size. This is 70% of whatever we publish to s3.

I took liberty to also test:

  • xz (the default level, -6) → 69MiB (55% original);
  • xz -9 → 59MiB (47% original, but has high memory requirements to decompress)
  • bz2 → 82MiB (65% original);
  • lzma → 69MiB, but took longer than xz.

I strongly propose to either migrate to a more modern compression algorithm (xz) or at least investigating why gzip does such a bad job on the build bots.

cc @brson

@nagisa
Copy link
Member Author

nagisa commented Jan 28, 2015

Apparently our s3 distribution includes docs, which my make dist-tar-bins didn’t include for some reason. That’s the cause of discrepancy between gzip sizes on my system and what’s shipped through s3.

I downloaded the tarball we ship and recompressed it for a fairer comparison:

  • xz → 75MiB;
  • xz -9 → 73MiB (honestly, surprised about how small the improvement is over xz -6);
  • bz2 → 95MiB;

@nagisa nagisa changed the title Tarballs we deliver are compressed terribly Tarballs we deliver should be compressed better Jan 28, 2015
@nagisa
Copy link
Member Author

nagisa commented Mar 4, 2015

The tarball (2015-03-03) has since increased to 138MiB in size. Recompressing that to xz results in 82MiB .tar.xz, for a 40% gain.

@alexcrichton
Copy link
Member

Do we know of statistics of how many systems can decompress a xz archive? I'm mostly just curious as I suspect that we would continue to produce both and then systems like multirust could figure out which to download based on the host system.

@nagisa
Copy link
Member Author

nagisa commented Mar 4, 2015

As I and others discussed on IRC:

  • Macs don’t appear to have this by default, unsure, recipe on Homebrew;
  • Windows have neither gzip, nor xz, nor tar by default, so it doesn’t matter either way;

On linux systems some very common packages (with defaultish configure options) depend on liblzma5/xz (gdb and systemd come to mind as examples) so it is very likely it will be available on a standard linux system.

@comex
Copy link
Contributor

comex commented Mar 6, 2015

OS X does support xz by default in its 'tar' command (which is bsdtar - not sure exactly when support was introduced, but I think it was in 10.9) and Archive Utility (apparently newly in 10.10). This works via a library rather than the xz command line utility, which is not provided.

@nagisa
Copy link
Member Author

nagisa commented Mar 6, 2015

Are you sure it is not just shelling out to the xz command? Does 'tar xJf
file' work without xz executable in the path?

EDIT: never mind, apparently bstdar just links and uses liblzma5 directly. This is very nice, I should consider dropping gnu tar and use bsdtar myself.

Either way, this means that using xz will benefit most of the users of both linux and os x.

2015/03/06 6:52 "comex" notifications@github.com:

OS X does support xz by default in its 'tar' command (which is bsdtar -
not sure exactly when support was introduced, but I think it was in 10.9)
and Archive Utility (apparently newly in 10.10).


Reply to this email directly or view it on GitHub
#21724 (comment).

@abonander
Copy link
Contributor

The 7-Zip Utility for Windows can decompress xz archives, according to Wikipedia. That's the only archiver I use on Windows. It's free and open source, but it is third-party.

However, isn't the installer the preferred way of getting Rust on Windows? That's what I use. I don't know how the installer does decompression but xz support will probably have to be implemented for it.

@ssokolow
Copy link

ssokolow commented Apr 3, 2015

In addition to 7-Zip (which invented the underlying LZMA2 compression format) XZ is also supported by all the other major tools not already mentioned:

  • WinRAR (If you're running Windows, technical enough to be interested in Rust, and not using 7-zip, it's probably because you prefer WinRAR)
  • The Unarchiver (Serves a similar role to 7-zip as the "extract anything" utility of choice for OSX power users)
  • PeaZip (As far as I can tell, what people generally fall back to when they want an open-source, GUI-based "do 99% of what I need" archive tool for Windows but don't like 7-Zip's UI. Of course, given that most people who don't like 7-zip just put up with WinRAR's shareware nags, it's not got much market share even as the next in line.)

Edit: I need to stop forgetting to double-check my memory of what download pages are offering before posting. I've trimmed out some irrelevant bits.

@pnkfelix
Copy link
Member

pnkfelix commented Apr 3, 2015

Strong +1 to @alexcrichton 's suggestion that we simply provide both. It costs us relatively nothing to construct both artifacts; is there any serious cost to provide them both (e.g. are we worried about connection charges or storage space on our servers?)

@nagisa
Copy link
Member Author

nagisa commented Apr 9, 2015

An update since the metadata reform. Now gzipped tarball is 107MB. xzipped is 78MB. Still an easy 30MB win.

EDIT: docless xz: 75MB, docless gzip: 100MB.

@steveklabnik
Copy link
Member

Is this still an issue today?

@nagisa
Copy link
Member Author

nagisa commented Nov 4, 2015

Recompressing gzip from https://static.rust-lang.org/dist/rust-1.4.0-x86_64-unknown-linux-gnu.tar.gz to xz still goes from 97MB to 75MB, a win of 22MB.

@nagisa
Copy link
Member Author

nagisa commented Nov 4, 2015

@nodakai
Copy link
Contributor

nodakai commented Feb 28, 2016

FYI, nowadays even Busybox supports xz

@lifthrasiir
Copy link
Contributor

lifthrasiir commented Jun 16, 2016

We now have switched to the full Rust solution for distribution, so we can easily switch from tar.gz to tar.xz in that case.

@alexcrichton alexcrichton added the P-low Low priority label Aug 22, 2016
@nagisa
Copy link
Member Author

nagisa commented Aug 31, 2016

I recently had a chance to live on data-capped tether for 2 weeks and it hurt me very hard when the new stage0 compiler got in. It took me a considerable amount of time to download the new compiler and put a noticeable dent into my data allowance. Both of those would have been much more bearable with xz.

@shlomif
Copy link

shlomif commented Sep 8, 2016

Please provide .xz downloads for the source tarballs. I am a packager for Mageia Linux and downloading the tar.gz and then uploading it our tarballs server over my slow ADSL upstream is time-consuming. I tried to compress the tar.gz tarball using xz -9 --extreme and the savings are significant:

shlomif[rpms]:$mageia/rust/SOURCES$ ls -l rustc-1.11.0-src.tar.gz ~/rustc-1.11.0-src.tar.xz 
-rw-r--r-- 1 shlomif shlomif 17108400 Sep  8 20:56 /home/shlomif/rustc-1.11.0-src.tar.xz
-rw-r--r-- 1 shlomif shlomif 26126471 Aug 16 13:39 rustc-1.11.0-src.tar.gz

That's a 34% saving.

@brson
Copy link
Contributor

brson commented Sep 8, 2016

I'd like to make this happen but it's quite complex to do. I think the basic way to do it is to recompress all the tarballs in one batch job at the same time during final manifest generation. It would be great to do it in a way that isn't conflated with other parts of the build infrastructure, so that it can be developed and tested independently of buildbot. Unfortunately the way the entire set of artifacts is put together is quite complex. I tried to write up a design that somebody else could implement but got pretty discouraged.

But some requirements I think

  • It should be one big batch job that recompresses all the tarballs
  • The names of the xz files and their hashes need to end up in the manifest files so they can be validated by rustup. This requirement makes things particularly hard since the format has to be expanded backwards compatibly, and the recompression must be done before creation of the manifest.

I do want to redesign the entire release build process, and it might be easier to make this happen as part of a redesign.

@brson
Copy link
Contributor

brson commented Sep 8, 2016

Compressing just the source tarball can probably be done relatively easily by modifying the build system with a --xz-source-tarball flag which we could enable on the linux bots.

@cuviper
Copy link
Member

cuviper commented Oct 13, 2016

If we're counting calories, stripping .rustc from rustc/lib/*.so (but not rustlib/!) saves about 9MB from the current unpacked nightly dist, and that savings even translates directly to compressed forms since .rustc was already compressed.

There are also a few spots of debuginfo, but that saves less.

@alexcrichton
Copy link
Member

Ok, I think nowadays we're quite ready to be poised to do this! Specifically I believe the steps would look like:

  1. First, familiarize yourself with dist.rs where all distribution related code lives.
  2. Update Travis/AppVeyor to have the required software to perform xz compression (or whatever we choose)
  3. For all tarballs we created, create xz versions as well. I think this'd basically look like:
    • after the tarball is created
    • execute cat $tarball | gunzip | xz > $tarball.xz
  4. Update Travis/AppVeyor to upload xz tarballs (this may not actually require any changes)
  5. Update the manifest generator to list xz in the manifest. The precise format is somewhat undecided but we can likely add sibling xz-url = '...' keys to the existing url keys (along with a hash). @brson or I should be contacted about this.
  6. Update rustup.rs to parse the new xz keys in the manifest
  7. Add xz decompression support to rustup (via a library). This is one example library, there may be more
  8. Change rustup to prefer xz by default (if it proves itself)
  9. Rejoice!

@ranma42
Copy link
Contributor

ranma42 commented Mar 24, 2017

I did some experiments with the compression by also tuning the order in which files are included in the archive and it looks like we might get further improvements. This is basically achieved by storing duplicate files one after the other, so that the stream compressor can encode them more efficiently.
The results and the Makefile I used for my experiments are available here.

cat $tarball | gunzip | xz > $tarball.xz reduces the latest tarball from 135 MB to 95 / 90 MB (without / with the -9 flag).
Changing the order in which tar stores the files makes it possible to compress it down to 79 / 58 MB (without / with the -9 flag).

I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :)

@alexcrichton
Copy link
Member

@ranma42 holy cow I had no idea we could get such a drastic improvement by reordering files, that's awesome!

FWIW the tarball creation itself is likely buried in rust-installer which may be difficult to modify but not impossible! Eventually I'd love to completely rewrite the rust-installer repo in Rust itself (e.g. src/tools/rust-installer) but that doesn't have to necessarily happen before this.

@est31
Copy link
Member

est31 commented Mar 25, 2017

I've tried to get better results than @ranma42 with both brotli and zstd at maximum compression settings (22 for zstd, 11 for brotli) but they were far behind xz compression at default setting (for rust-nightly-x86_64-unknown-linux-gnu.tar.gz):

460M    files.gnu.tar
125M    files.gnu.tar.bz2
96M     files.gnu.tar.xz
92M     files.gnu.tar.xz9
103M    files.gnu.tar.zst
460M    rev-sorted-files.gnu.tar
88M     rev-sorted-files.gnu.tar.bro
123M    rev-sorted-files.gnu.tar.bz2
79M     rev-sorted-files.gnu.tar.xz
58M     rev-sorted-files.gnu.tar.xz9
85M     rev-sorted-files.gnu.tar.zst

For the rust-src-nightly.tar.gz they were behind as well, just not as far:

181M    files.gnu.tar
25M     files.gnu.tar.bz2
31M     files.gnu.tar.gz
22M     files.gnu.tar.xz
21M     files.gnu.tar.xz9
23M     rev-sorted-files.gnu.tar.bro
32M     rev-sorted-files.gnu.tar.gz
22M     rev-sorted-files.gnu.tar.xz
22M     rev-sorted-files.gnu.tar.xz9
23M     rev-sorted-files.gnu.tar.zst

Also note that brotli takes far longer to compress than any other algo. In the second diagram, you can see that reverse sorting in fact has a tiny negative inpact for source code. I too would suggest to go with xz at level 9 with reverse ordering, as its a) far more widespread than zstd/brotli and b) possible better decompression speed for zstd/brotli is an unimportant advantage.

@llogiq
Copy link
Contributor

llogiq commented Mar 25, 2017

I wonder if we can improve the sorting by either using a similarity hash and order by hash value, or even use a distance metric and Floyd-Warshall to find out the cheapest path through all files.

Then again that's probably overdoing it.

@ranma42
Copy link
Contributor

ranma42 commented Mar 25, 2017

@llogiq the "reverse name sorting" trick is a cheap approximation of that, because it clusters files with the same extension. In the case of rust object files, it is effectively also sorting them by their hash, ensuring that identical libraries are adjacent in the list.

If we want to squeeze the tarball further, I would suggest investigating the biggest files in the release:

  • librustc_llvm.so is 62 MB, but I do not expect it to squeeze easily
  • the cargo binary is 37 MB and it looks like it statically links all of its dependencies; maybe it would be possible to dynamically link some of them to reduce its size? (strip can squeeze it down to 9 MB, but this would make the debugging experience worse)

@llogiq
Copy link
Contributor

llogiq commented Mar 25, 2017

Perhaps we should setup stripped binaries after all – as the savings are substantial. It may allow some people to use Rust who currently cannot afford it.

@nagisa
Copy link
Member Author

nagisa commented Mar 25, 2017

The difference between fully stripped and not stripped when decompressed is 120MB. Difference when compressed (for sorted files) is 8MB.

@est31
Copy link
Member

est31 commented Mar 25, 2017

Being bold, we could also think of every single function as one "file", reorder those using similarity hashes (or floyd-warshall, although I guess the number would be too high for pure floyd-warshall), and provide a self extracting archive or something. That would solve the "cargo links everything statically" problem.

@lifthrasiir
Copy link
Contributor

Just in case I have tested other options with xz in search of more size reduction and less memory in decompression [1]: (The archive used is 2017-03-15 nightly, which should be same to @ranma42's)

-rw-rw-r--  1 lifthrasiir  81628064 Mar 26 22:41 rev-sorted-files.bsd.tar.xz
-rw-rw-r--  1 lifthrasiir  81080832 Mar 26 22:47 rev-sorted-files.bsd.tar.xz6e
-rw-rw-r--  1 lifthrasiir  74531756 Mar 26 22:50 rev-sorted-files.bsd.tar.xz7
-rw-rw-r--  1 lifthrasiir  73955700 Mar 26 22:56 rev-sorted-files.bsd.tar.xz7e
-rw-rw-r--  1 lifthrasiir  74053348 Mar 26 23:00 rev-sorted-files.bsd.tar.xz8
-rw-rw-r--  1 lifthrasiir  73494812 Mar 26 23:06 rev-sorted-files.bsd.tar.xz8e
-rw-rw-r--  1 lifthrasiir  60213532 Mar 26 23:10 rev-sorted-files.bsd.tar.xz9
-rw-rw-r--  1 lifthrasiir  59672056 Mar 26 23:17 rev-sorted-files.bsd.tar.xz9e

*.xz file corresponds to the default option (-6). I've also tested -6e thorugh -9e which tries to compress more at the expense of compression speed (about 2x slower in my testing); they do have some impact but not much, so I guess -9 is the best option overall as long as users have enough memory (see the footnote below). Note that the decompression speed was insignificant except that -9/-9e were slightly faster than others (probably due to less I/O overhead).

[1] All dictionary compression scheme requires a certain amount of previously decoded data. In gzip this is not significant (~64K) but for costlier options of xz this may be significant: -9 requires 65 megabytes of memory for example.

@ranma42
Copy link
Contributor

ranma42 commented Apr 17, 2017

I followed the first steps suggested by @alexcrichton without encountering any significant issue.
@brson, should we start designing the new format of the manifest? What is the best place for doing that?

@alexcrichton
Copy link
Member

@ranma42 oh @brson and I discussed this a long time ago actually and we were both on board with just adding a new key to the manifest. Right now all artifacts have url = "..." which points to a *.tar.gz, and we'd just add a new key, xz-url = "..." which points to a *.tar.xz (or whatever format we select).

@alexcrichton
Copy link
Member

oh and similar to hash = "..." we'd have xz-hash = "..." for each artifact

@ranma42
Copy link
Contributor

ranma42 commented Apr 18, 2017

@alexcrichton a dash in the field name will prevent Target from being RustcEncodable and instead require explicit serialization/deserialization as mentioned here. Is that ok?

Given the proposed approach, I assume that there are no plans to add other formats in the future. Another option might be to add a sources array containing structs that have format, hash and url keys (or a BTreeMap in which the format is the key?). This would make it trivial to add/remove formats from the manifest without changing its schema. The tools (is there any other tool beside rustup consuming the manifest?) could just ignore those that are not supported or disabled for some reason.

@alexcrichton
Copy link
Member

Oh so the serde version of toml takes care of tha just fine (via serde attributes) and the old rustc-serialize version actually handled it as well (translating deserializing into a rust field named foo_bar to read from a TOML key foo-bar)

I think we're definitely open to new formats in the future, we'd just add more keys. We could support a generic container (like a list) for the formats but it didn't really seem to give much benefit over just listing everything manually. Downloaders will basically always look for an exact one and otherwise fall back to tarballs.

@ranma42
Copy link
Contributor

ranma42 commented Apr 25, 2017

I implemented the changes required to get the xz url and hash here, but I keep getting the _ in the field names in the manifests. What should I do to get the fields written as xz-url? Should I use a different version of rustc-serialize?

@alexcrichton
Copy link
Member

Oh ideally we'd switch to serde, but I wouldn't really worry about it, it's not that important. Due to bootstrapping using serde in the compiler is difficult right now, unfortunately.

@ranma42
Copy link
Contributor

ranma42 commented Apr 25, 2017

Then I will leave the manifest fields as xz_url and xz_hash for the time being and start updating rustup :)
I have opened rust-lang/rust-installer#57 and I was waiting for that to be merged before submitting the PR against rust, to update the submodule to the merge commit. Should I just open the new PR and then update it as needed?

bors added a commit that referenced this issue Apr 30, 2017
Generate XZ-compressed tarballs

Integrate the new `rust-installer` and extend manifests with keys for xz-compressed tarballs.

One of the steps required for #21724
frewsxcv added a commit to frewsxcv/rust that referenced this issue May 3, 2017
Generate XZ-compressed tarballs

Integrate the new `rust-installer` and extend manifests with keys for xz-compressed tarballs.

One of the steps required for rust-lang#21724
frewsxcv added a commit to frewsxcv/rust that referenced this issue May 3, 2017
Generate XZ-compressed tarballs

Integrate the new `rust-installer` and extend manifests with keys for xz-compressed tarballs.

One of the steps required for rust-lang#21724
bors added a commit to rust-lang/rustup that referenced this issue May 23, 2017
Add support for XZ-compressed packages

When XZ-compressed packages are available, prefer them in place of the
GZip-compressed ones as they can provide significant savings interms
of download size.

This should be the last step towards fixing rust-lang/rust#21724
bors added a commit to rust-lang/rustup that referenced this issue May 24, 2017
Add support for XZ-compressed packages

When XZ-compressed packages are available, prefer them in place of the
GZip-compressed ones as they can provide significant savings interms
of download size.

This should be the last step towards fixing rust-lang/rust#21724
@ranma42
Copy link
Contributor

ranma42 commented May 24, 2017

The next version of rustup should include rust-lang/rustup#1100, hence it should use XZ by default (if available).

@alexcrichton
Copy link
Member

And rustup has now shipped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-low Low priority
Projects
None yet
Development

No branches or pull requests