Consider disabling compression for rlibs and bytecode files #37086

Closed
nnethercote opened this Issue Oct 11, 2016 · 29 comments

Comments

Projects
None yet
@nnethercote
Contributor

nnethercote commented Oct 11, 2016

One of the hottest functions in rustc is tdefl_compress, which is called from deflate_bytes. It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

If we simply turned off compression in these two places we would get sizeable speed-ups. The following numbers are for a proof-of-concept patch, doing debug builds with a stage1 compiler.

futures-rs-test  4.632s vs  4.588s --> 1.009x faster (variance: 1.013x, 1.012x)
helloworld       0.249s vs  0.250s --> 0.997x faster (variance: 1.014x, 1.015x)
html5ever-2016-  7.967s vs  7.791s --> 1.023x faster (variance: 1.004x, 1.016x)
hyper.0.5.0      5.424s vs  5.177s --> 1.048x faster (variance: 1.004x, 1.006x)
inflate-0.1.0    5.013s vs  4.945s --> 1.014x faster (variance: 1.009x, 1.017x)
issue-32062-equ  0.367s vs  0.364s --> 1.008x faster (variance: 1.013x, 1.017x)
issue-32278-big  1.812s vs  1.810s --> 1.001x faster (variance: 1.007x, 1.008x)
jld-day15-parse  1.638s vs  1.606s --> 1.020x faster (variance: 1.001x, 1.012x)
piston-image-0. 12.522s vs 12.236s --> 1.023x faster (variance: 1.029x, 1.004x)
regex.0.1.30     2.684s vs  2.511s --> 1.069x faster (variance: 1.018x, 1.013x)
rust-encoding-0  2.232s vs  2.134s --> 1.046x faster (variance: 1.008x, 1.010x)
syntex-0.42.2   34.353s vs 33.205s --> 1.035x faster (variance: 1.011x, 1.013x)
syntex-0.42.2-i 18.848s vs 17.033s --> 1.107x faster (variance: 1.004x, 1.035x) 

regex and syntex-incr are the biggest wins.

The obvious downside is that the size of the relevant files is larger. So we need to decide if this we are happy with this trade-off, accepting larger files for faster compilation.

@nnethercote

This comment has been minimized.

Show comment
Hide comment
Contributor

nnethercote commented Oct 11, 2016

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Oct 11, 2016

Contributor

Maybe we should switch to a faster compression algorithm - e.g. lz4?

Contributor

arielb1 commented Oct 11, 2016

Maybe we should switch to a faster compression algorithm - e.g. lz4?

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Oct 11, 2016

Member

Something that could be considered is an algorithm that is seekable, so that we also could decompress on-demand (as we use absolute/relative positioning in the metadata blob).
If we can go back to compressing the metadata blobs in rlibs without noticeable slowdown we could get rid of the leb128 encoding and just write integers in little-endian as they would be in memory.
Then again, it might not be worth it.

Member

eddyb commented Oct 11, 2016

Something that could be considered is an algorithm that is seekable, so that we also could decompress on-demand (as we use absolute/relative positioning in the metadata blob).
If we can go back to compressing the metadata blobs in rlibs without noticeable slowdown we could get rid of the leb128 encoding and just write integers in little-endian as they would be in memory.
Then again, it might not be worth it.

@nagisa

This comment has been minimized.

Show comment
Hide comment
@nagisa

nagisa Oct 11, 2016

Contributor

we would get sizeable speed-ups

How much larger the rlibs become, though? 30%?

Contributor

nagisa commented Oct 11, 2016

we would get sizeable speed-ups

How much larger the rlibs become, though? 30%?

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 11, 2016

Contributor

For reference:

Alright so given the numbers #6954, I don't think LZ4 is worth it, and neither does @brson.

from #6902 (comment)

To be honest, I don't think that the speed gains listed above warrant turning compression off. How hard would it be to compress things on a background thread? Compression could be done in parallel with codegen, maybe?

Contributor

michaelwoerister commented Oct 11, 2016

For reference:

Alright so given the numbers #6954, I don't think LZ4 is worth it, and neither does @brson.

from #6902 (comment)

To be honest, I don't think that the speed gains listed above warrant turning compression off. How hard would it be to compress things on a background thread? Compression could be done in parallel with codegen, maybe?

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Oct 11, 2016

Member

rlibs are already uncompressed except bitcode. Metadata in dylibs is compressed.

I think we turned it off in rlibs not because of the one-time compression cost, but because of decompression in each user crate (it also means no mmap is possible).

Member

eddyb commented Oct 11, 2016

rlibs are already uncompressed except bitcode. Metadata in dylibs is compressed.

I think we turned it off in rlibs not because of the one-time compression cost, but because of decompression in each user crate (it also means no mmap is possible).

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Oct 11, 2016

Member

It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

As @eddyb mentioned rlib metadata isn't compressed, as it's intended to be mmap'd. So that also makes me curious where this compression is called from? A standard compilation should not take a look at the bytecode (e.g. it shouldn't need to decompress it), it's only there for LTO builds. Or that's at least the state of the world as of a few years ago when I implemented LTO...

@nnethercote are you sure that this function is mostly being called from decompression of metadata and/or bytecode? I could imagine that this showing up with compressing bytecode but not during a normal compile...

Member

alexcrichton commented Oct 11, 2016

It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

As @eddyb mentioned rlib metadata isn't compressed, as it's intended to be mmap'd. So that also makes me curious where this compression is called from? A standard compilation should not take a look at the bytecode (e.g. it shouldn't need to decompress it), it's only there for LTO builds. Or that's at least the state of the world as of a few years ago when I implemented LTO...

@nnethercote are you sure that this function is mostly being called from decompression of metadata and/or bytecode? I could imagine that this showing up with compressing bytecode but not during a normal compile...

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Oct 11, 2016

Member

tdefl_compress is compression, i.e. "deflate" ("inflate" being decompression - "de" of "deflate" might be a bit confusing).

Member

eddyb commented Oct 11, 2016

tdefl_compress is compression, i.e. "deflate" ("inflate" being decompression - "de" of "deflate" might be a bit confusing).

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Oct 11, 2016

Member

Oh ok I think I misread.

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs). We can likely stomach larger rlibs (the size rarely comes up) but to truly fix this we in theory want to disable bytecode-in-rlib entirely. It's only used for LTO, which is almost never used, so we should arguably require a flag to opt-in to LTO-able-rlib which Cargo passes by default if need be.

Member

alexcrichton commented Oct 11, 2016

Oh ok I think I misread.

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs). We can likely stomach larger rlibs (the size rarely comes up) but to truly fix this we in theory want to disable bytecode-in-rlib entirely. It's only used for LTO, which is almost never used, so we should arguably require a flag to opt-in to LTO-able-rlib which Cargo passes by default if need be.

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 11, 2016

Contributor

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs).

As the first comment says, compression occurs in two places. More specifically, here:

  • write_metadata, in src/librustc_trans/base.rs
  • link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

we could get rid of the leb128 encoding

leb128 encoding is also hot. E.g. see #37083 where a tiny improvement to read_unsigned_leb128 had a small but noticeable effect.

Contributor

nnethercote commented Oct 11, 2016

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs).

As the first comment says, compression occurs in two places. More specifically, here:

  • write_metadata, in src/librustc_trans/base.rs
  • link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

we could get rid of the leb128 encoding

leb128 encoding is also hot. E.g. see #37083 where a tiny improvement to read_unsigned_leb128 had a small but noticeable effect.

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 12, 2016

Contributor

Any suggestions on how to move forward here? The speed-ups are large, but the disk space trade-off is such that it will be easy for inertia to win here, and I'd like to avoid that. I can take file measurements if someone suggests good examples to measure.

Contributor

nnethercote commented Oct 12, 2016

Any suggestions on how to move forward here? The speed-ups are large, but the disk space trade-off is such that it will be easy for inertia to win here, and I'd like to avoid that. I can take file measurements if someone suggests good examples to measure.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Oct 12, 2016

Member

I suppose my personal ideal world would look a little like:

  • The compiler works with rlibs that have either compressed or uncompressed bytecode, some metadata says which.
  • A flag is added to the compiler to compress the bytecode (or basically "generate the smallest, yet fastest to read rlib")
  • The compiler's build system passes this flag on release builders

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

Member

alexcrichton commented Oct 12, 2016

I suppose my personal ideal world would look a little like:

  • The compiler works with rlibs that have either compressed or uncompressed bytecode, some metadata says which.
  • A flag is added to the compiler to compress the bytecode (or basically "generate the smallest, yet fastest to read rlib")
  • The compiler's build system passes this flag on release builders

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 12, 2016

Contributor

I'd prefer to have a future proof concept before doing anything here. Unless I'm overlooking something, there are only two crates where this makes a difference of more than a second and that's for debug builds of big crates. In a release build, I would suspect that the difference is well under 1% of total build time. So, I don't see an urgent need for action here. (Sorry, @nnethercote, I don't want to put a damper on your enthusiasm. It's great that you are looking into this, I just don't want to needlessly rush things)

Some questions that I'd like to have answered before proceeding:

  • Do we really want to make LTO opt-in? How would that interact with the possibility of machine-code-less rlibs that have been talked about from time to time. I'm worried about introducing another stable commandline option just to deprecate it again a few months later.
  • Is there a compression algorithm that works well with small, individually addressable items in a stream? Out of the box, LZW variants don't provide that, afaik.
  • Do we need a general purpose algorithm at all. Using leb128 for all numbers probably provides for pretty OK compression (but mixing leb128 with general purpose compression, like we do now, is probably a bad idea).
  • How much of the compressed data is crate metadata and how much is LLVM bitcode?
  • Could we disable compression implicitly, e.g. for debug builds, incremental compilation, and release builds that do not specify -Os or -Oz?
Contributor

michaelwoerister commented Oct 12, 2016

I'd prefer to have a future proof concept before doing anything here. Unless I'm overlooking something, there are only two crates where this makes a difference of more than a second and that's for debug builds of big crates. In a release build, I would suspect that the difference is well under 1% of total build time. So, I don't see an urgent need for action here. (Sorry, @nnethercote, I don't want to put a damper on your enthusiasm. It's great that you are looking into this, I just don't want to needlessly rush things)

Some questions that I'd like to have answered before proceeding:

  • Do we really want to make LTO opt-in? How would that interact with the possibility of machine-code-less rlibs that have been talked about from time to time. I'm worried about introducing another stable commandline option just to deprecate it again a few months later.
  • Is there a compression algorithm that works well with small, individually addressable items in a stream? Out of the box, LZW variants don't provide that, afaik.
  • Do we need a general purpose algorithm at all. Using leb128 for all numbers probably provides for pretty OK compression (but mixing leb128 with general purpose compression, like we do now, is probably a bad idea).
  • How much of the compressed data is crate metadata and how much is LLVM bitcode?
  • Could we disable compression implicitly, e.g. for debug builds, incremental compilation, and release builds that do not specify -Os or -Oz?
@oyvindln

This comment has been minimized.

Show comment
Hide comment
@oyvindln

oyvindln Oct 12, 2016

Contributor

A relatively easy change that might help would be to simply reduce the compression level. There are some flags that can be changed, see here, or here. Maybe there is a better tradeoff that could be found without making massive changes. Another thing that might be worth noting is that deflate function as it's written now will make miniz allocate memory (I haven't counted the exact size, but may be closer to 1mb) for the compressor each time it's called, which may not be ideal if it's used a number of times in short succession.

Contributor

oyvindln commented Oct 12, 2016

A relatively easy change that might help would be to simply reduce the compression level. There are some flags that can be changed, see here, or here. Maybe there is a better tradeoff that could be found without making massive changes. Another thing that might be worth noting is that deflate function as it's written now will make miniz allocate memory (I haven't counted the exact size, but may be closer to 1mb) for the compressor each time it's called, which may not be ideal if it's used a number of times in short succession.

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Oct 14, 2016

Member

I don't think we want to make LTO opt-in; Rust should support it by default, without needing to rebuild the world with it.

Member

joshtriplett commented Oct 14, 2016

I don't think we want to make LTO opt-in; Rust should support it by default, without needing to rebuild the world with it.

@jld

This comment has been minimized.

Show comment
Hide comment
@jld

jld Oct 15, 2016

Contributor

Not sure how much it helps, but deflate can be parallelized.

Contributor

jld commented Oct 15, 2016

Not sure how much it helps, but deflate can be parallelized.

@briansmith

This comment has been minimized.

Show comment
Hide comment
@briansmith

briansmith Oct 17, 2016

It's only used for LTO, which is almost never used

I disagree that LTO is almost never used. But, also, in the long run it should be used more. So, I'd rather have all libraries support LTO by default.

Rust should support it by default, without needing to rebuild the world with it.

I personally am OK with rebuilding the world to switch from non-LTO to LTO builds, as long as it doesn't require every library to opt into anything manually.

Perhaps, when building an application, Cargo could inspect the configuration and if any configuration enables LTO then it could build every library with LTO support. Otherwise, if the application doesn't enable LTO support, then it wouldn't enable LTO support. I don't know if this would need to inspect just the application's Cargo.toml or if it would need to recursively inspect all the Cargo.tomls, but this seems like just a detail.

It's only used for LTO, which is almost never used

I disagree that LTO is almost never used. But, also, in the long run it should be used more. So, I'd rather have all libraries support LTO by default.

Rust should support it by default, without needing to rebuild the world with it.

I personally am OK with rebuilding the world to switch from non-LTO to LTO builds, as long as it doesn't require every library to opt into anything manually.

Perhaps, when building an application, Cargo could inspect the configuration and if any configuration enables LTO then it could build every library with LTO support. Otherwise, if the application doesn't enable LTO support, then it wouldn't enable LTO support. I don't know if this would need to inspect just the application's Cargo.toml or if it would need to recursively inspect all the Cargo.tomls, but this seems like just a detail.

@lu-zero

This comment has been minimized.

Show comment
Hide comment
@lu-zero

lu-zero Oct 18, 2016

Contributor

Just to point the obvious, using something different from deflate might help better and if you are already encoding the data might be worthy considering do it once and use lz4, lzo or even zstd and spare one step.

How hard would be plug any of those?

Contributor

lu-zero commented Oct 18, 2016

Just to point the obvious, using something different from deflate might help better and if you are already encoding the data might be worthy considering do it once and use lz4, lzo or even zstd and spare one step.

How hard would be plug any of those?

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Oct 18, 2016

Contributor

zstd is supposedly a faster, equal-compression-rate version of zlib. Someone should check that.

Contributor

arielb1 commented Oct 18, 2016

zstd is supposedly a faster, equal-compression-rate version of zlib. Someone should check that.

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 19, 2016

Contributor

As the first comment says, compression occurs in two places. More specifically, here:

  • write_metadata, in src/librustc_trans/base.rs
  • link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

@eddyb worked out that write_metadata is compressing metadata unnecessarily for rlibs! #37267 addresses this. With that change, write_metadata still compresses for dylibs and "proc macro" crates but AIUI they are much rarer. So that's the first part of this PR addressed.

That still leaves the bytecode compression, which is the larger part of the potential speed-up. I tried disabling bytecode compression and the size of the rlibs for syntex increased from 54508120 to 75265964, a 1.38x increase, i.e. quite a bit. So unconditionally disabling it probably isn't feasible.

Contributor

nnethercote commented Oct 19, 2016

As the first comment says, compression occurs in two places. More specifically, here:

  • write_metadata, in src/librustc_trans/base.rs
  • link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

@eddyb worked out that write_metadata is compressing metadata unnecessarily for rlibs! #37267 addresses this. With that change, write_metadata still compresses for dylibs and "proc macro" crates but AIUI they are much rarer. So that's the first part of this PR addressed.

That still leaves the bytecode compression, which is the larger part of the potential speed-up. I tried disabling bytecode compression and the size of the rlibs for syntex increased from 54508120 to 75265964, a 1.38x increase, i.e. quite a bit. So unconditionally disabling it probably isn't feasible.

eddyb added a commit to eddyb/rust that referenced this issue Oct 19, 2016

Rollup merge of #37267 - nnethercote:opt-write_metadata, r=eddyb
Optimize `write_metadata`.

`write_metadata` currently generates metadata unnecessarily in some
cases, and also compresses it unnecessarily in some cases. This commit
fixes that. It speeds up three of the rustc-benchmarks by 1--4%.

r? @eddyb, who deserves much of the credit because he (a) identified the problem from the profile data I provided in #37086, and (b) explained to me how to fix it. Thank you, @eddyb!
@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Oct 19, 2016

Member

I did a small bit of analysis about compression algorithms and such. I extracted all bytecode from the standard distribution rlibs (e.g. everything we're shipping). I decompressed it and then recompressed it with a bunch of algorithms. Kept track of how long everything took as well as the compression ratios.

The raw data is here where each section is the statistics for one particular piece of bytecode. The final entry is the summation of everything previous. I tested:

  • xz compression levels 0-9
  • deflate compression levels (fast, default, best)
  • brotli compression 0-9
  • zstd compression 0-9

Basically what I think this tells me is that zstd is blazingly fast and gets better compression than deflate (what we're using today) at lower levels. Otherwise xz is super slow (surprise surprise) and brotli also isn't faring too well on this data set.

Now that being said, this seems like it's a tiny portion of compiles. The deflate times for any particular bytecode are in the handfuls of milliseconds at most it looks like. If we really want speed though, zstd seems the way to go.

Member

alexcrichton commented Oct 19, 2016

I did a small bit of analysis about compression algorithms and such. I extracted all bytecode from the standard distribution rlibs (e.g. everything we're shipping). I decompressed it and then recompressed it with a bunch of algorithms. Kept track of how long everything took as well as the compression ratios.

The raw data is here where each section is the statistics for one particular piece of bytecode. The final entry is the summation of everything previous. I tested:

  • xz compression levels 0-9
  • deflate compression levels (fast, default, best)
  • brotli compression 0-9
  • zstd compression 0-9

Basically what I think this tells me is that zstd is blazingly fast and gets better compression than deflate (what we're using today) at lower levels. Otherwise xz is super slow (surprise surprise) and brotli also isn't faring too well on this data set.

Now that being said, this seems like it's a tiny portion of compiles. The deflate times for any particular bytecode are in the handfuls of milliseconds at most it looks like. If we really want speed though, zstd seems the way to go.

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 19, 2016

Contributor

Thank you for the analysis, @alexcrichton. What parameters defines the "fast", "default", and "best" modes for deflate?

Now that being said, this seems like it's a tiny portion of compiles.

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

Contributor

nnethercote commented Oct 19, 2016

Thank you for the analysis, @alexcrichton. What parameters defines the "fast", "default", and "best" modes for deflate?

Now that being said, this seems like it's a tiny portion of compiles.

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

@oyvindln

This comment has been minimized.

Show comment
Hide comment
@oyvindln

oyvindln Oct 19, 2016

Contributor

Based on the tests done by @alexcrichton (maybe you could put up the test in a github repo or something?) it seems that lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

Contributor

oyvindln commented Oct 19, 2016

Based on the tests done by @alexcrichton (maybe you could put up the test in a github repo or something?) it seems that lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 19, 2016

Contributor

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

Is that still true after metadata is not compressed for rlibs anymore?

Contributor

michaelwoerister commented Oct 19, 2016

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

Is that still true after metadata is not compressed for rlibs anymore?

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 19, 2016

Contributor

I think a good first step towards improvement here would be to allow for LLVM bitcode to be either compressed or not. The way we store bitcode already contains a small header that tells us about the format, so this would be easy to implement in a backwards compatible way.

With that implemented we can just forgo compression in some scenarios (like debug builds or incr. comp.) in a transparent way.

Adding support for zstd would be nice too.

Contributor

michaelwoerister commented Oct 19, 2016

I think a good first step towards improvement here would be to allow for LLVM bitcode to be either compressed or not. The way we store bitcode already contains a small header that tells us about the format, so this would be easy to implement in a backwards compatible way.

With that implemented we can just forgo compression in some scenarios (like debug builds or incr. comp.) in a transparent way.

Adding support for zstd would be nice too.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Oct 19, 2016

Member

@nnethercote

The fast/default/best correspond to what's in flate2.

@oyvindln

It's pretty janky, but the script I used is here.

@michaelwoerister

Agreed we should support uncompressed bytecode! I'd also be fine with it behind a flag that we disable by default and only enable for our own releases.

Member

alexcrichton commented Oct 19, 2016

@nnethercote

The fast/default/best correspond to what's in flate2.

@oyvindln

It's pretty janky, but the script I used is here.

@michaelwoerister

Agreed we should support uncompressed bytecode! I'd also be fine with it behind a flag that we disable by default and only enable for our own releases.

@briansmith

This comment has been minimized.

Show comment
Hide comment
@briansmith

briansmith Oct 19, 2016

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

All said and done, my Visual Studio + Windows SDK setup is ~5GB. I wouldn't blink at the Rust installation doubling in size if it meant compilation was even slightly (a few percent) faster and/or if the toolchain was easier to maintain (i.e. if it didn't need to support any compression at all).

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

All said and done, my Visual Studio + Windows SDK setup is ~5GB. I wouldn't blink at the Rust installation doubling in size if it meant compilation was even slightly (a few percent) faster and/or if the toolchain was easier to maintain (i.e. if it didn't need to support any compression at all).

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 20, 2016

Contributor

lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

This is exactly what I've done in #37298. Thank you for the suggestion.

Contributor

nnethercote commented Oct 20, 2016

lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

This is exactly what I've done in #37298. Thank you for the suggestion.

Manishearth added a commit to Manishearth/rust that referenced this issue Oct 22, 2016

Rollup merge of #37298 - nnethercote:faster-deflate, r=alexcrichton
Use a faster `deflate` setting

In #37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%.

I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort.

The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler.
```
html5ever-2016-08-25      1.4% -> 0.7%
hyper.0.5.0               3.8% -> 2.4%
inflate-0.1.0             1.0% -> 0.5%
piston-image-0.10.3       2.9% -> 1.8%
regex.0.1.30              3.4% -> 2.1%
rust-encoding-0.3.0       4.8% -> 2.9%
syntex-0.42.2             2.9% -> 1.8%
syntex-0.42.2-incr-clean 14.2% -> 8.9%
```
The omitted ones spend 0% of their time in decompression.

And here are actual timings:
```
futures-rs-test  4.110s vs  4.102s --> 1.002x faster (variance: 1.017x, 1.004x)
helloworld       0.223s vs  0.226s --> 0.986x faster (variance: 1.012x, 1.022x)
html5ever-2016-  4.218s vs  4.186s --> 1.008x faster (variance: 1.008x, 1.010x)
hyper.0.5.0      4.746s vs  4.661s --> 1.018x faster (variance: 1.002x, 1.016x)
inflate-0.1.0    4.194s vs  4.143s --> 1.012x faster (variance: 1.007x, 1.006x)
issue-32062-equ  0.317s vs  0.316s --> 1.001x faster (variance: 1.013x, 1.005x)
issue-32278-big  1.811s vs  1.825s --> 0.992x faster (variance: 1.014x, 1.006x)
jld-day15-parse  1.412s vs  1.412s --> 1.001x faster (variance: 1.019x, 1.008x)
piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x)
reddit-stress    2.331s vs  2.342s --> 0.995x faster (variance: 1.019x, 1.006x)
regex.0.1.30     2.294s vs  2.276s --> 1.008x faster (variance: 1.007x, 1.007x)
rust-encoding-0  1.963s vs  1.924s --> 1.020x faster (variance: 1.009x, 1.006x)
syntex-0.42.2   29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x)
syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x)
```

r? @alexcrichton

Manishearth added a commit to Manishearth/rust that referenced this issue Oct 22, 2016

Rollup merge of #37298 - nnethercote:faster-deflate, r=alexcrichton
Use a faster `deflate` setting

In #37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%.

I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort.

The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler.
```
html5ever-2016-08-25      1.4% -> 0.7%
hyper.0.5.0               3.8% -> 2.4%
inflate-0.1.0             1.0% -> 0.5%
piston-image-0.10.3       2.9% -> 1.8%
regex.0.1.30              3.4% -> 2.1%
rust-encoding-0.3.0       4.8% -> 2.9%
syntex-0.42.2             2.9% -> 1.8%
syntex-0.42.2-incr-clean 14.2% -> 8.9%
```
The omitted ones spend 0% of their time in decompression.

And here are actual timings:
```
futures-rs-test  4.110s vs  4.102s --> 1.002x faster (variance: 1.017x, 1.004x)
helloworld       0.223s vs  0.226s --> 0.986x faster (variance: 1.012x, 1.022x)
html5ever-2016-  4.218s vs  4.186s --> 1.008x faster (variance: 1.008x, 1.010x)
hyper.0.5.0      4.746s vs  4.661s --> 1.018x faster (variance: 1.002x, 1.016x)
inflate-0.1.0    4.194s vs  4.143s --> 1.012x faster (variance: 1.007x, 1.006x)
issue-32062-equ  0.317s vs  0.316s --> 1.001x faster (variance: 1.013x, 1.005x)
issue-32278-big  1.811s vs  1.825s --> 0.992x faster (variance: 1.014x, 1.006x)
jld-day15-parse  1.412s vs  1.412s --> 1.001x faster (variance: 1.019x, 1.008x)
piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x)
reddit-stress    2.331s vs  2.342s --> 0.995x faster (variance: 1.019x, 1.006x)
regex.0.1.30     2.294s vs  2.276s --> 1.008x faster (variance: 1.007x, 1.007x)
rust-encoding-0  1.963s vs  1.924s --> 1.020x faster (variance: 1.009x, 1.006x)
syntex-0.42.2   29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x)
syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x)
```

r? @alexcrichton

bors added a commit that referenced this issue Oct 22, 2016

Auto merge of #37298 - nnethercote:faster-deflate, r=alexcrichton
Use a faster `deflate` setting

In #37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%.

I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort.

The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler.
```
html5ever-2016-08-25      1.4% -> 0.7%
hyper.0.5.0               3.8% -> 2.4%
inflate-0.1.0             1.0% -> 0.5%
piston-image-0.10.3       2.9% -> 1.8%
regex.0.1.30              3.4% -> 2.1%
rust-encoding-0.3.0       4.8% -> 2.9%
syntex-0.42.2             2.9% -> 1.8%
syntex-0.42.2-incr-clean 14.2% -> 8.9%
```
The omitted ones spend 0% of their time in decompression.

And here are actual timings:
```
futures-rs-test  4.110s vs  4.102s --> 1.002x faster (variance: 1.017x, 1.004x)
helloworld       0.223s vs  0.226s --> 0.986x faster (variance: 1.012x, 1.022x)
html5ever-2016-  4.218s vs  4.186s --> 1.008x faster (variance: 1.008x, 1.010x)
hyper.0.5.0      4.746s vs  4.661s --> 1.018x faster (variance: 1.002x, 1.016x)
inflate-0.1.0    4.194s vs  4.143s --> 1.012x faster (variance: 1.007x, 1.006x)
issue-32062-equ  0.317s vs  0.316s --> 1.001x faster (variance: 1.013x, 1.005x)
issue-32278-big  1.811s vs  1.825s --> 0.992x faster (variance: 1.014x, 1.006x)
jld-day15-parse  1.412s vs  1.412s --> 1.001x faster (variance: 1.019x, 1.008x)
piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x)
reddit-stress    2.331s vs  2.342s --> 0.995x faster (variance: 1.019x, 1.006x)
regex.0.1.30     2.294s vs  2.276s --> 1.008x faster (variance: 1.007x, 1.007x)
rust-encoding-0  1.963s vs  1.924s --> 1.020x faster (variance: 1.009x, 1.006x)
syntex-0.42.2   29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x)
syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x)
```

r? @alexcrichton
@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Nov 22, 2016

Contributor

#37267 removed some useless compression, and #37298 tweaked the deflate settings to make compression faster. Compression still shows up moderately high in some profiles but I think that's enough progress to close this issue.

Contributor

nnethercote commented Nov 22, 2016

#37267 removed some useless compression, and #37298 tweaked the deflate settings to make compression faster. Compression still shows up moderately high in some profiles but I think that's enough progress to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment