Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower performance caused only by using LTO #48371

Open
pmarcelll opened this issue Feb 20, 2018 · 14 comments
Open

Slower performance caused only by using LTO #48371

pmarcelll opened this issue Feb 20, 2018 · 14 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@pmarcelll
Copy link
Contributor

pmarcelll commented Feb 20, 2018

The Computer Language Benchmarks Game was on the Rust subreddit recently and while I checked out the numbers for Rust, I noticed that the Rust solution for the fasta benchmark is much slower than the C version, although they work fairly similarly, the multithreading in the C version is based on the Rust version. It turned out that the Rust benchmarks are compiled with LTO by default, and when I tested the code on my machine without LTO (both stable and nightly), it was almost as fast as the C version. I tried to find an existing issue, but most of them are about slow compilation, not slow runtime.

The interesting thing is that on the CPU monitor graph it's clearly visible that during the last part of the benchmark all CPU cores are only on 70% usage (so it's like a mutex is locked for too long). I also checked the binary size, it went down from 4.4 MB to 3.1 MB with LTO.

EDIT: I also tested it with the Mutex from parking_lot, it's still slow with LTO, but without it's a tiny bit faster than the C version.

@pietroalbini pietroalbini added I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 20, 2018
@pietroalbini
Copy link
Member

Could this be related to thinlto + multiple codegen units?

@nagisa
Copy link
Member

nagisa commented Feb 20, 2018

Please try with -C codegen-units=1.

@pmarcelll
Copy link
Contributor Author

With -C codegen-units=1 the LTO version is the fastest.

@pietroalbini
Copy link
Member

Yep, this is a known issue with multiple codegen units on release builds, which were enabled in the latest release: #47745
At the moment the solution to this, as you saw, is setting the codegen units to 1 (either via rustc args or cargo.toml).

Closing as a duplicate.

@pmarcelll
Copy link
Contributor Author

I saw that issue but I thought that ThinLTO works differently than regular LTO, so I opened this one.

@pmarcelll
Copy link
Contributor Author

Compiling with cargo +nightly rustc --release -- -Clto=thin and with cargo +nightly build --release produces similar results as the C version.

Compiling with cargo +nightly rustc --release -- -Clto=fat gives the slow numbers (as seen in the actual benchmark). So this is not caused by ThinLTO.

@robsmith11
Copy link

Another example with a pretty significant slowdown:

extern crate rand;

use rand::{Rng,SeedableRng,XorShiftRng};

fn main() {
    let mut rng:XorShiftRng = SeedableRng::from_seed([1,2,3,4]);
    let mut m:f64 = 0.0;
    for _ in 0..1_000_000_000 {
        let x = rng.gen();
        if x > m { m = x; }
    }
    println!("{}", m);
}

Without lto: 3.065 seconds
With lto: 5.116 seconds
With lto and codegen-units=1: 3.106 seconds

@pmarcelll
Copy link
Contributor Author

@nagisa I noticed that you added this issue to the list in #47745, but as I showed it in my previous comment, I didn't have a problem with ThinLTO, but with regular/fat LTO. If the underlying issue is really the same, can you note that fat LTO can also cause slowdown with multiple codegen units? And if it's a different issue, can you please reopen this issue (or find the appropriate one)? Thanks!

@nagisa
Copy link
Member

nagisa commented Feb 24, 2018

Removed it.

@robsmith11
Copy link

robsmith11 commented Feb 24, 2018

I agree that this bug should be reopened. My performance regression is also only with traditional "fat" LTO.

Why are multiple code-gens be using by default anyway? I thought they were only going to be used with thin LTO..

EDIT:
Ah, I didn't realize that thin LTO was also enabled by default. I suppose if thin LTO always results in run-time performance as good as fat LTO, then this is fine. But if fat LTO will still be useful in the future, maybe it should force codegen-units=1.

@nagisa nagisa reopened this Feb 24, 2018
@ollie27
Copy link
Member

ollie27 commented Feb 25, 2018

FWIW #47866 was another issue where multiple codegen units + fat lto produced worse code.

@pmarcelll
Copy link
Contributor Author

@ollie27 I didn't find that issue since it was already closed when I opened this, but this is probably the same issue. The conclusion in #47866 was that it's just how fat LTO works, unfortunately I'm not the one that compiles the code in my case, the best I can do is to convince the maintainers of the benchmark game to compile with codegen-units=1.

I won't close this issue yet since @robsmith11's code is pretty small, so it might be good for further investigation.

@robsmith11
Copy link

As I mentioned in my previous comment, I think the solution to this is that fat LTO should default to codegen-units=1, not 16 or whatever the new default is. Fat LTO isn't designed for good run-time performance with multiple codegen-units, only thin LTO is.

@steveklabnik
Copy link
Member

Triage; we've changed the defaults around this a bunch of times, but I'm not sure what they are today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants