Slower performance caused only by using LTO #48371

pmarcelll · 2018-02-20T14:28:30Z

The Computer Language Benchmarks Game was on the Rust subreddit recently and while I checked out the numbers for Rust, I noticed that the Rust solution for the fasta benchmark is much slower than the C version, although they work fairly similarly, the multithreading in the C version is based on the Rust version. It turned out that the Rust benchmarks are compiled with LTO by default, and when I tested the code on my machine without LTO (both stable and nightly), it was almost as fast as the C version. I tried to find an existing issue, but most of them are about slow compilation, not slow runtime.

The interesting thing is that on the CPU monitor graph it's clearly visible that during the last part of the benchmark all CPU cores are only on 70% usage (so it's like a mutex is locked for too long). I also checked the binary size, it went down from 4.4 MB to 3.1 MB with LTO.

EDIT: I also tested it with the Mutex from parking_lot, it's still slow with LTO, but without it's a tiny bit faster than the C version.

The text was updated successfully, but these errors were encountered:

pietroalbini · 2018-02-20T15:25:34Z

Could this be related to thinlto + multiple codegen units?

nagisa · 2018-02-20T16:42:27Z

Please try with -C codegen-units=1.

pmarcelll · 2018-02-20T20:32:11Z

With -C codegen-units=1 the LTO version is the fastest.

pietroalbini · 2018-02-20T20:41:26Z

Yep, this is a known issue with multiple codegen units on release builds, which were enabled in the latest release: #47745
At the moment the solution to this, as you saw, is setting the codegen units to 1 (either via rustc args or cargo.toml).

Closing as a duplicate.

pmarcelll · 2018-02-20T20:52:24Z

I saw that issue but I thought that ThinLTO works differently than regular LTO, so I opened this one.

pmarcelll · 2018-02-20T21:05:12Z

Compiling with cargo +nightly rustc --release -- -Clto=thin and with cargo +nightly build --release produces similar results as the C version.

Compiling with cargo +nightly rustc --release -- -Clto=fat gives the slow numbers (as seen in the actual benchmark). So this is not caused by ThinLTO.

robsmith11 · 2018-02-21T14:17:13Z

Another example with a pretty significant slowdown:

extern crate rand;

use rand::{Rng,SeedableRng,XorShiftRng};

fn main() {
    let mut rng:XorShiftRng = SeedableRng::from_seed([1,2,3,4]);
    let mut m:f64 = 0.0;
    for _ in 0..1_000_000_000 {
        let x = rng.gen();
        if x > m { m = x; }
    }
    println!("{}", m);
}

Without lto: 3.065 seconds
With lto: 5.116 seconds
With lto and codegen-units=1: 3.106 seconds

pmarcelll · 2018-02-23T21:37:17Z

@nagisa I noticed that you added this issue to the list in #47745, but as I showed it in my previous comment, I didn't have a problem with ThinLTO, but with regular/fat LTO. If the underlying issue is really the same, can you note that fat LTO can also cause slowdown with multiple codegen units? And if it's a different issue, can you please reopen this issue (or find the appropriate one)? Thanks!

nagisa · 2018-02-24T08:28:20Z

Removed it.

robsmith11 · 2018-02-24T18:05:29Z

I agree that this bug should be reopened. My performance regression is also only with traditional "fat" LTO.

Why are multiple code-gens be using by default anyway? I thought they were only going to be used with thin LTO..

EDIT:
Ah, I didn't realize that thin LTO was also enabled by default. I suppose if thin LTO always results in run-time performance as good as fat LTO, then this is fine. But if fat LTO will still be useful in the future, maybe it should force codegen-units=1.

ollie27 · 2018-02-25T16:26:00Z

FWIW #47866 was another issue where multiple codegen units + fat lto produced worse code.

pmarcelll · 2018-02-25T16:50:23Z

@ollie27 I didn't find that issue since it was already closed when I opened this, but this is probably the same issue. The conclusion in #47866 was that it's just how fat LTO works, unfortunately I'm not the one that compiles the code in my case, the best I can do is to convince the maintainers of the benchmark game to compile with codegen-units=1.

I won't close this issue yet since @robsmith11's code is pretty small, so it might be good for further investigation.

robsmith11 · 2018-02-26T07:23:29Z

As I mentioned in my previous comment, I think the solution to this is that fat LTO should default to codegen-units=1, not 16 or whatever the new default is. Fat LTO isn't designed for good run-time performance with multiple codegen-units, only thin LTO is.

steveklabnik · 2019-09-24T02:09:13Z

Triage; we've changed the defaults around this a bunch of times, but I'm not sure what they are today.

pietroalbini added I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 20, 2018

pietroalbini closed this as completed Feb 20, 2018

nagisa mentioned this issue Feb 20, 2018

codegen-units + ThinLTO is not as good as codegen-units = 1 #47745

Open

nagisa reopened this Feb 24, 2018

manuelVo mentioned this issue Aug 28, 2018

Iterating over ranges generates overflow check with opt-level = "z" and lto enabled #53627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower performance caused only by using LTO #48371

Slower performance caused only by using LTO #48371

pmarcelll commented Feb 20, 2018 •

edited

Loading

pietroalbini commented Feb 20, 2018

nagisa commented Feb 20, 2018

pmarcelll commented Feb 20, 2018

pietroalbini commented Feb 20, 2018

pmarcelll commented Feb 20, 2018

pmarcelll commented Feb 20, 2018

robsmith11 commented Feb 21, 2018

pmarcelll commented Feb 23, 2018

nagisa commented Feb 24, 2018

robsmith11 commented Feb 24, 2018 •

edited

Loading

ollie27 commented Feb 25, 2018

pmarcelll commented Feb 25, 2018

robsmith11 commented Feb 26, 2018

steveklabnik commented Sep 24, 2019

Slower performance caused only by using LTO #48371

Slower performance caused only by using LTO #48371

Comments

pmarcelll commented Feb 20, 2018 • edited Loading

pietroalbini commented Feb 20, 2018

nagisa commented Feb 20, 2018

pmarcelll commented Feb 20, 2018

pietroalbini commented Feb 20, 2018

pmarcelll commented Feb 20, 2018

pmarcelll commented Feb 20, 2018

robsmith11 commented Feb 21, 2018

pmarcelll commented Feb 23, 2018

nagisa commented Feb 24, 2018

robsmith11 commented Feb 24, 2018 • edited Loading

ollie27 commented Feb 25, 2018

pmarcelll commented Feb 25, 2018

robsmith11 commented Feb 26, 2018

steveklabnik commented Sep 24, 2019

pmarcelll commented Feb 20, 2018 •

edited

Loading

robsmith11 commented Feb 24, 2018 •

edited

Loading