New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

println!() prevents optimization by capturing pointers #50519

Open
df5602 opened this Issue May 7, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@df5602
Copy link

df5602 commented May 7, 2018

This weekend I ran some benchmarks on some of my code. After making a seemingly insignificant code change I noticed a small, but measurable performance regression. After investigating the generated assembly, I stumbled upon a case, where the compiler emits code that is not optimal.

This minimal example shows the same behaviour (Playground link):

extern crate rand;

use std::f32;
use rand::Rng;

fn main() {
    let mut list = [0.0; 16];
    let mut rg = rand::thread_rng();

    // Random initialization to prevent the compiler from optimizing the whole example away
    for i in 0..list.len() {
        list[i] = rg.gen_range(0.0, 0.1);
    }

    let mut lowest = f32::INFINITY;

    for i in 0..list.len() {
        lowest = if list[i] < lowest {    // <<<<<<<<<<<<<<<
            list[i]
        } else {
            lowest
        };
    }

    println!("{}", lowest);
}

When compiling with the --release flag, the compiler generates the following instructions for the marked block:

...
minss	%xmm0, %xmm1
movss	88(%rsp), %xmm0
minss	%xmm1, %xmm0
movss	92(%rsp), %xmm1
...

However, if I replace those lines with the following:

if list[i] < lowest {
    lowest = list[i];
}

the compiler emits a strange series of float compare and jump instructions:

.LBB5_38:
	movss	92(%rsp), %xmm1
	ucomiss	%xmm1, %xmm0
	ja	.LBB5_39
...
.LBB5_42:
	movss	100(%rsp), %xmm1
	ucomiss	%xmm1, %xmm0
	ja	.LBB5_43
...
.LBB5_39:
	movss	%xmm1, 12(%rsp)
	movaps	%xmm1, %xmm0
	movss	96(%rsp), %xmm1
	ucomiss	%xmm1, %xmm0
	jbe	.LBB5_42

As a comparison, both gcc and clang can optimize a similar C++ example:

#include <stdlib.h>
#include <iostream>

using namespace std;

int main() {
    float list[16];
    for(size_t i = 0; i < 16; ++i) {
        list[i] = rand();
    }

    float lowest = 1000.0f;

    for (size_t i = 0; i < 16; ++i) {
        
        /* Variant A: */
        //lowest = list[i] < lowest ? list[i] : lowest;

        /* Variant B: */
        if (list[i] < lowest) {
            lowest = list[i];
        }
    }

    cout << lowest;
}

Both compilers generate minss instructions for both variants.
(Godbolt)

I wasn't sure whether rustc or LLVM were responsible for this behaviour, however after a quick glance at the generated LLVM IR, I'm tending towards rustc, since in the first case it emits fcmp and select instructions, while in the latter it generates fcmp and br.

What do you think?

@nikic

This comment has been minimized.

Copy link
Contributor

nikic commented Dec 15, 2018

In this reduced example, minss is generated for both cases since rust 1.25: https://godbolt.org/z/wz8Kmk

Replacing the return with println brings the problem back.

@nikic nikic added the A-LLVM label Dec 15, 2018

@nikic

This comment has been minimized.

Copy link
Contributor

nikic commented Dec 15, 2018

Okay, the relevant difference that println! introduces is that it takes the address of lowest. If you use println!("{}", {lowest}) instead, the issue goes away.

Taking the address prevents the conversion of lowest from an alloca into an SSA value, and that's going to inhibit lots of optimizations (including the select formation desired here).

The good news is that this is probably not going to affect real code much, though I am concerned about cases where you have conditional debugging code that includes formatting.

Two ways this could be fixed:

  • On the LLVM side: Perform calculation on SSA values and only write it back into the alloca slot when the address is taken. Might not always be profitable and probably isn't even possible with the information LLVM has (e.g. it does not know that we won't modify memory through the pointer).
  • On the rust side: Avoid taking the reference, at least in cases where the value is small.

@nikic nikic changed the title Missed optimization: Compiler sometimes emits float compare + jump instead of MINSS/MAXSS println!() prevents optimization by capturing pointers Dec 23, 2018

@nikic

This comment has been minimized.

Copy link
Contributor

nikic commented Dec 23, 2018

@rkruppe @nagisa @eddyb Any idea what we can do here? I think it's pretty bad that println!() (or anything else using formatting, such as logging) breaks unrelated optimizations by capturing pointers. It's enough that any part of a structure is formatted to break optimization on all of it, as LLVM doesn't know which part of the structure we're actually going to use use, it only sees that some pointer into it is stored.

It would be great if we could force a copy of the formatted value before taking the pointer, but I'm not sure how to do that on a technical level. We'd only want to do this for specific types (integers and floats), but println! is expanded long before this type information is available.

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Dec 23, 2018

If changing how println etc. are expanded is on the table, maybe this is one more reason to consider a more performance-oriented expansion that generates direct inlineable calls, as outlined by @seanmonstar here. If for example the <f32 as Debug>::fmt calls are inlined (at least enough to not need to pass a reference), would that solve this issue?

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Dec 23, 2018

It would be great if we could force a copy of the formatted value before taking the pointer, but I'm not sure how to do that on a technical level. We'd only want to do this for specific types (integers and floats), but println! is expanded long before this type information is available.

The formatting machinery has been specifically crafted to minimize the size rather than increase the speed (desired for panics), which will eventually come at some cost somewhere, which is what we are seeing here. If we can find ways to improve println! expansion to do better without increasing sacrificing size, that would be great...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment