Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async performance regression #70488

Closed
goffrie opened this issue Mar 28, 2020 · 28 comments
Closed

Async performance regression #70488

goffrie opened this issue Mar 28, 2020 · 28 comments
Labels
A-async-await A-codegen A-LLVM E-needs-mcve I-slow ICEBreaker-LLVM P-low T-compiler

Comments

@goffrie
Copy link
Contributor

@goffrie goffrie commented Mar 28, 2020

See #69033 (comment); @kpp reports a 200%+ regression on some futures microbenchmarks, e.g.:

future::ready/async_combinators                                                                             
                        time:   [1.5699 ns 1.5743 ns 1.5793 ns]
                        change: [+279.32% +282.34% +285.20%] (p = 0.00 < 0.05)
                        Performance has regressed.

The benchmarks are at https://github.com/kpp/futures-async-combinators/.

I bisected this to 57e1da5, which most likely implicates #69837.

@Centril Centril added T-compiler I-slow I-nominated A-async-await P-high labels Mar 28, 2020
@jonas-schievink jonas-schievink self-assigned this Mar 28, 2020
@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

Can reproduce. The regression comes from changing the discriminant type from u32 to u8 (or whatever fits best, which almost always is u8), exposing the niche does not regress performance much.

Isn't x86 generally less efficient when dealing with <32 bit words? In that case we might want to consider doing this optimization conditionally, since it can be a somewhat significant RAM saving on smaller microcontroller targets. I could imagine using -C opt-level=s to decide this, but I don't think there's any precedence for doing something like that at the moment.

@Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Mar 28, 2020

I think the slowdown is likely some sort of bug in our codegen or LLVM perhaps?

This is a single iteration on nightly (unrolled 8 times in the actual benchmark code):

    0.00 :   18a260:       mov    BYTE PTR [rsp+0x8],0x0
    0.80 :   18a265:       movzx  ecx,BYTE PTR [rsp]
    0.99 :   18a269:       movzx  ecx,BYTE PTR [rsp+0x1]
    1.09 :   18a26e:       movzx  ecx,BYTE PTR [rsp+0x2]
    0.87 :   18a273:       movzx  ecx,BYTE PTR [rsp+0x3]
    0.97 :   18a278:       movzx  ecx,BYTE PTR [rsp+0x4]
    0.83 :   18a27d:       movzx  ecx,BYTE PTR [rsp+0x5]
    1.32 :   18a282:       movzx  ecx,BYTE PTR [rsp+0x6]
    0.87 :   18a287:       movzx  ecx,BYTE PTR [rsp+0x7]
    0.84 :   18a28c:       movzx  ecx,BYTE PTR [rsp+0x9]
    0.86 :   18a291:       movzx  ecx,BYTE PTR [rsp+0xa]
    1.01 :   18a296:       movzx  ecx,BYTE PTR [rsp+0xb]
    1.04 :   18a29b:       movzx  ecx,BYTE PTR [rsp+0x8]

On the good commit, it looks like this (and is only unrolled 4 times):

   17.76 :   183f62:       mov    DWORD PTR [rsp],0x0
    0.00 :   183f69:       mov    edx,DWORD PTR [rsp+0x4]
    3.26 :   183f6d:       mov    edx,DWORD PTR [rsp+0x8]
    0.00 :   183f71:       mov    edx,DWORD PTR [rsp]

Trying to reduce this down, I came up with this example:

use std::future::Future;

pub async fn foo(v: i32) -> i32 { v }

macro_rules! wrap {
    ($v:expr) => {
        async move { ($v).await }
    }
}

pub fn bar(v: i32) -> impl Future<Output = i32> {
    wrap!(wrap!(wrap!(wrap!(wrap!(wrap!(foo(v)))))))
}

In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap! -- whereas stable stays constant at 4, the size of i32 (presumably). I think that's the bug here, though I'm not sure how the discriminant size managed to affect that. Maybe previously we were getting split into two separate LLVM "words" via SROA or something and now we aren't since we fit into a single register or something like that? That could I guess plausibly lead to problems here; my assumption is that the benchmark is loading many more words from the stack on nightly because the "size" of the future has increased drastically? OTOH, size_of_val etc do not report a size increase -- so why the offset is increasing I'm not sure. But I suspect it's for the same reason we're seeing more stack loads.

Nightly:

example::foo:
        mov     eax, edi
        ret

example::bar:
        mov     rax, rdi
        mov     dword ptr [rdi], esi
        mov     byte ptr [rdi + 52], 0
        ret

and on stable:

example::foo:
        mov     eax, edi
        ret

example::bar:
        mov     rax, rdi
        mov     dword ptr [rdi], esi
        mov     dword ptr [rdi + 4], 0
        ret

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

Thanks, that's helpful!

@jonas-schievink jonas-schievink added the A-codegen label Mar 28, 2020
@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

The dword vs. byte moves can be explained by the alignment changing from 4 (due to using u32 as the discriminant) to 1 (since we now use u8 as the discriminant), as long as nothing with a larger alignment is contained.

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap! -- whereas stable stays constant at 4, the size of i32 (presumably).

Not entirely certain, but I think this is just automatic field reordering doing its thing. The discriminant is now a u8 with alignment 1, so it gets moved to the very end of the memory layout to not waste space with padding. Previously it was a u32 with alignment 4, just like all the other fields in the generator, so reordering anything there doesn't make sense.

If I take your example and replace i32 with bool, the size of foo goes from 8 to 2 bytes (and its alignment from 4 to 1), while bar goes from 56 to 14 bytes (its alignment also goes from 4 to 1).

So far this looks like a fairly expected outcome of my PR; it's pretty well-known that smaller alignment can generate less efficient code (OTOH apparently modern x86 chips can do unaligned moves just as fast as aligned ones, so maybe the byte-wise moves should not be emitted, but in the end this is completely up to LLVM). Since, AFAICT, it should only affect futures that do not store any data with alignment in them, the impact should be pretty minimal in practice, so we can also consider closing this without a fix.

(this does make me wish that we had better layout debugging capabilities in rustc though)

@memoryruins
Copy link
Contributor

@memoryruins memoryruins commented Mar 28, 2020

this does make me wish that we had better layout debugging capabilities in rustc

A #[rustc_layout(debug)] attribute was added in #69901 (edit: heh, just noticed you're the first comment on it ^^). With the help of type_alias_impl_trait, rustc output some layout information for bar's future: playground

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

Ah, nice to see that land, that's certainly a step in the right direction

@nagisa
Copy link
Member

@nagisa nagisa commented Mar 28, 2020

@jonas-schievink a quick test to see if this is indeed an alignment problem is to stick a #[repr(align(4))]-equivalent onto the generator.

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

That does not seem to do anything, hmm... Maybe I'm doing it wrong though. Would be useful to get an MCVE that exhibits the inefficient codegen.

@jonas-schievink jonas-schievink added the E-needs-mcve label Mar 28, 2020
@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

I've just checked the benchmark code, and it does not actually benchmark anything except copying the futures around. They are never polled because they are never spawned onto an executor.

The benchmark results would make sense here if the alignment is really the cause.

@kpp
Copy link
Contributor

@kpp kpp commented Mar 28, 2020

How so?

    executor::block_on(async {
        let mut group = c.benchmark_group("future::ready");

        group.bench_function("futures", |b| {
            b.iter(move || async {
                black_box(futures::future::ready(42)).await
            })
        });

We got block_on and .await.

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

Enabling Criterion's real_blackbox feature significantly speeds up the future::ready benchmarks. The default implementation does more copying.

With the feature enabled the benchmarks now take <1ns, so yeah, they were really just benchmarking the memcpy in black_box.

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

@kpp block_on only blocks on the outer async block, which does not await anything. The futures created by the closure passed to iter are just dropped without being spawned or awaited (which is what is actually getting benchmarked here). You can test this by putting a panic!() inside the inner async blocks. It is never reached.

@kpp
Copy link
Contributor

@kpp kpp commented Mar 28, 2020

Ooops. Never felt more embarrassed.

@kpp
Copy link
Contributor

@kpp kpp commented Mar 28, 2020

Jonas, does this sounds better to you?

fn bench_ready(c: &mut Criterion) {
    let mut group = c.benchmark_group("future::ready");

    group.bench_function("futures", |b| {
        b.iter(|| {
            executor::block_on(async {
                black_box(futures::future::ready(42)).await;
            })
        })
    });

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 28, 2020

I'd wrap the block_on in a black_box to make sure the entire thing gets evaluated (and not just the expression producing the future)

@kpp
Copy link
Contributor

@kpp kpp commented Mar 28, 2020

black_box(futures::future::ready(42).await); ?

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 29, 2020

Reproduction:

use std::{mem, ptr};

use std::future::Future;

async fn ready<T>(t: T) -> T {
    t
}

fn black_box<T>(dummy: T) -> T {
    unsafe {
        let ret = std::ptr::read_volatile(&dummy);
        std::mem::forget(dummy);
        ret
    }
}

#[inline(never)]
fn iter<O, R>(mut routine: R)
where
    R: FnMut() -> O,
{
    loop {
        black_box(routine());
    }
}

pub unsafe fn ready_bench() {
    iter(move || async {
        black_box(ready(42)).await
    });
}

On stable:

playground::iter:
	sub	rsp, 16

.LBB0_1:
	mov	dword ptr [rsp], 0
	mov	eax, dword ptr [rsp + 4]
	mov	eax, dword ptr [rsp + 8]
	mov	eax, dword ptr [rsp]
	jmp	.LBB0_1

playground::ready_bench:
	push	rax
	call	playground::iter
	ud2

On nightly:

playground::iter: # @playground::iter
# %bb.0:
	sub	rsp, 16

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
	mov	byte ptr [rsp + 8], 0
	movzx	eax, byte ptr [rsp]
	movzx	eax, byte ptr [rsp + 1]
	movzx	eax, byte ptr [rsp + 2]
	movzx	eax, byte ptr [rsp + 3]
	movzx	eax, byte ptr [rsp + 4]
	movzx	eax, byte ptr [rsp + 5]
	movzx	eax, byte ptr [rsp + 6]
	movzx	eax, byte ptr [rsp + 7]
	movzx	eax, byte ptr [rsp + 9]
	movzx	eax, byte ptr [rsp + 10]
	movzx	eax, byte ptr [rsp + 11]
	movzx	eax, byte ptr [rsp + 8]
	jmp	.LBB0_1
                                        # -- End function

playground::ready_bench: # @playground::ready_bench
# %bb.0:
	push	rax
	call	playground::iter
	ud2
                                        # -- End function

@goffrie
Copy link
Contributor Author

@goffrie goffrie commented Mar 29, 2020

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 29, 2020

The actual LLVM type did change to have an alignment of 1:

; stable
%"std::future::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [0 x i32], i32, [2 x i32] }

; nightly
%"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [8 x i8], i8, [3 x i8] }

One odd thing is that the volatile load in the LLVM IR is passed align 8, so I'm wondering why that isn't used to make the load more efficient here.

; playground::iter
; Function Attrs: noinline noreturn nounwind nonlazybind uwtable
define internal fastcc void @_ZN10playground4iter17h1130617e37f08519E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %_3 = alloca %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", align 8
  %_3.0.sroa_cast = bitcast %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3 to i8*
  %_3.8.sroa_idx = getelementptr inbounds %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, i64 0, i32 1, i32 1
  br label %bb1

bb1:                                              ; preds = %bb1, %start
  call void @llvm.lifetime.start.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
  store i8 0, i8* %_3.8.sroa_idx, align 8, !alias.scope !2
  %_3.0. = load volatile %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, align 8, !alias.scope !8, !noalias !11
  call void @llvm.lifetime.end.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
  br label %bb1
}

@LifeIsStrange
Copy link

@LifeIsStrange LifeIsStrange commented Mar 30, 2020

@jonas-schievink

Isn't x86 generally less efficient when dealing with <32 bit words?

Typically, CPUs are fastest at operating on integers of their native integer size. On x86, 32-bit arithmetics are can be twice as fast compare to 16-bit operands.

Cf: http://www.idryman.org/blog/2012/11/21/integer-promotion/

In C this potential cause of performance regression cannot happen because of Integer promotion

If so, couldn't rust support integer promotion? At the very least as an optional flag in order to enforce performance garanties?
In the context of this issue, toggling on this flag would allow to test the hypothesis and instantaneously resolve or not the root cause of the performance regression.

@jonas-schievink jonas-schievink added the A-LLVM label Mar 30, 2020
@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 30, 2020

The PR that caused this intentionally did the opposite of integer promotion (narrowing a u32 to a u8), just not in a way that's visible to users. Generally rustc/LLVM is free to do integer promotion or narrowing whenever it sees fit, as long as it doesn't change the observed behavior of the program.

I haven't completely nailed down this issue yet as I'm not an LLVM expert, but it seems like it's caused by LLVM using only the LLVM type for alignment decisions, not the explicit alignment of the load (Presumably this doesn't always happen for loads or it would have been caught earlier? Are only volatile loads affected?).

@LifeIsStrange
Copy link

@LifeIsStrange LifeIsStrange commented Mar 30, 2020

I have no idea if this is really relevant to the conversation but LLVM is introducing a new alignment type that could solve some issues http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html

@tmandry
Copy link
Contributor

@tmandry tmandry commented Mar 31, 2020

We discussed this in the wg-async-foundations meeting and several people wanted to close as won't fix. Without evidence that this causes performance regressions in real-world code, it may not be worth pursuing further. We also don't think it's worth making generators bigger or changing their alignment to improve a microbenchmark.

That said, it does seem like LLVM could be generating better assembly, so pinging those folks for more input:

@rustbot ping llvm

@rustbot rustbot added the ICEBreaker-LLVM label Mar 31, 2020
@rustbot
Copy link
Collaborator

@rustbot rustbot commented Mar 31, 2020

Hey LLVM ICE-breakers! This bug has been identified as a good
"LLVM ICE-breaking candidate". In case it's useful, here are some
instructions for tackling these sorts of bugs. Maybe take a look?
Thanks! <3

cc @comex @cuviper @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj

@jonas-schievink jonas-schievink removed their assignment Mar 31, 2020
@tmandry tmandry added P-low and removed P-high labels Mar 31, 2020
@comex
Copy link
Contributor

@comex comex commented Mar 31, 2020

For the "reproduction", the issue isn't the alignment; it's the way the example reimplements black_box in terms of a volatile load of an entire struct, as an attempt to make it work on stable. The actual std::hint::black_box is not implemented this way; if you change the example to use it, it will only work on nightly, but it doesn't show the wasteful assembly.

Longer explanation:

LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions. (Without volatile it will optimize them into something nicer.) Note that LLVM does this regardless of the size of struct, as long as the LLVM IR has a load of struct type (load volatile %AlignedBytes, %AlignedBytes* %src). But if the struct is small enough (say, 8 u8s), then rustc will instead produce LLVM IR containing a load of integer type (load volatile i64, i64* %1), so you get a single machine load instruction even though it's volatile. Interestingly, Clang does the same thing for C code. I wonder why.

Regardless, the result is unpredictable, nonintuitive, and undocumented, in both Rust and C. Luckily, nobody actually uses volatile loads/stores of entire structs in cases where it matters, like accessing hardware MMIO registers: the registers may or may not be represented in a struct, but the actual loads and stores are done to individual fields.

@jonas-schievink
Copy link
Member

@jonas-schievink jonas-schievink commented Mar 31, 2020

LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions.

Ah, I see, that explains the emitted code, thanks! It also explains what was observed in #66595 I suppose.

Well, I don't think there's much of a point in leaving this issue open then, so closing.

@MikailBag
Copy link
Contributor

@MikailBag MikailBag commented Apr 1, 2020

To sum up.
Am I right than performance regression happens only when all conditions are met?

  1. Custom black_box, based on volatile (instead of std::hint::black_box)
  2. Future is created, but neither polled not awaited.
  3. Future is tiny.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-async-await A-codegen A-LLVM E-needs-mcve I-slow ICEBreaker-LLVM P-low T-compiler
Projects
None yet
Development

No branches or pull requests