Bad codegen for boolean reductions on thumbv7neon #146

hsivonen · 2021-07-16T09:02:34Z

Using cargo build --release --target thumbv7neon-unknown-linux-gnueabihf:

With packed_simd_2, this:

#[no_mangle]
#[inline(never)]
fn packed_any(s: m8x16) -> bool {
    s.any()
}

compiles to this:

00003e66 <packed_any>:
    3e66:       f960 0acf       vld1.64 {d16-d17}, [r0]
    3e6a:       ff40 0aa1       vpmax.u8        d16, d16, d17
    3e6e:       ec51 0b30       vmov    r0, r1, d16
    3e72:       4308            orrs    r0, r1
    3e74:       bf18            it      ne
    3e76:       2001            movne   r0, #1
    3e78:       4770            bx      lr
        ...

With core_simd, this:

#[no_mangle]
#[inline(never)]
fn core_any(s: mask8x16) -> bool {
    s.any()
}

compiles to this:

00003e66 <core_any>:
    3e66:       b5f0            push    {r4, r5, r6, r7, lr}
    3e68:       f960 0acf       vld1.64 {d16-d17}, [r0]
    3e6c:       eed0 0bb0       vmov.u8 r0, d16[1]
    3e70:       eed0 1b90       vmov.u8 r1, d16[0]
    3e74:       eed0 2bd0       vmov.u8 r2, d16[2]
    3e78:       eed0 3bf0       vmov.u8 r3, d16[3]
    3e7c:       eef0 cb90       vmov.u8 ip, d16[4]
    3e80:       eef0 ebb0       vmov.u8 lr, d16[5]
    3e84:       eef0 4bd0       vmov.u8 r4, d16[6]
    3e88:       eef0 7bf0       vmov.u8 r7, d16[7]
    3e8c:       eed1 5bf0       vmov.u8 r5, d17[3]
    3e90:       eef1 6b90       vmov.u8 r6, d17[4]
    3e94:       4308            orrs    r0, r1
    3e96:       eed1 1b90       vmov.u8 r1, d17[0]
    3e9a:       4310            orrs    r0, r2
    3e9c:       eed1 2bb0       vmov.u8 r2, d17[1]
    3ea0:       4318            orrs    r0, r3
    3ea2:       eed1 3bd0       vmov.u8 r3, d17[2]
    3ea6:       ea40 000c       orr.w   r0, r0, ip
    3eaa:       ea40 000e       orr.w   r0, r0, lr
    3eae:       4320            orrs    r0, r4
    3eb0:       eef1 4bb0       vmov.u8 r4, d17[5]
    3eb4:       4338            orrs    r0, r7
    3eb6:       eef1 7bd0       vmov.u8 r7, d17[6]
    3eba:       4308            orrs    r0, r1
    3ebc:       eef1 1bf0       vmov.u8 r1, d17[7]
    3ec0:       4310            orrs    r0, r2
    3ec2:       4318            orrs    r0, r3
    3ec4:       4328            orrs    r0, r5
    3ec6:       4330            orrs    r0, r6
    3ec8:       4320            orrs    r0, r4
    3eca:       4338            orrs    r0, r7
    3ecc:       4308            orrs    r0, r1
    3ece:       f000 0001       and.w   r0, r0, #1
    3ed2:       bdf0            pop     {r4, r5, r6, r7, pc}

Additional info

This seriously regresses performance (packed_simd_2 vs. core_simd) for encoding_rs.

Previously when migrating from simd to packed_simd.

Meta

rustc --version --verbose:

rustc 1.55.0-nightly (b1f8e27b7 2021-07-15)
binary: rustc
commit-hash: b1f8e27b74c541d3d555149c8efa4bfe9385cd56
commit-date: 2021-07-15
host: armv7-unknown-linux-gnueabihf
release: 1.55.0-nightly
LLVM version: 12.0.1

stdsimd rev 715f9ac

The text was updated successfully, but these errors were encountered:

calebzulawski · 2021-07-16T12:24:47Z

packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.

hsivonen · 2021-07-16T12:34:52Z

Doesn't target_feature apply to std even when compiling std? I expected it to work when NEON-availability is part of the target as with thumbv7neon-unknown-linux-gnueabihf and not an application compile-time option.

calebzulawski · 2021-07-16T12:38:50Z

Good point--i missed the neon in the target. It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag). I'm guessing other architectures likely see a similar codegen issue, as well.

programmerjake · 2021-07-16T16:10:01Z

packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.

LLVM produces better code if truncating to i1 first, but it's still waay worse than packed-simd:
https://llvm.godbolt.org/z/3v71P9Wdo

reduce_or:
        vorr    d16, d1, d0
        vzip.8  d16, d17
        vorr    d16, d17, d16
        vmov.u16        r0, d16[0]
        vmov.32 d17[0], r0
        vmov.u16        r0, d16[1]
        vmov.32 d17[1], r0
        vmov.u16        r0, d16[2]
        vmov.32 d18[0], r0
        vmov.u16        r0, d16[3]
        vmov.32 d18[1], r0
        vorr    d16, d18, d17
        vmov.32 r0, d16[0]
        vmov.32 r1, d16[1]
        orr     r0, r1, r0
        and     r0, r0, #1
        bx      lr

hsivonen · 2021-07-16T16:38:54Z

It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).

The thumbv7neon-* targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon by baking NEON in at std compile time.

calebzulawski · 2021-07-16T16:56:48Z

We actually use the reduce-or intrinsic which does exceptionally poorly: https://llvm.godbolt.org/z/aEha7dh5b

It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).

The thumbv7neon-* targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon by baking NEON in at std compile time.

Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1. We could partially solve it by bypassing codegen with explicit neon intrinsics, but the subpar LLVM codegen is the root of the problem.

hsivonen · 2021-07-16T16:58:33Z

Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1.

Ideally, yes, but in the interim, it would be great to have a core_simd level conditionally-compiled implementation the produces the same instructions as the same reduction in simd and in packed_simd.

calebzulawski · 2021-07-16T17:01:33Z

That's a possibility, but I'm concerned about special casing every codegen issue for every target (this is certainly not the only one) and having a heap of target-specific code getting in the way of making stdsimd actually portable and stable. At a minimum we should also report this to LLVM (it may be relevant to more targets than just armv7)

hsivonen · 2021-07-16T17:05:34Z

I agree that reporting to LLVM makes sense and that it's problematic to have a lot of conditional target-specific code. However, another point of view is that if the application programmer can't trust that core::simd compiles to the best instructions, the value proposition for portable SIMD in general suffers. I think it's less bad to put the target-specific code in core::simd than e.g. in encoding_rs.

calebzulawski · 2021-07-17T04:32:12Z

I would argue that even with the sometimes-enabled target specific code, it still can't be trusted to compile to the best instructions. Regardless, I opened https://bugs.llvm.org/show_bug.cgi?id=51122. It appears there was a similar issue on x86-64 a couple years ago, but that was fixed.

calebzulawski · 2021-07-17T15:33:04Z

I was able to fix the codegen on Aarch64 by implementing any as self.to_int().horizontal_min() == -1, but this still fails on ARM and produces:

reduce_or:
 push    {r4, r5, r6, r7, r11, lr}
 vld1.64 {d16, d17}, [r0]
 vmin.s8 d16, d16, d17
 vmov.u8 r0, d16[0]
 vmov.u8 r1, d16[1]
 vmov.u8 r2, d16[2]
 vmov.u8 r3, d16[3]
 vmov.u8 r12, d16[4]
 vmov.u8 lr, d16[5]
 vmov.u8 r4, d16[6]
 vmov.u8 r5, d16[7]
 lsl     r0, r0, #24
 lsl     r1, r1, #24
 asr     r6, r0, #24
 asr     r7, r1, #24
 cmp     r6, r1, asr, #24
 asrlt   r7, r0, #24
 lsl     r0, r2, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r3, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r12, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, lr, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r4, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r5, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 and     r0, r7, #255
 sub     r0, r0, #255
 clz     r0, r0
 lsr     r0, r0, #5
 pop     {r4, r5, r6, r7, r11, pc}

LLVM does not seem to be aware of vpmin at all.

workingjubilee · 2021-10-03T18:57:20Z

We changed some stuff around and LLVM is now LLVM 13. No action on the bug, but is this still an issue today?

calebzulawski · 2021-12-05T16:15:15Z

I just checked godbolt and yes, it's still an issue (for both armv7 and aarch64).

thomcc · 2021-12-10T21:32:09Z

We've been concerned about this kind of thing for a while, it's... part of what the whole mask representation arguments were about near the start (and this looks somewhat like my examples of LLVM doing bad things from back then 😬).

While I'm unsure we need to be above a target-specific fix (libcore even has these in the scalar numerics, after all), it sounds like this is not just an arm problem... (the indication I'm getting is that this is bad on several targets)

Boolean reductions are a super common operation. Do... we actually have a feeling this is fixable? (Perhaps with a new intrinsic which is UB to use if the input is not a mask type?)

calebzulawski · 2021-12-11T00:39:15Z

The issue doesn't really have anything to do with our mask representation, but LLVM not lowering well. The x86 backend has special logic for boolean reductions, and either that logic needs to be made generic, or ported to all targets.

workingjubilee · 2021-12-24T02:33:37Z

The upstream of this issue is now tracked at llvm/llvm-project#50466

workingjubilee · 2021-12-24T06:50:58Z

Hrm. I think I would to know slightly more about how our highly generic code gets included in crates from std, if at all, since we mark as many things #[inline] as possible, before addressing this. The code generated is obviously bad, I am just wary about going off only one test example. And one alternative we haven't discussed is the possibility of tweaking the LLVM IR we emit more directly.

If we did solve it here, we might have to do so as a specialization on a length...

https://rustc.godbolt.org/z/9sEo9GPWY

workingjubilee · 2022-02-06T05:28:58Z

Mmm, I did some investigation and I have begun to doubt that it's worth fixing this in std::simd before codegen, precisely because this IS a super common operation.

You see, miri attempts to run std's tests, so miri needs to be able to manage what we do in tests. We could do an end-run around miri and give more explicit directions in LLVM lowering, but we shouldn't really bother at the library level. And then whatever fix we implemented by manipulating LLVMIR would be better off upstreamed.

thomcc · 2022-02-06T06:39:17Z

FWIW, this wouldn't be the first code in the stdlib that does #[cfg(miri)] to perform an optimization that miri doesn't support (or even does support, except under stricter MIRIFLAGS).

It's also not really that much worse than tweaks made in other places in libcore -- for example, https://github.com/rust-lang/rust/blob/master/library/core/src/num/uint_macros.rs#L1645-L1657 was tuned to match what some specific LLVM vectorization pattern looks for.

IOW I'm not sure there's any reason to favor purity over practicality here... People expect the stdlib to contain the gross hacks so they don't need to, especially for common operations (but also less common ones that are big enough wins, like with that abs_diff).

workingjubilee · 2022-02-06T09:13:04Z

The abs_diff example there doesn't seem to be related at all as it doesn't use cfg.

I am not exactly clear on why I would be "favoring purity over practicality" when I suggest directly altering the LLVMIR lowering of the intrinsic on some targets in order to get the desired results when it genuinely has begun to seem simpler than a mash of cfg at the library level which complicates life for everyone who lives upstream of LLVM's competence or lack thereof. I was not aware "tweak the LLVMIR conditionally in response to some compilation states" was a pure operation. It seems rather stateful to me. Is the compiler a monad?

The problem is that most of the solutions I have looked at seem to depend intensely on foreknowledge of the exact types and lengths involved, or on language-level capabilities like specialization, or on exhaustively expressing almost identical functions for every possible composition of types and lengths. This quickly risks exponential blowup in pages of code for one improvement.

Perhaps I made it sound like I intend to defer a solution indefinitely? On the contrary, I arrived at this conclusion after learning considerably more on how to generate LLVMIR and coming to the conclusion that directly complicating one of the simd_reduce intrinsics would be shorter and more maintainable.

thomcc · 2022-02-06T17:34:52Z

Well, the abs_diff is an x86-specific thing, and plausibly should have been cfged that way.

But more broadly: Fair enough -- Historically "this is an LLVM codegen bug" tends to imply things won't get fixed until an LLVM dev cares enough to fix it on their end.

workingjubilee · 2022-02-06T19:01:58Z

I mean, LLVM should fix it, but on our end, we can emit a sequence of llvm.aarch64.neon.{s,u}minp.v{4,8,16}.i{8,16,32} intrinsics when we recognize that the compilation target is aarch64 and we have an applicable vector. Likewise with Armv7 if Neon is enabled, etc.

Someone else will probably have to upstream it, though.
I can program in JavaScript, Rust, x86 Assembly, and LLVMIR, not C++.

calebzulawski · 2022-02-06T19:29:14Z

As always, compile time detection is only somewhat useful as many applications want runtime detection.

workingjubilee · 2022-02-06T20:17:15Z

That's true. I'll take the partial win, though.

calebzulawski · 2023-08-09T13:12:43Z

I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3

https://rustc.godbolt.org/z/a7njqsocr

programmerjake · 2023-08-09T15:57:01Z

https://rustc.godbolt.org/z/a7njqsocr

it looks like the optimizer is converting the reductions to a bitcast and compare, so to get it to work you'll need to special-case that too in instruction selection (or wherever else in the Arm backend)

calebzulawski · 2023-08-09T22:55:31Z

Ah, you're right, I thought I made that change too but it was only for wasm.

hsivonen · 2024-04-08T07:54:39Z

I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3

https://rustc.godbolt.org/z/a7njqsocr

Looks like the bad codegen also occurs with lower opt-levels.

hsivonen · 2024-04-10T05:57:48Z

Since packed_simd doesn't work with rustc anymore, I need to migrate encoding_rs in Firefox to core::simd. To enable that without regressing performance, I published a workaround crate for this bug. (It's great that apart from this bug core::simd works. Thank you!)

hsivonen added the C-bug Category: Bug label Jul 16, 2021

workingjubilee added A-LLVM Area: LLVM A-codegen Area: Code generation labels Feb 6, 2022

workingjubilee self-assigned this Feb 6, 2022

calebzulawski mentioned this issue Sep 22, 2022

Bad codegen for bitwise OR/AND masks #303

Open

jhorstmann mentioned this issue Nov 22, 2022

Consistently use the highest bit of vector masks when converting to i1 vectors rust-lang/rust#104693

Open

calebzulawski mentioned this issue Dec 7, 2022

Simd-using functions sometimes scalarize after inlining, even if they use vector ops on their own #321

Open

calebzulawski mentioned this issue Jan 9, 2023

Find-first-set style operations directly on masks #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad codegen for boolean reductions on thumbv7neon #146

Bad codegen for boolean reductions on thumbv7neon #146

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021

programmerjake commented Jul 16, 2021

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021 •

edited

Loading

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021

hsivonen commented Jul 16, 2021 •

edited

Loading

calebzulawski commented Jul 17, 2021

calebzulawski commented Jul 17, 2021

workingjubilee commented Oct 3, 2021

calebzulawski commented Dec 5, 2021

thomcc commented Dec 10, 2021

calebzulawski commented Dec 11, 2021 •

edited

Loading

workingjubilee commented Dec 24, 2021 •

edited

Loading

workingjubilee commented Dec 24, 2021

workingjubilee commented Feb 6, 2022

thomcc commented Feb 6, 2022

workingjubilee commented Feb 6, 2022

thomcc commented Feb 6, 2022

workingjubilee commented Feb 6, 2022 •

edited

Loading

calebzulawski commented Feb 6, 2022

workingjubilee commented Feb 6, 2022

calebzulawski commented Aug 9, 2023

programmerjake commented Aug 9, 2023

calebzulawski commented Aug 9, 2023

hsivonen commented Apr 8, 2024

hsivonen commented Apr 10, 2024

Bad codegen for boolean reductions on thumbv7neon #146

Bad codegen for boolean reductions on thumbv7neon #146

Comments

hsivonen commented Jul 16, 2021

Additional info

Meta

calebzulawski commented Jul 16, 2021

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021

programmerjake commented Jul 16, 2021

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021 • edited Loading

hsivonen commented Jul 16, 2021

calebzulawski commented Jul 16, 2021

hsivonen commented Jul 16, 2021 • edited Loading

calebzulawski commented Jul 17, 2021

calebzulawski commented Jul 17, 2021

workingjubilee commented Oct 3, 2021

calebzulawski commented Dec 5, 2021

thomcc commented Dec 10, 2021

calebzulawski commented Dec 11, 2021 • edited Loading

workingjubilee commented Dec 24, 2021 • edited Loading

workingjubilee commented Dec 24, 2021

workingjubilee commented Feb 6, 2022

thomcc commented Feb 6, 2022

workingjubilee commented Feb 6, 2022

thomcc commented Feb 6, 2022

workingjubilee commented Feb 6, 2022 • edited Loading

calebzulawski commented Feb 6, 2022

workingjubilee commented Feb 6, 2022

calebzulawski commented Aug 9, 2023

programmerjake commented Aug 9, 2023

calebzulawski commented Aug 9, 2023

hsivonen commented Apr 8, 2024

hsivonen commented Apr 10, 2024

calebzulawski commented Jul 16, 2021 •

edited

Loading

hsivonen commented Jul 16, 2021 •

edited

Loading

calebzulawski commented Dec 11, 2021 •

edited

Loading

workingjubilee commented Dec 24, 2021 •

edited

Loading

workingjubilee commented Feb 6, 2022 •

edited

Loading