Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad codegen for boolean reductions on thumbv7neon #146

Open
hsivonen opened this issue Jul 16, 2021 · 29 comments
Open

Bad codegen for boolean reductions on thumbv7neon #146

hsivonen opened this issue Jul 16, 2021 · 29 comments
Assignees
Labels
A-codegen Area: Code generation A-LLVM Area: LLVM C-bug Category: Bug

Comments

@hsivonen
Copy link
Member

Using cargo build --release --target thumbv7neon-unknown-linux-gnueabihf:

With packed_simd_2, this:

#[no_mangle]
#[inline(never)]
fn packed_any(s: m8x16) -> bool {
    s.any()
}

compiles to this:

00003e66 <packed_any>:
    3e66:       f960 0acf       vld1.64 {d16-d17}, [r0]
    3e6a:       ff40 0aa1       vpmax.u8        d16, d16, d17
    3e6e:       ec51 0b30       vmov    r0, r1, d16
    3e72:       4308            orrs    r0, r1
    3e74:       bf18            it      ne
    3e76:       2001            movne   r0, #1
    3e78:       4770            bx      lr
        ...

With core_simd, this:

#[no_mangle]
#[inline(never)]
fn core_any(s: mask8x16) -> bool {
    s.any()
}

compiles to this:

00003e66 <core_any>:
    3e66:       b5f0            push    {r4, r5, r6, r7, lr}
    3e68:       f960 0acf       vld1.64 {d16-d17}, [r0]
    3e6c:       eed0 0bb0       vmov.u8 r0, d16[1]
    3e70:       eed0 1b90       vmov.u8 r1, d16[0]
    3e74:       eed0 2bd0       vmov.u8 r2, d16[2]
    3e78:       eed0 3bf0       vmov.u8 r3, d16[3]
    3e7c:       eef0 cb90       vmov.u8 ip, d16[4]
    3e80:       eef0 ebb0       vmov.u8 lr, d16[5]
    3e84:       eef0 4bd0       vmov.u8 r4, d16[6]
    3e88:       eef0 7bf0       vmov.u8 r7, d16[7]
    3e8c:       eed1 5bf0       vmov.u8 r5, d17[3]
    3e90:       eef1 6b90       vmov.u8 r6, d17[4]
    3e94:       4308            orrs    r0, r1
    3e96:       eed1 1b90       vmov.u8 r1, d17[0]
    3e9a:       4310            orrs    r0, r2
    3e9c:       eed1 2bb0       vmov.u8 r2, d17[1]
    3ea0:       4318            orrs    r0, r3
    3ea2:       eed1 3bd0       vmov.u8 r3, d17[2]
    3ea6:       ea40 000c       orr.w   r0, r0, ip
    3eaa:       ea40 000e       orr.w   r0, r0, lr
    3eae:       4320            orrs    r0, r4
    3eb0:       eef1 4bb0       vmov.u8 r4, d17[5]
    3eb4:       4338            orrs    r0, r7
    3eb6:       eef1 7bd0       vmov.u8 r7, d17[6]
    3eba:       4308            orrs    r0, r1
    3ebc:       eef1 1bf0       vmov.u8 r1, d17[7]
    3ec0:       4310            orrs    r0, r2
    3ec2:       4318            orrs    r0, r3
    3ec4:       4328            orrs    r0, r5
    3ec6:       4330            orrs    r0, r6
    3ec8:       4320            orrs    r0, r4
    3eca:       4338            orrs    r0, r7
    3ecc:       4308            orrs    r0, r1
    3ece:       f000 0001       and.w   r0, r0, #1
    3ed2:       bdf0            pop     {r4, r5, r6, r7, pc}

Additional info

This seriously regresses performance (packed_simd_2 vs. core_simd) for encoding_rs.

Previously when migrating from simd to packed_simd.

Meta

rustc --version --verbose:

rustc 1.55.0-nightly (b1f8e27b7 2021-07-15)
binary: rustc
commit-hash: b1f8e27b74c541d3d555149c8efa4bfe9385cd56
commit-date: 2021-07-15
host: armv7-unknown-linux-gnueabihf
release: 1.55.0-nightly
LLVM version: 12.0.1

stdsimd rev 715f9ac

@hsivonen hsivonen added the C-bug Category: Bug label Jul 16, 2021
@calebzulawski
Copy link
Member

packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.

@hsivonen
Copy link
Member Author

Doesn't target_feature apply to std even when compiling std? I expected it to work when NEON-availability is part of the target as with thumbv7neon-unknown-linux-gnueabihf and not an application compile-time option.

@calebzulawski
Copy link
Member

Good point--i missed the neon in the target. It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag). I'm guessing other architectures likely see a similar codegen issue, as well.

@programmerjake
Copy link
Member

packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.

LLVM produces better code if truncating to i1 first, but it's still waay worse than packed-simd:
https://llvm.godbolt.org/z/3v71P9Wdo

reduce_or:
        vorr    d16, d1, d0
        vzip.8  d16, d17
        vorr    d16, d17, d16
        vmov.u16        r0, d16[0]
        vmov.32 d17[0], r0
        vmov.u16        r0, d16[1]
        vmov.32 d17[1], r0
        vmov.u16        r0, d16[2]
        vmov.32 d18[0], r0
        vmov.u16        r0, d16[3]
        vmov.32 d18[1], r0
        vorr    d16, d18, d17
        vmov.32 r0, d16[0]
        vmov.32 r1, d16[1]
        orr     r0, r1, r0
        and     r0, r0, #1
        bx      lr

@hsivonen
Copy link
Member Author

It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).

The thumbv7neon-* targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon by baking NEON in at std compile time.

@calebzulawski
Copy link
Member

calebzulawski commented Jul 16, 2021

We actually use the reduce-or intrinsic which does exceptionally poorly: https://llvm.godbolt.org/z/aEha7dh5b

It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).

The thumbv7neon-* targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon by baking NEON in at std compile time.

Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1. We could partially solve it by bypassing codegen with explicit neon intrinsics, but the subpar LLVM codegen is the root of the problem.

@hsivonen
Copy link
Member Author

Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1.

Ideally, yes, but in the interim, it would be great to have a core_simd level conditionally-compiled implementation the produces the same instructions as the same reduction in simd and in packed_simd.

@calebzulawski
Copy link
Member

That's a possibility, but I'm concerned about special casing every codegen issue for every target (this is certainly not the only one) and having a heap of target-specific code getting in the way of making stdsimd actually portable and stable. At a minimum we should also report this to LLVM (it may be relevant to more targets than just armv7)

@hsivonen
Copy link
Member Author

hsivonen commented Jul 16, 2021

I agree that reporting to LLVM makes sense and that it's problematic to have a lot of conditional target-specific code. However, another point of view is that if the application programmer can't trust that core::simd compiles to the best instructions, the value proposition for portable SIMD in general suffers. I think it's less bad to put the target-specific code in core::simd than e.g. in encoding_rs.

@calebzulawski
Copy link
Member

I would argue that even with the sometimes-enabled target specific code, it still can't be trusted to compile to the best instructions. Regardless, I opened https://bugs.llvm.org/show_bug.cgi?id=51122. It appears there was a similar issue on x86-64 a couple years ago, but that was fixed.

@calebzulawski
Copy link
Member

I was able to fix the codegen on Aarch64 by implementing any as self.to_int().horizontal_min() == -1, but this still fails on ARM and produces:

reduce_or:
 push    {r4, r5, r6, r7, r11, lr}
 vld1.64 {d16, d17}, [r0]
 vmin.s8 d16, d16, d17
 vmov.u8 r0, d16[0]
 vmov.u8 r1, d16[1]
 vmov.u8 r2, d16[2]
 vmov.u8 r3, d16[3]
 vmov.u8 r12, d16[4]
 vmov.u8 lr, d16[5]
 vmov.u8 r4, d16[6]
 vmov.u8 r5, d16[7]
 lsl     r0, r0, #24
 lsl     r1, r1, #24
 asr     r6, r0, #24
 asr     r7, r1, #24
 cmp     r6, r1, asr, #24
 asrlt   r7, r0, #24
 lsl     r0, r2, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r3, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r12, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, lr, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r4, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r5, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 and     r0, r7, #255
 sub     r0, r0, #255
 clz     r0, r0
 lsr     r0, r0, #5
 pop     {r4, r5, r6, r7, r11, pc}

LLVM does not seem to be aware of vpmin at all.

@workingjubilee
Copy link
Contributor

We changed some stuff around and LLVM is now LLVM 13. No action on the bug, but is this still an issue today?

@calebzulawski
Copy link
Member

I just checked godbolt and yes, it's still an issue (for both armv7 and aarch64).

@thomcc
Copy link
Member

thomcc commented Dec 10, 2021

We've been concerned about this kind of thing for a while, it's... part of what the whole mask representation arguments were about near the start (and this looks somewhat like my examples of LLVM doing bad things from back then 😬).

While I'm unsure we need to be above a target-specific fix (libcore even has these in the scalar numerics, after all), it sounds like this is not just an arm problem... (the indication I'm getting is that this is bad on several targets)

Boolean reductions are a super common operation. Do... we actually have a feeling this is fixable? (Perhaps with a new intrinsic which is UB to use if the input is not a mask type?)

@calebzulawski
Copy link
Member

calebzulawski commented Dec 11, 2021

The issue doesn't really have anything to do with our mask representation, but LLVM not lowering well. The x86 backend has special logic for boolean reductions, and either that logic needs to be made generic, or ported to all targets.

@workingjubilee
Copy link
Contributor

workingjubilee commented Dec 24, 2021

The upstream of this issue is now tracked at llvm/llvm-project#50466

@workingjubilee
Copy link
Contributor

Hrm. I think I would to know slightly more about how our highly generic code gets included in crates from std, if at all, since we mark as many things #[inline] as possible, before addressing this. The code generated is obviously bad, I am just wary about going off only one test example. And one alternative we haven't discussed is the possibility of tweaking the LLVM IR we emit more directly.

If we did solve it here, we might have to do so as a specialization on a length...

https://rustc.godbolt.org/z/9sEo9GPWY

@workingjubilee
Copy link
Contributor

Mmm, I did some investigation and I have begun to doubt that it's worth fixing this in std::simd before codegen, precisely because this IS a super common operation.

You see, miri attempts to run std's tests, so miri needs to be able to manage what we do in tests. We could do an end-run around miri and give more explicit directions in LLVM lowering, but we shouldn't really bother at the library level. And then whatever fix we implemented by manipulating LLVMIR would be better off upstreamed.

@thomcc
Copy link
Member

thomcc commented Feb 6, 2022

FWIW, this wouldn't be the first code in the stdlib that does #[cfg(miri)] to perform an optimization that miri doesn't support (or even does support, except under stricter MIRIFLAGS).

It's also not really that much worse than tweaks made in other places in libcore -- for example, https://github.com/rust-lang/rust/blob/master/library/core/src/num/uint_macros.rs#L1645-L1657 was tuned to match what some specific LLVM vectorization pattern looks for.

IOW I'm not sure there's any reason to favor purity over practicality here... People expect the stdlib to contain the gross hacks so they don't need to, especially for common operations (but also less common ones that are big enough wins, like with that abs_diff).

@workingjubilee
Copy link
Contributor

The abs_diff example there doesn't seem to be related at all as it doesn't use cfg.

I am not exactly clear on why I would be "favoring purity over practicality" when I suggest directly altering the LLVMIR lowering of the intrinsic on some targets in order to get the desired results when it genuinely has begun to seem simpler than a mash of cfg at the library level which complicates life for everyone who lives upstream of LLVM's competence or lack thereof. I was not aware "tweak the LLVMIR conditionally in response to some compilation states" was a pure operation. It seems rather stateful to me. Is the compiler a monad?

The problem is that most of the solutions I have looked at seem to depend intensely on foreknowledge of the exact types and lengths involved, or on language-level capabilities like specialization, or on exhaustively expressing almost identical functions for every possible composition of types and lengths. This quickly risks exponential blowup in pages of code for one improvement.

Perhaps I made it sound like I intend to defer a solution indefinitely? On the contrary, I arrived at this conclusion after learning considerably more on how to generate LLVMIR and coming to the conclusion that directly complicating one of the simd_reduce intrinsics would be shorter and more maintainable.

@thomcc
Copy link
Member

thomcc commented Feb 6, 2022

Well, the abs_diff is an x86-specific thing, and plausibly should have been cfged that way.

But more broadly: Fair enough -- Historically "this is an LLVM codegen bug" tends to imply things won't get fixed until an LLVM dev cares enough to fix it on their end.

@workingjubilee
Copy link
Contributor

workingjubilee commented Feb 6, 2022

I mean, LLVM should fix it, but on our end, we can emit a sequence of llvm.aarch64.neon.{s,u}minp.v{4,8,16}.i{8,16,32} intrinsics when we recognize that the compilation target is aarch64 and we have an applicable vector. Likewise with Armv7 if Neon is enabled, etc.

Someone else will probably have to upstream it, though.
I can program in JavaScript, Rust, x86 Assembly, and LLVMIR, not C++.

@calebzulawski
Copy link
Member

As always, compile time detection is only somewhat useful as many applications want runtime detection.

@workingjubilee
Copy link
Contributor

That's true. I'll take the partial win, though.

@calebzulawski
Copy link
Member

I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3

https://rustc.godbolt.org/z/a7njqsocr

@programmerjake
Copy link
Member

https://rustc.godbolt.org/z/a7njqsocr

it looks like the optimizer is converting the reductions to a bitcast and compare, so to get it to work you'll need to special-case that too in instruction selection (or wherever else in the Arm backend)

@calebzulawski
Copy link
Member

Ah, you're right, I thought I made that change too but it was only for wasm.

@hsivonen
Copy link
Member Author

hsivonen commented Apr 8, 2024

I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3

https://rustc.godbolt.org/z/a7njqsocr

Looks like the bad codegen also occurs with lower opt-levels.

@hsivonen
Copy link
Member Author

Since packed_simd doesn't work with rustc anymore, I need to migrate encoding_rs in Firefox to core::simd. To enable that without regressing performance, I published a workaround crate for this bug. (It's great that apart from this bug core::simd works. Thank you!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: LLVM C-bug Category: Bug
Projects
None yet
Development

No branches or pull requests

5 participants