-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad codegen for boolean reductions on thumbv7neon #146
Comments
packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code. |
Doesn't |
Good point--i missed the neon in the target. It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag). I'm guessing other architectures likely see a similar codegen issue, as well. |
LLVM produces better code if truncating to reduce_or:
vorr d16, d1, d0
vzip.8 d16, d17
vorr d16, d17, d16
vmov.u16 r0, d16[0]
vmov.32 d17[0], r0
vmov.u16 r0, d16[1]
vmov.32 d17[1], r0
vmov.u16 r0, d16[2]
vmov.32 d18[0], r0
vmov.u16 r0, d16[3]
vmov.32 d18[1], r0
vorr d16, d18, d17
vmov.32 r0, d16[0]
vmov.32 r1, d16[1]
orr r0, r1, r0
and r0, r0, #1
bx lr |
The |
We actually use the reduce-or intrinsic which does exceptionally poorly: https://llvm.godbolt.org/z/aEha7dh5b
Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for |
Ideally, yes, but in the interim, it would be great to have a |
That's a possibility, but I'm concerned about special casing every codegen issue for every target (this is certainly not the only one) and having a heap of target-specific code getting in the way of making stdsimd actually portable and stable. At a minimum we should also report this to LLVM (it may be relevant to more targets than just armv7) |
I agree that reporting to LLVM makes sense and that it's problematic to have a lot of conditional target-specific code. However, another point of view is that if the application programmer can't trust that |
I would argue that even with the sometimes-enabled target specific code, it still can't be trusted to compile to the best instructions. Regardless, I opened https://bugs.llvm.org/show_bug.cgi?id=51122. It appears there was a similar issue on x86-64 a couple years ago, but that was fixed. |
I was able to fix the codegen on Aarch64 by implementing
LLVM does not seem to be aware of |
We changed some stuff around and LLVM is now LLVM 13. No action on the bug, but is this still an issue today? |
I just checked godbolt and yes, it's still an issue (for both armv7 and aarch64). |
We've been concerned about this kind of thing for a while, it's... part of what the whole mask representation arguments were about near the start (and this looks somewhat like my examples of LLVM doing bad things from back then 😬). While I'm unsure we need to be above a target-specific fix (libcore even has these in the scalar numerics, after all), it sounds like this is not just an arm problem... (the indication I'm getting is that this is bad on several targets) Boolean reductions are a super common operation. Do... we actually have a feeling this is fixable? (Perhaps with a new intrinsic which is UB to use if the input is not a mask type?) |
The issue doesn't really have anything to do with our mask representation, but LLVM not lowering well. The x86 backend has special logic for boolean reductions, and either that logic needs to be made generic, or ported to all targets. |
The upstream of this issue is now tracked at llvm/llvm-project#50466 |
Hrm. I think I would to know slightly more about how our highly generic code gets included in crates from std, if at all, since we mark as many things If we did solve it here, we might have to do so as a specialization on a length... |
Mmm, I did some investigation and I have begun to doubt that it's worth fixing this in You see, miri attempts to run std's tests, so miri needs to be able to manage what we do in tests. We could do an end-run around miri and give more explicit directions in LLVM lowering, but we shouldn't really bother at the library level. And then whatever fix we implemented by manipulating LLVMIR would be better off upstreamed. |
FWIW, this wouldn't be the first code in the stdlib that does #[cfg(miri)] to perform an optimization that miri doesn't support (or even does support, except under stricter MIRIFLAGS). It's also not really that much worse than tweaks made in other places in libcore -- for example, https://github.com/rust-lang/rust/blob/master/library/core/src/num/uint_macros.rs#L1645-L1657 was tuned to match what some specific LLVM vectorization pattern looks for. IOW I'm not sure there's any reason to favor purity over practicality here... People expect the stdlib to contain the gross hacks so they don't need to, especially for common operations (but also less common ones that are big enough wins, like with that abs_diff). |
The I am not exactly clear on why I would be "favoring purity over practicality" when I suggest directly altering the LLVMIR lowering of the intrinsic on some targets in order to get the desired results when it genuinely has begun to seem simpler than a mash of cfg at the library level which complicates life for everyone who lives upstream of LLVM's competence or lack thereof. I was not aware "tweak the LLVMIR conditionally in response to some compilation states" was a pure operation. It seems rather stateful to me. Is the compiler a monad? The problem is that most of the solutions I have looked at seem to depend intensely on foreknowledge of the exact types and lengths involved, or on language-level capabilities like specialization, or on exhaustively expressing almost identical functions for every possible composition of types and lengths. This quickly risks exponential blowup in pages of code for one improvement. Perhaps I made it sound like I intend to defer a solution indefinitely? On the contrary, I arrived at this conclusion after learning considerably more on how to generate LLVMIR and coming to the conclusion that directly complicating one of the |
Well, the abs_diff is an x86-specific thing, and plausibly should have been But more broadly: Fair enough -- Historically "this is an LLVM codegen bug" tends to imply things won't get fixed until an LLVM dev cares enough to fix it on their end. |
I mean, LLVM should fix it, but on our end, we can emit a sequence of Someone else will probably have to upstream it, though. |
As always, compile time detection is only somewhat useful as many applications want runtime detection. |
That's true. I'll take the partial win, though. |
I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without |
it looks like the optimizer is converting the reductions to a bitcast and compare, so to get it to work you'll need to special-case that too in instruction selection (or wherever else in the Arm backend) |
Ah, you're right, I thought I made that change too but it was only for wasm. |
Looks like the bad codegen also occurs with lower opt-levels. |
Since |
Using
cargo build --release --target thumbv7neon-unknown-linux-gnueabihf
:With
packed_simd_2
, this:compiles to this:
With
core_simd
, this:compiles to this:
Additional info
This seriously regresses performance (
packed_simd_2
vs.core_simd
) forencoding_rs
.Previously when migrating from
simd
topacked_simd
.Meta
rustc --version --verbose
:stdsimd
rev 715f9acThe text was updated successfully, but these errors were encountered: