Strange ASM pessimizations as a result of algorithmic optimization #92119
Labels
A-autovectorization
Issue related to autovectorization, which can impact perf or code size.
A-codegen
Area: Code generation
A-LLVM
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.
C-bug
Category: This is a bug.
I-slow
Issue: Problems and improvements with respect to performance of generated code.
T-compiler
Relevant to the compiler team, which will review and decide on the PR/issue.
So, I was trying to sum a slice of f32s quickly on stable Rust.
But like pretty much all floating-point reductions, the naive algorithm (
floats.iter().sum::<f32>()
) does not autovectorize because its "natural" summation order introduces a serial dependency between successive sums. Which makes SIMD parallelization illegal in the eye of a compiler that guarantees bitwise floating point reproducibility like rustc does. Fair enough.I saw this as a good motivation to move to explicit SIMD programming, but did not want to lose hardware portability (or, more precisely, wanted to keep it easy), so I tried to see how close I could get to
stdsimd
on stable Rust with only autovectorization and a pinch of hardware-specific vector size tuning.Some hours of trial and error later, I got into a reasonably satisfactory state. In particular, the core of the algorithm...
...translated pretty much into the assembly that I would have written by hand, which made me very happy...
Nitpicky as I am, however, I was still a little bit unhappy about the part afterwards, which introduced a chain of serial dependencies that could become a bit long if I were to use a lot of accumulators...
...because I knew that in this particular case, there should be an easy way to avoid that, which is to interleave the SIMD accumulator merging with the summation of remaining data.
However, much to my surprise, performing this algorithmic optimization leads rustc to heavily pessimize the inner loop code by spilling all but one accumulator on every iteration:
Why would that happen? The only explanation I have is that rustc is somehow unable to prove that the
accumulators
slice does not alias with thevectors
/remainder
slices, and thus spills to memory just in case accumulator changes would affect the input of the next computations.But this sounds like a bug to me: given that I have an &mut to the accumulators, my understanding is that rustc should be able to infer that no other code can see the accumulators, and thus they can remain resident in registers for the entire duration of the accumulation loop.
Can someone with more knowledge of how rustc and LLVM do their optimization magic cross-check this and tell if my understanding is correct or if the register spills are truly necessary to preserve the semantics of my code?
Also, this is on stable release 1.57.0. On beta and nightly, the generated code becomes even stranger:
Here, rustc generates the code I would expect for the last three accumulators, but then it goes crazy with the first accumulator and generates the least efficient SSE load I have ever seen.
So it seems the aliasing issue got resolved, but was replaced by another issue beyond my comprehension... Here again, compiler expert help would be useful.
The text was updated successfully, but these errors were encountered: