Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange ASM pessimizations as a result of algorithmic optimization #92119

Open
HadrienG2 opened this issue Dec 20, 2021 · 5 comments
Open

Strange ASM pessimizations as a result of algorithmic optimization #92119

HadrienG2 opened this issue Dec 20, 2021 · 5 comments
Labels
A-autovectorization Issue related to autovectorization, which can impact perf or code size. A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@HadrienG2
Copy link

So, I was trying to sum a slice of f32s quickly on stable Rust.

But like pretty much all floating-point reductions, the naive algorithm (floats.iter().sum::<f32>()) does not autovectorize because its "natural" summation order introduces a serial dependency between successive sums. Which makes SIMD parallelization illegal in the eye of a compiler that guarantees bitwise floating point reproducibility like rustc does. Fair enough.

I saw this as a good motivation to move to explicit SIMD programming, but did not want to lose hardware portability (or, more precisely, wanted to keep it easy), so I tried to see how close I could get to stdsimd on stable Rust with only autovectorization and a pinch of hardware-specific vector size tuning.


Some hours of trial and error later, I got into a reasonably satisfactory state. In particular, the core of the algorithm...

    // Perform concurrent SIMD accumulation
    let mut accumulators = [SimdF32::default(); CONCURRENCY];
    for chunk in chunks {
        for (accumulator, &vector) in accumulators.iter_mut().zip(chunk) {
            *accumulator += vector;
        }
    }

    // Merge the SIMD accumulators into one
    let mut stride = CONCURRENCY / 2;
    while stride > 0 {
        for i in 0..stride {
            accumulators[i] += accumulators[i + stride];
        }
        stride /= 2;
    }
    let mut accumulator = accumulators[0];

...translated pretty much into the assembly that I would have written by hand, which made me very happy...

.LBB0_17:
	addps	(%rdi), %xmm1
	addps	16(%rdi), %xmm3
	addps	32(%rdi), %xmm2
	addps	48(%rdi), %xmm4
	addq	$64, %rdi
	addq	$4, %rcx
	jne	.LBB0_17
	jmp	.LBB0_18

...

.LBB0_18:
	addps	%xmm4, %xmm3
	addps	%xmm2, %xmm1
	addps	%xmm3, %xmm1

Nitpicky as I am, however, I was still a little bit unhappy about the part afterwards, which introduced a chain of serial dependencies that could become a bit long if I were to use a lot of accumulators...

    for &vector in remainder {
        accumulator += vector;
    }
.LBB0_31:
	addps	(%rcx), %xmm1
	addq	$16, %rcx
	incq	%rax
	jne	.LBB0_31

...because I knew that in this particular case, there should be an easy way to avoid that, which is to interleave the SIMD accumulator merging with the summation of remaining data.

    // Reinterprete input as aligned SIMD vectors + some extra floats.
    let (peel, mut vectors, tail) = unsafe { input.align_to::<SimdF32>() };
    
    // Perform concurrent SIMD accumulation, starting at maximal concurrency and
    // decreasing once the number of input vectors gets too small
    let mut accumulators = [SimdF32::default(); MAX_CONCURRENCY];
    let mut concurrency = MAX_CONCURRENCY;
    while concurrency > 0 {
        // Set up accumulators and chunked SIMD data according to current concurrency
        let accumulators = &mut accumulators[..concurrency];
        let chunks = vectors.chunks_exact(concurrency);
        let remainder = chunks.remainder();

        // Perform concurrent SIMD accumulation
        for chunk in chunks {
            for (accumulator, &vector) in accumulators.iter_mut().zip(chunk) {
                *accumulator += vector;
            }
        }
        
        // Halve SIMD concurrency to accumulate remaining data
        concurrency /= 2;
        for i in 0..concurrency {
            accumulators[i] += accumulators[i+concurrency];
        }
        vectors = remainder;
    }
    let accumulator = accumulators[0];

However, much to my surprise, performing this algorithmic optimization leads rustc to heavily pessimize the inner loop code by spilling all but one accumulator on every iteration:

.LBB0_16:
	addps	(%rax), %xmm1
	movaps	16(%rsp), %xmm2
	movaps	32(%rsp), %xmm3
	movaps	48(%rsp), %xmm4
	addps	16(%rax), %xmm2
	movaps	%xmm2, 16(%rsp)
	addps	32(%rax), %xmm3
	addps	48(%rax), %xmm4
	movaps	%xmm3, 32(%rsp)
	movaps	%xmm4, 48(%rsp)
	addq	$64, %rax
	addq	$4, %rdi
	jne	.LBB0_16

Why would that happen? The only explanation I have is that rustc is somehow unable to prove that the accumulators slice does not alias with the vectors/remainder slices, and thus spills to memory just in case accumulator changes would affect the input of the next computations.

But this sounds like a bug to me: given that I have an &mut to the accumulators, my understanding is that rustc should be able to infer that no other code can see the accumulators, and thus they can remain resident in registers for the entire duration of the accumulation loop.

Can someone with more knowledge of how rustc and LLVM do their optimization magic cross-check this and tell if my understanding is correct or if the register spills are truly necessary to preserve the semantics of my code?


Also, this is on stable release 1.57.0. On beta and nightly, the generated code becomes even stranger:

.LBB0_16:
	movq	(%rdx), %rcx
	movq	8(%rdx), %rbx
	movd	%ebx, %xmm5
	shrq	$32, %rbx
	movd	%ecx, %xmm6
	shrq	$32, %rcx
	movd	%ecx, %xmm7
	punpckldq	%xmm6, %xmm7
	movd	%ebx, %xmm6
	punpckldq	%xmm5, %xmm6
	punpcklqdq	%xmm7, %xmm6
	addps	%xmm6, %xmm2
	addps	16(%rdx), %xmm1
	addps	32(%rdx), %xmm4
	addps	48(%rdx), %xmm3
	addq	$64, %rdx
	addq	$4, %rax
	jne	.LBB0_16

Here, rustc generates the code I would expect for the last three accumulators, but then it goes crazy with the first accumulator and generates the least efficient SSE load I have ever seen.

So it seems the aliasing issue got resolved, but was replaced by another issue beyond my comprehension... Here again, compiler expert help would be useful.

@HadrienG2 HadrienG2 added the C-bug Category: This is a bug. label Dec 20, 2021
@hkratz
Copy link
Contributor

hkratz commented Dec 20, 2021

@rustbot label +T-compiler +I-slow +A-codegen +A-llvm

@rustbot rustbot added A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 20, 2021
@AngelicosPhosphoros
Copy link
Contributor

Your splitting by CONCURENCY threads is suboptimal.
You should split your vectors by chunks with size len() / CONCURENCY + 1 instead of chunks with size CONCURENCY. Your current version interacts badly with caches.

@HadrienG2
Copy link
Author

@AngelicosPhosphoros : Nice catch, this might be the reason why I did not manage so far to get this code above ~50% of the CPU's peak load throughput (with inputs in L1 cache). I'll try this suggestion out, among other ideas.

@HadrienG2
Copy link
Author

@AngelicosPhosphoros On a Zen 2 CPU at least, increasing load stride makes no difference to throughput. What did make a difference, however, was remembering to add codegen-units = 1 to my bench cargo profile so that rustc stops generating bad code in unpredictable ways. Given extra tuning, this gave me a peak throughput at 85% of theoretical hardware maximum in SSE mode and 73% of hardware maximum in AVX mode, which is good enough for me.

@AngelicosPhosphoros
Copy link
Contributor

Yeah, I really love codegen-units=1 and lto="fat" for production builds.

@workingjubilee workingjubilee added the A-autovectorization Issue related to autovectorization, which can impact perf or code size. label Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-autovectorization Issue related to autovectorization, which can impact perf or code size. A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants