Skip to content

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

@ejmahler

Description

@ejmahler

I tried this code (example 1), in which we have a public function mutate_array that internally calls mutate_chunk:

use std::arch::x86_64::*;

#[inline(always)]
pub unsafe fn mutate_chunk(rows: [__m256d; 4]) -> [__m256d; 4] {
    [
        _mm256_permute2f128_pd(rows[0], rows[1], 0x20),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x20),
        _mm256_permute2f128_pd(rows[0], rows[1], 0x31),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x31),
    ]
}


#[target_feature(enable = "avx")]
pub unsafe fn mutate_array(input: *const f64, output: *mut f64) {
    let mut input_data = [_mm256_setzero_pd(); 4];

    for i in 0..4 {
        input_data[i] = _mm256_loadu_pd(input.add(4*i));
    }

    let output_data = mutate_chunk(input_data);

    for i in 0..4 {
        _mm256_storeu_pd(output.add(4*i), output_data[i]);
    }
}

This is a very stripped-down example of code that appears all over my project. We load data into AVX registers, do some sort of operation on the loaded data, then store it back to memory. The (more or less) optimal assembly for this example code is:

example::mutate_array:
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vperm2f128      ymm5, ymm2, ymm3, 49
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 32
        vmovups ymmword ptr [rsi], ymm4
        vmovups ymmword ptr [rsi + 32], ymm1
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovups ymmword ptr [rsi + 96], ymm5
        vzeroupper
        ret

4 loads, 4 permutes, 4 stores.

As you can see from the godbolt link, the actual generated assembly is quite a bit longer:

example::mutate_array:
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 288
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vmovaps ymmword ptr [rsp + 96], ymm3
        vmovaps ymmword ptr [rsp + 64], ymm2
        vmovaps ymmword ptr [rsp + 32], ymm1
        vmovaps ymmword ptr [rsp], ymm0
        vmovaps ymm0, ymmword ptr [rsp]
        vmovaps ymm1, ymmword ptr [rsp + 32]
        vmovaps ymm2, ymmword ptr [rsp + 64]
        vmovaps ymm3, ymmword ptr [rsp + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vmovaps ymmword ptr [rsp + 128], ymm4
        vperm2f128      ymm4, ymm2, ymm3, 32
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 49
        vmovaps ymmword ptr [rsp + 160], ymm4
        vmovaps ymmword ptr [rsp + 192], ymm0
        vmovaps ymmword ptr [rsp + 224], ymm1
        vmovaps ymm0, ymmword ptr [rsp + 224]
        vmovups ymmword ptr [rsi + 96], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 192]
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 160]
        vmovups ymmword ptr [rsi + 32], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 128]
        vmovups ymmword ptr [rsi], ymm0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret

The second assembly block is the same as the first, except for the addition of reads/writes to the rsp (ie the stack). It loads the 4 values from memory fine -- but before running the permutes, it stores the values to rsp, then immediately reads them back. Same thing after the permutes: Before writing the data to the output, it stores it to rsp, then immediately reads it back.

It's possible to nudge the compiler into generating the correct output by partially unrolling the input and output loops.

By changing the input loop

for i in 0..4 {
    input_data[i] = _mm256_loadu_pd(input.add(4*i));
}

to

for i in 0..2 {
    input_data[i*2] =   _mm256_loadu_pd(input.add(8*i));
    input_data[i*2+1] = _mm256_loadu_pd(input.add(8*i + 4));
}

we can see that the loop is functionally identical, but the compiler no longer writes the inputs to the stack (example 2).

We can apply the same treatment to the output loop, completely eliminating the stack reads and writes: example 3.

Without knowing anything about the internals of the compiler, I can imagine two possibilities here:

  1. Example 1 demonstrates trivial missed optimization: the compiler is unrolling the loop, but fails to determine that it can eliminate the array. As a result, it more than doubles the instruction count of the function, tanking performance.
  2. Alternatively, something in the Rust standard requires all arrays to have an in-memory representation, and they aren't allowed to be completely optimized away to register storage, even entirely within a function. If this is the case, then examples 2 and 3 demonstrate a code generation bug, because we can clearly see that the storage to the array was completely optimized away.

Meta

rustc --version --verbose:

rustc 1.44.0-nightly (42abbd887 2020-04-07)
binary: rustc
commit-hash: 42abbd8878d3b67238f3611b0587c704ba94f39c
commit-date: 2020-04-07
host: x86_64-pc-windows-msvc
release: 1.44.0-nightly
LLVM version: 9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-SIMDArea: SIMD (Single Instruction Multiple Data)A-codegenArea: Code generationC-enhancementCategory: An issue proposing an enhancement or a PR with one.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions