Sub-optimal codegen: Unnecessarily dumping AVX registers to stack

I tried this code ([example 1](https://rust.godbolt.org/z/86FfSi)), in which we have a public function `mutate_array` that internally calls `mutate_chunk`:

```rust
use std::arch::x86_64::*;

#[inline(always)]
pub unsafe fn mutate_chunk(rows: [__m256d; 4]) -> [__m256d; 4] {
    [
        _mm256_permute2f128_pd(rows[0], rows[1], 0x20),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x20),
        _mm256_permute2f128_pd(rows[0], rows[1], 0x31),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x31),
    ]
}


#[target_feature(enable = "avx")]
pub unsafe fn mutate_array(input: *const f64, output: *mut f64) {
    let mut input_data = [_mm256_setzero_pd(); 4];

    for i in 0..4 {
        input_data[i] = _mm256_loadu_pd(input.add(4*i));
    }

    let output_data = mutate_chunk(input_data);

    for i in 0..4 {
        _mm256_storeu_pd(output.add(4*i), output_data[i]);
    }
}
```

This is a very stripped-down example of code that appears all over my project. We load data into AVX registers, do some sort of operation on the loaded data, then store it back to memory. The (more or less) optimal assembly for this example code is:
```
example::mutate_array:
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vperm2f128      ymm5, ymm2, ymm3, 49
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 32
        vmovups ymmword ptr [rsi], ymm4
        vmovups ymmword ptr [rsi + 32], ymm1
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovups ymmword ptr [rsi + 96], ymm5
        vzeroupper
        ret
```
4 loads, 4 permutes, 4 stores.

As you can see from the godbolt link, the actual generated assembly is quite a bit longer:

```
example::mutate_array:
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 288
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vmovaps ymmword ptr [rsp + 96], ymm3
        vmovaps ymmword ptr [rsp + 64], ymm2
        vmovaps ymmword ptr [rsp + 32], ymm1
        vmovaps ymmword ptr [rsp], ymm0
        vmovaps ymm0, ymmword ptr [rsp]
        vmovaps ymm1, ymmword ptr [rsp + 32]
        vmovaps ymm2, ymmword ptr [rsp + 64]
        vmovaps ymm3, ymmword ptr [rsp + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vmovaps ymmword ptr [rsp + 128], ymm4
        vperm2f128      ymm4, ymm2, ymm3, 32
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 49
        vmovaps ymmword ptr [rsp + 160], ymm4
        vmovaps ymmword ptr [rsp + 192], ymm0
        vmovaps ymmword ptr [rsp + 224], ymm1
        vmovaps ymm0, ymmword ptr [rsp + 224]
        vmovups ymmword ptr [rsi + 96], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 192]
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 160]
        vmovups ymmword ptr [rsi + 32], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 128]
        vmovups ymmword ptr [rsi], ymm0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret
```

The second assembly block is the same as the first, except for the addition of reads/writes to the `rsp` (ie the stack). It loads the 4 values from memory fine -- but before running the permutes, it stores the values to `rsp`, then immediately reads them back. Same thing after the permutes: Before writing the data to the output, it stores it to `rsp`, then immediately reads it back.

It's possible to nudge the compiler into generating the correct output by partially unrolling the input and output loops.

By changing the input loop
```rust
for i in 0..4 {
    input_data[i] = _mm256_loadu_pd(input.add(4*i));
}
```
to
```rust
for i in 0..2 {
    input_data[i*2] =   _mm256_loadu_pd(input.add(8*i));
    input_data[i*2+1] = _mm256_loadu_pd(input.add(8*i + 4));
}
```
we can see that the loop is functionally identical, but the compiler no longer writes the inputs to the stack ([example 2](https://rust.godbolt.org/z/8Ct7Uu)).

We can apply the same treatment to the output loop, completely eliminating the stack reads and writes: [example 3](https://rust.godbolt.org/z/PHhARc).

Without knowing anything about the internals of the compiler, I can imagine two possibilities here:
 1. Example 1 demonstrates trivial missed optimization: the compiler is unrolling the loop, but fails to determine that it can eliminate the array. As a result, it more than doubles the instruction count of the function, tanking performance.
 2. Alternatively, something in the Rust standard requires all arrays to have an in-memory representation, and they aren't allowed to be completely optimized away to register storage, even entirely within a function. If this is the case, then examples 2 and 3 demonstrate a code generation bug, because we can clearly see that the storage to the array was completely optimized away.

### Meta


`rustc --version --verbose`:
```
rustc 1.44.0-nightly (42abbd887 2020-04-07)
binary: rustc
commit-hash: 42abbd8878d3b67238f3611b0587c704ba94f39c
commit-date: 2020-04-07
host: x86_64-pc-windows-msvc
release: 1.44.0-nightly
LLVM version: 9.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

Meta

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

Description

Meta

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions