vshrq_n_u8 does not generate ushr instruction when used in loop #82072

cberner · 2021-02-13T21:33:52Z

I tried this code:

#![feature(stdsimd)]
#![feature(aarch64_target_feature)]


#[target_feature(enable = "neon")]
pub unsafe fn fused_addassign_mul_scalar_neon(output_ptr: *mut u8, input_ptr: *const u8) {
    use std::arch::aarch64::*;
  
    for i in 0..2 {
        let input = vld1q_u8(input_ptr.add(i * 16));
        let hi_bits = vshrq_n_u8(input, 4);
        *(output_ptr as *mut uint8x16_t).add(i) = hi_bits;
    }
}

https://godbolt.org/z/5dzxf6

I expected to see this happen: the ushr instruction used twice, since vshrq_n_u8 is documented as generating the ushr instruction

Instead, this happened: the ushr is used once, and then 16 single byte load and shift instructions are used

If the loop range is changed to 0..1 then a single ushr instruction is generated, so it seems to be an issue optimizing the second iteration of the loop correctly.

Meta

rustc --version --verbose:

rustc 1.52.0-nightly (3f5aee2d5 2021-02-12)
binary: rustc
commit-hash: 3f5aee2d5241139d808f4fdece0026603489afd1
commit-date: 2021-02-12
host: aarch64-unknown-linux-gnu
release: 1.52.0-nightly
LLVM version: 11.0.1

The text was updated successfully, but these errors were encountered:

workingjubilee · 2021-10-05T23:02:15Z

On the current nightly:

example::fused_addassign_mul_scalar_neon:
        ldr     q0, [x1]
        ushr    v0.16b, v0.16b, #4
        str     q0, [x0]
        ldr     q0, [x1, #16]
        ushr    v0.16b, v0.16b, #4
        str     q0, [x0, #16]
        ret

So this seems to be fixed. Thank you for reporting!

Now that rust-lang/rust#82072 is fixed this intrinsic works and improves mulassign & FMA performance by ~30% on Raspberry Pi 3 B+. End to end speedup is ~5%

cberner added the C-bug Category: This is a bug. label Feb 13, 2021

workingjubilee closed this as completed Oct 5, 2021

cberner added a commit to cberner/raptorq that referenced this issue Oct 17, 2021

Use vshrq_n_u8 in neon optimizations

73c24e8

Now that rust-lang/rust#82072 is fixed this intrinsic works and improves mulassign & FMA performance by ~30% on Raspberry Pi 3 B+. End to end speedup is ~5%

cberner added a commit to cberner/raptorq that referenced this issue Oct 17, 2021

Use vshrq_n_u8 in neon optimizations

88959e0

Now that rust-lang/rust#82072 is fixed this intrinsic works and improves mulassign & FMA performance by ~30% on Raspberry Pi 3 B+. End to end speedup is ~5%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vshrq_n_u8 does not generate ushr instruction when used in loop #82072

vshrq_n_u8 does not generate ushr instruction when used in loop #82072

cberner commented Feb 13, 2021

workingjubilee commented Oct 5, 2021

vshrq_n_u8 does not generate ushr instruction when used in loop #82072

vshrq_n_u8 does not generate ushr instruction when used in loop #82072

Comments

cberner commented Feb 13, 2021

Meta

workingjubilee commented Oct 5, 2021