Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vshrq_n_u8 does not generate ushr instruction when used in loop #82072

Closed
cberner opened this issue Feb 13, 2021 · 1 comment
Closed

vshrq_n_u8 does not generate ushr instruction when used in loop #82072

cberner opened this issue Feb 13, 2021 · 1 comment
Labels
A-codegen Area: Code generation A-SIMD Area: SIMD (Single Instruction Multiple Data) C-bug Category: This is a bug. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state requires-nightly This issue requires a nightly compiler in some way. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@cberner
Copy link

cberner commented Feb 13, 2021

I tried this code:

#![feature(stdsimd)]
#![feature(aarch64_target_feature)]


#[target_feature(enable = "neon")]
pub unsafe fn fused_addassign_mul_scalar_neon(output_ptr: *mut u8, input_ptr: *const u8) {
    use std::arch::aarch64::*;
  
    for i in 0..2 {
        let input = vld1q_u8(input_ptr.add(i * 16));
        let hi_bits = vshrq_n_u8(input, 4);
        *(output_ptr as *mut uint8x16_t).add(i) = hi_bits;
    }
}

https://godbolt.org/z/5dzxf6

I expected to see this happen: the ushr instruction used twice, since vshrq_n_u8 is documented as generating the ushr instruction

Instead, this happened: the ushr is used once, and then 16 single byte load and shift instructions are used

If the loop range is changed to 0..1 then a single ushr instruction is generated, so it seems to be an issue optimizing the second iteration of the loop correctly.

Meta

rustc --version --verbose:

rustc 1.52.0-nightly (3f5aee2d5 2021-02-12)
binary: rustc
commit-hash: 3f5aee2d5241139d808f4fdece0026603489afd1
commit-date: 2021-02-12
host: aarch64-unknown-linux-gnu
release: 1.52.0-nightly
LLVM version: 11.0.1
@cberner cberner added the C-bug Category: This is a bug. label Feb 13, 2021
@jonas-schievink jonas-schievink added A-codegen Area: Code generation A-SIMD Area: SIMD (Single Instruction Multiple Data) O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state requires-nightly This issue requires a nightly compiler in some way. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 13, 2021
@workingjubilee
Copy link
Member

On the current nightly:

example::fused_addassign_mul_scalar_neon:
        ldr     q0, [x1]
        ushr    v0.16b, v0.16b, #4
        str     q0, [x0]
        ldr     q0, [x1, #16]
        ushr    v0.16b, v0.16b, #4
        str     q0, [x0, #16]
        ret

So this seems to be fixed. Thank you for reporting!

cberner added a commit to cberner/raptorq that referenced this issue Oct 17, 2021
Now that rust-lang/rust#82072 is fixed this
intrinsic works and improves mulassign & FMA performance by ~30% on
Raspberry Pi 3 B+. End to end speedup is ~5%
cberner added a commit to cberner/raptorq that referenced this issue Oct 17, 2021
Now that rust-lang/rust#82072 is fixed this
intrinsic works and improves mulassign & FMA performance by ~30% on
Raspberry Pi 3 B+. End to end speedup is ~5%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-SIMD Area: SIMD (Single Instruction Multiple Data) C-bug Category: This is a bug. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state requires-nightly This issue requires a nightly compiler in some way. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants