Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal code-gen for branchless pointer increment #117128

Open
Voultapher opened this issue Oct 24, 2023 · 0 comments
Open

Sub-optimal code-gen for branchless pointer increment #117128

Voultapher opened this issue Oct 24, 2023 · 0 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@Voultapher
Copy link
Contributor

A common idiom in branchless (jumpless) code is to either increment a pointer by 0 or 1 which can be derived from a bool. The obvious way to do this is ptr = ptr.add(bool_val as usize) however doing so results in slightly sub-optimal code-gen on both x86_64 and Arm.

A minimal example:

pub unsafe fn inc_direct(v: &[u64], mut side_effect_out: *mut *const u64) {
    for elem in v {
        let cond = *elem < 30;
        *side_effect_out = elem;
        side_effect_out = side_effect_out.add(cond as usize);
    }
}

Generated machine-code for the loop:

x86_64

.LBB0_2:
        xor     eax, eax
        cmp     qword ptr [rdi], 30
        mov     qword ptr [rdx], rdi
        lea     rdi, [rdi + 8]
        setb    al
        lea     rdx, [rdx + 8*rax]
        add     rsi, -8
        jne     .LBB0_2

Arm

.LBB0_2:
        ldr     x10, [x9], #8
        str     x0, [x2]
        mov     x0, x9
        cmp     x10, #30
        cset    w10, lo
        subs    x8, x8, #8
        add     x2, x2, w10, uxtw #3
        b.ne    .LBB0_2

Notably, on x86_64 the use of xor + setb + lea. And cset on Arm.

Compared to a version using an additional counter with counter += cond as usize;

pub unsafe fn inc_via_counter(v: &[u64], mut side_effect_out: *mut *const u64) {
    let mut cond_count = 0;
    for elem in v {
        let cond = *elem < 30;
        *(side_effect_out.add(cond_count)) = elem;
        cond_count += cond as usize;
    }
}

Generated machine-code for the loop:

x86_64

.LBB1_2:
        cmp     qword ptr [rdi], 30
        mov     qword ptr [rdx + 8*rax], rdi
        lea     rdi, [rdi + 8]
        adc     rax, 0
        add     rsi, -8
        jne     .LBB1_2

Arm

.LBB1_2:
        ldr     x11, [x10], #8
        str     x0, [x2, x8, lsl #3]
        mov     x0, x10
        cmp     x11, #30
        cinc    x8, x8, lo
        subs    x9, x9, #8
        b.ne    .LBB1_2

Notably the use of adc on x86_64 and cinc on Arm. Using uica and LLVM to simulate the performance of the respective loops we get:

inc_direct cycles per loop iteration:

Tool Skylake Sunny Cove
uiCA 2.01 1.87
llvm-mca 1.59 1.59

inc_via_counter cycles per loop iteration:

Tool Skylake Sunny Cove
uiCA 1.31 1.56
llvm-mca 1.33 1.33

Simulations as part of some larger code structure are less conclusive, but one thing is certain, for both architectures it's less instructions and worst case same perf. Ideally rustc could generate the more efficient code for the obvious direct increment version.

Meta

rustc --version --verbose:

rustc 1.75.0-nightly (187b8131d 2023-10-03)
binary: rustc
commit-hash: 187b8131d4f760f856b214fce34534903276f2ef
commit-date: 2023-10-03
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.2
@Voultapher Voultapher added the C-bug Category: This is a bug. label Oct 24, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 24, 2023
@nikic nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

3 participants