Suboptimal `eq` compilation on structs compared to equivalent C++ code #106269

Kmeakin · 2022-12-29T19:25:07Z

The eq implementation produces worse code than the equivalent C++ code (Rust, C++)

I tried this code:

pub struct S {
    a: u8,
    b: u8,
    c: u8,
    d: u8,
}

pub fn eq(s1: &S, s2: &S) -> bool {
    s1.a == s2.a && s1.b == s2.b && s1.c == s2.c && s1.d == s2.d
}

I expected to see this happen:

The resulting assembly should load and compare a single u64:

example::eq:
        ldr     w8, [x0]
        ldr     w9, [x1]
        cmp     w8, w9
        cset    w0, eq
        ret

Instead, this happened:

The assembly loads and compares each u8 individually:

example::eq:
        ldrb    w9, [x0]
        mov     w8, wzr
        ldrb    w10, [x1]
        cmp     w9, w10
        b.ne    .LBB0_4
        ldrb    w9, [x0, #1]
        ldrb    w10, [x1, #1]
        cmp     w9, w10
        b.ne    .LBB0_4
        ldrb    w8, [x0, #2]
        ldrb    w9, [x1, #2]
        cmp     w8, w9
        b.ne    .LBB0_5
        ldrb    w8, [x0, #3]
        ldrb    w9, [x1, #3]
        cmp     w8, w9
        cset    w8, eq
.LBB0_4:
        mov     w0, w8
        ret
.LBB0_5:
        mov     w0, wzr
        ret

Meta

rustc --version --verbose:

rustc 1.68.0-nightly (88c58e3c2 2022-12-26)
binary: rustc
commit-hash: 88c58e3c2c097ebffac425d9e080dcb1aadf790e
commit-date: 2022-12-26
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6
Compiler returned: 0

The text was updated successfully, but these errors were encountered:

leonardo-m · 2022-12-29T20:35:11Z

With more aggressive compilation flags I am seeing (latest Nightly):

example::eq:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

Kmeakin · 2022-12-29T21:15:04Z

With more aggressive compilation flags I am seeing (latest Nightly):

example::eq:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

What flags did you use?

leonardo-m · 2022-12-29T21:25:09Z

-O -Z mir-opt-level=4

nikic · 2022-12-29T21:31:58Z

Looks like the first two comparisons gets flattened into one block, and that prevents detection of the whole pattern.

clubby789 · 2023-01-02T14:34:27Z

SeparateConstSwitch (enabled at mir-opt-level >= 4) is the MIR opt which merges the short-circuiting failure paths into 1, and allows SimplifyCFG to reduce this to a pattern which can be turned into a memcmp.

scottmcm · 2023-01-06T19:54:43Z

#106294 might improve this.

I noticed in https://rust.godbolt.org/z/q768EYdqq that it's doing <2 x i8> dances:

  %0 = load <2 x i8>, ptr %s1, align 1, !dbg !11
  %1 = load <2 x i8>, ptr %s2, align 1, !dbg !12
  %2 = icmp eq <2 x i8> %0, %1, !dbg !11
  %3 = extractelement <2 x i1> %2, i64 0, !dbg !11
  %4 = extractelement <2 x i1> %2, i64 1, !dbg !11

which might be no longer needed once it's !noundef and it can thus just compare them both directly.

There's also something going on here related to the short-circuiting that I don't understand.

If I change it to

pub fn eq(s1: &S, s2: &S) -> bool {
    (s1.a == s2.a) & (s1.b == s2.b) & (s1.c == s2.c) & (s1.d == s2.d)
}

Then it optimizes as expected

define noundef zeroext i1 @_ZN7example2eq17h1d431ecac5604099E(ptr noalias nocapture noundef readonly align 1 dereferenceable(4) %s1, ptr noalias nocapture noundef readonly align 1 dereferenceable(4) %s2) unnamed_addr #0 !dbg !6 {
  %0 = load i32, ptr %s1, align 1, !dbg !11
  %1 = load i32, ptr %s2, align 1, !dbg !12
  %2 = icmp eq i32 %0, %1, !dbg !13
  ret i1 %2, !dbg !14
}

https://rust.godbolt.org/z/ThfqaW3cr

It's definitely marked dereferenceable(4) either way, so LLVM ought to be able to move the loads early either way.

nikic · 2023-01-06T21:33:12Z

@scottmcm Short-circuiting is relevant in that moving everything into one block is generally non-profitable. There is a special pass that detects this pattern (MergeICmps) and converts it into memcmp/bcmp, which is then later expanded into a target-specific efficient comparison sequence. This is a backend IR pass, so you don't see it in --emit=llvm-ir output.

However, it requires a pretty specific pattern, and does not handle the case where the control flow has been partially flattened and one pair of comparisons uses a select rather than a branch.

Your X86 example adds insult to injury by actually vectorizing those two i8 comparisons: https://llvm.godbolt.org/z/qPe6Kfq6e That looks pretty clearly non-profitable: https://llvm.godbolt.org/z/dT85WE8xn

Edit: Filed llvm/llvm-project#59867 for the SLPVectorizer issue.

workingjubilee · 2023-02-20T22:58:50Z

I suppose this opt would be specifically enabled by our optimization semantics allowing spurious reads. No change since the noundef add, though. The SeparateConstSwitch pass may be affected by #107009 being accepted.

clubby789 · 2024-04-13T22:59:27Z

Looks like this is fixed now, we should have a test for this if we don't already

Add test for issue 106269 Closes rust-lang#106269 Made this an assembly test as the LLVM codegen is still quite verbose and doesn't really indicate the behaviour we want

Rollup merge of rust-lang#124299 - clubby789:106269-test, r=nikic Add test for issue 106269 Closes rust-lang#106269 Made this an assembly test as the LLVM codegen is still quite verbose and doesn't really indicate the behaviour we want

Kmeakin added the C-bug Category: This is a bug. label Dec 29, 2022

Nilstrieb added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 29, 2022

nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 29, 2022

scottmcm mentioned this issue Feb 23, 2023

&usize::<= is much slower than usize::<= #105259

Open

Nilstrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

Kmeakin closed this as completed Apr 13, 2024

clubby789 reopened this Apr 13, 2024

clubby789 added the E-needs-test Call for participation: Writing correctness tests. label Apr 13, 2024

clubby789 mentioned this issue Apr 23, 2024

Add test for issue 106269 #124299

Merged

bors closed this as completed in 4b6c191 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal `eq` compilation on structs compared to equivalent C++ code #106269

Suboptimal `eq` compilation on structs compared to equivalent C++ code #106269

Kmeakin commented Dec 29, 2022

leonardo-m commented Dec 29, 2022

Kmeakin commented Dec 29, 2022

leonardo-m commented Dec 29, 2022

nikic commented Dec 29, 2022

clubby789 commented Jan 2, 2023

scottmcm commented Jan 6, 2023 •

edited

nikic commented Jan 6, 2023 •

edited

workingjubilee commented Feb 20, 2023 •

edited

clubby789 commented Apr 13, 2024

Suboptimal eq compilation on structs compared to equivalent C++ code #106269

Suboptimal eq compilation on structs compared to equivalent C++ code #106269

Comments

Kmeakin commented Dec 29, 2022

I tried this code:

I expected to see this happen:

Instead, this happened:

Meta

leonardo-m commented Dec 29, 2022

Kmeakin commented Dec 29, 2022

leonardo-m commented Dec 29, 2022

nikic commented Dec 29, 2022

clubby789 commented Jan 2, 2023

scottmcm commented Jan 6, 2023 • edited

nikic commented Jan 6, 2023 • edited

workingjubilee commented Feb 20, 2023 • edited

clubby789 commented Apr 13, 2024

Suboptimal `eq` compilation on structs compared to equivalent C++ code #106269

Suboptimal `eq` compilation on structs compared to equivalent C++ code #106269

scottmcm commented Jan 6, 2023 •

edited

nikic commented Jan 6, 2023 •

edited

workingjubilee commented Feb 20, 2023 •

edited