Skip to content

Bad codegen using u16::to/from_be_bytes #126419

@mkeeter

Description

@mkeeter

I wrote code to increment a big-endian u16 within a circular buffer of u8 values. There are two implementations: one using u16::to/from_be_bytes, and the other using a naive mask-and-shift implementation.

pub fn incr_u16_a(slice: &mut [u8; 256], i: u8) {
    let hi = slice[i as usize];
    let lo = slice[i.wrapping_add(1) as usize];
    let mut v = u16::from_be_bytes([hi, lo]);
    v = v.wrapping_add(1);
    let [hi, lo] = v.to_be_bytes();
    slice[i as usize] = hi;
    slice[i.wrapping_add(1) as usize] = lo;
}

pub fn incr_u16_b(slice: &mut [u8; 256], i: u8) {
    let hi = slice[i as usize];
    let lo = slice[i.wrapping_add(1) as usize];
    let mut v = ((hi as u16) << 8) | (lo as u16);
    v = v.wrapping_add(1);
    let hi = (v >> 8) as u8;
    let lo = v as u8;
    slice[i as usize] = hi;
    slice[i.wrapping_add(1) as usize] = lo;
}

I was expecting both versions to be basically equivalent.

Instead, the version using u16::to/from_be_bytes generates a bunch of extra instructions

incr_u16_a:
        add     w8, w1, #1
        and     x9, x1, #0xff
        and     x8, x8, #0xff
        ldrb    w10, [x0, x9]
        ldrb    w11, [x0, x8]
        orr     w10, w10, w11, lsl #8
        rev     w10, w10
        lsr     w10, w10, #16
        add     w10, w10, #1
        rev     w10, w10
        lsr     w11, w10, #16
        lsr     w10, w10, #24
        strb    w11, [x0, x9]
        strb    w10, [x0, x8]
        ret

incr_u16_b:
        add     w8, w1, #1
        and     x9, x1, #0xff
        and     x8, x8, #0xff
        ldrb    w10, [x0, x9]
        ldrb    w11, [x0, x8]
        orr     w10, w11, w10, lsl #8
        add     w10, w10, #1
        strb    w10, [x0, x8]
        lsr     w11, w10, #8
        strb    w11, [x0, x9]
        ret

(Godbolt link, running using --edition 2021 -O --target=aarch64-unknown-linux-gnu)

There's similar inefficiency when compiling for x86

incr_u16_a:
        movzx   edx, sil
        mov     eax, edx
        inc     al
        movzx   eax, al
        movzx   esi, byte ptr [rdi + rdx]
        movzx   ecx, byte ptr [rdi + rax]
        shl     ecx, 8
        or      ecx, esi
        rol     cx, 8
        inc     ecx
        rol     cx, 8
        mov     byte ptr [rdi + rdx], cl
        mov     byte ptr [rdi + rax], ch
        ret

incr_u16_b:
        movzx   eax, sil
        mov     ecx, eax
        inc     cl
        movzx   ecx, cl
        movzx   edx, byte ptr [rdi + rcx]
        movzx   esi, byte ptr [rdi + rax]
        shl     esi, 8
        add     edx, esi
        inc     edx
        mov     byte ptr [rdi + rax], dh
        mov     byte ptr [rdi + rcx], dl
        ret

Looking at the extra instructions, I suspect that the call to intrinsics::bswap (deep in uint_macros.rs) is failing to be optimized out.

Meta

rustc --version --verbose:

rustc 1.78.0 (9b00956e5 2024-04-29)

This also happens on nightly (tested on Godbolt)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-codegenArea: Code generationC-bugCategory: This is a bug.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions