Bad codegen using `u16::to/from_be_bytes`

I wrote code to increment a big-endian `u16` within a circular buffer of `u8` values.  There are two implementations: one using `u16::to/from_be_bytes`, and the other using a naive mask-and-shift implementation.

```rust
pub fn incr_u16_a(slice: &mut [u8; 256], i: u8) {
    let hi = slice[i as usize];
    let lo = slice[i.wrapping_add(1) as usize];
    let mut v = u16::from_be_bytes([hi, lo]);
    v = v.wrapping_add(1);
    let [hi, lo] = v.to_be_bytes();
    slice[i as usize] = hi;
    slice[i.wrapping_add(1) as usize] = lo;
}

pub fn incr_u16_b(slice: &mut [u8; 256], i: u8) {
    let hi = slice[i as usize];
    let lo = slice[i.wrapping_add(1) as usize];
    let mut v = ((hi as u16) << 8) | (lo as u16);
    v = v.wrapping_add(1);
    let hi = (v >> 8) as u8;
    let lo = v as u8;
    slice[i as usize] = hi;
    slice[i.wrapping_add(1) as usize] = lo;
}
```

I was expecting both versions to be basically equivalent.

Instead, the version using `u16::to/from_be_bytes` generates a bunch of extra instructions
```asm
incr_u16_a:
        add     w8, w1, #1
        and     x9, x1, #0xff
        and     x8, x8, #0xff
        ldrb    w10, [x0, x9]
        ldrb    w11, [x0, x8]
        orr     w10, w10, w11, lsl #8
        rev     w10, w10
        lsr     w10, w10, #16
        add     w10, w10, #1
        rev     w10, w10
        lsr     w11, w10, #16
        lsr     w10, w10, #24
        strb    w11, [x0, x9]
        strb    w10, [x0, x8]
        ret

incr_u16_b:
        add     w8, w1, #1
        and     x9, x1, #0xff
        and     x8, x8, #0xff
        ldrb    w10, [x0, x9]
        ldrb    w11, [x0, x8]
        orr     w10, w11, w10, lsl #8
        add     w10, w10, #1
        strb    w10, [x0, x8]
        lsr     w11, w10, #8
        strb    w11, [x0, x9]
        ret
```

([Godbolt link](https://godbolt.org/z/cTnaro7vj), running using `--edition 2021 -O --target=aarch64-unknown-linux-gnu`)

There's similar inefficiency when compiling for x86
```asm
incr_u16_a:
        movzx   edx, sil
        mov     eax, edx
        inc     al
        movzx   eax, al
        movzx   esi, byte ptr [rdi + rdx]
        movzx   ecx, byte ptr [rdi + rax]
        shl     ecx, 8
        or      ecx, esi
        rol     cx, 8
        inc     ecx
        rol     cx, 8
        mov     byte ptr [rdi + rdx], cl
        mov     byte ptr [rdi + rax], ch
        ret

incr_u16_b:
        movzx   eax, sil
        mov     ecx, eax
        inc     cl
        movzx   ecx, cl
        movzx   edx, byte ptr [rdi + rcx]
        movzx   esi, byte ptr [rdi + rax]
        shl     esi, 8
        add     edx, esi
        inc     edx
        mov     byte ptr [rdi + rax], dh
        mov     byte ptr [rdi + rcx], dl
        ret
```

Looking at the extra instructions, I suspect that the call to `intrinsics::bswap` (deep in [`uint_macros.rs`](https://doc.rust-lang.org/beta/src/core/num/uint_macros.rs.html#278)) is failing to be optimized out.

### Meta


`rustc --version --verbose`:
```
rustc 1.78.0 (9b00956e5 2024-04-29)
```

This also happens on nightly (tested on Godbolt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad codegen using `u16::to/from_be_bytes` #126419

Meta

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad codegen using u16::to/from_be_bytes #126419

Description

Meta

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bad codegen using `u16::to/from_be_bytes` #126419