-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Open
Labels
A-codegenArea: Code generationArea: Code generationC-bugCategory: This is a bug.Category: This is a bug.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchCategory: An issue highlighting optimization opportunities or PRs implementing suchT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Description
I wrote code to increment a big-endian u16
within a circular buffer of u8
values. There are two implementations: one using u16::to/from_be_bytes
, and the other using a naive mask-and-shift implementation.
pub fn incr_u16_a(slice: &mut [u8; 256], i: u8) {
let hi = slice[i as usize];
let lo = slice[i.wrapping_add(1) as usize];
let mut v = u16::from_be_bytes([hi, lo]);
v = v.wrapping_add(1);
let [hi, lo] = v.to_be_bytes();
slice[i as usize] = hi;
slice[i.wrapping_add(1) as usize] = lo;
}
pub fn incr_u16_b(slice: &mut [u8; 256], i: u8) {
let hi = slice[i as usize];
let lo = slice[i.wrapping_add(1) as usize];
let mut v = ((hi as u16) << 8) | (lo as u16);
v = v.wrapping_add(1);
let hi = (v >> 8) as u8;
let lo = v as u8;
slice[i as usize] = hi;
slice[i.wrapping_add(1) as usize] = lo;
}
I was expecting both versions to be basically equivalent.
Instead, the version using u16::to/from_be_bytes
generates a bunch of extra instructions
incr_u16_a:
add w8, w1, #1
and x9, x1, #0xff
and x8, x8, #0xff
ldrb w10, [x0, x9]
ldrb w11, [x0, x8]
orr w10, w10, w11, lsl #8
rev w10, w10
lsr w10, w10, #16
add w10, w10, #1
rev w10, w10
lsr w11, w10, #16
lsr w10, w10, #24
strb w11, [x0, x9]
strb w10, [x0, x8]
ret
incr_u16_b:
add w8, w1, #1
and x9, x1, #0xff
and x8, x8, #0xff
ldrb w10, [x0, x9]
ldrb w11, [x0, x8]
orr w10, w11, w10, lsl #8
add w10, w10, #1
strb w10, [x0, x8]
lsr w11, w10, #8
strb w11, [x0, x9]
ret
(Godbolt link, running using --edition 2021 -O --target=aarch64-unknown-linux-gnu
)
There's similar inefficiency when compiling for x86
incr_u16_a:
movzx edx, sil
mov eax, edx
inc al
movzx eax, al
movzx esi, byte ptr [rdi + rdx]
movzx ecx, byte ptr [rdi + rax]
shl ecx, 8
or ecx, esi
rol cx, 8
inc ecx
rol cx, 8
mov byte ptr [rdi + rdx], cl
mov byte ptr [rdi + rax], ch
ret
incr_u16_b:
movzx eax, sil
mov ecx, eax
inc cl
movzx ecx, cl
movzx edx, byte ptr [rdi + rcx]
movzx esi, byte ptr [rdi + rax]
shl esi, 8
add edx, esi
inc edx
mov byte ptr [rdi + rax], dh
mov byte ptr [rdi + rcx], dl
ret
Looking at the extra instructions, I suspect that the call to intrinsics::bswap
(deep in uint_macros.rs
) is failing to be optimized out.
Meta
rustc --version --verbose
:
rustc 1.78.0 (9b00956e5 2024-04-29)
This also happens on nightly (tested on Godbolt)
Metadata
Metadata
Assignees
Labels
A-codegenArea: Code generationArea: Code generationC-bugCategory: This is a bug.Category: This is a bug.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchCategory: An issue highlighting optimization opportunities or PRs implementing suchT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.