Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suboptimal codegen when initializing an array of padded structs #122274

Open
iximeow opened this issue Mar 10, 2024 · 1 comment
Open

suboptimal codegen when initializing an array of padded structs #122274

iximeow opened this issue Mar 10, 2024 · 1 comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@iximeow
Copy link
Contributor

iximeow commented Mar 10, 2024

in the following code, reset_implicit_padding generates somewhat worse assembly under -C opt-level=3 than reset_explicit_padding. with explicit padding generates to a memset - i'd hoped both would memset:

#[repr(Rust, align(2))]
#[derive(Default, Copy, Clone)]
pub struct ImplicitPadding {
    v: u8,
}

#[repr(Rust, align(2))]
#[derive(Default, Copy, Clone)]
pub struct ExplicitPadding {
    v: u8,
    _pad: [u8; 1],
}

#[no_mangle]
pub fn reset_implicit_padding(array: &mut [ImplicitPadding; 160]) {
    *array = [ImplicitPadding::default(); 160];
}

#[no_mangle]
pub fn reset_explicit_padding(array: &mut [ExplicitPadding; 160]) {
    *array = [ExplicitPadding::default(); 160];
}

which compiles to these two implementations:

reset_implicit_padding:
        mov     eax, 7
.LBB0_1:
        mov     byte ptr [rdi + 2*rax - 14], 0
        mov     byte ptr [rdi + 2*rax - 12], 0
        mov     byte ptr [rdi + 2*rax - 10], 0
        mov     byte ptr [rdi + 2*rax - 8], 0
        mov     byte ptr [rdi + 2*rax - 6], 0
        mov     byte ptr [rdi + 2*rax - 4], 0
        mov     byte ptr [rdi + 2*rax - 2], 0
        mov     byte ptr [rdi + 2*rax], 0
        add     rax, 8
        cmp     rax, 167
        jne     .LBB0_1
        ret

reset_explicit_padding:
        mov     edx, 320
        xor     esi, esi
        jmp     qword ptr [rip + memset@GOTPCREL]

godbolt for reference.

it turns out that when an array like this does become a memset it's seemingly often because llvm realized in loop-idiom that the stores can be turned into a memset, rather than ever being emitted as a memset from rustc?

i don't know enough to know if this would need rustc to initialize padding bytes as undef (?), or if they're already undef and there's some other reason llvm doesn't figure out this could be a memset. i also can't tell if this would be in conflict with this issue about padding copies or the regression test for it. there are certainly cases where copying large amounts of padding would be counterproductive, though if writing a padding byte could turn 4+2+1 bytes of stores into one 8-byte store, that's probably always a net win.

(the original code i've minimized had 6-byte or 14-byte structs aligned to 8 and 16 respectively, and for entirely indecipherable reasons also got the assignment fully unrolled, yielding ~600 mov rather than a memset. i would much rather that be a call to memset, and have added a _pad field for the time being.)

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 10, 2024
@workingjubilee workingjubilee added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. labels Mar 10, 2024
@clubby789
Copy link
Contributor

clubby789 commented Mar 10, 2024

Looks somewhat similar to #101685. iirc, LLVM eliminates store undef (which is what ends up being emitted by this) before running the opt that transforms them to memset/cpy.

store i8 val, ptr
store i8 val, ptr + 2
store i8 val, ptr + 4

etc. can't be turned into a memset at that point as LLVM only sees the holey stores

iximeow added a commit to iximeow/yaxgbc that referenced this issue Mar 10, 2024
this one is a little odd, see
rust-lang/rust#122274

as reported there,
> for entirely indecipherable reasons also got the assignment fully
> unrolled, yielding ~600 mov rather than a memset

that string of 600 movs would be executed on each hblank when resetting
pixel buffers to `Pixel::default()`. it seems to be fast enough, but it
makes `GBC::run()` quite a bit harder to read in perf.
@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants