Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen significantly worse when using u128 rather than two u64 #123627

Open
linkmauve opened this issue Apr 8, 2024 · 3 comments
Open

Codegen significantly worse when using u128 rather than two u64 #123627

linkmauve opened this issue Apr 8, 2024 · 3 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. S-has-mcve Status: A Minimal Complete and Verifiable Example has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@linkmauve
Copy link
Contributor

I tried this code:

fn read_bits(data: &mut u128, n_bits: u32) -> u32 {
    let ret = *data & ((1 << n_bits) - 1);
    *data >>= n_bits;
    ret as u32
}

This replaces the original code which was emulating the same operation using two u64:

struct Bitstream {
    low: u64,
    high: u64,
}

impl Bitstream {
    fn read_bits(&mut self, num_bits: u32) -> u32 {
        let mask = (1 << num_bits) - 1;
        let bits = self.low & mask;
        self.low >>= num_bits;
        self.low |= (self.high & mask) << (u64::BITS as u64 - num_bits as u64);
        self.high >>= num_bits;
        bits as u32
    }
}

I expected to see this happen: no conditional code should be generated, resulting in the same performances as the u64 code.

Instead, this happened: the codegen is significantly worse. For a BC7 decoding crate the blocks are decoded 20 ns slower when using u128 instead of the existing high and low u64 (163.48 ns per block → 145.83 ns per block, -10.749% including randomness overhead). The same happens with an ASTC decoding crate. This is a followup of #122252.

Meta

rustc --version --verbose:

rustc 1.79.0-nightly (9d5cdf75a 2024-04-07)
@linkmauve linkmauve added the C-bug Category: This is a bug. label Apr 8, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 8, 2024
@jieyouxu jieyouxu added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Apr 8, 2024
@bjorn3 bjorn3 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Apr 8, 2024
@jieyouxu jieyouxu added S-has-mcve Status: A Minimal Complete and Verifiable Example has been found for this issue and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 8, 2024
@joboet
Copy link
Contributor

joboet commented Apr 8, 2024

The operation is not actually the same. The behaviour of the u128 version is different from the custom version when bits is more than 64, as it correctly shifts bits from the upper 64 bits into the lower ones, whereas in the custom version, you get some very weird behaviour.

@linkmauve
Copy link
Contributor Author

This function is always inlined in functions where num_bits is way less than 64, usually at most 4, and codegen actually gets worse when I add assert!(num_bits < 64);. I would expect the compiler to be able to optimize based on the constant the function gets called with, since it is always known at compile time.

@tylanphear
Copy link

tylanphear commented Apr 18, 2024

@linkmauve at least for obvious examples where n_bits is a compile-time constant, it does appear that the compiler avoids cmov sequences when it can also prove n_bits < 64. Even where it is not known at compile-time, if you explicitly mask n_bits by 63, this also allows it to avoid cmov sequences (godbolt).

The fact that assert!() doesn't help here is unfortunate (I even tried using unreachable_unchecked() to force the generation of a llvm.assume but that also didn't help) -- most likely there is room for improvement in LLVM's lowering of shr on types with bitwidth > 64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. S-has-mcve Status: A Minimal Complete and Verifiable Example has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants