Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ptr::align_offset generates surprisingly bad code #75579

Closed
thomcc opened this issue Aug 16, 2020 · 2 comments · Fixed by #75728
Closed

ptr::align_offset generates surprisingly bad code #75579

thomcc opened this issue Aug 16, 2020 · 2 comments · Fixed by #75728
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@thomcc
Copy link
Member

thomcc commented Aug 16, 2020

I happened to look at more closely at the generated assembly for some code that uses align_offset, and noticed... align_offset does not compile as efficiently as one might hope for the case of "align pointer to size_of::<usize>()"

For example (ignore that I omit handling of align_offset's error return value):

pub unsafe fn align_offset(p: *const u8) -> *const u8 {
    p.add(p.align_offset(core::mem::size_of::<usize>()))
}

compiles to

example::align_offset:
        movl    %edi, %ecx
        andl    $7, %ecx
        movl    $8, %eax
        subq    %rcx, %rax
        testq   %rcx, %rcx
        cmoveq  %rcx, %rax
        addq    %rdi, %rax
        retq

Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):

pub unsafe fn manual_align_offset(p: *const u8) -> *const u8 {
    // IIRC just doing `arithmetic(p as usize) as *const u8` makes LLVM sad
    let aligned = ((p as usize + USIZE_SIZE - 1) & !(USIZE_SIZE - 1)) as isize;
    p.offset(aligned - (p as isize))
}

compiles to

example::manual_align_offset:
        leaq    7(%rdi), %rax
        andq    $-8, %rax
        retq

Which is substantially better along a variety of metrics, including but not limited to actual runtime.

Taking a look at the source for align_offset reveals that it uh, well it does some stuff.

/// Any questions go to @nagisa.
#[lang = "align_offset"]
pub(crate) unsafe fn align_offset<T: Sized>(p: *const T, a: usize) -> usize {
/// Calculate multiplicative modular inverse of `x` modulo `m`.
///
/// This implementation is tailored for align_offset and has following preconditions:
///
/// * `m` is a power-of-two;
/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead)
///
/// Implementation of this function shall not panic. Ever.
#[inline]
fn mod_inv(x: usize, m: usize) -> usize {
/// Multiplicative modular inverse table modulo 2⁴ = 16.
///
/// Note, that this table does not contain values where inverse does not exist (i.e., for
/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.)
const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15];
/// Modulo for which the `INV_TABLE_MOD_16` is intended.
const INV_TABLE_MOD: usize = 16;
/// INV_TABLE_MOD²
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD;
let table_inverse = INV_TABLE_MOD_16[(x & (INV_TABLE_MOD - 1)) >> 1] as usize;
if m <= INV_TABLE_MOD {
table_inverse & (m - 1)
} else {
// We iterate "up" using the following formula:
//
// $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$
//
// until 2²ⁿ ≥ m. Then we can reduce to our desired `m` by taking the result `mod m`.
let mut inverse = table_inverse;
let mut going_mod = INV_TABLE_MOD_SQUARED;
loop {
// y = y * (2 - xy) mod n
//
// Note, that we use wrapping operations here intentionally – the original formula
// uses e.g., subtraction `mod n`. It is entirely fine to do them `mod
// usize::MAX` instead, because we take the result `mod n` at the end
// anyway.
inverse = inverse.wrapping_mul(2usize.wrapping_sub(x.wrapping_mul(inverse)));
if going_mod >= m {
return inverse & (m - 1);
}
going_mod = going_mod.wrapping_mul(going_mod);
}
}
}
let stride = mem::size_of::<T>();
let a_minus_one = a.wrapping_sub(1);
let pmoda = p as usize & a_minus_one;
if pmoda == 0 {
// Already aligned. Yay!
return 0;
}
if stride <= 1 {
return if stride == 0 {
// If the pointer is not aligned, and the element is zero-sized, then no amount of
// elements will ever align the pointer.
!0
} else {
a.wrapping_sub(pmoda)
};
}
let smoda = stride & a_minus_one;
// SAFETY: a is power-of-two so cannot be 0. stride = 0 is handled above.
let gcdpow = unsafe { intrinsics::cttz_nonzero(stride).min(intrinsics::cttz_nonzero(a)) };
let gcd = 1usize << gcdpow;
if p as usize & (gcd.wrapping_sub(1)) == 0 {
// This branch solves for the following linear congruence equation:
//
// ` p + so = 0 mod a `
//
// `p` here is the pointer value, `s` - stride of `T`, `o` offset in `T`s, and `a` - the
// requested alignment.
//
// With `g = gcd(a, s)`, and the above asserting that `p` is also divisible by `g`, we can
// denote `a' = a/g`, `s' = s/g`, `p' = p/g`, then this becomes equivalent to:
//
// ` p' + s'o = 0 mod a' `
// ` o = (a' - (p' mod a')) * (s'^-1 mod a') `
//
// The first term is "the relative alignment of `p` to `a`" (divided by the `g`), the second
// term is "how does incrementing `p` by `s` bytes change the relative alignment of `p`" (again
// divided by `g`).
// Division by `g` is necessary to make the inverse well formed if `a` and `s` are not
// co-prime.
//
// Furthermore, the result produced by this solution is not "minimal", so it is necessary
// to take the result `o mod lcm(s, a)`. We can replace `lcm(s, a)` with just a `a'`.
let a2 = a >> gcdpow;
let a2minus1 = a2.wrapping_sub(1);
let s2 = smoda >> gcdpow;
let minusp2 = a2.wrapping_sub(pmoda >> gcdpow);
return (minusp2.wrapping_mul(mod_inv(s2, a2))) & a2minus1;
}
// Cannot be aligned at all.
usize::MAX
}
... Anyway I'm just gonna take at face value that this all is needed since it tracks that sometimes you might need to GCD... And hell, maybe mine misses a case like "p + USIZE - 1 wraps around the address space but the aligned value wouldn't).

Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).

So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when size_of::<T> == 1... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).

Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of align_offset's secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?

P.S. Godbolt link: https://godbolt.org/z/388Enf

@thomcc thomcc added the C-bug Category: This is a bug. label Aug 16, 2020
@RalfJung
Copy link
Member

miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned...

FWIW this is something we'd like to fix (rust-lang/miri#1074).

@jonas-schievink jonas-schievink added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 16, 2020
@nagisa
Copy link
Member

nagisa commented Aug 16, 2020

is this fixable?

Most of the function is constant-evaluatable (with the only unknown usually being the pointer value) so there is no reason it could not have a special-case (based on input types) for cases where a more optimal implementation is possible.

bors added a commit to rust-lang-ci/rust that referenced this issue Aug 19, 2020
Improve codegen for `align_offset`

In this PR the `align_offset` implementation is changed/improved to produce better code in certain scenarios such as when pointer type is has a stride of 1 or when building for low optimisation levels.

While these changes do not achieve the "ideal" codegen referenced in rust-lang#75579, it gets significantly closer to it. I’m not actually sure if the codegen can actually be much better with this function returning the offset, rather than the aligned pointer.

See the descriptions for separate commits for further information.
nagisa added a commit to nagisa/rust that referenced this issue Aug 20, 2020
`stride == 1` case can be computed more efficiently through `-p (mod
a)`. That, then translates to a nice and short sequence of LLVM
instructions:

    %address = ptrtoint i8* %p to i64
    %negptr = sub i64 0, %address
    %offset = and i64 %negptr, %a_minus_one

And produces pretty much ideal code-gen when this function is used in
isolation.

Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.

    %aligned = getelementptr inbounds i8, i8* %p, i64 %offset

This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:

    ; x86_64
    dec     rsi
    mov     rax, rdi
    neg     rax
    and     rax, rsi
    add     rax, rdi

In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:

    ; x86_64
    lea     rax, [rsi + rdi - 1]
    neg     rsi
    and     rax, rsi

To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the `and` instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.

One such ordering is proposed in rust-lang#75579 and results in LLVM IR that
looks broadly like this:

    ; using add enables `lea` and similar CISCisms
    %offset_ptr = add i64 %address, %a_minus_one
    %mask = sub i64 0, %a
    %masked = and i64 %offset_ptr, %mask
    ; can be folded with `gepi` that may follow
    %offset = sub i64 %masked, %address

…and generates the intended x86_64 machine code. One might also wonder
how the increased amount of code would impact a RISC target. Turns out
not much:

    ; aarch64 previous                 ; aarch64 new
    sub     x8, x1, #1                 add     x8, x1, x0
    neg     x9, x0                     sub     x8, x8, #1
    and     x8, x9, x8                 neg     x9, x1
    add     x0, x0, x8                 and     x0, x8, x9

    (and similarly for ppc, sparc, mips, riscv, etc)

The only target that seems to do worse is… wasm32.

Onto actual measurements – the best way to evaluate snippets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%, however!
@bors bors closed this as completed in 69e68cf Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants