Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for adcx instruction in the addcarryx intrinsics instead of just adc #666

Open
gnzlbg opened this issue Jan 31, 2019 · 31 comments
Open
Labels

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 31, 2019

cc @jethrogb who reported this here #631 (comment)

@jethrogb
Copy link
Contributor

Title should say "addcarryx"

@gnzlbg gnzlbg changed the title Check for adcx instruction in the addcarry intrinsics instead of just adc Check for adcx instruction in the addcarryx intrinsics instead of just adc Jan 31, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 31, 2019

LLVM doesn't emit this intrinsic anymore: https://reviews.llvm.org/D51754

The rationale given is:

There's no advantage to this instruction unless you need to avoid touching other flag bits. It's encoding is longer, it can't fold an immediate, it doesn't write all the flags.

So I'm closing this. If you feel like we should emit this instruction here, feel free to re-open this, and fill an LLVM bug about it pinging the people involved in the submission and review of that pull-request, and link it here.

@gnzlbg gnzlbg closed this as completed Jan 31, 2019
@jethrogb
Copy link
Contributor

jethrogb commented Jan 31, 2019

The public interface we expose has nothing to do with what LLVM does internally. For simple functions such as this, our public interface needs to match the Intel Intrinsics guide. If we can't implement it properly in LLVM, we should not stabilize the intrinsic.

If you didn't care about using ADX, you could use the addcarry intrinsic which always works.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 31, 2019

You are right, I think that llvm diff is unrelated. We should be using llvm.addcarryx instead of llvm.addcarry here. I'll start by adding some tests first and see if I can get that intrinsic to work.

@gnzlbg gnzlbg reopened this Jan 31, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 31, 2019

FWIW, we don't promise which instructions any intrinsic will generate. We promise that the running executable won't be able to tell the difference.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 31, 2019

So looking at the LLVM tests, it appears that indeed addcarryx is replaced with addcarry when possible. AFAICT replacing it would only be incorrect if the user were to inspect the overflow flag which we are not doing (and I am unsure whether that's possible or easy to do in LLVM - cc @rkruppe).

@jethrogb I have a PR (#672) that updates these to use the llvm addcarryx intrinsic, but the code generation issue persists. We could workaround that using inline assembly, but before we add any workarounds one would need to fill in an LLVM bug showing an MWE in which the transformation is incorrect.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 19, 2019

@jethrogb do you think you could help with an MWE for an LLVM bug report for this ?

@jethrogb
Copy link
Contributor

jethrogb commented Apr 19, 2019

I came up with this

/// Returns `(a + b, c + d)`
#[inline(never)]
#[no_mangle]
pub fn dual_mp_add(a: &[u64], b: &[u64], c: &[u64], d: &[u64]) -> (Vec<u64>, Vec<u64>) {
    unsafe {
        let len = a.len();
        assert_eq!(len, b.len());
        assert_eq!(len, c.len());
        assert_eq!(len, d.len());
        let mut ab = vec![0; len];
        let mut cd = vec![0; len];

        let mut ab_carry = 0;
        let mut cd_carry = 0;
        for i in 0..len {
            ab_carry = _addcarryx_u64(ab_carry, *a.get_unchecked(i), *b.get_unchecked(i), ab.get_unchecked_mut(i));
            cd_carry = _addcarryx_u64(cd_carry, *c.get_unchecked(i), *d.get_unchecked(i), cd.get_unchecked_mut(i));
        }

        (ab, cd)
    }
}

But the assembly is so laughably inefficient (the addcarry call isn't even inlined) that not using the ADX instruction isn't really going to do much in terms of performance.

@jethrogb
Copy link
Contributor

Sorry, I forgot that you need to add a + in -C target-feature and there is no error if you don't. When doing that, I get this:

    34a0:       49 8b 4c f5 00          mov    0x0(%r13,%rsi,8),%rcx
    34a5:       40 80 c7 ff             add    $0xff,%dil
    34a9:       49 13 0c f4             adc    (%r12,%rsi,8),%rcx
    34ad:       40 0f 92 c7             setb   %dil
    34b1:       48 89 4c f5 00          mov    %rcx,0x0(%rbp,%rsi,8)
    34b6:       49 8b 0c f7             mov    (%r15,%rsi,8),%rcx
    34ba:       80 c2 ff                add    $0xff,%dl
    34bd:       49 13 0c f0             adc    (%r8,%rsi,8),%rcx
    34c1:       0f 92 c2                setb   %dl
    34c4:       48 89 0c f0             mov    %rcx,(%rax,%rsi,8)
    34c8:       48 8d 76 01             lea    0x1(%rsi),%rsi
    34cc:       48 39 f3                cmp    %rsi,%rbx
    34cf:       75 cf                   jne    34a0 <dual_mp_add+0xf0>

I think this is supposed to be slower than using ADX.

@jethrogb
Copy link
Contributor

jethrogb commented Apr 19, 2019

Actually, you can't implement dual integer addition with ADX because AFAICT there's no way to check the loop variable without touching CF/OF. I'll try to come up with something else.

Edit: You can using JRCXZ!

@jethrogb
Copy link
Contributor

Ok, so I replaced that code with this assembly:

   neg    %rbx    
   xor    %edx,%edx
   xor    %edi,%edi
   xor    %esi,%esi
1:
   mov    0x0(%r13,%rsi,8),%rcx
   adcx   (%r12,%rsi,8),%rcx
   mov    %rcx,0x0(%rbp,%rsi,8)
   mov    (%r15,%rsi,8),%rcx
   adox   (%r8,%rsi,8),%rcx
   mov    %rcx,(%rax,%rsi,8)
   lea    0x1(%rsi),%rsi
   lea    (%rsi,%rbx),%rcx
   jrcxz  2f
   jmp 1b
2:

This is basically the same speed with my poor benchmark.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019

If the code is correct and the speed is the same, there might be an advantage to not using adcx intrinsics that we are not seeing but LLVM might (code size, port pressure, ...). We can still ask a question in LLVM bugzilla about why isn't adcx being picked up though, but we do allow LLVM to do these "optimizations" as long as the resulting machine code is not slower.

@jethrogb
Copy link
Contributor

jethrogb commented Apr 20, 2019

I decided to implement MP multiply, which is documented by Intel as a main usecase for these instructions:

#![feature(test, asm)]

extern crate test;

use core::arch::x86_64::{_addcarryx_u64, _mulx_u64};

#[inline(never)]
#[no_mangle]
pub fn mp_mul_intrin(a: &[u64], b: &[u64]) -> Vec<u64> {
    unsafe {
        let len = a.len();
        assert_eq!(len, b.len());
        let mut c = vec![0; len * 2];

        for b_i in (0..len).into_iter().rev() {
            let b_elem = *b.get_unchecked(b_i);
            let mut carry_lo = 0;
            let mut carry_hi = 0;
            for a_i in (0..len).into_iter().rev() {
                let mut hi = 0;
                let lo = _mulx_u64(*a.get_unchecked(a_i), b_elem, &mut hi);
                carry_lo = _addcarryx_u64(carry_lo, lo, *c.get_unchecked(b_i + a_i + 1), c.get_unchecked_mut(b_i + a_i + 1));
                carry_hi = _addcarryx_u64(carry_hi, hi, *c.get_unchecked(b_i + a_i), c.get_unchecked_mut(b_i + a_i));
            }
            *c.get_unchecked_mut(b_i) += carry_lo as u64;
        }

        c
    }
}

#[inline(never)]
#[no_mangle]
pub fn mp_mul_asm(a: &[u64], b: &[u64]) -> Vec<u64> {
    unsafe {
        let len = a.len();
        assert_eq!(len, b.len());
        let mut c = vec![0; len * 2];
        
        asm!("
    # start iteration of `b_i`: `len`-1 downto 0
    lea -1($3), %rsi
1:  # every iteration of `b_i`
        # load `b_elem`
        mov ($1, %rsi, 8), %rdx
        # clear `carry_lo`, `carry_hi`
        xor %r9, %r9
        # start iteration of `a_i`+1: `len` downto 1
        mov $3, %rcx
        jmp 3f
2:          # the end of every iteration of `a_i`+1, except the last iteration
            # no displacement, RCX has already been decremented
            mov %r10, (%r11, %rcx, 8)
3:      # every iteration of `a_i`+1
            # compute c + b_i
            lea ($2, %rsi, 8), %r11
            # compute a[a_i]*b_elem
            mulx -8($0, %rcx, 8), %r8, %r9
            # add low word
            mov (%r11, %rcx, 8), %r10
            adcx %r8, %r10
            mov %r10, (%r11, %rcx, 8)
            # add high word
            mov -8(%r11, %rcx, 8), %r10
            adox %r9, %r10
            # this is done later to be able to add in last carry in last iteration of `a_i`+1
            # mov %r10, -8(%r11, %rcx, 8)
            # advance `a_i`+1
            lea -1(%rcx), %rcx
            jrcxz 4f
            jmp 2b
4:          # the end of the last iteration of `a_i`+1
            adc $$0, %r10
            mov %r10, (%r11)
        # advance `b_i`
        dec %rsi
        jns 1b
"
            :
            : "r"(a.as_ptr()), "r"(b.as_ptr()), "r"(c.as_mut_ptr()), "r"(len)
            : "rsi", "rcx", "rdx", "r8", "r9", "r10", "r11", "flags"
        );

        c
    }
}

use num_bigint::{RandBigInt, BigUint};

fn bigint_to_slice(i: &BigUint) -> Vec<u64> {
    i.to_bytes_be().chunks(8).map(|chunk| {
        let mut a = [0u8; 8];
        a.copy_from_slice(chunk);
        u64::from_be_bytes(a)
    }).collect()
}


#[test]
fn test_rand() {
    for _ in 0..1_000 {
        let a = rand::thread_rng().gen_biguint(1024);
        let b = rand::thread_rng().gen_biguint(1024);
        if a.bits() + b.bits() <= 2041 {
            continue;
        }
        let c = &a * &b;
        assert_eq!(
            mp_mul_intrin(&bigint_to_slice(&a), &bigint_to_slice(&b)),
            bigint_to_slice(&c)
        );
        assert_eq!(
            mp_mul_asm(&bigint_to_slice(&a), &bigint_to_slice(&b)),
            bigint_to_slice(&c)
        );
        println!("ok");
    }
}

#[bench]
fn bench_intrin(bench: &mut test::Bencher) {
    loop {
        let a = rand::thread_rng().gen_biguint(1024);
        let b = rand::thread_rng().gen_biguint(1024);
        if a.bits() + b.bits() <= 2041 {
            continue;
        }
        let a = bigint_to_slice(&a);
        let b = bigint_to_slice(&b);
        bench.iter(|| mp_mul_intrin(&a, &b));
        break;
    }
}

#[bench]
fn bench_asm(bench: &mut test::Bencher) {
    loop {
        let a = rand::thread_rng().gen_biguint(1024);
        let b = rand::thread_rng().gen_biguint(1024);
        if a.bits() + b.bits() <= 2041 {
            continue;
        }
        let a = bigint_to_slice(&a);
        let b = bigint_to_slice(&b);
        bench.iter(|| mp_mul_asm(&a, &b));
        break;
    }
}

Here you can see a pretty significant difference:

test bench_asm    ... bench:         390 ns/iter (+/- 58)
test bench_intrin ... bench:         507 ns/iter (+/- 65)

For reference, here's the assembly I get for the intrinsics version:

   1638d:       49 8d 3c d8             lea    (%r8,%rbx,8),%rdi
   16391:       48 83 c7 f0             add    $0xfffffffffffffff0,%rdi
   16395:       49 89 da                mov    %rbx,%r10
   16398:       4d 8d 4a ff             lea    -0x1(%r10),%r9
   1639c:       48 85 db                test   %rbx,%rbx
   1639f:       75 24                   jne    163c5 <mp_mul_intrin+0xb5>
   163a1:       eb 6d                   jmp    16410 <mp_mul_intrin+0x100>
   163a3:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   163aa:       00 00 00 
   163ad:       0f 1f 00                nopl   (%rax)
   163b0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
   163b5:       48 83 c7 f8             add    $0xfffffffffffffff8,%rdi
   163b9:       4d 89 ca                mov    %r9,%r10
   163bc:       4d 8d 4a ff             lea    -0x1(%r10),%r9
   163c0:       48 85 db                test   %rbx,%rbx
   163c3:       74 4b                   je     16410 <mp_mul_intrin+0x100>
   163c5:       4b 8b 74 d7 f8          mov    -0x8(%r15,%r10,8),%rsi
   163ca:       31 ed                   xor    %ebp,%ebp
   163cc:       31 c9                   xor    %ecx,%ecx
   163ce:       66 90                   xchg   %ax,%ax
   163d0:       48 89 f0                mov    %rsi,%rax
   163d3:       49 f7 64 dc f8          mulq   -0x8(%r12,%rbx,8)
   163d8:       80 c1 ff                add    $0xff,%cl
   163db:       48 11 44 df 08          adc    %rax,0x8(%rdi,%rbx,8)
   163e0:       0f 92 c1                setb   %cl
   163e3:       40 80 c5 ff             add    $0xff,%bpl
   163e7:       48 11 14 df             adc    %rdx,(%rdi,%rbx,8)
   163eb:       40 0f 92 c5             setb   %bpl
   163ef:       48 83 c3 ff             add    $0xffffffffffffffff,%rbx
   163f3:       75 db                   jne    163d0 <mp_mul_intrin+0xc0>
   163f5:       0f b6 c1                movzbl %cl,%eax
   163f8:       4b 01 44 d0 f8          add    %rax,-0x8(%r8,%r10,8)
   163fd:       4d 85 c9                test   %r9,%r9
   16400:       75 ae                   jne    163b0 <mp_mul_intrin+0xa0>
   16402:       eb 1b                   jmp    1641f <mp_mul_intrin+0x10f>
   16404:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   1640b:       00 00 00 
   1640e:       66 90                   xchg   %ax,%ax
   16410:       31 c9                   xor    %ecx,%ecx
   16412:       0f b6 c1                movzbl %cl,%eax
   16415:       4b 01 44 d0 f8          add    %rax,-0x8(%r8,%r10,8)
   1641a:       4d 85 c9                test   %r9,%r9
   1641d:       75 91                   jne    163b0 <mp_mul_intrin+0xa0>

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019

Thank you, I think that's very convincing. Could you fill in an LLVM bug for this and link it here? (otherwise I can do it for you if you want).

@jethrogb
Copy link
Contributor

I'd prefer it if you did it, I'm guessing you have a better picture of what should be in such a bug report.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019

So investigating for the LLVM bug report I produced the following slightly more minimal example (godbolt - no std, no debug info, no println, no panics, ...):

#![feature(adx_target_feature)]
#![no_std]
use core::{arch::x86_64::*, hint::unreachable_unchecked};

#[target_feature(enable = "adx,bmi2")]
pub unsafe fn mp_mul_intrin(a: &[u64], b: &[u64], c: &mut [u64]) {
    let len = a.len();
    if len != b.len() {
        unreachable_unchecked();
    }
    if c.len() != len * 2 {
        unreachable_unchecked();
    }

    for b_i in (0..len).into_iter().rev() {
        let b_elem = *b.get_unchecked(b_i);
        let mut carry_lo = 0;
        let mut carry_hi = 0;
        for a_i in (0..len).into_iter().rev() {
            let mut hi = 0;
            let lo = _mulx_u64(*a.get_unchecked(a_i), b_elem, &mut hi);
            carry_lo = _addcarryx_u64(
                carry_lo,
                lo,
                *c.get_unchecked(b_i + a_i + 1),
                c.get_unchecked_mut(b_i + a_i + 1),
            );
            carry_hi = _addcarryx_u64(
                carry_hi,
                hi,
                *c.get_unchecked(b_i + a_i),
                c.get_unchecked_mut(b_i + a_i),
            );
        }
        *c.get_unchecked_mut(b_i) += carry_lo as u64;
    }
}

Inspecting the LLVM-IR in debug mode, we are not emitting llvm.x86.addcarryx instructions and I don't know why. Looking at core::arch source of _addcarryx_u64: https://github.com/rust-lang-nursery/stdsimd/blob/master/crates/core_arch/src/x86_64/adx.rs#L33 we are calling llvm_addcarryx_u64 https://github.com/rust-lang-nursery/stdsimd/blob/master/crates/core_arch/src/x86_64/adx.rs#L9 which links llvm.x86.addcarryx.u64. So in debug mode, that's what we should be seeing. I've checked the nightly version in godbolt, and it is from 18.04.2019, so core::arch appears to be recent enough. Any ideas ?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019

This reproduces the issue (godbolt):

#![feature(link_llvm_intrinsics, abi_unadjusted, adx_target_feature)]

#[allow(improper_ctypes)]
extern "unadjusted" {
    #[link_name = "llvm.x86.addcarryx.u64"]
    fn llvm_addcarryx_u64(a: u8, b: u64, c: u64, d: *mut u8) -> u8;
}

#[target_feature(enable = "adx")]
pub unsafe fn _addcarryx_u64(c_in: u8, a: u64, b: u64, out: &mut u64) -> u8 {
    llvm_addcarryx_u64(c_in, a, b, out as *mut _ as *mut u8)
}

produces

ource_filename = "example.3a1fbbbh-cgu.0"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i8 @_ZN7example14_addcarryx_u6417hf98d062ffaad3a58E(i8 %c_in, i64 %a, i64 %b, i64* align 8 dereferenceable(8) %out) unnamed_addr #0 {
start:
  %0 = bitcast i64* %out to i8*
  %1 = call { i8, i64 } @llvm.x86.addcarry.64(i8 %c_in, i64 %a, i64 %b)
  %2 = extractvalue { i8, i64 } %1, 1
  %3 = bitcast i8* %0 to i64*
  store i64 %2, i64* %3, align 1
  %4 = extractvalue { i8, i64 } %1, 0
  br label %bb1

bb1: ; preds = %start
  ret i8 %4
}

declare { i8, i64 } @llvm.x86.addcarry.64(i8, i64, i64) #1

attributes #0 = { nounwind nonlazybind "probe-stack"="__rust_probestack""target-cpu"="x86-64""target-features"="+adx" }
attributes #1 = { nounwind readnone }

!llvm.module.flags = !{!0}

!0 = !{i32 2, !"RtLibUseGOT", i32 1}

when compiled with "-C opt-level=0 -C debuginfo=0 -C panic=abort --edition=2018 -C lto=no --emit=llvm-ir".

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019

It appears that this is just LLVM replacing the addcarryx intrinsic with addcarry (e.g. https://llvm.godbolt.org/z/vfVH70), even with opt-level=0.

@jethrogb
Copy link
Contributor

jethrogb commented Apr 20, 2019

It looks like the mulx instrinsic also doesn't work correctly? Edit: there we don't even try to call the LLVM intrinsic...

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019 via email

@jethrogb
Copy link
Contributor

jethrogb commented Apr 20, 2019

That is #27. In the code I shared, without mulx, it's no use emitting adcx/adox, because it would be overwriting the flags we were so carefully trying to keep around.

@jethrogb
Copy link
Contributor

#27 claims the 64-bit version "works fine" but that's not (or no longer) the case.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 20, 2019 via email

@gnzlbg gnzlbg added the A-llvm label Apr 22, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 22, 2019

So this is now blocked on LLVM: https://bugs.llvm.org/show_bug.cgi?id=41546

It seems LLVM is not set up to generate code like this.

@jon-chuang
Copy link

Does this affect using the asm! macro? Or does it only affect the intrinsics in std_arch?

@Amanieu
Copy link
Member

Amanieu commented Mar 30, 2020

This only affects the intrinsics.

@jon-chuang
Copy link

Doesn't u128's mul_assign depend on mulx?

@tarcieri
Copy link
Contributor

I saw @gnzlbg mention this earlier:

We could workaround that using inline assembly, but before we add any workarounds one would need to fill in an LLVM bug showing an MWE in which the transformation is incorrect.

...and @Amanieu mentioned that inline assembly is not impacted here.

It seems that the LLVM bug report has been filed, so is an inline assembly workaround potentially on the table here now?

@bjorn3
Copy link
Member

bjorn3 commented Oct 27, 2021

AFAICT this is only a perf issue and not a correctness issue. Using inline assembly in the body of just this intrinsic definition will likely only slowdown things as LLVM won't be able to fold an immediate into the instruction. It can't cause a speedup as adcx still clobbers the carry flag and there is no way to say that only the carry flag is clobbered and not the other flags. To get a perf improvement from using inline assembly you will likely need to write the entire code sequence that contains the adcx usage in inline asm.

@tarcieri
Copy link
Contributor

tarcieri commented Oct 27, 2021

@bjorn3 yeah, I was worried about that.

As @jethrogb mentioned earlier what would be really nice here is some way to combine ADC/MULX such that flags aren't clobbered. But if they are, it defeats the point.

To make matters even more complicated we're looking for options that can potentially work in a const context, which is why we're looking for an intrinsics-based approach as opposed to directly using inline assembly ourselves. But if that isn't possible, we can potentially pursue these sorts of optimizations using inline assembly outside of a const context.

@dignifiedquire
Copy link

dignifiedquire commented Oct 27, 2021

Disclaimer: this might be unsound & unsafe.

So I have this prototype, which uses only functions + asm! to build a carry chain. It does generate the expected assembly.

playground::carry_chain: # @playground::carry_chain
# %bb.0:
	pushq	%rax
	movq	%rdx, %rax

	xorq	%rdx, %rdx
	movq	%rdi, %rdx
	mulxq	%rax, %rdx, %rdi
	adoxq	%rdi, %rdx
	adcxq	%rax, %rdi

	movq	%rsi, %rdx
	mulxq	%rcx, %rax, %rdx
	adoxq	%rdx, %rax
	adcxq	%rcx, %rdx

	popq	%rcx
	retq
                                        # -- End function

(removed #APP comments for better readability)

I haven't been able to get rid of the mov statements inbetween, but probably with some tweaking that should be possible as well (as those should not be interleaved in the hand written assembly).

Rust Playground

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants