Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement all ARM NEON intrinsics #148

Open
gnzlbg opened this issue Oct 24, 2017 · 32 comments
Open

Implement all ARM NEON intrinsics #148

gnzlbg opened this issue Oct 24, 2017 · 32 comments
Labels

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 24, 2017

Steps for implementing an intrinsic:

  • Select an intrinsic below
  • Review coresimd/arm/neon.rs and coresimd/aarch64/neon.rs
  • Consult ARM official documentation about your intrinsic
  • Consult godbolt for how the intrinsic should be codegen'd, using clang as an example. Use the links below and replace the name of the intrinsic in the code with your intrinsic. Note that if ARM is an error then your intrinsic may be AArch64-only
  • If the codegen is the same on ARM/AArch64, place the intrinsic in coresimd/arm/neon.rs. If it's different place it in both with appropriate #[cfg] in coresimd/arm/neon.rs. If it's only AArch64 place it in coresimd/aarch64/neon.rs
  • Write a test for your intrinsic at the bottom of the file as well
  • Test! Probably use rustup run nightly sh ci/run-docker.sh aarch64-unknown-linux-gnu.
  • When ready, send a PR!

All unimplemented NEON intrinsics

@gnzlbg

This comment has been minimized.

@oconnor663
Copy link
Contributor

Is there a blocker for these, or is it just finding time to do it? I'd like to help, but I'd need a more experienced compiler/SIMD person to point me in the right direction.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Nov 15, 2018

I can mentor. Start by taking a look at some of the intrinsics in the coresimd/aarch64/neon.rs module :)

@oconnor663
Copy link
Contributor

Is there some upstream source that these all get copied from, or are they actually written by hand?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Nov 16, 2018

I am not sure I understand the question ? The neon modules in this repository are written by hand, although @Amanieu has expressed interest into generating some parts of them automatically.

@oconnor663
Copy link
Contributor

oconnor663 commented Nov 16, 2018 via email

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Nov 16, 2018

Ah, I see, that would be the ARM NEON spec: https://developer.arm.com/technologies/neon/intrinsics

@alexcrichton
Copy link
Member

Now might be a great time to help make some more progress on this! We've got tons of intrinsics already implemented (thanks @gnzlbg!), and I've just implemented automatic verification of all added intrinsics, so we know if they're added they've got the correct signature at least!

I've updated the OP of this issue with more detailed instructions about how to bind NEON intrinsics. Hopefully it's not too bad any more!

We'll probably want to reorganize modules so they're a bit smaller and more manageable over time, but for now if anyone's interested to add more intrinsics and needs some help let me know!

@valpackett
Copy link
Contributor

more manageable

I have a proposal for this: using a macro to make definitions one-line e.g.:

neon_op!(binary vadd_s8 : int8x8_t == simd_add, assert vadd / add, doc "Vector add");
neon_op!(binary vaddl_s8 : int8x8_t -> int16x8_t == simd_add, assert vaddl / saddl, doc "Vector long add");
neon_op!(unary vmovn_s16 : int16x8_t -> int8x8_t == simd_cast, assert vmovn / xtn, doc "Vector narrow integer");

This will make adding new ones easier (scrolling through a bolierplate-filled file just feels awful), and I'll add a lot more simd_sub simd_mul simd_lt etc. ones. Would this be accepted?

macro definition I currently have
macro_rules! neon_op {
    (binary $name:ident : $type:ident == $op:ident, assert $instr32:ident / $instr64:ident, doc $doc:literal) => {
        #[inline]
        #[doc = $doc]
        #[target_feature(enable = "neon")]
        #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))]
        #[cfg_attr(all(test, target_arch = "arm"), assert_instr($instr32))]
        #[cfg_attr(all(test, target_arch = "aarch64"), assert_instr($instr64))]
        pub unsafe fn $name(a: $type, b: $type) -> $type {
            $op(a, b)
        }
    };
    (binary $name:ident : $type:ident -> $result_type:ident == $op:ident, assert $instr32:ident / $instr64:ident, doc $doc:literal) => {
        #[inline]
        #[doc = $doc]
        #[target_feature(enable = "neon")]
        #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))]
        #[cfg_attr(all(test, target_arch = "arm"), assert_instr($instr32))]
        #[cfg_attr(all(test, target_arch = "aarch64"), assert_instr($instr64))]
        pub unsafe fn $name(a: $type, b: $type) -> $result_type {
            let a: $result_type = simd_cast(a);
            let b: $result_type = simd_cast(b);
            $op(a, b)
        }
    };
    (unary $name:ident : $type:ident -> $result_type:ident == $op:ident, assert $instr32:ident / $instr64:ident, doc $doc:literal) => {
        #[inline]
        #[doc = $doc]
        #[target_feature(enable = "neon")]
        #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))]
        #[cfg_attr(all(test, target_arch = "arm"), assert_instr($instr32))]
        #[cfg_attr(all(test, target_arch = "aarch64"), assert_instr($instr64))]
        pub unsafe fn $name(a: $type) -> $result_type {
            $op(a)
        }
    };
}

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jul 28, 2019

For the definitions, I think that using macros is ok.

I am not sure I follow how does macros generate run-time tests for the intrinsics, that's usually the bulk of the work.

@aloucks
Copy link

aloucks commented Jul 8, 2020

What is the reasoning behind some intrinsics linking in the LLVM intrinsic directly while others are using the generic simd_XXX functions?

For example:

/// Halving add
#[inline]
#[target_feature(enable = "neon")]
#[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))]
#[cfg_attr(all(test, target_arch = "arm"), assert_instr("vhadd.u16"))]
#[cfg_attr(all(test, target_arch = "aarch64"), assert_instr(uhadd))]
pub unsafe fn vhadd_u16(a: uint16x4_t, b: uint16x4_t) -> uint16x4_t {
#[allow(improper_ctypes)]
extern "C" {
#[cfg_attr(target_arch = "arm", link_name = "llvm.arm.neon.vhaddu.v4i16")]
#[cfg_attr(target_arch = "aarch64", link_name = "llvm.aarch64.neon.uhadd.v4i16")]
fn vhadd_u16_(a: uint16x4_t, b: uint16x4_t) -> uint16x4_t;
}
vhadd_u16_(a, b)
}

Versus:

/// Compare bitwise Equal (vector)
#[inline]
#[target_feature(enable = "neon")]
#[cfg_attr(test, assert_instr(cmeq))]
pub unsafe fn vceq_u64(a: uint64x1_t, b: uint64x1_t) -> uint64x1_t {
simd_eq(a, b)
}

Given the sheer volume of neon intrinsics, it seems rather daunting to implement them all by hand using the guide in the first post. I'm wondering if there's a deterministic data driven way to generate all of them using #[link_name = "llvm.*"] as done in the first example. Maybe the llvm c headers could be useful?

@bjorn3
Copy link
Member

bjorn3 commented Jul 8, 2020

What is the reasoning behind some intrinsics linking in the LLVM intrinsic directly while others are using the generic simd_XXX functions?

Not all intrinsics have a corresponding simd_* platform-intrinsic.

I'm wondering if there's a deterministic data driven way to generate all of them using #[link_name = "llvm.*"] as done in the first example. Maybe the llvm c headers could be useful?

Please don't. The simd_* platform intrinsics are much easier to implement in alternative codegen backends than the llvm intrinsics, as they are generic over vector types and they are backend independent.

@alexcrichton
Copy link
Member

@aloucks most of the intrinsics (AFAIK) have been added piecemeal over time, so it's sort of expected that they're not 100% consistent. Otherwise though I'd imagine that whatever works best would be fine to add to this repository. Auto-generation sounds pretty reasonable to me, and for an implementation we strive to match what Clang does in its implementation of these intrinsics.

@alexcrichton
Copy link
Member

Also, to be clear, this library is not designed for ease of implementation in alternate codegen backends. The purpose of this crate is to get the LLVM backend up and running with SIMD. Discussions and design constraints for alternate backends should be discussed in a separate issue.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 17, 2020

Hey all, some friends and I have made a google sheet of all the Neon intrinsics, their inputs, output, and the ARM summary comment.

There could easily have been errors when copying around and manipulating thousands of entries of text, but I think that it's got all the bugs sorted out.

If you want to try some auto-generation, this is a good place to start. There's even a column where I've marked what we have in nightly so far, so if you just auto-gen all the functions that aren't checked you shouldn't hit any duplicate definitions.

I hope to find the time to actually contribute some functions, but for now this will have to do.

EDIT: also I just subscribed to the entire repo, so if there's any PRs that add more functions I'll try to check those boxes on the sheet and keep it up to date.

@Lokathor
Copy link
Contributor

Working with @Shnatsel, I described the "godbolt process" and they were kind enough to make it a bash script that you can run locally

#!/bin/bash
set -e
INTRINSIC_NAME="$1"
TEMP_DIR="$(mktemp -d)"
cleanup() {
    rm -r "$TEMP_DIR"
}
trap cleanup EXIT
(
cd "$TEMP_DIR"
echo "#include <arm_neon.h>
int test() {
  return (int) $INTRINSIC_NAME;
}" > ./in.c

clang -emit-llvm -O2 -S -target armv7-unknown-linux-gnueabihf -g0 in.c
ARM_NAME=$(grep --only-matching '@llvm.arm.neon.[A-Za-z0-9.]\+' ./*.ll | tr -d '@' | head -n 1)

clang -emit-llvm -O2 -S -target aarch64-unknown-linux-gnu -g0 in.c
AARCH64_NAME=$(grep --only-matching '@llvm.aarch64.neon.[A-Za-z0-9.]\+' ./*.ll | tr -d '@' | head -n 1)

echo "$INTRINSIC_NAME, $ARM_NAME, $AARCH64_NAME"
)

You will probably need the gcc-multilib package or similar installed so that the correct headers are available.

Note that many functions don't have an associated llvm intrinsic that can be as easily scrapped out this way, but maybe 1/4th or so of them do.

@SparrowLii
Copy link
Member

@Lokathor Several instructions have been added recently: vaddhn, vbic, vorn, vceqz, vtst, vabd, vaba. Though some of them are not fully supported( like vceqzd). If you don’t have time to maintain this google sheet, I think I can help

@nano-bot
Copy link

nano-bot commented Mar 9, 2021

@Lokathor Several instructions have been added recently: vaddhn, vbic, vorn, vceqz, vtst, vabd, vaba. Though some of them are not fully supported( like vceqzd). If you don’t have time to maintain this google sheet, I think I can help

Awesome, looking forward to this!

@fzyzcjy
Copy link

fzyzcjy commented Sep 27, 2021

Any updates after a long time...? Thanks

@bjorn3
Copy link
Member

bjorn3 commented Sep 27, 2021

If you look at the pull request list you can see that there has been activity on this quite recently. For example #1224 was opened yesterday.

@fzyzcjy
Copy link

fzyzcjy commented Sep 27, 2021

@bjorn3 Thanks! Indeed I mostly want to know when can we see it in stable version.
By the way do you suggest use nightly in production environment? If so I can use it now.

@CryZe
Copy link
Contributor

CryZe commented Oct 21, 2021

@SparrowLii You marked the following instructions as completed (same for min):

https://i.imgur.com/OipjDCy.png

It doesn't seem like those instructions are actually part of your recent PR (nor were they on the master branch before that) so I unmarked them again.

@SparrowLii
Copy link
Member

SparrowLii commented Oct 21, 2021

@CryZe They can be found in the master branch now:
https://github.com/rust-lang/stdarch/blob/master/crates/core_arch/src/aarch64/neon/generated.rs#L8519-L8539
https://github.com/rust-lang/stdarch/blob/master/crates/core_arch/src/aarch64/neon/generated.rs#L8545-L8565
Sorry I marked them before #1230 merged, this is to prevent others from submitting duplicate PRs

@CryZe
Copy link
Contributor

CryZe commented Oct 21, 2021

Welp, I'll mark them again then. Somehow the GitHub Pull Request UI doesn't show them as diffs at all: https://i.imgur.com/BsHR5in.gif

@SparrowLii
Copy link
Member

Github’s comparison tool will always have problems when changing a large amount of code XD

@SparrowLii
Copy link
Member

SparrowLii commented Oct 21, 2021

As in #1230, except for the following instructions and those use 16-bit floating-point, other instructions have been implemented:

  1. The following instructions are only available in aarch64 now, because the corresponding target_feature cannot be found in the available features of arm:
    vcadd_rotvcmlavdot

  2. The feature i8mm is not valid:
    vmmlavusmmla: https://rust.godbolt.org/z/8GbKW5ef4

  3. LLVM ERROR(Can be reproduced in godbolt):
    vsm4e: https://rust.godbolt.org/z/xhT1xvGTP

  4. LLVM ERROR(Normal in gotbolt, but LLVM ERROR: Cannot select: intrinsic raises at runtime)
    vsudotvusdot: https://rust.godbolt.org/z/aMnEvab3n
    vqshlu: https://rust.godbolt.org/z/hvGhrhdMT

  5. Not implmented in LLVM and cannot be implemented manually:
    vmull_p64(for arm)、vsm3vrax1q_u64vxarq_u64vrnd32vrnd64vsha512

@Amanieu
Copy link
Member

Amanieu commented Oct 21, 2021

As in #1230, except for the following instructions and those use 16-bit floating-point, other instructions have been implemented:

1. The following instructions are only available in aarch64 now, because the corresponding `target_feature` cannot be found in the available features of arm:
   `vcadd_rot`、`vcmla`、`vdot`

On LLVM's ARM backend, vcadd_rot and vcmla are under the v8.3a feature. vdot is under the dotprod feature. I got this information from llvm-project/llvm/lib/Target/ARM/ARMInstrNEON.td.

2. The feature `i8mm` is not valid:
   `vmmla`、`vusmmla`: [rust.godbolt.org/z/8GbKW5ef4](https://rust.godbolt.org/z/8GbKW5ef4)

Already discussed in rust-lang/rust#90079.

3. LLVM ERROR(Can be reproduced in godbolt):
   `vsm4e`: [rust.godbolt.org/z/xhT1xvGTP](https://rust.godbolt.org/z/xhT1xvGTP)

Use llvm.aarch64.crypto.sm4ekey instead of llvm.aarch64.sve.sm4ekey.

4. LLVM ERROR(Normal in gotbolt, but `LLVM ERROR: Cannot select: intrinsic` raises at runtime)
   `vsudot`、`vusdot`: [rust.godbolt.org/z/aMnEvab3n](https://rust.godbolt.org/z/aMnEvab3n)
   `vqshlu`: [rust.godbolt.org/z/hvGhrhdMT](https://rust.godbolt.org/z/hvGhrhdMT)

You need to make you test function pub in godbolt, otherwise it will be optimized away as unreachable by rustc before LLVM.

vsudot/vusdot require the i8mm target feature. vqshlu seems to work fine in godbolt after changing the pub.

5. Not implmented in LLVM and cannot be implemented manually:
   `vmull_p64`(for arm)、`vsm3`、`vrax1q_u64`、`vxarq_u64`、`vrnd32`、`vrnd64`、`vsha512`

These all seem to exist in LLVM at least for AArch64. For ARM we can just leave these out for now.

@SparrowLii
Copy link
Member

Hope someone can help implement the remaining instructions.

@SparrowLii
Copy link
Member

SparrowLii commented Nov 9, 2021

@Amanieu v8.5a feature is non-runtime detected so we can't use #[simd_test(enable = "neon,v8.5a")]. So how do we add tests for instructions that use v8.5a, like vrnd32x and vrnd64x?

@hkratz
Copy link
Contributor

hkratz commented Nov 9, 2021

@SparrowLii Shouldn't that work with the frintts feature?

@SparrowLii
Copy link
Member

@SparrowLii Shouldn't that work with the frintts feature?

Looks useful: https://rust.godbolt.org/z/894W8cndG

@Amanieu
Copy link
Member

Amanieu commented Nov 9, 2021

LLVM only supports frintts on AArch64, so it's fine to not support this intrinsic on ARM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests