Implement all x86 vendor intrinsics #40

alexcrichton · 2017-09-25T19:58:53Z

This is intended to be a tracking issue for implementing all vendor intrinsics in this repository.
This issue is also intended to be a guide for documenting the process of adding new vendor intrinsics to this crate.

If you decide to implement a set of vendor intrinsics, please check the list below to make sure somebody else isn't already working on them. If it's not checked off or has a name next to it, feel free to comment that you'd like to implement it!

At a high level, each vendor intrinsic should correspond to a single exported Rust function with an appropriate target_feature attribute. Here's an example for _mm_adds_epi16:

/// Add packed 16-bit integers in `a` and `b` using saturation.
#[inline]
#[target_feature(enable = "sse2")]
#[cfg_attr(test, assert_instr(paddsw))]
pub unsafe fn _mm_adds_epi16(a: __m128i, b: __m128i) -> __m128i {
    unsafe { paddsw(a, b) }
}

Let's break this down:

The #[inline] is added because vendor intrinsic functions generally should always be inlined because the intent of a vendor intrinsic is to correspond to a single particular CPU instruction. A vendor intrinsic that is compiled into an actual function call could be quite disastrous for performance.
The #[target_feature(enable = "sse2")] attribute intructs the compiler to generate code with the sse2 target feature enabled, regardless of the target platform. That is, even if you're compiling for a platform that doesn't support sse2, the compiler will still generate code for _mm_adds_epi16 as if sse2 support existed. Without this attribute, the compiler might not generate the intended CPU instruction.
The #[cfg_attr(test, assert_instr(paddsw))] attribute indicates that when we're testing the crate we'll assert that the paddsw instruction is generated inside this function, ensuring that the SIMD intrinsic truly is an intrinsic for the instruction!
The types of the vectors given to the intrinsic should match exactly the types as provided in the vendor interface. (with things like int64_t translated to i64 in Rust)
The implementation of the vendor intrinsic is generally very simple. Remember, the goal is to compile a call to _mm_adds_epi16 down to a single particular CPU instruction. As such, the implementation typically defers to a compiler intrinsic (in this case, paddsw) when one is available. More on this below as well.
The intrinsic itself is unsafe due to the usage of #[target_feature]

Once a function has been added, you should also add at least one test for basic functionality. Here's an example for _mm_adds_epi16:

#[simd_test = "sse2"]
unsafe fn test_mm_adds_epi16() {
    let a = _mm_set_epi16(0, 1, 2, 3, 4, 5, 6, 7);
    let b = _mm_set_epi16(8, 9, 10, 11, 12, 13, 14, 15);
    let r = _mm_adds_epi16(a, b);
    let e = _mm_set_epi16(8, 10, 12, 14, 16, 18, 20, 22);
    assert_eq_m128i(r, e);
}

Note that #[simd_test] is the same as #[test], it's just a custom macro to enable the target feature in the test and generate a wrapper for ensuring the feature is available on the local cpu as well.

Finally, once that's done, send a PR!

Writing the implementation

An implementation of an intrinsic (so far) generally has one of three shapes:

The vendor intrinsic does not have any corresponding compiler intrinsic, so you must write the implementation in such a way that the compiler will recognize it and produce the desired codegen. For example, the _mm_add_epi16 intrinsic (note the missing s in add) is implemented via simd_add(a, b), which compiles down to LLVM's cross platform SIMD vector API.
The vendor intrinsic does have a corresponding compiler intrinsic, so you must write an extern block to bring that intrinsic into scope and then call it. The example above (_mm_adds_epi16) uses this approach.
The vendor intrinsic has a parameter that must be a constant value when given to the CPU instruction, where that constant is often a parameter that impacts the operation of the intrinsic. This means the implementation of the vendor intrinsic must guarantee that a particular parameter be a constant. This is tricky because Rust doesn't (yet) have a stable way of doing this, so we have to do it ourselves. How you do it can vary, but one particularly gnarly example is _mm_cmpestri (make sure to look at the constify_imm8! macro).

References

All intel intrinsics can be found here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236

The compiler intrinsics available to us through LLVM can be found here: https://gist.github.com/anonymous/a25d3e3b4c14ee68d63bd1dcb0e1223c

The Intel vendor intrinsic API can be found here: https://gist.github.com/anonymous/25d752fda8521d29699a826b980218fc

The Clang header files for vendor intrinsics can also be incredibly useful. When in doubt, Do What Clang Does:
https://github.com/llvm-mirror/clang/tree/master/lib/Headers

TODO

["AVX2"]

["MMX"]

["SSE"]

["SSE2"]

["SSE4.1"]

_mm_stream_load_si128

previous description of this issue

The text was updated successfully, but these errors were encountered:

alexcrichton · 2017-09-25T19:59:31Z

cc @BurntSushi @gnzlbg, I've opened this up and moved TODO.md out here, I figure it may be easier to collaborate here to ensure we can attach names everywhere!

mattico · 2017-09-25T23:32:38Z

Could you edit the guide to suggest unsafe functions for the intrinsics? #21

alexcrichton · 2017-09-26T01:37:42Z

@mattico makes sense yeah! Although we may want to wait until #21 is closed out to avoid inconsistencies

AdamNiederer · 2017-09-26T18:22:14Z

For those wishing to implement intrinsics above SSE2, make sure you're running your tests with RUSTFLAGS="-C target-cpu=native" cargo test on something which supports that instruction set extension. It looks lilke it's only running the SSE2 tests otherwise.

gnzlbg · 2017-09-26T18:41:16Z

You can use `RUSTFLAGS="-C target-feature=+avx2" to enable a particular extension. Note however that a CPU that does support the extension is needed for running the tests. To develop tests for a different architecture (e.g. develop for ARM from x86) you can use cross-compilation. To run the tests... travis is an option. I don't know if there is a better option though.

AdamNiederer · 2017-09-26T18:42:54Z

It looks like travis only runs SSE2 and below with our current config. I wonder if their machines support AVX...

alexcrichton · 2017-09-26T18:44:12Z

@AdamNiederer oh that's actually a bug! I think I see what's going on though, I'll submit a fix.

gnzlbg · 2017-09-26T18:46:41Z

@alexcrichton https://github.com/rust-lang-nursery/stdsimd/blob/master/ci/run.sh probably needs to set RUSTFLAGS="-C target-cpu=native" to run most tests. @AdamNiederer makes a point though, what instruction sets does travis support? If it doesn't support AVX2, those will never be tested (I am pretty sure travis does not support AVX512, so we'll need a different solution for that).

AdamNiederer · 2017-09-26T18:48:18Z

Added in #45. Let's see what Travis has to say about it.

EDIT: The build is failing, but those same 20 tests were failing for me on my Ivy Bridge box last night. I think LLVM might be spitting out wider version of 128 or 64-wide instructions on CPUs which support them. It also looks like travis supports AVX2 🎉

alexcrichton · 2017-09-26T19:32:59Z

@gnzlbg oh I'm going to add cfg_feature_enabled! to all tests and enable them all unconditionally all the time, that way whatever your cpu supports we'll be testing everything (without any required interaction)

@AdamNiederer thanks! I'll look into the failures and see if I can fix them.

dlrobertson · 2017-09-28T12:28:45Z

Interested in helping out with this. Figured I'd start super small with cvtps2dq #65

vbarrielle · 2017-09-29T15:41:58Z

Hello, I've given a try at __mm256_div_ps and its double counterpart, see #73.

dlrobertson · 2017-09-30T14:02:46Z

Post #81 SSE 4.2 should be covered.

BurntSushi · 2017-09-30T14:08:42Z

@dlrobertson Awesome! I've updated the checklist.

vbarrielle · 2017-10-05T11:54:41Z

I've got an implementation for _mm256_{hadd,hsub}_{ps,pd} in #95.

rroohhh · 2017-10-06T12:54:12Z

What is the plan with FMA, is there a reason behind omitting it in the above list?

p32blo · 2017-10-06T15:15:07Z

Here are some intrinsics that are in the TODO, but are already implemented.

sse

_mm_getcsr _mm_setcsr _MM_GET_EXCEPTION_STATE _MM_SET_EXCEPTION_STATE _MM_GET_EXCEPTION_MASK _MM_SET_EXCEPTION_MASK _MM_GET_ROUNDING_MODE
_MM_SET_ROUNDING_MODE _MM_GET_FLUSH_ZERO_MODE
_MM_SET_FLUSH_ZERO_MODE _mm_prefetch _mm_sfence

sse2

_mm_cvtpd_epi32 _mm_cvtsd_si32 _mm_cvtsd_ss _mm_cvtss_sd _mm_cvttpd_epi32 _mm_cvttsd_si32 _mm_cvttps_epi32 _mm_load_pd (no tests) _mm_store_pd (no tests) _mm_load1_pd

sse3

_mm_addsub_ps _mm_addsub_ps _mm_hadd_pd _mm_hadd_ps _mm_hsub_pd _mm_hsub_ps _mm_lddqu_si128 _mm_movedup_pd _mm_loaddup_pd _mm_movehdup_ps _mm_moveldup_ps

ssse3

_mm_alignr_epi8

avx

_mm256_and_pd _mm256_and_ps _mm256_andnot_pd _mm256_andnot_ps _mm256_blend_pd _mm256_blend_ps _mm256_blendv_pd _mm256_blendv_ps _mm256_div_pd _mm256_div_ps _mm256_dp_ps _mm256_hadd_pd _mm256_hadd_ps _mm256_hsub_pd _mm256_hsub_ps _mm256_or_pd _mm256_or_ps _mm256_shuffle_pd _mm256_shuffle_ps _mm256_xor_pd _mm256_xor_ps _mm256_cvtepi32_pd _mm256_cvtepi32_ps _mm256_cvtpd_ps _mm256_cvtps_epi32 _mm256_cvtps_pd _mm256_cvttpd_epi32 _mm256_cvtpd_epi32 _mm256_cvttps_epi32 _mm256_extractf128_ps _mm256_extractf128_pd _mm256_extractf128_si256 _mm256_extract_epi8 _mm256_extract_epi16 _mm256_extract_epi32 _mm256_extract_epi64 _mm256_zeroall _mm256_zeroupper _mm256_permutevar_ps _mm_permutevar_ps _mm256_permute_ps _mm256_undefined_ps _mm256_undefined_pd _mm256_undefined_si256

avx2

_mm256_alignr_epi8 _mm256_movemask_epi8

alexcrichton · 2017-10-06T17:45:37Z

@p32blo updated!

gwenn · 2017-10-08T10:12:11Z

_mm256_blend_ps and _mm256_shuffle_ps are not implemented.
When I try, I have to kill cargo/rustc: it seems that the macros expansion is too complex (8 levels).

gnzlbg · 2017-10-12T13:13:02Z

This post should add how to document the intrinsics.

gnzlbg · 2017-10-12T13:13:33Z

@rroohhh it should be part of AVX2 although we might want to implement it in its own module.

GabrielMajeri · 2017-10-12T18:00:14Z

@alexcrichton this issue's topic is quite long and hard to browse, could you please use something like the mechanism described in this comment, to allow collapsing individual sections?

Something like this

Some intrinsic

Code for the above:

<details><summary>Something like this</summary><p>
       << This line break is necessary!
- [ ] Some intrinsic
</p></details>

nominolo · 2017-10-15T20:21:32Z

@alexcrichton Could you please check off the following tasks in the SSE section?

everything from _mm_and_ps until _mm_ucomineq_ss
everything from _mm_set_ss until _mm_loadr_ps

For _mm_stream_ps please annotate it with a link to #114

alexcrichton · 2017-10-16T15:48:06Z

@nominolo done1

`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.

tvladyslav · 2018-01-29T21:30:30Z

@alexcrichton long story short:

The Intel® C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. ... The SVML intrinsics do not have any corresponding instructions. The prototypes for the SVML intrinsics are available in the immintrin.h file.

https://software.intel.com/en-us/node/524289

AdamNiederer · 2018-01-29T21:32:04Z

The SVML is just a bunch of inlining-friendly assembly-level subroutines which use SSE/AVX instructions to compute higher-level mathematical primitives. I'm pretty sure it's "just another library", otherwise. It's heavily optimized for Intel CPUs, much like ICC. I'm also pretty sure it's not open-source or readily available.

tvladyslav · 2018-01-30T00:46:04Z

@alexcrichton , sse instructions are split into 3 folders: i586, i686 and x86_64. How should I know where to put an implementation for _mm_log2_pd, for example? It is not obvious for me.

alexcrichton · 2018-01-30T06:03:13Z

@crypto-universe @AdamNiederer ok cool, thanks for the info! Sounds like I should omit those intrinsics. I've updated the OP to omit the SVML intrinsics.

@crypto-universe oh the division between those modules is somewhat non-important now. The main one is that x86_64 is only compiled on 64-bit targets, but 32-bit targets compile both i586 and i686. If the intrinsic only works on x86_64 it should go there, otherwise either of the other modules is fine.

alexcrichton · 2018-02-11T16:29:51Z

Ok I think this is effectively "done enough" that we can close and follow up with more specific issues if need be. Thanks so much for everyone's help on this!

Fix rust-lang#40

et-tommythorn · 2020-09-17T21:44:15Z

Is this the right place to mention that core::arch is missing RISC-V support or should I open a tracking bug? (I'm specifically interested in adding support for the equivalent of rdtsc).

Amanieu · 2020-09-17T23:48:53Z

We generally try to stick to vendor-specified intrinsics, e.g. SSE intrinsics and ARM NEON intrinsics. AFAIK RISC-V doesn't have any target-specific intrinsics defined in GCC or Clang.

tommythorn · 2020-09-17T23:55:34Z

Ough. Thanks. I can see your reasoning, but that raises the bar by orders of magnitude and pushes the problem to all clients of core::arch :(

Amanieu · 2020-09-17T23:56:28Z

You can always just use inline assembly if you really want a specific instruction...

tommythorn · 2020-09-17T23:57:15Z

That's literally what "pushes the problem to all clients" means.

Lokathor · 2020-09-18T00:03:36Z

Probably best to just open a new issue where it can get eyes and discussion. The tail end of a long-closed issue isn't a good way to bring your problem to light.

jack-pappas · 2020-09-18T00:07:26Z

@Amanieu It doesn't look like there are any RISC-V intrinsics in llvm/clang yet, but there is some recent work in that area: https://www.sifive.com/blog/risc-v-vector-extension-intrinsic-support

Amanieu · 2020-09-18T00:09:27Z

Those are actually much trickier than it seems since they involve scalable vectors with a size not known at compile-time. This requires special support in the compiler. The same issue applies to the ARM SVE intrinsics.

mjptree · 2020-12-09T21:27:33Z

Out of interest and because it has recently become relevant: ["VMX"] would be helpful.

newpavlov · 2021-05-14T11:50:27Z

Maybe it's worth to open separate issue for each target feature? For example, I wanted to use _mm_stream_load_si128 and was quite surprised that std::arch::x86_64 does not have it.

Is there a reason why streaming load intrinsics were omitted?

Amanieu · 2021-05-14T12:11:10Z

Please open a new issue if there are any missing intrinsic.

alexcrichton added help wanted impl-period labels Sep 25, 2017

MaloJaffre added a commit to MaloJaffre/stdsimd that referenced this issue Nov 2, 2017

Add SSE2 trivial aliases and conversions.

588e984

`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.

MaloJaffre mentioned this issue Nov 2, 2017

Add SSE2 trivial aliases and conversions. #165

Merged

MaloJaffre added a commit to MaloJaffre/stdsimd that referenced this issue Nov 2, 2017

Add SSE2 trivial aliases and conversions.

f815110

`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.

MaloJaffre added a commit to MaloJaffre/stdsimd that referenced this issue Nov 2, 2017

Add SSE2 trivial aliases and conversions.

f920b86

`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.

alexcrichton closed this as completed Feb 11, 2018

TheIronBorn mentioned this issue Apr 18, 2018

vld1q_u32/vst1q_u32 etc. #429

Closed

jcsoo mentioned this issue Apr 27, 2018

Stable assembly operations rust-embedded/wg#63

Closed

1 task

jcsoo mentioned this issue May 4, 2018

Implement ARM intrinsics for thumbv6 / thumbv7 #437

Closed

46 tasks

TheDan64 mentioned this issue Oct 8, 2018

Missing x86/_64 MMX functions #579

Closed

TheIronBorn mentioned this issue Nov 26, 2018

Optimize u8x8::trailing_zeros for AArch64 rust-lang/packed_simd#193

Open

gnzlbg mentioned this issue Aug 20, 2019

Add more ARM SIMD intrinsics #792

Merged

pickfire added a commit to pickfire/stdarch that referenced this issue Jun 21, 2020

Add _mm_loadu_si64

8ed548d

Fix rust-lang#40

pickfire mentioned this issue Jun 21, 2020

Add _mm_loadu_si64 #870

Merged

pickfire added a commit to pickfire/stdarch that referenced this issue Jul 5, 2020

Add _mm_loadu_si64

c1cadf5

Fix rust-lang#40

pickfire added a commit to pickfire/stdarch that referenced this issue Jul 11, 2020

Add _mm_loadu_si64

0490bcc

Fix rust-lang#40

tommythorn mentioned this issue Sep 18, 2020

RISC-V core::arch? #913

Open

newpavlov mentioned this issue Jun 7, 2021

Missing x86 vendor intrinsics (SSE2, SSE 4.1, AVX2) #1178

Open

10 tasks

coastalwhite mentioned this issue Aug 6, 2023

Tracking Issue for RISC-V Ratified Extensions Intrinsics rust-lang/rust#114544

Open

11 tasks

Implement all x86 vendor intrinsics #40

Implement all x86 vendor intrinsics #40

Comments

alexcrichton commented Sep 25, 2017 • edited by Amanieu Loading

Writing the implementation

References

TODO

alexcrichton commented Sep 25, 2017

mattico commented Sep 25, 2017

alexcrichton commented Sep 26, 2017

AdamNiederer commented Sep 26, 2017 • edited Loading

gnzlbg commented Sep 26, 2017 • edited Loading

AdamNiederer commented Sep 26, 2017 • edited Loading

alexcrichton commented Sep 26, 2017

gnzlbg commented Sep 26, 2017

AdamNiederer commented Sep 26, 2017 • edited Loading

alexcrichton commented Sep 26, 2017

dlrobertson commented Sep 28, 2017

vbarrielle commented Sep 29, 2017

dlrobertson commented Sep 30, 2017

BurntSushi commented Sep 30, 2017

vbarrielle commented Oct 5, 2017

rroohhh commented Oct 6, 2017

p32blo commented Oct 6, 2017 • edited Loading

alexcrichton commented Oct 6, 2017

gwenn commented Oct 8, 2017

gnzlbg commented Oct 12, 2017

gnzlbg commented Oct 12, 2017 • edited Loading

GabrielMajeri commented Oct 12, 2017

nominolo commented Oct 15, 2017

alexcrichton commented Oct 16, 2017

tvladyslav commented Jan 29, 2018

AdamNiederer commented Jan 29, 2018 • edited Loading

tvladyslav commented Jan 30, 2018

alexcrichton commented Jan 30, 2018

alexcrichton commented Feb 11, 2018

et-tommythorn commented Sep 17, 2020

Amanieu commented Sep 17, 2020

tommythorn commented Sep 17, 2020

Amanieu commented Sep 17, 2020

tommythorn commented Sep 17, 2020

Lokathor commented Sep 18, 2020

jack-pappas commented Sep 18, 2020

Amanieu commented Sep 18, 2020

mjptree commented Dec 9, 2020

newpavlov commented May 14, 2021

Amanieu commented May 14, 2021

alexcrichton commented Sep 25, 2017 •

edited by Amanieu

Loading

AdamNiederer commented Sep 26, 2017 •

edited

Loading

gnzlbg commented Sep 26, 2017 •

edited

Loading

AdamNiederer commented Sep 26, 2017 •

edited

Loading

AdamNiederer commented Sep 26, 2017 •

edited

Loading

p32blo commented Oct 6, 2017 •

edited

Loading

gnzlbg commented Oct 12, 2017 •

edited

Loading

AdamNiederer commented Jan 29, 2018 •

edited

Loading