Skip to content
Jim Huang edited this page Jan 11, 2024 · 12 revisions

There are several other projects that occupy a similar space. This page is intended to provide a brief comparison between SIMDe and the alternatives. If you are aware of any other similar projects please add them; we'd love to take a look, maybe we can steal some ideas for SIMDe :)

Implementing one API with another

  • Implements: ARM, Intel
  • Language: C++
  • License: Apache 2.0

Iris is the only other project I'm aware of which is attempting to create portable implementations like SIMDe.

It's C++-only, Apache 2.0 license. AFAICT there are no accelerated fallbacks, nor is there a good way to add them since it relies extensively on templates. It's easy to add new portable implementations (much more so than in SIMDe), though.

  • Implements: NEON
  • Language: C
  • License: Apache 2.0

Written by an Intel employee to implement NEON using SSE up to 4.2. No fallbacks, so depending on what functions you use it requires up to SSE 4.2.

  • Implements: SSE
  • Language: C
  • License: MIT

This project implements full SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and the AES extension using NEON. The implementations are generally quite good. However, overall, SIMDe is capable of supporting more extensions, such as AVX. A large number of the implementations from sse2neon were merged into SIMDe early on (this is actually the main reason SIMDe is MIT licensed). Occasionally, we review and merge new implementations into SIMDe, especially if they are improvements over our current ones.

  • Implements: SSE
  • Language: D
  • License: Boost

Implements the Intel intrinsics API in D using AArch64 or a portable implementation.

Abstraction Layers

There are lots of projects which aim to create an abstraction layer which can be ported to different architectures.

This approach can work well, but doesn't always. The benefits and drawbacks are roughly the same as with the SIMDe approach, it's just they create their own functions instead of using an existing API as the abstraction layer. This means they can iron out some oddities in the API, but code has to be rewritten to target the new API, and there is always a bit of a mismatch between the API and the platform.

Furthermore, abstraction layers tend to result in lowest common denominator APIs. _mm_maddubs_epi16, which is a favorite example of mine because it's so specific… I honestly have no idea when this is useful, but I'm sure Intel had a particular use case in mind when they added it. It adds a signed 8-bit integer to an unsigned 8-bit integer, producing a signed 16-bit integer result for each lane. Then it performs saturated addition on each horizontal pair and returns the result.

It's highly unlikely that an abstraction layer will have a single function which does all that. It's much more likely that you'll have to call a few functions; maybe one for each input to widen to 16 bits, then at least one addition (remember, the vectors are now 256 bits). Next you'll need to perform pairwise addition. There might be a function for that, but if not you'll need some shuffles to extract the even and odd values from each vector. Then perform saturated addition on those, which again may or may not be supported by your abstraction layer. If not, you'll need to emulate it with min/max calls (if supported) or a couple of comparisons and blends.

On the other hand, in SIMDe we can also add optimized implementations of various functions, and _mm_maddubs_epi16 is no exception. There is already an AArch64 implementation which should be pretty fast, and a ARMv7 NEON implementation which isn't too bad. We may also be able to add some more implementations in the future.

With SIMDe what you get isn't the lowest common denominator of functionality, it's the union of everything that's available. SIMDe's _mm_maddubs_epi16 may not be any faster than an abstraction layer if you're targeting a platform without an optimized implementation, but if you are then SIMDe is going to be a lot faster.

SIMDe's approach isn't without drawbacks, of course. For one, it can be hard to know whether a particular function will be fast or slow on a given architecture, whereas lowest-common-denominator libraries will pretty much be fast everywhere but functionality will be a bit more basic. It's also a lot more work… there are around 6500 SIMD functions in x86 alone, and IIRC NEON is at around 2500.

Using an abstraction layer in SIMDe is something we're willing to consider. We would still be able to provide implementations directly in the APIs we're implementing so a lowest-common-denominator API wouldn't be a problem, and it would help us keep some logic centralized and make it easier to add new code. We'll probably end up creating something designed specifically for SIMDe, but hopefully other projects can provide some ideas, too.

  • Language: C++
  • License: Apache 2.0

Highway does dynamic dispatch, which is interesting but there will be a performance hit. OTOH, it makes it easier to implement dynamic dispatch since users don't have to do anything special.

It's also width-agnostic, which is very nice.

  • Language : C++
  • License: 3-clause BSD

This seems to be private to OpenCV, but you should be able to rip it out and plug it into another codebase without too much effort.

It's obviously targeted at computer vision, so there may be some holes in the API if you want to use it for other purposes, but it seems pretty well designed. If you already depend on OpenCV it's definitely worth taking a look.

This as an effort to develop a standard interface to SIMD functions for C++. AFAICT it relies pretty heavily on auto-vectorization, and the idea is that you basically annotate your code with std::experimental::simd stuff to help the compiler.

It actually seems pretty cool, but limited. If you want to try it now, see VcDevel/std-simd. (It should be already available for Clang.)

  • Language: C++11
  • License: 3-clause BSD

Looks like a pretty standard, but good, SIMD abstraction layer.

  • Language: C89, C++98, C++11 and C++14
  • License: MIT (Expat)

Similar to xsimd, but differs slightly in scope. Has SVE, GPU, and SPMD support, but no STL algorithms.