Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to build without "-march=native" #20

Closed
paoloczi opened this issue Apr 15, 2019 · 46 comments
Closed

Option to build without "-march=native" #20

paoloczi opened this issue Apr 15, 2019 · 46 comments

Comments

@paoloczi
Copy link

The code builds with "-march=native" which is best for performance but reduces portability. There are use cases where portability is more important than performance, and therefore it would be nice to also have the option to build without "-march=native".

@rvaser
Copy link
Owner

rvaser commented Apr 16, 2019

Implemented in the newest version (3.0.0). You can remove -marc=native from flags with -Dspoa_optimize_for_native=OFF when running cmake.

Best regards,
Robert

@paoloczi
Copy link
Author

Thank you for the quick response!

@Hikoyu
Copy link

Hikoyu commented Jun 18, 2019

-mssse3 or -msse4.1 instead of -march=native clearly affect the performance of recent CPU (Haswell or later).
I benchmarked the calculation time under each complilation condition: -mssse3, -msse4.1, -mavx2, and -march=native.
The number of sequences was 30.
This test was performed on my Linux server (Ubuntu 18.04, Xeon E3-1240L v3).
image

@rvaser
Copy link
Owner

rvaser commented Jun 18, 2019

Thanks for the evaluation, -march=native is the default build parameter. I was wondering how did you manage to compile with -mssse3? I am looking at the Intel Instrinsics Guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) and the function _mm_max_epi32 is supported in SSE4.1 and onwards.

@Hikoyu
Copy link

Hikoyu commented Jun 18, 2019

I just edited line 20 of 'CMakeLists.txt' and executed cmake with following options -DCMAKE_CXX_FLAGS=-fPIC -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -Dspoa_build_executable=ON.
Compiler is gcc-7.4.0.
The result of reverse-compiling by objdump showed that specifying -mssse3 caused little vectorization. (Of course, pmaxsd or vpmaxsd instructions were not included.)

@rvaser
Copy link
Owner

rvaser commented Jun 18, 2019

Ah, I forgot that spoa also has SISD code, not specifying sse4.1 or avx2 will fallback to it :)

@paoloczi
Copy link
Author

My request for an option to build without -march=native was motivated by situations where portability of compiled code is more important than performance (thank you @rvaser for adding that option!). Spoa correctly makes -march=native the default.

@Hikoyu it would be nice and useful if you could add to your plot a curve for compilation without -march and without any -m options (this is what happens when spoa is built with -Dspoa_optimize_for_native=OFF). That results in code with maximum portability (that is you can run the code on a machine different from the one that built it). That would give us a quantitative measure of the performance cost of portability for spoa. Extrapolating from your curves, it could be close to a factor of two, but it would be nice to know.

@Hikoyu
Copy link

Hikoyu commented Jun 20, 2019

Sorry guys, previous benchmark results were strange (too slow). I might have done something inappropriate compilation.
So, I recompiled spoa-v3.0.0 for each condition without editing 'CMakeLists.txt' and retried the benchmark tests. I also performed depth-dependence test.

Cmake options were shown below.
native: -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -Dspoa_build_executable=ON -DCMAKE_CXX_FLAGS='-fPIC'
without native: -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -Dspoa_build_executable=ON -Dspoa_optimize_for_native=OFF -DCMAKE_CXX_FLAGS='-fPIC'
ssse3: -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -Dspoa_build_executable=ON -Dspoa_optimize_for_native=OFF -DCMAKE_CXX_FLAGS='-fPIC -mssse3'
sse4.1: -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -Dspoa_build_executable=ON -Dspoa_optimize_for_native=OFF -DCMAKE_CXX_FLAGS='-fPIC -msse4.1'
avx2: -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -Dspoa_build_executable=ON -Dspoa_optimize_for_native=OFF -DCMAKE_CXX_FLAGS='-fPIC -mavx2'

image
image

The results of without native and native coincided with those of ssse3 and avx2 respectively.
So, you cannot see the green and purple plots.

@rvaser
I got it. Probably, without native also used the SISD code?

@paoloczi
I got your point, but without native caused more than 3 times performance degradation according to my recent data. I recommend adding at least -msse4.1, which is available in Penryn (2nd generation of Intel Core2) or later CPUs.

@paoloczi
Copy link
Author

Thank you @Hikoyu for doing this. The performance ratio between native and portable is 3 to 5, much greater than I expected. I use spoa in the Shasta long read assembler, and in that case I also build with -march=native by default. However, most users of this type of software expect to be able to download a pre-built release and use it on whatever machine they choose. For that reason the release build is done without -march=native, and the documentation explains that for better performance you should rebuild the code on the run machine. Not an ideal situation, but a reasonable compromise.

Fortunately for the Shasta assembler spoa takes only a small fraction of the total assembly time (5% or so when using -march=native). However, it is likely/possible that the rest of the assembly code will also be impacted. I have not yet had the time to do a comparison with/without -march=native for assembly of a large genome.

Any additional thoughts/suggestions/perspective on this?

@Hikoyu
Copy link

Hikoyu commented Jun 20, 2019

As you pointed out, -march=native should be removed for binary distribution. But, little vectorized SISD code appeared to be used when -Dspoa_optimize_for_native=OFF was specified in cmake.
Well, how old CPUs would you like to support?
For example, recent (2013-) Intel processors (Haswell or later) support avx2. Semi-old (2008-) Intel processors (Penryn or later) support sse4.1.
If you want to support older processors, just remove -march=native.
The other cases, I supposed that specifying -msse4.1 or -mavx2 instead of -march=native maintain the portability without losing performance.

@paoloczi
Copy link
Author

Thanks @Hikoyu for your perspective. I suspect the next step on this for me would be to quantify the cost of portability for my case like you did for spoa, and then make a decision based on that information and your considerations.

@rvaser
Copy link
Owner

rvaser commented Jun 20, 2019

@Hikoyu yes, when compiling without -march=native the SISD code is used. Either sse4.1 or avx2 need to be set in order to run SIMD code. Also, thanks for the follow-up evaluation!

@paoloczi I am agreeing with @Hikoyu that you should consider shipping with -msse4.1 as most of CPUs nowadays support it.

@paoloczi
Copy link
Author

Thanks @rvaser, I will definitely consider your suggestion for my next release, after some performance experiments.

On the other hand, here are some TensorFlow users complaining that it does not run on their machine without sse 4.1. I think the answer for these people should be "if you want to do this type of computation, get a new computer"...

@Hikoyu
Copy link

Hikoyu commented Jun 20, 2019

@paoloczi I think they should get a new computer because their machines are probably over 10 years old. In addition, 32-bit Intel processors do not support SSE4.1, but almost all 64-bit Intel processors (except for 1st generation of Intel Core2) support.

@rvaser
Copy link
Owner

rvaser commented Mar 8, 2020

A quick followup, v3.0.2 now has a new flag (spoa_optimize_for_portability) which compiles with -msse4.1. This flag is used in Bioconda in order to avoid problems tied with -march=native.

@paoloczi
Copy link
Author

paoloczi commented Mar 9, 2020

Thank you for the update!

Ideally, I would like the ability to install on a single machine both a native library (-march=native) and a portable/semi-portable library (no -m option or -msse4.1), built with different names. That way the decision between the two version is postponed until build time. The way it is, that decision has to be made when installing spoa.

Of course, I could imitate this behavior by renaming the library, but that would be a bit messy and I would prefer to avoid it.

@mr-c
Copy link

mr-c commented Apr 13, 2020

Hello from the Debian Med team. Thank you for the options to turn off -march=native, however we can't ship a SSE4.1 only library, as the AMD64 (a.k.a x86_64) ISA baseline is SSE2 and no higher.

However, if you can add a dispatcher routine and compile multiple versions (-mavx2, -mavx, -msse4.1, .. down to -msse2) and the dispatcher chooses the correct routine at run time, then that would be okay by us.

For compatibility with less than SSE4.1 I suggest looking at https://github.com/nemequ/simde/ ; an example patch to spoa using the SIMD Everywhere header-only library is at https://salsa.debian.org/med-team/spoa/-/blob/master/debian/patches/simde but since there is no dispatching going on in the library, we only compile with -msse2 on 64-bit intel/amd.

TL;DR: please add a SIMD dispatching to your library/build system so that Debian can ship a performant and truly portable library. Thanks!

@rvaser
Copy link
Owner

rvaser commented Apr 14, 2020

Hi Michael,
I cannot downgrade to SSE2 as one or two instructions are only present in SSE4.1+. I could try and patch everything with SIMD Everywhere (thank you for the link). However, do you know if there are any significant performance penalties when simulating intrinsics that are not supported?

Best regards,
Robert

@mr-c
Copy link

mr-c commented Apr 14, 2020

rvaser: It won't be as good as AVX2/SSE4.1, but it means that spoa will be able to run on systems where it could not before. If you add a SIMD dispatcher routine and compile accordingly, then there will be no performance difference when running on systems that support AVX2/SSE4.1 as the native instructions will be use.

@rvaser
Copy link
Owner

rvaser commented May 5, 2020

@mr-c, is this PR sufficient?

@mr-c
Copy link

mr-c commented May 5, 2020

@rvaser Yes, that looks like a great SIMDe usage; thank you!

@mr-c
Copy link

mr-c commented May 5, 2020

@rvaser I would not that there still isn't a function to switch between different levels of native SIMD support, so for Debian the AMD64 (x86_64) binary package of the spoa library will only use SSE2 native SIMD intrinsics.

@mbrcic
Copy link
Contributor

mbrcic commented May 5, 2020

@mr-c, a parameterized compiler flag in cmake would do the trick (like -mno-sse4), or you could even supply it to cmake through the command line. SIMDe adjusts the used intrinsics accordingly.

@mr-c
Copy link

mr-c commented May 5, 2020

@mbrcic Of course. But for Debian we can only ship a single shared libspoa library for x86-64, so we'd need a dispatcher to allow for full performance when the user has a AVX2 or SSE4.1 capable processor. We sometimes compile applications at multiple SIMD levels and add our own shellscript CPU based dispatcher but in this case, all of the main processing is in libspoa, so there would be no advantage.

See https://gcc.gnu.org/wiki/FunctionMultiVersioning and https://lwn.net/Articles/691932/ for some gcc tips on the topic

@mbrcic
Copy link
Contributor

mbrcic commented May 6, 2020

@mr-c , do you have any example of SIMDe working with function multiversioning? I am not sure that such combo can work.

One way SIMD dispatching could work would be to make SimdAlignmentEngine a template and instantiate and compile it for different instruction sets. And then put simple CPU dispatcher into createSimdAlignmentEngine that selects appropriate SIMD engine. Maybe createSimdAlignmentEngine could take advantage of function multiversioning to make things easier with CPU dispatching.

However, @rvaser should see if the potential benefit justifies the breadth of changes to his code.

@rvaser
Copy link
Owner

rvaser commented May 7, 2020

I am not sure about the scope of changes needed for your request @mr-c, and I do not have the time at the moment to explore this. Regarding the PR by @mbrcic, it is merged in the master branch.

@mr-c
Copy link

mr-c commented May 7, 2020

For GCC, in theory, the following added to one or more non-virtual SIMD using functions should be enough (though I haven't tested this):

#if defined(SIMDE_ARCH_AMD64)
__attribute__ ((target_clones("avx2","avx","sse4.2","sse4.1","ssse3","sse3","default"),flatten))
#elif defined(SIMDE_ARCH_X86)
__attribute__ ((target_clones("ssse3","sse3","sse2","sse","mmx","default"),flatten))
#endif

@rvaser
Copy link
Owner

rvaser commented May 7, 2020

Could you maybe test that somewhere?

@mr-c
Copy link

mr-c commented May 7, 2020

@rvaser Sure. Can you tell me which non-virtual function I should apply these attributes to?

@rvaser
Copy link
Owner

rvaser commented May 7, 2020

The SIMD function is https://github.com/rvaser/spoa/blob/master/src/simd_alignment_engine.hpp#L32, but it is a overridden virtual function. Will that work?

It calls three functions which are non-virtual: https://github.com/rvaser/spoa/blob/master/src/simd_alignment_engine.hpp#L47-L57.

@mr-c
Copy link

mr-c commented May 7, 2020

The SIMD function is https://github.com/rvaser/spoa/blob/master/src/simd_alignment_engine.hpp#L32, but it is a overridden virtual function.

/build/spoa-3.0.2/src/simd_alignment_engine.cpp:1834:1: sorry, unimplemented: virtual function multiversioning not supported

It calls three functions which are non-virtual: https://github.com/rvaser/spoa/blob/master/src/simd_alignment_engine.hpp#L47-L57.

If I put the attributes there, I get an internal compiler error. If I add the attributes to the functions themselves in simd_alignment_enginer.cpp then it compiles, but elfx86exts shows the same profile as not having the attributes. I confirmed the functionality of elfx86exts by adding -march=native which did produce AVX2 and friends :-/

@mr-c
Copy link

mr-c commented May 7, 2020

Tagging in @nemequ for his advice

@nemequ
Copy link

nemequ commented May 7, 2020

SIMDe detects available ISA extensions using preprocessor macros, and those don't change with target_clones, so while the compiler may be able to autovectorize the portable implementations in SIMDe it's not guaranteed. Basically, target_clones is fantastic for autovectorization, but not if you have explicit calls to SIMD intrinsics.

It sounds like you have the SIMD portions of your code pretty well isolated to the SimdAlignmentEngine class, so I think the best solution would be to make SimdAlignmentEngine an abstract class, then compile simde_alignment_engine.cpp with different flags (one with the defaults, one with -msse4.1, one with -mavx2, etc.) and just have slightly different symbols for each (e.g., SimdAlignmentEngineDefault, SimdAlignmentEngineSse4_1, SimdAligmentEngineAvx2), all of which implement SimdAlignmentEngine. You would just need a bit of CMake logic coupled with some preprocessor macros.

Then you just choose the best one once at runtime; that's fairly straightforward, but the APIs tend to be architecture and/or compiler-specific, so the question is how much do you care about portability? With SIMDe you could easily have the code running on ARM (with NEON for vectorized versions), POWER (AltiVec), WASM, etc., you'd just need to put a bit of effort into the portability of the resolver. If all you care about is GCC/clang, __builtin_cpu_supports. For ICC, _may_i_use_cpu_feature. For MSVC __cpuid. If you want something highly portable, cpu_features is probably your best bet.

If you want I can probably through somethig together for you pretty quickly. This could actually be a nice example to link to for SIMDe since it's pretty straightforward. I just need to know what compilers and architectures you want to support.

@mbrcic
Copy link
Contributor

mbrcic commented May 8, 2020

I've implemented CPU dispatching using SIMDe in personal branch.

I need to test it bit more (it seems to work well on AVX2 and SSE4.1). I have used FeatureDetector for CPU dispatching since cpu_features has a bug that needs to be resolved.

After confirming success in testing, we'll see if @rvaser wishes PR.

@rvaser
Copy link
Owner

rvaser commented May 8, 2020

@nemequ, no idea which compilers/architectures should be included. Would saying all make this a tedious attempt? :)

@mbrcic, thanks again for your continuous effort! Please let me know how the tests go.

@mr-c
Copy link

mr-c commented May 8, 2020

@rvaser regards to architectures, my feelings on the subject from a bioinformatics / Debian perspective are at https://github.com/mr-c/misc/wiki/Debian-Packaging-Checklist#arch-feelings :-)

As for compilers, I personally make no efforts beyond gcc :-)

@mbrcic
Copy link
Contributor

mbrcic commented May 8, 2020

@rvaser, I've tested the dispatching code on AVX2, SSE4.1 and SSE2 from the same binary. I've tested it with gcc and clang. And it works! :)

So, the code currently has only AVX2,SSE4.1 and SSE2 dispatching but other architectures are easy to add:

  1. you expand enumeration Arch with new architecture
  2. add dispatching case in dispatcher.cpp
  3. add new case at the beginning of simd_alignment_engine_dispatch.cpp
  4. add case with compiler flags to cmake configuration

It amounts to small amount of added code per new architecture.

Also, the non-dispatching case works just as before.

Should I do PR?

@nemequ
Copy link

nemequ commented May 8, 2020

@nemequ, no idea which compilers/architectures should be included. Would saying all make this a tedious attempt? :)

SIMDe should work basically everywhere, so to just get a basic port running everywhere there isn't really anything to do; if the SIMD code is your only dependency on x86 you should be able to just compile on any given architecture and you're done.

The problem getting accelerated builds working. Architectures have different ISA extensions, and compilers have different flags to enable them. ARM is particularly irksome since there are lots of possible combinations of SIMD, FPU, and architecture version you might want to specify.

Compilers also support different options. There are lots of examples like -maltivec (GCC, clang) vs. -qaltivec (XL C/C++).

And of course MSVC does something completely different; you can't even specify that you want, say, SSE4.1 support but not AVX. /arch:AVX will enable AVX, but there is no /arch:SSE4.1. They just assume everyone is checking CPUID at runtime and building their own system, so the best thing to do there is /arch:SSE2 -DSIMDE_X86_SSE4_1_NATIVE which will enable SSE4.1 in SIMDe, so explicitly calling SSE4.1 functions (via SIMDe) will generate SSE4.1 instructions, but if you're not explicitly calling the relevant functions the compiler will only auto-vectorize to SSE2.

I've implemented CPU dispatching using SIMDe in personal branch.

Nice. Not sure about the choice of FeatureDetector, but the code looks solid. Definitely the right idea IMHO.

I have used FeatureDetector for CPU dispatching since cpu_features has a bug that needs to be resolved.

I'm not sure that bug is worth using FeatureDetector instead of cpu_features. The latter supports other architectures whereas FeatureDetector only supports different ISA extensions within x86. In fact, it doesn't even work on non-x86 (i.e., it won't compile).

The number of CPUs out there with SSE4.1 support but not AVX is pretty small compared to the number of CPUs with NEON support, for example. It also looks like there is a PR open to fix that issue, so if you don't want to wait until it's merged you can always use a fork with the PR until then.

I also have some code for feature detection which at least works on x86 and ARM, though not anywhere else, which you're welcome to steal.

That said, it should be pretty easy to replace FeatureDetector at some point in the future when cpu_features is fixed.

@mbrcic
Copy link
Contributor

mbrcic commented May 8, 2020

@nemequ first of all, thank for an awesome library! I might contribute at some point in future :)

Actually, the first dispatcher I wrote was using cpu_features, but then in testing I encountered that bug so I substituted it with FeatureDetector, for now. It is very easy to replace it later, as soon as they merge.
Don't worry about ARM NEON, CPU dispatcher path is bypassed for it. I have tested non-dispatched version on ARM NEON and it works like a charm - with a big performance improvement thanks to SIMDe.

Since @mr-c asked for CPU dispatching to accommodate for non-AVX machines, this bug in cpu_features is problematic.

@nemequ
Copy link

nemequ commented May 8, 2020

@nemequ first of all, thank for an awesome library! I might contribute at some point in future :)

Thanks, I glad it's proving useful to you. You'd be most welcome; there are a lot of functions to get through. If you're ever bored it's a decent way to kill some time ;)

Actually, the first dispatcher I wrote was using cpu_features, but then in testing I encountered that bug so I substituted it with FeatureDetector, for now. It is very easy to replace it later, as soon as they merge.

Okay, works for me. It sounds like you have a good handle on the situation.

Don't worry about ARM NEON, CPU dispatcher path is bypassed for it. I have tested non-dispatched version on ARM NEON and it works like a charm - with a big performance improvement thanks to SIMDe.

That's good to know, thanks!

I don't know if it's a problem or not that you're not checking for NEON support at runtime. I'm not sure if the Debian baseline for ARM includes NEON or not; @mr-c would probably know?

Since @mr-c asked for CPU dispatching to accommodate for non-AVX machines, this bug in cpu_features is problematic.

I think the main concern is that the drop from AVX to SSE2 (x86_64 baseline) is significant, and Debian packages have to work on plain old SSE2. AVX to SSE4.1 isn't as big of an issue since the number of machines with SSE4.1 but not AVX is pretty small (according to that bug report in cpu_features, but I know they numbers are pretty close on the Steam Hardware Survey, too). I suspect Debian would be happy with just choosing between baseline and AVX for x86_64. For x86, maybe baseline (no vector instructions at all), SSE2, and AVX? Again, @mr-c probably has a better feel for this.

@mr-c
Copy link

mr-c commented May 9, 2020

I'm not sure if the Debian baseline for ARM includes NEON or not; @mr-c would probably know?

There are several "arm" architectures in Debian, but the 32 bit version (armel and armhf) are on their way out.

Checking the Debian architecture baselines for arm64:

aarch64 (ARMv8-A) with no optional extensions. This includes VFPv4 and NEON, but ARMv8.1-A and up are not guaranteed.


For x86, maybe baseline (no vector instructions at all), SSE2, and AVX? Again, @mr-c probably has a better feel for this.

There is no x86 (32-bit) CPU with AVX; the highest is SSSE3 :-)

@mbrcic
Copy link
Contributor

mbrcic commented May 11, 2020

@nemequ @mr-c , okay guys I've replaced FeatureDetector with that patched version of cpu_features, I've tested it and it works. I think this makes everyone happy :) The code deals with SSE2,SSE4.1 and AVX2, but is now easily extendable to dispatching for any platform with cpu_features in place.

I've made PR.

@rvaser
Copy link
Owner

rvaser commented May 18, 2020

I am really sorry for this delay. The PR is pulled. Thanks @mbrcic!

@bagashe
Copy link
Contributor

bagashe commented May 26, 2020

@rvaser : Are you planning another release soon with these changes? Thanks.

@rvaser
Copy link
Owner

rvaser commented May 28, 2020

@bagashe, I have released a couple of releases, the latest is v3.3.0. Sorry for the delay.

@mr-c
Copy link

mr-c commented Aug 25, 2020

Thank you @rvaser ! I think this issue can be closed.

@rvaser rvaser closed this as completed Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants