Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect AVX2 support at runtime #67

Open
jakirkham opened this issue Feb 17, 2018 · 8 comments
Open

Detect AVX2 support at runtime #67

jakirkham opened this issue Feb 17, 2018 · 8 comments

Comments

@jakirkham
Copy link
Member

Currently users have to decide at compile time if they would like to build a binary that supports AVX2 intrinsics or not. If they build with AVX2 intrinsics and end up deploying to somewhere that lacks AVX2 intrinsics, they will suffer a segfault due to the illegal instruction. Though users can build without AVX2 intrinsics and it will work fine regardless of whether the target infrastructure has AVX2 support, the compression algorithms here may run slower than if they were built with AVX2 support. Admittedly avoiding a segfault is much more important than degraded performance.

However, in the ideal case, we could build numcodecs with and without AVX2 support and then merely detect at runtime whether AVX2 instructions were permitted and thus choose the appropriate code path without crashing in either case. This will take a bit of work to understand where AVX2 instructions are being introduced and how to avoid them. Though some of that was already done in the first referenced issue below.

xref: zarr-developers/zarr-python#136
xref: #24
xref: #26
xref: #27

@alimanfoo
Copy link
Member

The only use of AVX2 intrinsics AFAIK is within c-blosc. @FrancescAlted could you confirm that c-blosc does not perform runtime dispatching based on hardware capabilities? If so, is this feasible?

@jakirkham
Copy link
Member Author

So when we had investigated issue ( zarr-developers/zarr-python#136 ) last time (admittedly about ~1yr ago). We had narrowed it down to an AVX2 instruction, vinserti128, popping up in __pyx_pw_4zarr_5blosc_19compress, which was used by all compression code paths (except Zlib). @FrancescAlted had previously looked and found that there was no vinserti128 in Blosc. This means it had to have been in the Zarr Cython-generated C code. We decided the solution was to allow one to disable AVX2 instructions at compile time. This works, but comes with caveat that we cannot use AVX2 instructions at run time should they be available.

Now I have not investigated the analogous case since the Zarr/Numcodecs split, but suspect the issue still exists. Can try and generate a new reproducer using newer versions of Zarr and Numcodecs, which should help us understand where this problem occurs now. Looking back at the C code now, would suspect this line to have caused the issue. Fixing this sort of issue may require some trickery on the building end of things.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 19, 2018 via email

@FrancescAlted
Copy link

I confirm that C-Blosc does perform runtime dispatching based on hardware capabilities. In order to better assess if the different acceleration paths are being available to Blosc, I have just implemented the possibility to print the different CPU capabilities that will be used via the BLOSC_PRINT_SHUFFLE_ACCEL environment variable. And yes, it should be possible to activate the AVX2 path just in processors having this capability.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 23, 2018 via email

@FrancescAlted
Copy link

Yeah, I am not an expert either. My hunch is that the -mavx2 flag should not be enforced during the compiler invocation when building the library, and let the cmake machinery to decide whether the compiler supports AVX2 so that it can generate code paths in the binaries. That possibly means that this is going to be difficult to achieve in environments that are not using cmake, but I may be wrong here.

At any rate, I am pinging the guy who did most of the SSE2/AVX2 runtime detection in Blosc some years ago. @juliantaylor any hints on this would be highly appreciated. Thanks in advance!

@juliantaylor
Copy link

juliantaylor commented Feb 23, 2018

Correct, -mavx2 allows the compiler to place avx2 code into whatever place it likes.

This piece of code looks like it compiles in avx2 unconditionally, though I am not familiar with this cython feature, it might just be an annotation not used during compilation:
https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/vlen.c#L9

If your code that profits from avx2 is inside of non-public cython code called from python it should be pretty easy to compile it twice wrap the appropriate call depending on runtime environment in python.
Code to determine cpu features at runtime can be found in e.g. blosc or you can use compiler features (like gcc and newer clang versions __builtin_cpu_supports)
This is also assuming cython cannot yet by itself do automatic cpu set specific function cloning and dispatching like gcc (or icc) can.

@detrout
Copy link

detrout commented May 16, 2019

I'm experiencing this issue on some of my machines.

The kernel thinks the illegal instruction is in the blosc library which seems to be provided by numcodecs

[7486855.845681] traps: python3[201485] trap invalid opcode ip:7f24c9a68b46 sp:7fffd0a84760 error:0
[7486855.845688]  in blosc.cpython-37m-x86_64-linux-gnu.so[7f24c9a64000+a8000]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants