New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal instruction on Opteron (detects SSE3 when no sse3 present) #2794
Comments
|
Probably not with detection as such - at the very least it is consistent between compile-time (cpuid_x86.c) and runtime (driver/others/dynamic.c), both using ecx&1 as the flag for SSE3 capability. More likely an SSE3 instruction crept into an OPTERON-only BLAS kernel, or the same SSE3-using kernel is configured for both OPTERON and OPTERON-SSE3. Does |
|
Hmm. WRT detection, it would not be the first time a VM environment does not emulate all details of cpuid correctly. |
|
Ok, this is weird..... I could have sworn it did, but its the reverse...?
So you're right, its correctly interpreting the system as opteron, but running sse3 code regardless... why the reverse still gives a sigill, I don't know. Opteron isn't the only machine having SIGILL issues after we switched from MKL to OpenBLAS, Phenom and Core2 Duos also fail, but I don't have time or hardware to get backtraces on those. There's a thread over in our forums with the full list which includes a DB dump of all the CPUs I've seen errors on: https://www.mlcathome.org/mlcathome/forum_thread.php?id=63 . |
|
The systems listed on the forum from our internal DB were from a client linked against the libopenblas.so shipped with Ubuntu 14.04 (so very old), so perhaps those issues are fixed in a newer release. What you see above is with the opteron testing is a new client I'm testing which compiles in the latest libopenblas (from git) statically into the binary in the hopes of fixing these issues. This opteron issue in this is the only one I can test and verify its still an issue. |
|
Thanks. Seems the logic (if one can even call it that) which maps the environment variable to the list of supported core types simply has the two Opteron models reversed. This does not explain the SIGILL though, as the sgemm_oncopy maps to the same code in both. An unmet alignment requirement seems more likely than an actual unsupported instruction, I will set up a qemu VM as I do not have the actual hardware. (Same goes for Phenom, I may be able to revive a Core2Duo later - chances are that the OpenBLAS code for these is unchanged from the original GotoBLAS of ten years ago while compilers have advanced to expose formerly harmless mistakes. |
|
An alignment issue would make sense. I'm building with gcc9 and gfortran9 . Good luck and let me know if I can help. The code is open source https://gitlab.com/clemej/mlds , the embedded openblas build is in the Thanks again for looking into it. |
|
CPUID is from 2nd generation It has 3dnow, so another kind of sigill will (maybe) hit. It is a bug in libvirt that
|
|
/usr/share/libviert/cpu_map/x86_Opteron_G1.xml and to be pedantic (and not start on later Opterons) |
|
@brada4 not sure if any of this is actually relevant when (a) it does not look like the SIGILL involves SSE3 instructions at all and (b) as I understand it the problem was/is observed on actual hardware. |
|
Just an observation that emulator has wrong bits and may not reflect the real hardware, though opteron and crash is heard in both sides. |
|
Actually toplevel CPUID never detects OPTERON_SSE3 |
|
The only non-portable instruction in that code is an mmx or 3dnow "emms/femms" instruction towards the end of sgemm_ncopy_4_opteron.S - Ubuntu happens to have a stale ticket with something similar https://answers.launchpad.net/ubuntu/+question/284671 where disassembling the crash address in gdb pointed at just that statement. |
|
I'm posting the binary and instructions on how to run it if that would help. It's a multi-step process since it's an appimage bundle, but in case its helpful:
To debug under gdb, run:
NOTE: This binary includes libopenblas 0.2.18 as a shared library, not the latest git embedded version compiled statically. Since you said the code hadn't changed, I hope that's not an issue. Library is in the same lib runtime directory. I'm now experiencing another issue where I the build sometimes(?) fails to compile with a segv when compiling openblas master as part of my full build, but I haven't determined if that's pilot error or an actual openblas issue (binutils in ubuntu 16.04 might be too old). |
|
Depends - is it a segv in the compiler, or in one of the tests that are run at the end of the build ? (also this might just be semantics, but "master" branch is outdated, you'll want "develop"). |
|
The segv is in the tests I'm pretty sure, but I'm too distracted to run that down at the moment. I'll open another issue if that turns out to be a problem. And yes, I meant (and have been using) the default |
|
zen 2300U with Opteron_G1 VM - no crash to get backtrace from, and detects prescott, EDIT no crash with OPTERON_SSE3 forced either |
|
@clemej - can you run mlds inside gdb i.e |
|
Might take a day or so to get back to you on this. day job needs attention at the moment. |
|
No rush, if there is a bug, it is 15+ years old. |
|
@clemej if/when you get back to this and you are in gdb, could you please run |
|
Hi, sorry fro the delay. Yes, the instruction points directly to the This leads to an interesting (if frustrating) question:
Note that I pushed out an update to the client that sets I'm still swamped with my dayjob, so I'll be sporadic in responding over the next few days, but thank you again for looking into it. |
|
I've dug a little deeper and we might be chasing a bit of a red herring here. Here's a list of the CPUs from my database that have shown SIGILL instructions, note the opterons I see are not G1 based. Also, obviously not all are cause by this issue. Perhaps these systems are too new for 3dnow? I apologize for not looking this deep sooner.. I just assumed the sigill I was seeing was the same. Now I suspect the one i see if a matter of using KVM to emulate an old opteron on a system without 3dnow. |
|
Yes it is frustrating - I cannot answer that question as I do not have old Opteron hardware. Ideally "we" would coerce one of your users into running your software under gdb to see if they get the same location in their traceback. or you could supply them a modified build with the |
|
Agreed. without a stack trace on hardware its likely impossible. I own a core2 system I can try on (thanks thinkpad t400!), and I may have access to one of the newer opterons to get a stack trace and experiment a little on real hardware. Give me another day or so. |
|
The Opteron 2431 in your list looks new enough to expect it to handle all of MMX, 3DNOW and SSE3, in fact it has its separate ISTANBUL target in OpenBLAS. The gcc compile farm has an Opteron 2212 system that would probably correspond to OpenBLAS' OPTERON_SSE3, I can see if I can repeat the SIGILL there (hope their OS is new enough to have glibc >=2.14 as required by your appimage). |
|
Also related: https://community.amd.com/thread/159993 |
|
Not that it matters for this bug, but I can't reproduce an issue on my penryn-based Core2 system. However, I note that all the core2 systems in that list are Merom, one (tiny) generation behind. |
|
FEMMS is 3dnow instruction. It works on authentic old opteron, but not on emulator. So what instruction fails on that old opteron? |
|
@clemej interesting find but 3DNOW is probed via cpuid instructions in a dedicated build, and only assumed to be available in Athlon and Opteron cpus in DYNAMIC_ARCH builds. |
|
Unfortunately one of the two Opteron-2212 hosts in the gcc compile farm runs ancient debian5 (glibc 2.7, no problems seen there with BLAS-Tester though), |
|
Update - user error w.r.t. my gcc login on the other Opteron. However |
|
you mean |
|
Right, just caught this mistake.. However trying to run the mlds binary from gdb after successful extraction, with dataset.hdf5 placed in squash-fs/usr/lib, again results in a SIGILL in the same location. |
|
Alright, then this might not actually be an libopenblas bug. If you would be so kind as to post the disassembler output and then close this as not-a-bug, I would appreciate it. And my sincerest apologies leading you on a wild goose chase. |
|
Full backtrace (same in single-threaded i..e OMP_NUM_THREADS=1, and with OPENBLAS_CORETYPE=GENERIC) gdb disassemble /r at this point gives no idea what is going wrong, but I do notice that you ship a lot of libraries but rely on the system libpthread.so.0 |
|
that's an sse4_1 instruction. that is being emitted by my compiler, not openblas. :(. thank you for the help, i''ll take it from here. |
|
Ugh. gcc assuming |
|
It's from the intel DNNL library, which is now opeAPI.. which is embedded in the pytorch build by default. I'm gonna assume they just don't bother to a) check and b) support CPUs that don't have SSE4.1 or higher.. which was introduces with... drumrolll... penryn core2duos! which would explain all the merom and p4 failures in the above list, and probably most of the K10 amd failures too. |
|
Ah ok. Assuming oneDNN is basically the same thing, the build options section of its documentation mentions a build option DNNL_ARCH_OPT_FLAGS which indeed defaults to requiring sse4.1 |
clemej commentedAug 25, 2020
I run MLC@Home, and recently recompiled the client to use OpenBLAS instead of MKL. However, volunteers running on older Opterons reported crashes with SIGILL. This is easy to reproduce, simply launch a new VM (virt-manager/kvm includes a CPU profile for an opteron 240 (gen 1) ) . Here's the lscpu output for this VM:
Then you run my client with libopenblas compiled into pytorch under GDB, here's the output:
Obviously, there's no SSE3 in this generation of opteron. Setting
OPENBLAS_CORETYPE=GENERICallows it to run fine. However, setting OPENBLAS_CORETYPE=OPTERON, still crashes with in the same OPTERON_SSE3 function. So something is very messed up with opteron detection.OpenBLAS master is compiled with
BINARY=64 TARGET=GENERIC USE_THREAD=1 USE_OPENMP=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 MAX_THREADS=64 NO_AFFFINITY=1 NO_WARMUP=1 NO_SHARED=1.The text was updated successfully, but these errors were encountered: