-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maybe a problem with 0.3.12 and NumPy? #2970
Comments
No idea. If anything, thread memory requirements of 0.3.12 should be less than before, and I thought NumPy was satisfied with the state of OpenBLAS after application of the fmod workarounds. Are you bundling the actual 0.3.12 release or some snapshot from around that time ? And the older OpenBLAS you went back to is what - 0.3.10 or even older ? |
@martin-frbg Our CI testing showed no problems and the wheels builds succeeded, so 0.3.12 looked good prior to release. EDIT: I also expect a fair number of people tried 1.19.3 because of the Python 3.9 support. If the problem was universal there would probably have been more bug reports. |
Well the only change in 0.3.11/12 that I would expect to have any effect at library/thread initialization time is the reduction of the BLAS3_MEM_ALLOC_THRESHOLD variable, and its previous default is stil available in Makefile.rule (or as a parameter to |
Unfortunately, it's 0.3.9 |
Original issues got closed without code change |
@brada4 We switched back to the earlier library for 1.19.4. |
Hm thanks. 0.3.10 (or at least the changes that looked important enough to label with the milestone) was mostly CMAKE build improvements and a few thread race fixes, nothing that I would expect to blow up as soon as the library gets loaded. |
The build tag (a32f1dca) on OpenBLAS so file is not related to any tags here. Where it does come from, how do we trace back to OpenBLAS source code used to build the library? |
The descriptive string |
According to the comment in the issue there is information at https://drive.google.com/drive/folders/18mcYQu4GGPzwCRj14Pzy8LvfC9bcLid8?usp=sharing on the docker environment where the segfault occurs. |
It would take ages to reconstruct failing environment 1:1. The question is if it is possible to clearly attribute problem to "OpenBLAS threading" by confirming there is no problem whatsoever when threading thing is turned off via env variable. Issue mentions |
@brada4 I am not sure what you are asking. A static build of OpenBLAS is downloaded and is built into the so you see with some other things as part of the NumPy build process. The reason it has a hash is to uniquely bind that so to the NumPy build, it does not directly reflect a version of OpenBLAS. The exact version of OpenBLAS used in the build of the wheel is the code tagged with the 1.19.3 tag, and is here - you can see the tag is |
@mattip thanks, that fills the gap in my understanding. |
One thing that was changed in 3.12 is x86_64: clobber all xmm registers after vzeroupper. See also Omit clobbers from vzeroupper until final [PR92190]. And Microsoft x64 calling convention tells, that: |
Another bit can be found here: Software consequences of extending XMM to YMM (Agner Fog) However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway. |
Both the users who reported this issue are using dockers on linux, right? |
Callee-saves is implemented in the assembly PROLOGUE/EPILOGUE for Windows so this particular change should play no role here |
That then is probably the GEMM buffer, configurable as BUFFERSIZE at compile time (in Makefile.rule) The default was increased a few times recently to fix overflows at huge matrix sizes - there is a certainly a trade-off and possibly a fundamental design flaw in OpenBLAS involved.0.3.12 built with |
If we shrink the BUFFERSIZE, would we run the risk of running out of room with 24 threads and large matrices? Should NumPy cap the number of threads with |
|
bumping from 0.3.10 to 0.3.12 in nixpkgs has caused numpy to fail on tests now stacktrace:
lscpu info
|
@jonringer this looks like something else, not the BUFFERSIZE thing. Wonder why the backtrace shows the OpenMP library and a liblapack but nothing that calls itself OpenBLAS - do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ? (So far I have not seen mention of numpy itself failing tests with 0.3.12, cf charris' comment above #2970 (comment) ) |
I am a little hesitant about overriding the OpenBLAS default BUFFERSIZE constant in NumPy's build process. OpenBLAS changed it for good reasons, and it could cause incompatibilities. Is refactoring this on the roadmap? |
"prod with error" is not expected to work. Ubuntu 16 kernel has no provision of AVX512 XSAVE. |
@brada4, do you mean some individual SIMD instruction set extensions are disabled (using XCR0 register) by the virtual OS within Docker, and this is the reason for the segfaults? |
AVX512 extensions are visible in KVM guest, but using them leads to numeric eurekas, which likely signify register corruption. |
What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest? |
Seems quite unlikely to me - there was code using AVX512 extensions in 0.3.9 already and the DYNAMIC_ARCH builds are supposed to do a runtime check for actual availability. I'd be much more interested in results from a build with the smaller 0.3.9 BUFFERSIZE, or a simple self-contained reproducer. If anything I would expect AVX512 mishandling to result in SIGILL or SIGFPE (or silent garbage results), not SIGSEGV |
Sorry, I corrected that in the description. FWIW, the information provided indicates the host machine has 128GB and 31 cores. I can't replicate that locally unfortunately. How much memory does the default BUFFER_SIZE allocate per thread? |
128mb, with 0.3.9 it was only 32mb (written as a bitshift in common_x86_64.h, it is "32<<22" vs "32<<20", therefore the suggestion to specify "20" in #2970 (comment) |
Completely corrupt numeric results, basically all openblas tests corrupt, ssh drops out often etc. Anyone running Ubuntu 16 in both sides of KVM, rigging compiler to have AVX-512 supported, then yes, it stops working. |
This may be unrelated to the other discussion items, but I was successful in finding the offending commit. I also verified that reverting the commit allowed for me to build numpy and run the numpy tests on a 3990X. Testing steps: default.nixlet
openblas_overlay = self: super: {
# replace all references of openblas in nixpkgs with local version
openblas = super.openblas.overrideAttrs (oldAttrs: {
src = super.lib.cleanSource ./.;
});
};
in
# create new package set, referencing local version of openblas
import ../nixpkgs { overlays = [ openblas_overlay ]; } bisect command: git bisect run nix-build default.nix -A python3Packages.numpy --no-build-output --cores 128 gist of run: https://gist.github.com/jonringer/3d9351b8f1e153f5a7275975880b4319 This also seems to align with my previous comment #2970 (comment), in which the seg fault would occur inside libgomp |
libgomp segfault not reproduced on Ryzen5-4600H with current develop (with and without interface64 and symbol suffixing) and gcc 7.5. Rerunning the builds with gcc 9.3 now but no segfault so far. |
@jonringer can you try out the NumPy wheels from https://anaconda.org/scipy-wheels-nightly/numpy ? We built the last round of wheels with |
Thx @mattip - obviously my 4600 is a poor substitute for Threadripper, but at least his does not some appear to be some separate issue with OpenMP and forks after all. (Wonder how much memory that TR has, obviously a fork() would be a perfect place to run out of it) |
@mattip according to #2982 (comment) his problem appears to be unrelated to BUFFERSIZE after all. Curious whether the change fixes "your" numpy problems though |
@mattip I don't actually use this library personally or professionally. My main concern is that I'm no longer able to test numpy without a segfault while curating packages on nixpkgs. |
Hm... only instance of "fork" in the numpy testsuite seems to be in a case that looks even more benign than the 1024x1024 dgemm |
Sorry for the noise. I was grasping for straws, seems I chose the wrong one. |
FWIW: Some Python stuff does a fork behind the scenes. Such as querying the system name/type/arch which spawns a process via fork while looking innocent on the usage site, like |
If I read the comments on numpy/numpy#17759 correctly, reducing BUFFERSIZE apparently did not fix the |
I think we can close this, or at least relate to it as a use-case OpenBLAS will not support. It does not seem reasonable to me to open 36 threads on a machine with less than 1GB memory available to support the threads. NumPy is now using |
Just a heads up, that NumPy is seeing some problems with OpenBLAS 0.3.12.
We released NumPy 1.19.3 with OpenBLAS 0.3.12 in order to fix the windows
fmod
problems. We had to back it out and are releasing a 0.19.4 with the previous OpenBLAS and code to error out if thefmod
bug is detected instead. We got reports that 1.19.3 crashes, xref numpy/numpy#17674 and numpy/numpy#17684. In the first issue, the reporter provideda docker image and code to repoduceoutput of some system diagonstics. The second issue is a lack of memory inhstack
, which is difficult to attribute to OpenBLAS, so there may be something else going on.Both issues were "fixed" (or covered up) by the 1.19.4 release candidate with an older OpenBLAS.
Does any of this make sense to you?
Edit: no docker image, just diagnostics
The text was updated successfully, but these errors were encountered: