-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728
Comments
I've spent the last few days trying to find the source of this segfault as well, and I can also confirm that it is only present in 0.3.10. During my debugging process (mostly done an HPC with a job scheduler, where I can request one or more CPUs per job), I noticed that the segfault only occurs when I have multiple CPUs available to me. On a single CPU the issue does not occur. Moreover, setting the Here's the very simple script I ended up arriving at for debugging:
This should not cause the segfault, even on multi-core systems, while removing the env variable being set and running on a system with multiple cores available to it causes the issue. Finally, I arrived at the shape for Hopefully this helps someone! |
My current hypothesis is that both multiple cores and a dynamic_arch build are required for this crash, one possibility is that the TARGET=PRESCOTT for the common code presets the BUFFERSIZE for the distributed GEMM to a smaller value than what is actually expected by the HASWELL kernels. At least the crash now seems vaguely familiar from earlier "big matrix - small blocksize" situations and the default BUFFERSIZE for the various x86_64 cpus did not diverge until fairly recently. |
Mhhh. indeed DYNAMIC_ARCH=1 TARGET=HASWELL does no longer want to fail on Haswell, the plot thickens. |
Thanks @martin-frbg, for the fix. A user confirmed that your PR fixed the issue. |
Origin: upstream, OpenMathLib/OpenBLAS@6c33764 Bug: OpenMathLib/OpenBLAS#2728 Bug-Debian: https://bugs.debian.org/966175 Last-Update: 2020-07-29 Last-Update: 2020-07-29 Gbp-Pq: Name fix-dynamic-arch-gemm-crashes.patch
Origin: upstream, OpenMathLib/OpenBLAS@6c33764 Bug: OpenMathLib/OpenBLAS#2728 Bug-Debian: https://bugs.debian.org/966175 Last-Update: 2020-07-29 Last-Update: 2020-07-29 Gbp-Pq: Name fix-dynamic-arch-gemm-crashes.patch
Recently a segfault in the scipy testsuite was reported to occur in conda packages for x86_64 based on 0.3.10 conda-forge/scipy-feedstock#130 - this was initially suspected to be related to PR#2516 as the only major change affecting AVX2 code in that release. However more recent events conda-forge/openblas-feedstock#101 with the simple reproducer #2516 (comment)
make it more likely to be either a more general memory management problem or a fault in the Haswell DGEMM kernel that appears to occur in DYNAMIC_ARCH builds only. In my test (currently with python 3.6.4), valgrind reports an attempt by the python parser code to free an unallocated memory region shortly before the segfault occurs ostensibly in line 255 of the DGEMM_ONCOPY kernel. Replacing the Haswell/SkylakeX dgemm_ncopy_8_skylakex.c with its generic gemm_ncopy_8.c counterpart has no effect.
The text was updated successfully, but these errors were encountered: