Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

Closed
martin-frbg opened this issue Jul 21, 2020 · 4 comments · Fixed by #2729
Closed

Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

martin-frbg opened this issue Jul 21, 2020 · 4 comments · Fixed by #2729

Comments

@martin-frbg
Copy link
Collaborator

Recently a segfault in the scipy testsuite was reported to occur in conda packages for x86_64 based on 0.3.10 conda-forge/scipy-feedstock#130 - this was initially suspected to be related to PR#2516 as the only major change affecting AVX2 code in that release. However more recent events conda-forge/openblas-feedstock#101 with the simple reproducer #2516 (comment)

    $ python
    Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
    [GCC 7.5.0] on linux
            import numpy as np
            t = np.array([[1, 0.5], [-1, 0.5]])
            matrix_a = np.dot([(1,2) for i in range(1000000)], t)

make it more likely to be either a more general memory management problem or a fault in the Haswell DGEMM kernel that appears to occur in DYNAMIC_ARCH builds only. In my test (currently with python 3.6.4), valgrind reports an attempt by the python parser code to free an unallocated memory region shortly before the segfault occurs ostensibly in line 255 of the DGEMM_ONCOPY kernel. Replacing the Haswell/SkylakeX dgemm_ncopy_8_skylakex.c with its generic gemm_ncopy_8.c counterpart has no effect.

@proever
Copy link

proever commented Jul 22, 2020

I've spent the last few days trying to find the source of this segfault as well, and I can also confirm that it is only present in 0.3.10.

During my debugging process (mostly done an HPC with a job scheduler, where I can request one or more CPUs per job), I noticed that the segfault only occurs when I have multiple CPUs available to me. On a single CPU the issue does not occur. Moreover, setting the OPENBLAS_NUM_THREADS environment variable to 1, either in the shell or using os.environ in the python script, fixes the issue even when I have multiple CPUs available.

Here's the very simple script I ended up arriving at for debugging:

import os

os.environ['OPENBLAS_NUM_THREADS'] = '1'

import numpy as np

a = np.random.randn(100000, 3)
b = np.random.randn(3, 3)

np.matmul(a, b)

This should not cause the segfault, even on multi-core systems, while removing the env variable being set and running on a system with multiple cores available to it causes the issue.

Finally, I arrived at the shape for a (100000, 3) somewhat arbitrarily, and I'm sure this number is not particularly meaningful, but I did notice that reducing the size of a fixes the issue as well, for example when I change a's dimensions to only (50000, 3), the issue goes away. Of course, this is probably related to how much memory was available to start with. But my (uninformed) guess would be that if the matrices are above a certain size, they get shared in memory across CPUs, and then somehow a threading issue causes memory to be accessed when it shouldn't be, leading to the segfault.

Hopefully this helps someone!

@martin-frbg
Copy link
Collaborator Author

martin-frbg commented Jul 22, 2020

My current hypothesis is that both multiple cores and a dynamic_arch build are required for this crash, one possibility is that the TARGET=PRESCOTT for the common code presets the BUFFERSIZE for the distributed GEMM to a smaller value than what is actually expected by the HASWELL kernels. At least the crash now seems vaguely familiar from earlier "big matrix - small blocksize" situations and the default BUFFERSIZE for the various x86_64 cpus did not diverge until fairly recently.

@martin-frbg
Copy link
Collaborator Author

Mhhh. indeed DYNAMIC_ARCH=1 TARGET=HASWELL does no longer want to fail on Haswell, the plot thickens.

@isuruf
Copy link
Contributor

isuruf commented Jul 29, 2020

Thanks @martin-frbg, for the fix. A user confirmed that your PR fixed the issue.

raspbian-autopush pushed a commit to raspbian-packages/openblas that referenced this issue Aug 2, 2020
Origin: upstream, OpenMathLib/OpenBLAS@6c33764
Bug: OpenMathLib/OpenBLAS#2728
Bug-Debian: https://bugs.debian.org/966175
Last-Update: 2020-07-29

Last-Update: 2020-07-29
Gbp-Pq: Name fix-dynamic-arch-gemm-crashes.patch
raspbian-autopush pushed a commit to raspbian-packages/openblas that referenced this issue Aug 12, 2020
Origin: upstream, OpenMathLib/OpenBLAS@6c33764
Bug: OpenMathLib/OpenBLAS#2728
Bug-Debian: https://bugs.debian.org/966175
Last-Update: 2020-07-29

Last-Update: 2020-07-29
Gbp-Pq: Name fix-dynamic-arch-gemm-crashes.patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants