Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

martin-frbg · 2020-07-21T21:00:46Z

Recently a segfault in the scipy testsuite was reported to occur in conda packages for x86_64 based on 0.3.10 conda-forge/scipy-feedstock#130 - this was initially suspected to be related to PR#2516 as the only major change affecting AVX2 code in that release. However more recent events conda-forge/openblas-feedstock#101 with the simple reproducer #2516 (comment)

    $ python
    Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
    [GCC 7.5.0] on linux
            import numpy as np
            t = np.array([[1, 0.5], [-1, 0.5]])
            matrix_a = np.dot([(1,2) for i in range(1000000)], t)

make it more likely to be either a more general memory management problem or a fault in the Haswell DGEMM kernel that appears to occur in DYNAMIC_ARCH builds only. In my test (currently with python 3.6.4), valgrind reports an attempt by the python parser code to free an unallocated memory region shortly before the segfault occurs ostensibly in line 255 of the DGEMM_ONCOPY kernel. Replacing the Haswell/SkylakeX dgemm_ncopy_8_skylakex.c with its generic gemm_ncopy_8.c counterpart has no effect.

The text was updated successfully, but these errors were encountered:

proever · 2020-07-22T15:52:19Z

I've spent the last few days trying to find the source of this segfault as well, and I can also confirm that it is only present in 0.3.10.

During my debugging process (mostly done an HPC with a job scheduler, where I can request one or more CPUs per job), I noticed that the segfault only occurs when I have multiple CPUs available to me. On a single CPU the issue does not occur. Moreover, setting the OPENBLAS_NUM_THREADS environment variable to 1, either in the shell or using os.environ in the python script, fixes the issue even when I have multiple CPUs available.

Here's the very simple script I ended up arriving at for debugging:

import os

os.environ['OPENBLAS_NUM_THREADS'] = '1'

import numpy as np

a = np.random.randn(100000, 3)
b = np.random.randn(3, 3)

np.matmul(a, b)

This should not cause the segfault, even on multi-core systems, while removing the env variable being set and running on a system with multiple cores available to it causes the issue.

Finally, I arrived at the shape for a (100000, 3) somewhat arbitrarily, and I'm sure this number is not particularly meaningful, but I did notice that reducing the size of a fixes the issue as well, for example when I change a's dimensions to only (50000, 3), the issue goes away. Of course, this is probably related to how much memory was available to start with. But my (uninformed) guess would be that if the matrices are above a certain size, they get shared in memory across CPUs, and then somehow a threading issue causes memory to be accessed when it shouldn't be, leading to the segfault.

Hopefully this helps someone!

martin-frbg · 2020-07-22T16:34:23Z

My current hypothesis is that both multiple cores and a dynamic_arch build are required for this crash, one possibility is that the TARGET=PRESCOTT for the common code presets the BUFFERSIZE for the distributed GEMM to a smaller value than what is actually expected by the HASWELL kernels. At least the crash now seems vaguely familiar from earlier "big matrix - small blocksize" situations and the default BUFFERSIZE for the various x86_64 cpus did not diverge until fairly recently.

martin-frbg · 2020-07-22T16:41:25Z

Mhhh. indeed DYNAMIC_ARCH=1 TARGET=HASWELL does no longer want to fail on Haswell, the plot thickens.

isuruf · 2020-07-29T16:53:41Z

Thanks @martin-frbg, for the fix. A user confirmed that your PR fixed the issue.

Origin: upstream, OpenMathLib/OpenBLAS@6c33764 Bug: OpenMathLib/OpenBLAS#2728 Bug-Debian: https://bugs.debian.org/966175 Last-Update: 2020-07-29 Last-Update: 2020-07-29 Gbp-Pq: Name fix-dynamic-arch-gemm-crashes.patch

martin-frbg mentioned this issue Jul 21, 2020

AVX2 STRSM kernels #2516

Merged

martin-frbg mentioned this issue Jul 22, 2020

Unify BUFFER_SIZE settings for x86_64 again to fix DYNAMIC_ARCH crashes #2729

Merged

martin-frbg closed this as completed in #2729 Jul 22, 2020

isuruf mentioned this issue Jul 23, 2020

BUG: Possible bug with dot product on Azure #2732

Closed

This was referenced Jul 23, 2020

Error on Azure CI (Windows instance) with numpy 1.19.0 numpy/numpy#16913

Closed

MRG, MAINT: Try conda-forge mne-tools/mne-python#8046

Merged

h-vetinari mentioned this issue Jul 25, 2020

Segfault with openblas 0.3.10 conda-forge/openblas-feedstock#104

Closed

martin-frbg mentioned this issue Oct 5, 2020

Segfault in numpy #2880

Closed

martin-frbg mentioned this issue Oct 13, 2020

numpy.dot crashes when openBLAS is built against an older TARGET #2893

Closed

adrianparvino mentioned this issue Oct 19, 2020

python36Packages.scipy: segmentation fault when running tests NixOS/nixpkgs#92458

Closed

tobihan mentioned this issue Oct 31, 2020

segfault in zgemm_oncopy_EMAG8180 during sagemath test on arm64 #2959

Closed

tobihan mentioned this issue Nov 22, 2020

segfault in zgemm_oncopy_POWER8 during sagemath test on ppc64el #3000

Closed

jschueller mentioned this issue Jan 2, 2021

Kriging metamodel throws segfault when evaluated on huge sample openturns/openturns#1709

Closed

chokkyvista mentioned this issue Mar 10, 2021

Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) #3135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

martin-frbg commented Jul 21, 2020

proever commented Jul 22, 2020

martin-frbg commented Jul 22, 2020 •

edited

Loading

martin-frbg commented Jul 22, 2020

isuruf commented Jul 29, 2020

Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

Segfaults in the conda-forge numpy and scipy traced to 0.3.10 #2728

Comments

martin-frbg commented Jul 21, 2020

proever commented Jul 22, 2020

martin-frbg commented Jul 22, 2020 • edited Loading

martin-frbg commented Jul 22, 2020

isuruf commented Jul 29, 2020

martin-frbg commented Jul 22, 2020 •

edited

Loading