Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite segfault on Linux/x86_64/Python 3.7 with old GCC #12483

Closed
pkgw opened this issue Oct 29, 2018 · 7 comments · Fixed by #12485
Closed

Test suite segfault on Linux/x86_64/Python 3.7 with old GCC #12483

pkgw opened this issue Oct 29, 2018 · 7 comments · Fixed by #12485
Milestone

Comments

@pkgw
Copy link

pkgw commented Oct 29, 2018

Description

In the conda-forge project, we see a segfault in the test suite when building scikit-learn packages against Python 3.7 using an older GCC. Here is the related issue report.

Steps/Code to Reproduce

This failure comes up in our CI-based automated build system. Here is the build info and here is the relevant log file.

My reproduction of the build environment isolates the segfault to this test:

test_common.py::test_non_meta_estimators[AgglomerativeClustering-AgglomerativeClustering-check_clustering(readonly_memmap=True)] <- $PREFIX/lib/python3.7/site-packages/sklearn/tests/test_common.py

The bit involving a memmap seems to me like it could potentially be fragile, but I see that there are other tests with the same flag that succeed.

Running in a debugger, I don't have any debug symbols, but the backtrace lands in scipy:

(gdb) bt
#0  0x00007fffcc189f3e in pdist_euclidean_double_wrap ()
   from /a/TEMP37/lib/python3.7/site-packages/scipy/spatial/_distance_wrap.cpython-37m-x86_64-linux-gnu.so
#1  0x0000000000431f3d in cfunction_call_varargs ()
#2  0x0000000000431fee in PyCFunction_Call ()
#3  0x00000000004ef5df in do_call_core ()

(Then the backtrace continues for many frames inside the Python interpreter.)

We do not see this failure on other versions of Python, or when building with GCC7, or on macOS.

I can upload the Conda package made out of the build that exhibits this problem, but since it doesn't have debugging symbols I think it would be a challenge to do much investigation with it.

Versions

System
------
    python: 3.7.0 | packaged by conda-forge | (default, Sep 30 2018, 14:56:18)  [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
executable: /a/TEMP37/bin/python
   machine: Linux-4.18.16-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight

BLAS
----
    macros: HAVE_CBLAS=None
  lib_dirs: /a/TEMP37/lib
cblas_libs: openblas, openblas

Python deps
-----------
       pip: 18.1
setuptools: 40.5.0
   sklearn: 0.20.0
     numpy: 1.15.3
     scipy: 1.1.0
    Cython: None
    pandas: None
@amueller
Copy link
Member

likely related to memmap=True. This is 0.20.0 release right? not any branch or something. We had one of these before. We're probably trying to write to a memmapped array.

@amueller
Copy link
Member

Makes me think of #11133. Maybe check if doing np.require(something, requirements="W") like in #11133 helps.
Is this deterministic?

@pkgw
Copy link
Author

pkgw commented Oct 29, 2018

Yes, 0.20.0. Yes, deterministic.

For what it's worth, in my detailed debugging there were other tests that had readonly_mmap=True that passed without segfaulting. And the tests pass for other platforms, Python versions, and compiler versions. But perhaps for this particular build environment and this particular algorithm, something subtle gets triggered that causes the generated code to try to write to the array.

@jakirkham
Copy link
Contributor

jakirkham commented Oct 29, 2018

FWIW here is the C code in SciPy 1.1.0, which is responsible for defining pdist_euclidean_double_wrap (along with various other pdist functions all double precision).

Edit: The underlying computation is here.

@rth
Copy link
Member

rth commented Oct 29, 2018

I can also reproduce this with the condaforge/linux-anvil Docker image and the following setup,

/usr/bin/sudo yum -y install wget
conda create -n sklearn-env numpy==1.15.3 scipy==1.1.0 pytest python=3.7
conda activate sklearn-env
wget -O scikit-learn.tar.gz https://files.pythonhosted.org/packages/0f/d7/136a447295adade38e7184618816e94190ded028318062a092daeb972073/scikit-learn-0.20.0.ta
r.gz
tar xzf scikit-learn.tar.gz
cd scikit-learn-0.20.0/
pip install -e .
pytest sklearn/tests/test_common.py::test_non_meta_estimators -vs -k 'AgglomerativeClustering-check_clustering(readonly_memmap'

It appears to be happening during a call to scipy.cluster.hierarchy.ward in ward_tree here. At this point the scipy ward function gets as input a (51, 2) array with the following array.flags,

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : False
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

However saving this array and providing it to ward in read-only mode (by opening the data as mmap) does not segfault. Maybe it's due to to the fact that the array is a slice of a larger array (and so I'm not sure what happens with alignment). I have not investigated further.

In any case using np.require(something, requirements="W") does fix it (or using O or A flags as in either case a copy is triggered I imagine).

Fix proposed in #12485

I have not tried to reproduce this on master, but I think it would still apply.

@amueller amueller added this to the 0.20.1 milestone Oct 30, 2018
@amueller
Copy link
Member

I don't understand why this is platform specific but better safe than sorry?

@jakirkham
Copy link
Contributor

Think it is compiler specific (not platform specific). Using a newer version of GCC on Linux fixes the issue.

amueller pushed a commit that referenced this issue Nov 6, 2018
…12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes #12483

(see the above issue for more details)
thoo pushed a commit to thoo/scikit-learn that referenced this issue Nov 14, 2018
…cikit-learn#12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes scikit-learn#12483

(see the above issue for more details)
thoo pushed a commit to thoo/scikit-learn that referenced this issue Nov 14, 2018
…cikit-learn#12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes scikit-learn#12483

(see the above issue for more details)
jnothman pushed a commit to jnothman/scikit-learn that referenced this issue Nov 14, 2018
…cikit-learn#12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes scikit-learn#12483

(see the above issue for more details)
jnothman pushed a commit to jnothman/scikit-learn that referenced this issue Nov 14, 2018
…cikit-learn#12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes scikit-learn#12483

(see the above issue for more details)
xhluca pushed a commit to xhluca/scikit-learn that referenced this issue Apr 28, 2019
…cikit-learn#12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes scikit-learn#12483

(see the above issue for more details)
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this issue Jul 12, 2019
…cikit-learn#12485)

This fixes a segfault in AgglomerativeClustering with read-only mmaps that happens inside `ward_tree` when calling `scipy.cluster.hierarchy.ward`.

Closes scikit-learn#12483

(see the above issue for more details)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants