Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eigh() tests fail to pass, crash Python with seemingly ramdom pattern #11709

Closed
congma opened this issue Mar 22, 2020 · 52 comments · Fixed by #11737
Closed

eigh() tests fail to pass, crash Python with seemingly ramdom pattern #11709

congma opened this issue Mar 22, 2020 · 52 comments · Fixed by #11737
Milestone

Comments

@congma
Copy link
Contributor

congma commented Mar 22, 2020

This problem is related to #11601, which has been closed by #11702 ( @ilayn ). However, the crash has not been fixed by the latter PR.

The symptoms remained almost identical to the one described in my comment in #11601 (comment)

In summary, when running the test for eigh(), Python tends to crash with SIGSEGV or SIGABRT. Sometimes this happens during the test_eigh() function, sometimes after it passed with "100%" but before pytest returns.

The test that triggers the crash is the following test function:

# Old eigh tests kept for backwards compatibility
@pytest.mark.parametrize('eigvals', (None, (2, 4)))
@pytest.mark.parametrize('turbo', (True, False))
@pytest.mark.parametrize('lower', (True, False))
@pytest.mark.parametrize('overwrite', (True, False))
@pytest.mark.parametrize('dtype_', ('f', 'd', 'F', 'D'))
@pytest.mark.parametrize('dim', (6,))
def test_eigh(self, dim, dtype_, overwrite, lower, turbo, eigvals):
atol = 1e-11 if dtype_ in ('dD') else 1e-4
a = _random_hermitian_matrix(n=dim, dtype=dtype_)
w, z = eigh(a, overwrite_a=overwrite, lower=lower, eigvals=eigvals)
assert_dtype_equal(z.dtype, dtype_)
w = w.astype(dtype_)
diag_ = diag(z.T.conj() @ a @ z).real
assert_allclose(diag_, w, rtol=0., atol=atol)
a = _random_hermitian_matrix(n=dim, dtype=dtype_)
b = _random_hermitian_matrix(n=dim, dtype=dtype_, posdef=True)
w, z = eigh(a, b, overwrite_a=overwrite, lower=lower,
overwrite_b=overwrite, turbo=turbo, eigvals=eigvals)
assert_dtype_equal(z.dtype, dtype_)
w = w.astype(dtype_)
diag1_ = diag(z.T.conj() @ a @ z).real
assert_allclose(diag1_, w, rtol=0., atol=atol)
diag2_ = diag(z.T.conj() @ b @ z).real
assert_allclose(diag2_, ones(diag2_.shape[0]), rtol=0., atol=atol)

Some patterns from the histories of crashes

I run the test script with runtests.py 100 times and saved the output as text files.

By grepping the output files ./runtests.py, I notice that the last-known position in Python before it crashes could be three lines, namely 873, 876, and 877. L 873 is the actual call to eigh(), while the crash can happen as late as 876 or 877, where the arrays returned from eigh() are accessed.

Only 6 out of 100 runs passed without any problems.

In some cases (35 out of the 100), Python segfaults after nominally completing all the tests in TestEigh::test_eigh.

In the cases where Python was killed with SIGABRT, 36 were at L 873 (call to eigh()), while 9 were at L 876 where output z was used. In many other runs, the test script was not featured in the Python backtrace if any.

The parametrized inputs that triggered the crash were of the form test_eigh[6-D-XXX-YYY-ZZZ-eigvals1]. That is, the crashes happened for dimension 6, dtype double complex, with eigvals= keyword parameter set to the tuple (2, 4). The XXX--ZZZ parameters are boolean flags for keywords turbo, lower, and overwrite respectively.

An incomplete tally of the parameters (turbo, lower, and overwrite), where Python crashed before finishing all the tests, is as follows:

   5 False-False-False
  11 False-False-True
  13 False-True-False
   6 False-True-True
   7 True-False-False
   4 True-False-True
  15 True-True-True

The combination (turbo=True, lower=True, overwrite=False) is the one missing from the 2^3 = 8 cases yet.

Reproducing code example:

./runtests.py -vt scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh

Scipy/Numpy/Python version information:

Scipy master branch as of ae34ce4, Numpy 1.18.1, Python 3.7.6, conda macos with MKL 2019.4.

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

I think we first need to isolate that piece of code and run externally without the test. The tests are with fixed-seed so it doesn't make any sense when it succeeds sometimes and not otherwise.

Also what do you get if you run scipy.linalg.lapack.ilaver()?

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

Also what do you get if you run scipy.linalg.lapack.ilaver()?

(3, 7, 0)

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

What do you get with the following ?

import numpy as np
from scipy.linalg.lapack import zheevr as evr, zheevr_lwork as evr_lw, _compute_lwork

a = np.array([[0.57830695+0.j, 0.57732626+0.j, 0.50339199+0.j, 0.49955715+0.j, 0.65422828+0.j, 0.39991909+0.j],
              [0.57732626+0.j, 0.26035319+0.j, 0.98113963+0.j, 0.63757905+0.j, 0.79092958+0.j, 0.72842587+0.j],
              [0.50339199+0.j, 0.98113963+0.j, 0.26858524+0.j, 0.61785254+0.j, 0.62372901+0.j, 0.63216722+0.j],
              [0.49955715+0.j, 0.63757905+0.j, 0.61785254+0.j, 0.78260145+0.j, 0.77017289+0.j, 0.82255765+0.j],
              [0.65422828+0.j, 0.79092958+0.j, 0.62372901+0.j, 0.77017289+0.j, 0.78081655+0.j, 0.50652208+0.j],
              [0.39991909+0.j, 0.72842587+0.j, 0.63216722+0.j, 0.82255765+0.j, 0.50652208+0.j, 0.00767085+0.j]],
             dtype='D', order='F')

lwork, lrwork, liwork = _compute_lwork(evr_lw, n=6, lower=False)
evr(a, compute_v= 1, il=3, iu=5, lower=False, overwrite_a=True, range='I')

I'm on LAPACK 3.9.0 with OpenBLAS

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

What do you get with the following ?

I run the script with master version of scipy 50 times. 5 out of them were segfaults after the output as follows, and the rest exited without errors.

 (array([-0.01350094,  0.0714959 ,  0.24539661,  0.        ,  0.        ,
         0.        ]),
 array([[-0.60090663+0.j,  0.17259041+0.j,  0.69638956+0.j],
        [ 0.11969248+0.j, -0.46110847+0.j,  0.02372352+0.j],
        [ 0.04234416+0.j, -0.56691172+0.j, -0.03759849+0.j],
        [-0.28507343+0.j,  0.4913636 +0.j, -0.58940264+0.j],
        [ 0.71896113+0.j,  0.41897405+0.j,  0.28269445+0.j],
        [-0.15690741+0.j, -0.13865491+0.j, -0.29283701+0.j]]),
 3,
 array([], dtype=int32),
 0)

I was also testing running eigh on the complex array np.eye(6, dtype="D") repeatedly and also notice much less change of running into a crash. The crash rate under pytest seems much higher.

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

I'm actually out of ideas. Random crashing cannot be related to SciPy as far as I can see. How do you build SciPy from the master branch?

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

How do you build SciPy from the master branch?

export NPY_DISTUTILS_APPEND_FLAGS=1
export SCIPY_USE_G77_ABI_WRAPPER=1
./runtests

conda is also doing some tricks with compiler flags by default when I activate an environment, and the ones below are the ones I think are relevant

CXX=x86_64-apple-darwin13.4.0-clang++
CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
DEBUG_CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -Og -g -Wall -Wextra
LDFLAGS=-Wl,-pie -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs
DEBUG_CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -Og -g -Wall -Wextra
LDFLAGS_LD=-pie -headerpad_max_install_names -dead_strip_dylibs
LD=/Users/congma/miniconda3/envs/scipy-ws/bin/x86_64-apple-darwin13.4.0-ld
CC=x86_64-apple-darwin13.4.0-clang
CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe

The Fortran compiler is gfortran-9.3.0.

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

I am not familiar with the MacOSX stuff even less with conda env but ``runtests.py` won't upgrade your installed scipy but only build it. After the build you need to install the package somehow. Hence the current master version should give

import scipy as sp
print(sp.__version__)
1.5.0.dev0+ae34ce4

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

I am not familiar with the MacOSX stuff even less with conda env but ``runtests.py` won't upgrade your installed scipy but only build it. After the build you need to install the package somehow. Hence the current master version should give

import scipy as sp
print(sp.__version__)
1.5.0.dev0+ae34ce4

I'm running external snippets with the help of ./runtests.py after building scipy from master (which I assume will set the correct path variables to the built package). The command I'm running is

./runtests.py --ipython < FILE_NAME.py

where I put the Python statements in a file at FILE_NAME.py. To check that I'm running against the built scipy master, I'm using the same print statement which shows the same git tip SHA1 as the one here (ae34ce4).

I think conda maintainers/tests may have more resources at hand to test with a similar environment. If there's something in master that crashes, they will be hit (and affected) eventually. Should we and how can we bring the issue to their attention?

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

I meant the code snippet I gave above. When you run that as a standalone code snippet can you check what the print statement gives. No conda env no scripts, as simple as possible It helps to have everything as barebone as possible.

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

as a standalone code snippet can you check what the print statement gives. No conda env no scripts

1.4.1 with no conda env (i.e. base env), and no script, simply under python -c. This is the scipy version from conda's binary channel.

EDIT: I should add that the no-env (i.e. base env) conda setting is purely binary and only installs released versions which can differ by a lot from master. I'm pretty much sure I'd better not install built packages from master tip into that. The conda env is where I build or even install from master.

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

I am trying to eliminate possibilities but only one at a time.

If you open an Python environment (whatever you use) and import scipy and then check the version manually, it says 1.5.0.dev0+ae34ce4 that's fine. Otherwise it is not suitable for the test.

Then if you run the code snippet without any script but via IPython or python terminal and it still crashes that's a datapoint. If it doesn't then it is related to how conda triggers runtests and not related to Scipy.

We haven't triangulated yet when and where it happens. You seem to use many things at once that's why I am trying to isolate the tests. I don't mind which tool you use. But a straightforward test is needed.

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

Also is it possible that you upgrade to LAPACK 3.9.0 in another environment? I don't know why 2019.4 MKL still uses 3.7.0

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

If you open an Python environment (whatever you use) and import scipy and then check the version manually, it says 1.5.0.dev0+ae34ce4 that's fine. Otherwise it is not suitable for the test.

Just done that in a new environment with scipy==1.5.0.dev0+ae34ce4 (wheel freshly built from master source and then installed). The outcome of the test script in #11709 (comment) was 90 out of 100 success, and the rest were segfaults.

Also is it possible that you upgrade to LAPACK 3.9.0 in another environment? I don't know why 2019.4 MKL still uses 3.7.0

I don't think I know how to do that right now, I could possibly try an openblas recent-release build (but that won't be MKL) and have scipy master build use that. The MKL dependency is kinda weirdly baked into the conda dependency relations; trying to circumvent that makes conda fail to find satisfiable dependencies.

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

I was having the same issue as before too. I'm not sure what in the dependency closure fixed the issue but I updated the numpy version in my conda virtual env via conda update numpy and that fixed the failing test.

[Long scrollback buffer listing]

(scipy-dev) Lucass-MacBook:scipy rlucas$ conda update numpy
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.8.2
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /Users/rlucas/anaconda3/envs/scipy-dev

  added / updated specs:
    - numpy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    asn1crypto-1.3.0           |           py36_0         162 KB
    attrs-19.3.0               |             py_0          39 KB
    babel-2.8.0                |             py_0         6.0 MB
    cffi-1.14.0                |   py36hb5b8e2f_0         218 KB
    chardet-3.0.4              |        py36_1003         201 KB
    docutils-0.16              |           py36_0         742 KB
    idna-2.9                   |             py_1          56 KB
    imagesize-1.2.0            |             py_0          10 KB
    intel-openmp-2019.4        |              233         1.1 MB
    jinja2-2.11.1              |             py_0          97 KB
    kiwisolver-1.1.0           |   py36h0a44026_0          60 KB
    more-itertools-8.2.0       |             py_0          39 KB
    ncurses-6.2                |       h0a44026_0        1024 KB
    openssl-1.1.1e             |       h1de35cc_0         3.5 MB
    packaging-20.3             |             py_0          35 KB
    pip-20.0.2                 |           py36_1         1.9 MB
    py-1.8.1                   |             py_0          69 KB
    pycparser-2.20             |             py_0          93 KB
    pygments-2.6.1             |             py_0         687 KB
    pyopenssl-19.1.0           |           py36_0          87 KB
    pyparsing-2.4.6            |             py_0          64 KB
    pysocks-1.7.1              |           py36_0          30 KB
    python-dateutil-2.8.1      |             py_0         224 KB
    pytz-2019.3                |             py_0         231 KB
    requests-2.23.0            |           py36_0          91 KB
    setuptools-46.0.0          |           py36_0         645 KB
    six-1.14.0                 |           py36_0          27 KB
    snowballstemmer-2.0.0      |             py_0          58 KB
    sphinxcontrib-applehelp-1.0.2|             py_0          30 KB
    sphinxcontrib-devhelp-1.0.2|             py_0          24 KB
    sphinxcontrib-htmlhelp-1.0.3|             py_0          29 KB
    sphinxcontrib-qthelp-1.0.3 |             py_0          26 KB
    sphinxcontrib-serializinghtml-1.1.4|             py_0          25 KB
    sqlite-3.31.1              |       ha441bb4_0         2.4 MB
    tornado-6.0.4              |   py36h1de35cc_1         647 KB
    urllib3-1.25.8             |           py36_0         165 KB
    wheel-0.34.2               |           py36_0          49 KB
    zipp-2.2.0                 |             py_0          12 KB
    ------------------------------------------------------------
                                           Total:        20.8 MB

The following packages will be REMOVED:

  inflect-4.1.0-py36_0
  jaraco.itertools-5.0.0-py_0
  sphinxcontrib-1.0-py36_1
  sphinxcontrib-websupport-1.1.0-py36_1

The following packages will be UPDATED:

  asn1crypto                                  0.24.0-py36_0 --> 1.3.0-py36_0
  atomicwrites                                 1.2.1-py36_0 --> 1.3.0-py36_1
  attrs              pkgs/main/osx-64::attrs-18.2.0-py36h2~ --> pkgs/main/noarch::attrs-19.3.0-py_0
  babel                pkgs/main/osx-64::babel-2.6.0-py36_0 --> pkgs/main/noarch::babel-2.8.0-py_0
  cffi                                1.11.5-py36h6174b99_1 --> 1.14.0-py36hb5b8e2f_0
  chardet                                      3.0.4-py36_1 --> 3.0.4-py36_1003
  cryptography                         2.4.1-py36ha12b0ac_0 --> 2.8-py36ha12b0ac_0
  docutils                              0.14-py36hbfde631_0 --> 0.16-py36_0
  idna                    pkgs/main/osx-64::idna-2.7-py36_0 --> pkgs/main/noarch::idna-2.9-py_1
  imagesize          pkgs/main/osx-64::imagesize-1.1.0-py3~ --> pkgs/main/noarch::imagesize-1.2.0-py_0
  intel-openmp                                   2019.1-144 --> 2019.4-233
  jinja2               pkgs/main/osx-64::jinja2-2.10-py36_0 --> pkgs/main/noarch::jinja2-2.11.1-py_0
  kiwisolver                           1.0.1-py36h0a44026_0 --> 1.1.0-py36h0a44026_0
  libedit                           3.1.20170329-hb402a30_2 --> 3.1.20181209-hb402a30_0
  markupsafe                           1.1.0-py36h1de35cc_0 --> 1.1.1-py36h1de35cc_0
  mkl_fft                             1.0.12-py36h5e564d8_0 --> 1.0.15-py36h5e564d8_0
  mkl_random                           1.0.2-py36h27c97d8_0 --> 1.1.0-py36ha771720_0
  more-itertools     pkgs/main/osx-64::more-itertools-4.3.~ --> pkgs/main/noarch::more-itertools-8.2.0-py_0
  ncurses                                    6.1-h0a44026_1 --> 6.2-h0a44026_0
  openssl                                 1.1.1d-h1de35cc_3 --> 1.1.1e-h1de35cc_0
  packaging          pkgs/main/osx-64::packaging-18.0-py36~ --> pkgs/main/noarch::packaging-20.3-py_0
  pip                                           18.1-py36_0 --> 20.0.2-py36_1
  py                      pkgs/main/osx-64::py-1.7.0-py36_0 --> pkgs/main/noarch::py-1.8.1-py_0
  pycparser          pkgs/main/osx-64::pycparser-2.19-py36~ --> pkgs/main/noarch::pycparser-2.20-py_0
  pygments           pkgs/main/osx-64::pygments-2.2.0-py36~ --> pkgs/main/noarch::pygments-2.6.1-py_0
  pyopenssl                                   18.0.0-py36_0 --> 19.1.0-py36_0
  pyparsing          pkgs/main/osx-64::pyparsing-2.3.0-py3~ --> pkgs/main/noarch::pyparsing-2.4.6-py_0
  pysocks                                      1.6.8-py36_0 --> 1.7.1-py36_0
  python-dateutil    pkgs/main/osx-64::python-dateutil-2.7~ --> pkgs/main/noarch::python-dateutil-2.8.1-py_0
  pytz                 pkgs/main/osx-64::pytz-2018.7-py36_0 --> pkgs/main/noarch::pytz-2019.3-py_0
  requests                                    2.20.1-py36_0 --> 2.23.0-py36_0
  setuptools                                  40.6.2-py36_0 --> 46.0.0-py36_0
  six                                         1.11.0-py36_1 --> 1.14.0-py36_0
  snowballstemmer    pkgs/main/osx-64::snowballstemmer-1.2~ --> pkgs/main/noarch::snowballstemmer-2.0.0-py_0
  sphinxcontrib-app~                             1.0.1-py_0 --> 1.0.2-py_0
  sphinxcontrib-dev~                             1.0.1-py_0 --> 1.0.2-py_0
  sphinxcontrib-htm~                             1.0.2-py_0 --> 1.0.3-py_0
  sphinxcontrib-qth~                             1.0.2-py_0 --> 1.0.3-py_0
  sphinxcontrib-ser~                             1.1.3-py_0 --> 1.1.4-py_0
  sqlite                                  3.25.3-ha441bb4_0 --> 3.31.1-ha441bb4_0
  tornado                              5.1.1-py36h1de35cc_0 --> 6.0.4-py36h1de35cc_1
  urllib3                                       1.23-py36_0 --> 1.25.8-py36_0
  wheel                                       0.32.3-py36_0 --> 0.34.2-py36_0
  zipp                                           2.1.0-py_0 --> 2.2.0-py_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
pytz-2019.3          | 231 KB    | ########################################################################################################################################## | 100% 
sphinxcontrib-appleh | 30 KB     | ########################################################################################################################################## | 100% 
pysocks-1.7.1        | 30 KB     | ########################################################################################################################################## | 100% 
wheel-0.34.2         | 49 KB     | ########################################################################################################################################## | 100% 
kiwisolver-1.1.0     | 60 KB     | ########################################################################################################################################## | 100% 
six-1.14.0           | 27 KB     | ########################################################################################################################################## | 100% 
babel-2.8.0          | 6.0 MB    | ########################################################################################################################################## | 100% 
urllib3-1.25.8       | 165 KB    | ########################################################################################################################################## | 100% 
imagesize-1.2.0      | 10 KB     | ########################################################################################################################################## | 100% 
sphinxcontrib-serial | 25 KB     | ########################################################################################################################################## | 100% 
pyparsing-2.4.6      | 64 KB     | ########################################################################################################################################## | 100% 
zipp-2.2.0           | 12 KB     | ########################################################################################################################################## | 100% 
sqlite-3.31.1        | 2.4 MB    | ########################################################################################################################################## | 100% 
packaging-20.3       | 35 KB     | ########################################################################################################################################## | 100% 
pycparser-2.20       | 93 KB     | ########################################################################################################################################## | 100% 
more-itertools-8.2.0 | 39 KB     | ########################################################################################################################################## | 100% 
ncurses-6.2          | 1024 KB   | ########################################################################################################################################## | 100% 
setuptools-46.0.0    | 645 KB    | ########################################################################################################################################## | 100% 
intel-openmp-2019.4  | 1.1 MB    | ########################################################################################################################################## | 100% 
sphinxcontrib-qthelp | 26 KB     | ########################################################################################################################################## | 100% 
docutils-0.16        | 742 KB    | ########################################################################################################################################## | 100% 
sphinxcontrib-htmlhe | 29 KB     | ########################################################################################################################################## | 100% 
asn1crypto-1.3.0     | 162 KB    | ########################################################################################################################################## | 100% 
jinja2-2.11.1        | 97 KB     | ########################################################################################################################################## | 100% 
python-dateutil-2.8. | 224 KB    | ########################################################################################################################################## | 100% 
idna-2.9             | 56 KB     | ########################################################################################################################################## | 100% 
attrs-19.3.0         | 39 KB     | ########################################################################################################################################## | 100% 
openssl-1.1.1e       | 3.5 MB    | ########################################################################################################################################## | 100% 
chardet-3.0.4        | 201 KB    | ########################################################################################################################################## | 100% 
snowballstemmer-2.0. | 58 KB     | ########################################################################################################################################## | 100% 
pygments-2.6.1       | 687 KB    | ########################################################################################################################################## | 100% 
sphinxcontrib-devhel | 24 KB     | ########################################################################################################################################## | 100% 
requests-2.23.0      | 91 KB     | ########################################################################################################################################## | 100% 
tornado-6.0.4        | 647 KB    | ########################################################################################################################################## | 100% 
pip-20.0.2           | 1.9 MB    | ########################################################################################################################################## | 100% 
py-1.8.1             | 69 KB     | ########################################################################################################################################## | 100% 
cffi-1.14.0          | 218 KB    | ########################################################################################################################################## | 100% 
pyopenssl-19.1.0     | 87 KB     | ########################################################################################################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(scipy-dev) Lucass-MacBook:scipy rlucas$ ./runtests.py -vt scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh
Building, see build.log...
Build OK (0:00:09.551111 elapsed)
================================================================================ test session starts ================================================================================
platform darwin -- Python 3.6.7, pytest-5.0.1, py-1.8.1, pluggy-0.13.1 -- /Users/rlucas/anaconda3/envs/scipy-dev/bin/python
cachedir: .pytest_cache
rootdir: /Users/rlucas/scipy-dev/scipy, inifile: pytest.ini
collected 64 items                                                                                                                                                                  

scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-True-True-None] PASSED                                                                                        [  1%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-True-True-eigvals1] PASSED                                                                                    [  3%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-True-False-None] PASSED                                                                                       [  4%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-True-False-eigvals1] PASSED                                                                                   [  6%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-False-True-None] PASSED                                                                                       [  7%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-False-True-eigvals1] PASSED                                                                                   [  9%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-False-False-None] PASSED                                                                                      [ 10%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-True-False-False-eigvals1] PASSED                                                                                  [ 12%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-True-True-None] PASSED                                                                                       [ 14%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-True-True-eigvals1] PASSED                                                                                   [ 15%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-True-False-None] PASSED                                                                                      [ 17%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-True-False-eigvals1] PASSED                                                                                  [ 18%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-False-True-None] PASSED                                                                                      [ 20%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-False-True-eigvals1] PASSED                                                                                  [ 21%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-False-False-None] PASSED                                                                                     [ 23%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-f-False-False-False-eigvals1] PASSED                                                                                 [ 25%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-True-True-None] PASSED                                                                                        [ 26%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-True-True-eigvals1] PASSED                                                                                    [ 28%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-True-False-None] PASSED                                                                                       [ 29%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-True-False-eigvals1] PASSED                                                                                   [ 31%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-False-True-None] PASSED                                                                                       [ 32%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-False-True-eigvals1] PASSED                                                                                   [ 34%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-False-False-None] PASSED                                                                                      [ 35%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-True-False-False-eigvals1] PASSED                                                                                  [ 37%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-True-True-None] PASSED                                                                                       [ 39%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-True-True-eigvals1] PASSED                                                                                   [ 40%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-True-False-None] PASSED                                                                                      [ 42%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-True-False-eigvals1] PASSED                                                                                  [ 43%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-False-True-None] PASSED                                                                                      [ 45%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-False-True-eigvals1] PASSED                                                                                  [ 46%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-False-False-None] PASSED                                                                                     [ 48%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-d-False-False-False-eigvals1] PASSED                                                                                 [ 50%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-True-True-None] PASSED                                                                                        [ 51%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-True-True-eigvals1] PASSED                                                                                    [ 53%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-True-False-None] PASSED                                                                                       [ 54%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-True-False-eigvals1] PASSED                                                                                   [ 56%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-False-True-None] PASSED                                                                                       [ 57%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-False-True-eigvals1] PASSED                                                                                   [ 59%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-False-False-None] PASSED                                                                                      [ 60%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-True-False-False-eigvals1] PASSED                                                                                  [ 62%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-True-True-None] PASSED                                                                                       [ 64%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-True-True-eigvals1] PASSED                                                                                   [ 65%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-True-False-None] PASSED                                                                                      [ 67%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-True-False-eigvals1] PASSED                                                                                  [ 68%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-False-True-None] PASSED                                                                                      [ 70%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-False-True-eigvals1] PASSED                                                                                  [ 71%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-False-False-None] PASSED                                                                                     [ 73%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-F-False-False-False-eigvals1] PASSED                                                                                 [ 75%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-True-True-None] PASSED                                                                                        [ 76%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-True-True-eigvals1] PASSED                                                                                    [ 78%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-True-False-None] PASSED                                                                                       [ 79%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-True-False-eigvals1] PASSED                                                                                   [ 81%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-False-True-None] PASSED                                                                                       [ 82%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-False-True-eigvals1] PASSED                                                                                   [ 84%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-False-False-None] PASSED                                                                                      [ 85%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-True-False-False-eigvals1] PASSED                                                                                  [ 87%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-True-None] PASSED                                                                                       [ 89%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-True-eigvals1] PASSED                                                                                   [ 90%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-False-None] PASSED                                                                                      [ 92%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-False-eigvals1] PASSED                                                                                  [ 93%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-True-None] PASSED                                                                                      [ 95%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-True-eigvals1] PASSED                                                                                  [ 96%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-None] PASSED                                                                                     [ 98%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-eigvals1] PASSED                                                                                 [100%]

============================================================================= 64 passed in 0.95 seconds =============================================================================
(scipy-dev) Lucass-MacBook:scipy rlucas$

so @congma if you can do that on your dev box, maybe give that a try?

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

@congma actually perhaps I spoke too soon, I seem to now be getting the varying failure modes that you are seeing too, whereas before I was only seeing the consistent abort trap 6 failure when running the test case.

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

Note sure if this is helpful but posting:

(scipy-dev) Lucass-MacBook:scipy rlucas$ ./runtests.py -vt scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh

... elided bulk of passing tests output ...
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-True-None] PASSED                                                                                       [ 89%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-True-eigvals1] FAILED                                                                                   [ 90%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-False-None] PASSED                                                                                      [ 92%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-False-eigvals1] PASSED                                                                                  [ 93%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-True-None] PASSED                                                                                      [ 95%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-True-eigvals1] PASSED                                                                                  [ 96%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-None] PASSED                                                                                     [ 98%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-eigvals1] PASSED                                                                                 [100%]

===================================================================================== FAILURES ======================================================================================
_________________________________________________________________ TestEigh.test_eigh[6-D-False-True-True-eigvals1] __________________________________________________________________
scipy/linalg/tests/test_decomp.py:876: in test_eigh
    diag_ = diag(z.T.conj() @ a @ z).real
E   ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size -2308328180370374638 is different from 6)
======================================================================== 1 failed, 63 passed in 0.99 seconds ========================================================================
(scipy-dev) Lucass-MacBook:scipy rlucas$ 

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

so sometimes I'm also getting an abort trap 6 on the test (same as before) and sometimes the full set of parametrized tests pass here but there is still a segfault error:

 ... all tests passing, test output elided...
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-eigvals1] PASSED                                                                                 [100%]

============================================================================= 64 passed in 0.77 seconds =============================================================================
Fatal Python error: Segmentation fault

Current thread 0x00007fff8e2c5380 (most recent call first):
Segmentation fault: 11

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

@rlucas7 great, how did you catch the (apparent) integer overflow? Are there any specific triggers?

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

@rlucas7 Could you also run the snippet I've posted above? It's the stripped down version of that test.

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

@ilayn , the appearance of a gigantic integer in @rlucas7's test run is only caught when the return value of eigh() (or zheevr) is accessed, which seemed consistent with some of the test runs I did while making the OP.

As long as @rlucas7 and I can test, shall we also change the last bit of the stripped-down test snippet into something like this?

w, z, *others = evr(...)  # Last line of your original snippet
np.diag(z.T.conj() @ a @ z).real  # Or some other way to access w or z

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

@rlucas7 great, how did you catch the (apparent) integer overflow? Are there any specific triggers?

I kept repeatedly running the 1-liner:

(scipy-dev) Lucass-MacBook:scipy rlucas$ ./runtests.py -vt scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh

that has only happened that one time, I tried another 20-ish times and did not reproduce.

NOTE: I somehow mistakenly edited your original post here @congma, sorry about that. I've removed and quoted the response here now.

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

@rlucas7 Could you also run the snippet I've posted above? It's the stripped down version of that test.

Sure can, here is the output:

(scipy-dev) Lucass-MacBook:scipy rlucas$ python
Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> scipy.linalg.lapack.ilaver()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'scipy' is not defined
>>> import scipy
>>> scipy.linalg.lapack.ilaver()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'scipy' has no attribute 'linalg'
>>> import numpy as np
>>> from scipy.linalg.lapack import zheevr as evr, zheevr_lwork as evr_lw, _compute_lwork
>>> a = np.array([[0.57830695+0.j, 0.57732626+0.j, 0.50339199+0.j, 0.49955715+0.j, 0.65422828+0.j, 0.39991909+0.j],
...               [0.57732626+0.j, 0.26035319+0.j, 0.98113963+0.j, 0.63757905+0.j, 0.79092958+0.j, 0.72842587+0.j],
...               [0.50339199+0.j, 0.98113963+0.j, 0.26858524+0.j, 0.61785254+0.j, 0.62372901+0.j, 0.63216722+0.j],
...               [0.49955715+0.j, 0.63757905+0.j, 0.61785254+0.j, 0.78260145+0.j, 0.77017289+0.j, 0.82255765+0.j],
...               [0.65422828+0.j, 0.79092958+0.j, 0.62372901+0.j, 0.77017289+0.j, 0.78081655+0.j, 0.50652208+0.j],
...               [0.39991909+0.j, 0.72842587+0.j, 0.63216722+0.j, 0.82255765+0.j, 0.50652208+0.j, 0.00767085+0.j]],
...              dtype='D', order='F')
>>> lwork, lrwork, liwork = _compute_lwork(evr_lw, n=6, lower=False)
>>> evr(a, compute_v= 1, il=3, iu=5, lower=False, overwrite_a=True, range='I')
(array([-0.01350094,  0.0714959 ,  0.24539661,  0.        ,  0.        ,
        0.        ]), array([[-0.60090663+0.j,  0.17259041+0.j,  0.69638956+0.j],
       [ 0.11969248+0.j, -0.46110847+0.j,  0.02372352+0.j],
       [ 0.04234416+0.j, -0.56691172+0.j, -0.03759849+0.j],
       [-0.28507343+0.j,  0.4913636 +0.j, -0.58940264+0.j],
       [ 0.71896113+0.j,  0.41897405+0.j,  0.28269445+0.j],
       [-0.15690741+0.j, -0.13865491+0.j, -0.29283701+0.j]]), 3, array([], dtype=int32), 0)
>>> import scipy.linalg
>>> scipy.linalg.lapack.ilaver()
(3, 7, 0)
>>> 

The failiing test with the large int is an intermittent case though so not sure it is something I can consistently reproduce.

@ilayn do you still think this is an MKL issue?

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

OK could both of you run again but this time making compute_v=0? Let's see if it is indeed the w, z problem. I have a smell now.

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

OK could both of you run again but this time making compute_v=0? Let's see if it is indeed the w, z problem. I have a smell now.

>>> evr(a, compute_v= 0, il=3, iu=5, lower=False, overwrite_a=True, range='I')
(array([-0.51520357, -0.03123833,  2.29315688,  0.        ,  0.        ,
        0.        ]), array([], shape=(0, 0), dtype=complex128), 3, array([], dtype=int32), 0)

not sure it matters but run in the same python session as above. Kept everything else the same.

update: Tried a couple more times and no issues, but when I closed the python session I get a segfault, weird, @ilayn is that what you expect?

>>> evr(a, compute_v= 0, il=3, iu=5, lower=False, overwrite_a=True, range='I')
(array([-0.51520357, -0.03123833,  2.29315688,  0.        ,  0.        ,
        0.        ]), array([], shape=(0, 0), dtype=complex128), 3, array([], dtype=int32), 0)
>>> evr(a, compute_v= 0, il=3, iu=5, lower=False, overwrite_a=True, range='I')
(array([-0.51520357, -0.03123833,  2.29315688,  0.        ,  0.        ,
        0.        ]), array([], shape=(0, 0), dtype=complex128), 3, array([], dtype=int32), 0)
>>> evr(a, compute_v= 0, il=3, iu=5, lower=False, overwrite_a=True, range='I')
(array([-0.51520357, -0.03123833,  2.29315688,  0.        ,  0.        ,
        0.        ]), array([], shape=(0, 0), dtype=complex128), 3, array([], dtype=int32), 0)
>>> 
Segmentation fault: 11
(scipy-dev) Lucass-MacBook:scipy rlucas$ 

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

update: Tried a couple more times and no issues, but when I closed the python session I get a segfault, weird

Same here with the first version of the test snippet. Still building scipy with as clean env as possible and will post later. I'd suggest wrapping the test snippet in a big loop and collect the output in a log file.

@rlucas7
Copy link
Member

rlucas7 commented Mar 22, 2020

update: Tried a couple more times and no issues, but when I closed the python session I get a segfault, weird

Same here with the first version of the test snippet. Still building scipy with as clean env as possible and will post later. I'd suggest wrapping the test snippet in a big loop and collect the output in a log file.

I'm getting with both compute_v values, e.g. 0 and 1:

(scipy-dev) Lucass-MacBook:scipy rlucas$ for i in `seq 1 20`; do  echo $i; python tmp.py;  done
1
2
3
4
5
6
7
8
9
10
11
12
13
Segmentation fault: 11
14
15
16
17
18
19
20
Segmentation fault: 11
(scipy-dev) Lucass-MacBook:scipy rlucas$ for i in `seq 1 20`; do  echo $i; python tmp1.py;  done
1
2
Segmentation fault: 11
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Segmentation fault: 11
19
20
(scipy-dev) Lucass-MacBook:scipy rlucas$ diff tmp.py tmp1.py 
13c13
< evr(a, compute_v=0, il=3, iu=5, lower=False, overwrite_a=True, range='I')
---
> evr(a, compute_v= 1, il=3, iu=5, lower=False, overwrite_a=True, range='I')
(scipy-dev) Lucass-MacBook:scipy rlucas$ 

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

OK could both of you run again but this time making compute_v=0? Let's see if it is indeed the w, z problem. I have a smell now.

Ran 100 times, out of which 84 exited without errors and apparently correct values, 1 aborted, 15 segfaulted. Haven't got the one like in #11709 (comment) yet.

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

I'm running another 5000 tests and already seeing some weird overflow-like Python exceptions, but here's also a question regarding the test script:

w, z = eigh(a, overwrite_a=overwrite, lower=lower, eigvals=eigvals)
assert_dtype_equal(z.dtype, dtype_)
w = w.astype(dtype_)
diag_ = diag(z.T.conj() @ a @ z).real

If at L 873 the code requests that the a array can be overwritten, why is a used again in matrix computation at L 876?

@ilayn
Copy link
Member

ilayn commented Mar 22, 2020

This is really weird. Can you also print the in between variables ?

@congma
Copy link
Contributor Author

congma commented Mar 22, 2020

Thank you all for the ideas. I'll go on testing tomorrow and hopefully will find something of interest to scipy (instead of chasing weird MKL bugs).

@congma
Copy link
Contributor Author

congma commented Mar 23, 2020

I'm running another 5000 tests and already seeing some weird overflow-like Python exceptions

The script I run was identical to the first test snippet (with compute_v=True), except that the return values of evr() were saved and printed, also the matrix product z.T.conj() @ a @ z were printed.

The tally from those runs were:

The three runs that captured Python errors:

Traceback (most recent call last):
  File "ilayn.py", line 18, in <module>
    print(np.diag(z.T.conj() @ a @ z).real)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 71163161 is different from 6)
Traceback (most recent call last):
  File "ilayn.py", line 18, in <module>
    print(np.diag(z.T.conj() @ a @ z).real)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 6 is different from 3)
Traceback (most recent call last):
  File "ilayn.py", line 18, in <module>
    print(np.diag(z.T.conj() @ a @ z).real)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 6 is different from 3)

@congma
Copy link
Contributor Author

congma commented Mar 23, 2020

This is really weird. Can you also print the in between variables ?

What are the inbetween variables that could help the most?

@congma
Copy link
Contributor Author

congma commented Mar 23, 2020

Got a potentially meaningful error message:

import numpy as np
from scipy.linalg.lapack import zheevr as evr, zheevr_lwork as evr_lw, _compute_lwork

a = np.array([[0.57830695+0.j, 0.57732626+0.j, 0.50339199+0.j, 0.49955715+0.j, 0.65422828+0.j, 0.39991909+0.j],
              [0.57732626+0.j, 0.26035319+0.j, 0.98113963+0.j, 0.63757905+0.j, 0.79092958+0.j, 0.72842587+0.j],
              [0.50339199+0.j, 0.98113963+0.j, 0.26858524+0.j, 0.61785254+0.j, 0.62372901+0.j, 0.63216722+0.j],
              [0.49955715+0.j, 0.63757905+0.j, 0.61785254+0.j, 0.78260145+0.j, 0.77017289+0.j, 0.82255765+0.j],
              [0.65422828+0.j, 0.79092958+0.j, 0.62372901+0.j, 0.77017289+0.j, 0.78081655+0.j, 0.50652208+0.j],
              [0.39991909+0.j, 0.72842587+0.j, 0.63216722+0.j, 0.82255765+0.j, 0.50652208+0.j, 0.00767085+0.j]],
             dtype='D', order='F')
ap = a.copy()

lwork, lrwork, liwork = _compute_lwork(evr_lw, n=6, lower=False)
print("lwork parameters: %r" % dict(lwork=lwork, lrwork=lrwork, liwork=liwork))
w, z, *others = evr(a, compute_v=1, il=3, iu=5, lower=False, overwrite_a=True, range='I')
print("shape w:", w.shape)
print("shape z:", z.shape)
print("w: ", w)
print("z: ", z)
print("z.H @ a:", z.T.conj() @ ap)
l = z.T.conj() @ ap
print("z.H @ a @ z", l @ z)

Output from a run:

lwork parameters: {'lwork': 12, 'lrwork': 144, 'liwork': 60}
shape w: (6,)
shape z: (6, 3)
w:  [-0.01350094  0.0714959   0.24539661  0.          0.          0.        ]
z:  [[-0.60090663+0.j  0.17259041+0.j  0.69638956+0.j]
 [ 0.11969248+0.j -0.46110847+0.j  0.02372352+0.j]
 [ 0.04234416+0.j -0.56691172+0.j -0.03759849+0.j]
 [-0.28507343+0.j  0.4913636 +0.j -0.58940264+0.j]
 [ 0.71896113+0.j  0.41897405+0.j  0.28269445+0.j]
 [-0.15690741+0.j -0.13865491+0.j -0.29283701+0.j]]
z.H @ a: [[ 0.00811281+0.j -0.00161596+0.j -0.00057169+0.j  0.00384876+0.j
  -0.00970665+0.j  0.0021184 +0.j]
 [ 0.01233951+0.j -0.03296736+0.j -0.04053186+0.j  0.03513048+0.j
   0.02995493+0.j -0.00991326+0.j]
 [ 0.17089164+0.j  0.00582167+0.j -0.00922654+0.j -0.14463741+0.j
   0.06937226+0.j -0.07186121+0.j]]
z.H @ a @ z: [[-1.35009443e-02+0.j -9.89334482e-17+0.j  4.25007252e-17+0.j]
 [-8.02309608e-17+0.j  7.14958992e-02+0.j -4.64038530e-17+0.j]
 [ 3.12250226e-17+0.j -7.63278329e-17+0.j  2.45396612e-01+0.j]]
python(17997,0x1132a6dc0) malloc: *** error for object 0x29416aa800000000: pointer being freed was not allocated
python(17997,0x1132a6dc0) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

This suggests a free() without malloc() somewhere but I'm not sure where.

@congma
Copy link
Contributor Author

congma commented Mar 23, 2020

Two addition Python exceptions with strange shape information:

Traceback (most recent call last):
  File "ilayn.py", line 20, in <module>
    print("z.H @ a:", z.T.conj() @ ap)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 44154137 is different from 6)
Traceback (most recent call last):
  File "ilayn.py", line 20, in <module>
    print("z.H @ a:", z.T.conj() @ ap)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 17050905 is different from 6)

Another 5000 runs, with

  • 2 Python ValueError (see above)
  • 70 Aborted
  • 630 Segfaulted

@ilayn
Copy link
Member

ilayn commented Mar 23, 2020

Not the failures but actually the ones that pass makes me worried more. This signals that I have to dig in to the LAPACK wrappers instead. Because this really doesn't make any sense.

The returned array sizes are almost random. I don't see any underflow pattern hence probably they are random memory values which implies f2py is not returning a proper object which in turn our wrappers are not correct.

These tests were converted to pytest decorators very recently and maybe we have discovered a bug that were not tested properly. I'll check this properly on a Linux box (I'm on Win10) when I can.

@andyfaff
Copy link
Contributor

I'll check this properly on a Linux box (I'm on Win10) when I can.

docker is your friend here... reproducibility in setup every time.

docker run -it --rm -v $PWD/:/home/scipy scipy/scipy-dev /bin/bash

@congma
Copy link
Contributor Author

congma commented Mar 23, 2020

This may well be more noise, but just in case it may help, I created a stripped down version in pure C that mostly performs the LAPACK operations done by the Python test snippet, in the hope that one may verify whether this is an underlying C-level MKL issue. The program appears to run fine without any problem. I ran the program in loops that repeat 5000 times per loop, and I have yet to run into a crash.


The C program is as follows:

#include <stdio.h>
#include "mkl.h"


#define ARR_SIZE	(36)
#define ARR_DIM		(6)
#define ARR_SUPP	(12)
#define ARR_C		(18)


static void print_matrix_colmajor(const lapack_complex_double * restrict a,
                                  lapack_int nr, lapack_int nc)
{
    int i, j;
    for ( i = 0; i < nr; ++i )
    {
	for ( j = 0; j < nc; ++j )
	{
	    lapack_complex_double ze;
	    int k;
	    k = i + j * nr;
	    ze = a[k];
	    printf(" %# .8f%+.8gj", ze.real, ze.imag);
	}
	printf(",\n");
    }
}


int main(int argc, char * argv[])
{
    /* Initial input matrix data. */
    const double ar[ARR_SIZE] =
    {
	0.57830695, 0.57732626, 0.50339199, 0.49955715, 0.65422828, 0.39991909,
	0.57732626, 0.26035319, 0.98113963, 0.63757905, 0.79092958, 0.72842587,
	0.50339199, 0.98113963, 0.26858524, 0.61785254, 0.62372901, 0.63216722,
	0.49955715, 0.63757905, 0.61785254, 0.78260145, 0.77017289, 0.82255765,
	0.65422828, 0.79092958, 0.62372901, 0.77017289, 0.78081655, 0.50652208,
	0.39991909, 0.72842587, 0.63216722, 0.82255765, 0.50652208, 0.00767085
    };
    /* Complexified input matrix */
    lapack_complex_double a[ARR_SIZE];
    lapack_complex_double ap[ARR_SIZE];
    int i;

    /* Create the input matrix with the correct type that will be fed to
     * zheevr() */
    for ( i = 0; i < ARR_SIZE; ++i )
    {
	a[i].real = ar[i];
	a[i].imag = 0.0;
	ap[i].real = ar[i];
	ap[i].imag = 0.0;
    }

    lapack_int m, info;
    double w[ARR_DIM];
    lapack_complex_double z[ARR_SIZE];
    lapack_int isuppz[ARR_SUPP];

    info = LAPACKE_zheevr(LAPACK_COL_MAJOR, 'V', 'I', 'U', ARR_DIM, a, ARR_DIM,
	    0.0, 0.0, 3, 5, 0.0, &m, w, z, ARR_DIM, isuppz);

    printf("info: %d\n", info);
    printf("m: %d\n", m);
    printf("w: [");
    for ( i = 0; i < m; ++i )
    {
	printf(" %# .8f", w[i]);
    }
    printf("]\n");

    printf("z: [");
    print_matrix_colmajor(z, ARR_DIM, m);
    printf("]\n");

    /* Calculate the vector-matrix dot product z.H @ a */
    lapack_complex_double alpha = {1.0, 0.0};
    lapack_complex_double beta = {0.0, 0.0};
    lapack_complex_double c[ARR_C];

    cblas_zgemm(CblasColMajor, CblasConjTrans, CblasNoTrans,
	    m, ARR_DIM, ARR_DIM,
	    &alpha, z, ARR_DIM, ap, ARR_DIM,
	    &beta, c, m);

    printf("c = z.H @ a: [");
    print_matrix_colmajor(c, m, ARR_DIM);
    printf("]\n");
}

Running the compiled program linked against MKL 2019.4 gives the following output

info: 0
m: 3
w: [ -0.01350094  0.07149590  0.24539661]
z: [ -0.60090663+0j  0.17259041+0j  0.69638956+0j,
  0.11969248+0j -0.46110847+0j  0.02372352+0j,
  0.04234416+0j -0.56691172+0j -0.03759849+0j,
 -0.28507343+0j  0.49136360+0j -0.58940264+0j,
  0.71896113+0j  0.41897405+0j  0.28269445+0j,
 -0.15690741+0j -0.13865491+0j -0.29283701+0j,
]
c = z.H @ a: [  0.00811281+0j -0.00161596+0j -0.00057169+0j  0.00384876+0j -0.00970665+0j  0.00211840+0j,
  0.01233951+0j -0.03296736+0j -0.04053186+0j  0.03513048+0j  0.02995493+0j -0.00991326+0j,
  0.17089164+0j  0.00582167+0j -0.00922654+0j -0.14463741+0j  0.06937226+0j -0.07186121+0j,
]

@ilayn
Copy link
Member

ilayn commented Mar 23, 2020

@congma Could you please open the file scipy/scipy/linalg/flapack_sym_herm.pyf.src file and find the line 843

<ftype2> intent(hide),dimension(lrwork) :: rwork

and change it to

<ftype2> intent(hide),dimension(lrwork),depend(lrwork) :: rwork

and then build and test again?

@rlucas7
Copy link
Member

rlucas7 commented Mar 24, 2020

@ilayn Thanks for looking further into the issue.
I tried the change you propose and still getting errors.
The same linalg test is failing when I try to run tests (after the change) and I'm still getting segfaults.

# changed file
(scipy-dev) Lucass-MacBook:scipy rlucas$ git add -p . 
diff --git a/scipy/linalg/flapack_sym_herm.pyf.src b/scipy/linalg/flapack_sym_herm.pyf.src
index d3ac672b5..e6c8f23a4 100644
--- a/scipy/linalg/flapack_sym_herm.pyf.src
+++ b/scipy/linalg/flapack_sym_herm.pyf.src
@@ -840,7 +840,7 @@ subroutine <prefix2c>heevr(compute_v,range,lower,n,a,lda,vl,vu,il,iu,abstol,w,z,
     integer intent(hide),depend(n) :: lda=max(1,n)
     integer intent(hide),depend(z) :: ldz=max(1,shape(z,0))
     <ftype2c>  intent(hide),dimension(lwork),depend(lwork) :: work
-    <ftype2> intent(hide),dimension(lrwork) :: rwork
+    <ftype2> intent(hide),dimension(lrwork),depend(lrwork)  :: rwork
     integer intent(hide),dimension(liwork),depend(liwork) :: iwork
 
     <ftype2> intent(out),dimension(n),depend(n) :: w
Stage this hunk [y,n,q,a,d,e,?]? n

# redid build (from test runner)

(scipy-dev) Lucass-MacBook:scipy rlucas$ python runtests.py -v -s special
Building, see build.log...
Build OK (0:00:08.037006 elapsed)
================================================================================ test session starts ================================================================================
platform darwin -- Python 3.6.7, pytest-5.0.1, py-1.8.1, pluggy-0.13.1 -- /Users/rlucas/anaconda3/envs/scipy-dev/bin/python
cachedir: .pytest_cache
rootdir: /Users/rlucas/scipy-dev/scipy, inifile: pytest.ini
collected 1709 items / 179 deselected / 1530 selected                                                                                                                               

scipy/special/tests/test_basic.py::TestCephes::test_airy PASSED                                                                                                               [  0%]
scipy/special/tests/test_basic.py::TestCephes::test_airye PASSED                                                                                                              [  0%]
scipy/special/tests/test_basic.py::TestCephes::test_binom PASSED                                                                                                              [  0%]
scipy/special/tests/test_basic.py::TestCephes::test_binom_2 PASSED                                                                                                            [  0%]
scipy/special/tests/test_basic.py::TestCephes::test_binom_exact PASSED
... output elided
cipy/special/tests/test_zeta.py::test_riemann_zeta_special_cases PASSED                                                                                                      [ 99%]
scipy/special/tests/test_zeta.py::test_riemann_zeta_avoid_overflow PASSED                                                                                                     [100%]

================================================== 1482 passed, 24 skipped, 179 deselected, 23 xfailed, 1 xpassed in 14.05 seconds ==================================================
(scipy-dev

# now rerun script... 

(scipy-dev) Lucass-MacBook:scipy rlucas$ for i in `seq 1 20`; do  echo $i; python tmp.py;  done
1
2
3
4
5
6
7
8
Segmentation fault: 11
9
10
11
12
Segmentation fault: 11
13
14
15
16
17
18
19
20
(scipy-dev)

### also still getting intermittent segfaults

# ending output of: `runtests.py -v -s linalg`

scipy/linalg/tests/test_special_matrices.py::test_fiedler PASSED                                                                                                              [ 99%]
scipy/linalg/tests/test_special_matrices.py::test_fiedler_companion PASSED                                                                                                    [100%]

==================================================== 1708 passed, 4 skipped, 5 deselected, 2 xfailed, 3 xpassed in 24.63 seconds ====================================================
Fatal Python error: Segmentation fault

Current thread 0x00007fff8e2c5380 (most recent call first):
Segmentation fault: 11

not sure if @congma is seeing similar.

@congma
Copy link
Contributor Author

congma commented Mar 24, 2020

and then build and test again?

not sure if @congma is seeing similar.

Yes I am. While the loop is running, a glimpse shows that the random crashes were still there and not substantially different from the earlier ones.

@congma
Copy link
Contributor Author

congma commented Mar 24, 2020

One more data point amid the noise: I have now a python process that is apparently stuck and doing nothing (as in consuming 0% CPU time) while executing the test snippet. I may have to kill it later but I have no idea why it is doing what it's doing (i.e. nothing).

EDIT: Sorry, please disregard that. I didn't make the correct change specified in #11709 (comment)

EDIT2: Sorry disregard the disregard. The stuck-process problem still happens (in my current setting, at the 2333rd of 5000 runs).

Also a new ValueError was caught during the runs:

Traceback (most recent call last):
  File "ilayn.py", line 20, in <module>
    print("z.H @ a:", z.T.conj() @ ap)
ValueError: A global iterator flag was passed as a per-operand flag to the iterator constructor

@congma
Copy link
Contributor Author

congma commented Mar 24, 2020

There's a slight hint that I might have hit something when playing around with the f2py wrapper for ?heevr.

In the wrapper changed the declaration and attribute for the isuppz array to something much more simple:

integer intent(out),dimension((compute_v?2*max(1,n):0)),depend(n,compute_v) :: isuppz

and in the Python test snippet for zheevr I also added a print() statement to show all the return values (including the isuppz output array) from the evr() call, beyond w and z. The dimensions for isuppz, in the modified f2py file, is intended to err on the side of being large enough.

This time, the crashes seemed gone. The isuppz output now looks like the following:

array([          1,           1,           0, XXXXXXXX,           0,
                 0,           0,           0,           0,           0,
                 0,           0], dtype=int32)

where XXXXXXXX is a seemingly random integer that differs in each run.


Rationale

To quote verbatim from the netlib LAPACK docs:

          ISUPPZ is INTEGER array, dimension ( 2*max(1,M) )
          The support of the eigenvectors in Z, i.e., the indices
          indicating the nonzero elements in Z. The i-th eigenvector
          is nonzero only in elements ISUPPZ( 2*i-1 ) through
          ISUPPZ( 2*i ). This is an output of ZSTEMR (tridiagonal
          matrix). The support of the eigenvectors of A is typically
          1:N because of the unitary transformations applied by ZUNMTR.
          Implemented only for RANGE = 'A' or 'I' and IU - IL = N - 1

Notice that the doc says "Implemented only for RANGE = 'A' or 'I' and IU - IL = N - 1," (emphasis mine). It says nothing about the condition under which the isuppz array is not dereferenced at all. I got a suspicion that with certain LAPACK implementation, the isuppz array might have been used internally anyway, so the dimensions might have to match (otherwise out-of-bound access might occur).

@congma
Copy link
Contributor Author

congma commented Mar 24, 2020

Before I have to wrap up for the day (brain hardly functioning right now), for reference here's the f2py wrappers as they currently stand on my side. I realized I've made other changes too, and I don't want to mislead anyone working on this. Notice I changed the attributes for ldz and z too.

subroutine <prefix2c>heevr(compute_v,range,lower,n,a,lda,vl,vu,il,iu,abstol,w,z,m,ldz,isuppz,work,lwork,rwork,lrwork,iwork,liwork,info)
    ! Standard Symmetric/HermitianEigenvalue Problem
    ! Complex - Single precision / Double precision
    !
    ! if jobz = 'N' there are no eigvecs hence 0x0 'z' returned
    ! if jobz = 'V' and range = 'A', z is (nxn)
    ! if jobz = 'V' and range = 'V', z is (nxn) since returned number of eigs is unknown beforehand
    ! if jobz = 'V' and range = 'I', z is (nx(iu-il+1))

    callstatement (*f2py_func)((compute_v?"V":"N"),range,(lower?"L":"U"),&n,a,&lda,&vl,&vu,&il,&iu,&abstol,&m,w,z,&ldz,isuppz,work,&lwork,rwork,&lrwork,iwork,&liwork,&info)
    callprotoargument char*,char*,char*,int*,<ctype2c>*,int*,<ctype2>*,<ctype2>*,int*,int*,<ctype2>*,int*,<ctype2>*,<ctype2c>*,int*,int*,<ctype2c>*,int*,<ctype2>*,int*,int*,int*,int*

    <ftype2c> intent(in,copy,aligned8),check(shape(a,0)==shape(a,1)),dimension(n,n) :: a
    integer optional,intent(in),check(compute_v==1||compute_v==0):: compute_v = 1
    character optional,intent(in),check(*range=='A'||*range=='V' ||*range=='I') :: range='A'
    integer optional,intent(in),check(lower==0||lower==1) :: lower = 0
    integer optional,intent(in) :: il=1
    integer optional,intent(in),depend(a) :: iu=shape(a,0)
    <ftype2> optional,intent(in) :: vl=0.0
    <ftype2> optional,intent(in),check(vu>vl),depend(vl) :: vu=1.0
    <ftype2> intent(in) :: abstol=0.0
    integer optional,intent(in),depend(a),check(lwork>=max(2*shape(a,0),1)||lwork==-1) :: lwork=max(2*shape(a,0),1)
    integer optional,intent(in),depend(a),check(lrwork>=max(24*shape(a,0),1)||lrwork==-1) :: lrwork=max(24*shape(a,0),1)
    integer optional,intent(in),depend(a),check(liwork>=max(1,10*shape(a,0))||liwork==-1):: liwork= max(1,10*shape(a,0))

    integer intent(hide),depend(a) :: n=shape(a,0)
    integer intent(hide),depend(n) :: lda=max(1,n)
    integer intent(hide),depend(n) :: ldz=(compute_v?max(1,n):1)
    <ftype2c>  intent(hide),dimension(lwork),depend(lwork) :: work
    <ftype2> intent(hide),dimension(lrwork),depend(lrwork) :: rwork
    integer intent(hide),dimension(liwork),depend(liwork) :: iwork

    <ftype2> intent(out),dimension(n),depend(n) :: w
    <ftype2c> intent(out),dimension((compute_v?ldz:0),(compute_v?(*range=='I'?iu-il+1:MAX(1,n)):0)),depend(n,ldz,compute_v,range,iu,il) :: z
    integer intent(out) :: m
    ! Only returned if range=='A' or range=='I' and il, iu = 1, n
    integer intent(out),dimension((compute_v?2*max(1,n):0)),depend(n,compute_v) :: isuppz
    integer intent(out) :: info

end subroutine <prefix2c>heevr

subroutine <prefix2c>heevr_lwork(n,lower,a,lda,vl,vu,il,iu,abstol,m,w,z,ldz,isuppz,work,lwork,rwork,lrwork,iwork,liwork,info)
    ! LWORK routines for (c/z)heevr
    fortranname <prefix2c>heevr
    callstatement (*f2py_func)("N","A",(lower?"L":"U"),&n,&a,&lda,&vl,&vu,&il,&iu,&abstol,&m,&w,&z,&ldz,&isuppz,&work,&lwork,&rwork,&lrwork,&iwork,&liwork,&info)
    callprotoargument char*,char*,char*,int*,<ctype2c>*,int*,<ctype2>*,<ctype2>*,int*,int*,<ctype2>*,int*,<ctype2>*,<ctype2c>*,int*,int*,<ctype2c>*,int*,<ctype2>*,int*,int*,int*,int*

    ! Inputs
    integer intent(in):: n
    integer optional,intent(in),check(lower==0||lower==1) :: lower = 0
    ! Not referenced
    <ftype2c> intent(hide) :: a
    integer intent(hide),depend(n) :: lda = max(1,n)
    <ftype2> intent(hide) :: vl=0.
    <ftype2> intent(hide) :: vu=1.
    integer intent(hide) :: il=1
    integer intent(hide) :: iu=2
    <ftype2> intent(hide) :: abstol=0.
    integer  intent(hide) :: m=1
    <ftype2> intent(hide) :: w
    <ftype2c> intent(hide) :: z
    integer intent(hide),depend(n):: ldz = max(1,n)
    integer intent(hide) :: isuppz
    integer intent(hide) :: lwork = -1
    integer intent(hide) :: lrwork = -1
    integer intent(hide) :: liwork = -1
    ! Outputs
    <ftype2c> intent(out) :: work
    <ftype2> intent(out) :: rwork
    integer intent(out) :: iwork
    integer intent(out) :: info

end subroutine <prefix2c>heevr_work

congma added a commit to congma/scipy that referenced this issue Mar 28, 2020
In the LAPACK wrapper for ?heevr, enlarge the size of the isuppz array
parameter in order to prevent out-of-bound access by the underlying
library. For both ?heevr and ?syevr, more robust checks and default
values are also put in place.

Reference: scipy#11709
@ilayn
Copy link
Member

ilayn commented Mar 28, 2020

I am going to check now but if what you found is the culprit then this needs to be reported to MKL. Because reference implementation and OpenBLAS doesn't show this behavior.

@charris
Copy link
Member

charris commented Mar 28, 2020

@oleksandr-pavlyk Ping.

@congma
Copy link
Contributor Author

congma commented Mar 28, 2020

I am going to check now but if what you found is the culprit then this needs to be reported to MKL. Because reference implementation and OpenBLAS doesn't show this behavior.

In that case, is there a method to selectively compile the wrapper as a stopgap measure?

@ilayn
Copy link
Member

ilayn commented Mar 28, 2020

That's really something we want to avoid because every implementation flavor has different problems, and it bloats the code base not to mention maintainability burden.

MKL's implementation of ?stemr is somehow missing the alleig switch according to your experiments.

@congma
Copy link
Contributor Author

congma commented Mar 29, 2020

Interesting observation! I think a more actionable issue for MKL is that it violates its own documentation (contract) of zheevr, which did state that the array was meant to be "[r]eferenced only if eigenvectors are needed (jobz = 'V') and all eigenvalues are needed, that is, range = 'A' or range = 'I' and il = 1 and iu = n."

@ilayn
Copy link
Member

ilayn commented Apr 5, 2020

I don't know if the original poster Zoe is anyone from this thread but apparently there is a MKL bug ticket is open for exactly the same reason.

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/851351

@oleksandr-pavlyk Do you mind nudging this ticket if possible ?

@oleksandr-pavlyk
Copy link
Contributor

@ilayn Done.

@ilayn
Copy link
Member

ilayn commented Apr 29, 2020

Intel team confirmed the bug and included the fix for the upcoming MKL 2020 update 2.

@ilayn ilayn added this to the 1.5.0 milestone May 6, 2020
@ilayn
Copy link
Member

ilayn commented Aug 5, 2020

For the record, the fix is included in the official release. Last week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants