-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eigh() tests fail to pass, crash Python with seemingly ramdom pattern #11709
Comments
I think we first need to isolate that piece of code and run externally without the test. The tests are with fixed-seed so it doesn't make any sense when it succeeds sometimes and not otherwise. Also what do you get if you run |
|
What do you get with the following ? import numpy as np
from scipy.linalg.lapack import zheevr as evr, zheevr_lwork as evr_lw, _compute_lwork
a = np.array([[0.57830695+0.j, 0.57732626+0.j, 0.50339199+0.j, 0.49955715+0.j, 0.65422828+0.j, 0.39991909+0.j],
[0.57732626+0.j, 0.26035319+0.j, 0.98113963+0.j, 0.63757905+0.j, 0.79092958+0.j, 0.72842587+0.j],
[0.50339199+0.j, 0.98113963+0.j, 0.26858524+0.j, 0.61785254+0.j, 0.62372901+0.j, 0.63216722+0.j],
[0.49955715+0.j, 0.63757905+0.j, 0.61785254+0.j, 0.78260145+0.j, 0.77017289+0.j, 0.82255765+0.j],
[0.65422828+0.j, 0.79092958+0.j, 0.62372901+0.j, 0.77017289+0.j, 0.78081655+0.j, 0.50652208+0.j],
[0.39991909+0.j, 0.72842587+0.j, 0.63216722+0.j, 0.82255765+0.j, 0.50652208+0.j, 0.00767085+0.j]],
dtype='D', order='F')
lwork, lrwork, liwork = _compute_lwork(evr_lw, n=6, lower=False)
evr(a, compute_v= 1, il=3, iu=5, lower=False, overwrite_a=True, range='I') I'm on LAPACK 3.9.0 with OpenBLAS |
I run the script with master version of
I was also testing running eigh on the complex array |
I'm actually out of ideas. Random crashing cannot be related to SciPy as far as I can see. How do you build SciPy from the master branch? |
export NPY_DISTUTILS_APPEND_FLAGS=1
export SCIPY_USE_G77_ABI_WRAPPER=1
./runtests conda is also doing some tricks with compiler flags by default when I activate an environment, and the ones below are the ones I think are relevant
The Fortran compiler is gfortran-9.3.0. |
I am not familiar with the MacOSX stuff even less with conda env but ``runtests.py` won't upgrade your installed scipy but only build it. After the build you need to install the package somehow. Hence the current master version should give import scipy as sp
print(sp.__version__)
1.5.0.dev0+ae34ce4 |
I'm running external snippets with the help of ./runtests.py --ipython < FILE_NAME.py where I put the Python statements in a file at I think conda maintainers/tests may have more resources at hand to test with a similar environment. If there's something in master that crashes, they will be hit (and affected) eventually. Should we and how can we bring the issue to their attention? |
I meant the code snippet I gave above. When you run that as a standalone code snippet can you check what the print statement gives. No conda env no scripts, as simple as possible It helps to have everything as barebone as possible. |
EDIT: I should add that the no-env (i.e. base env) conda setting is purely binary and only installs released versions which can differ by a lot from master. I'm pretty much sure I'd better not install built packages from master tip into that. The conda env is where I build or even install from master. |
I am trying to eliminate possibilities but only one at a time. If you open an Python environment (whatever you use) and import scipy and then check the version manually, it says Then if you run the code snippet without any script but via IPython or python terminal and it still crashes that's a datapoint. If it doesn't then it is related to how conda triggers We haven't triangulated yet when and where it happens. You seem to use many things at once that's why I am trying to isolate the tests. I don't mind which tool you use. But a straightforward test is needed. |
Also is it possible that you upgrade to LAPACK 3.9.0 in another environment? I don't know why 2019.4 MKL still uses 3.7.0 |
Just done that in a new environment with
I don't think I know how to do that right now, I could possibly try an openblas recent-release build (but that won't be MKL) and have scipy master build use that. The MKL dependency is kinda weirdly baked into the conda dependency relations; trying to circumvent that makes conda fail to find satisfiable dependencies. |
I was having the same issue as before too. I'm not sure what in the dependency closure fixed the issue but I updated the [Long scrollback buffer listing]
so @congma if you can do that on your dev box, maybe give that a try? |
@congma actually perhaps I spoke too soon, I seem to now be getting the varying failure modes that you are seeing too, whereas before I was only seeing the consistent abort trap 6 failure when running the test case. |
Note sure if this is helpful but posting:
|
so sometimes I'm also getting an abort trap 6 on the test (same as before) and sometimes the full set of parametrized tests pass here but there is still a segfault error:
|
@rlucas7 great, how did you catch the (apparent) integer overflow? Are there any specific triggers? |
@rlucas7 Could you also run the snippet I've posted above? It's the stripped down version of that test. |
@ilayn , the appearance of a gigantic integer in @rlucas7's test run is only caught when the return value of As long as @rlucas7 and I can test, shall we also change the last bit of the stripped-down test snippet into something like this? w, z, *others = evr(...) # Last line of your original snippet
np.diag(z.T.conj() @ a @ z).real # Or some other way to access w or z |
I kept repeatedly running the 1-liner:
that has only happened that one time, I tried another 20-ish times and did not reproduce. NOTE: I somehow mistakenly edited your original post here @congma, sorry about that. I've removed and quoted the response here now. |
Sure can, here is the output:
The failiing test with the large @ilayn do you still think this is an MKL issue? |
OK could both of you run again but this time making |
not sure it matters but run in the same python session as above. Kept everything else the same. update: Tried a couple more times and no issues, but when I closed the python session I get a segfault, weird, @ilayn is that what you expect?
|
Same here with the first version of the test snippet. Still building scipy with as clean env as possible and will post later. I'd suggest wrapping the test snippet in a big loop and collect the output in a log file. |
I'm getting with both
|
Ran 100 times, out of which 84 exited without errors and apparently correct values, 1 aborted, 15 segfaulted. Haven't got the one like in #11709 (comment) yet. |
I'm running another 5000 tests and already seeing some weird overflow-like Python exceptions, but here's also a question regarding the test script: scipy/scipy/linalg/tests/test_decomp.py Lines 873 to 876 in ae34ce4
If at L 873 the code requests that the |
This is really weird. Can you also print the in between variables ? |
Thank you all for the ideas. I'll go on testing tomorrow and hopefully will find something of interest to scipy (instead of chasing weird MKL bugs). |
The script I run was identical to the first test snippet (with The tally from those runs were:
The three runs that captured Python errors:
|
What are the inbetween variables that could help the most? |
Got a potentially meaningful error message: import numpy as np
from scipy.linalg.lapack import zheevr as evr, zheevr_lwork as evr_lw, _compute_lwork
a = np.array([[0.57830695+0.j, 0.57732626+0.j, 0.50339199+0.j, 0.49955715+0.j, 0.65422828+0.j, 0.39991909+0.j],
[0.57732626+0.j, 0.26035319+0.j, 0.98113963+0.j, 0.63757905+0.j, 0.79092958+0.j, 0.72842587+0.j],
[0.50339199+0.j, 0.98113963+0.j, 0.26858524+0.j, 0.61785254+0.j, 0.62372901+0.j, 0.63216722+0.j],
[0.49955715+0.j, 0.63757905+0.j, 0.61785254+0.j, 0.78260145+0.j, 0.77017289+0.j, 0.82255765+0.j],
[0.65422828+0.j, 0.79092958+0.j, 0.62372901+0.j, 0.77017289+0.j, 0.78081655+0.j, 0.50652208+0.j],
[0.39991909+0.j, 0.72842587+0.j, 0.63216722+0.j, 0.82255765+0.j, 0.50652208+0.j, 0.00767085+0.j]],
dtype='D', order='F')
ap = a.copy()
lwork, lrwork, liwork = _compute_lwork(evr_lw, n=6, lower=False)
print("lwork parameters: %r" % dict(lwork=lwork, lrwork=lrwork, liwork=liwork))
w, z, *others = evr(a, compute_v=1, il=3, iu=5, lower=False, overwrite_a=True, range='I')
print("shape w:", w.shape)
print("shape z:", z.shape)
print("w: ", w)
print("z: ", z)
print("z.H @ a:", z.T.conj() @ ap)
l = z.T.conj() @ ap
print("z.H @ a @ z", l @ z) Output from a run:
This suggests a |
Two addition Python exceptions with strange shape information:
Another 5000 runs, with
|
Not the failures but actually the ones that pass makes me worried more. This signals that I have to dig in to the LAPACK wrappers instead. Because this really doesn't make any sense. The returned array sizes are almost random. I don't see any underflow pattern hence probably they are random memory values which implies f2py is not returning a proper object which in turn our wrappers are not correct. These tests were converted to pytest decorators very recently and maybe we have discovered a bug that were not tested properly. I'll check this properly on a Linux box (I'm on Win10) when I can. |
docker is your friend here... reproducibility in setup every time.
|
This may well be more noise, but just in case it may help, I created a stripped down version in pure C that mostly performs the LAPACK operations done by the Python test snippet, in the hope that one may verify whether this is an underlying C-level MKL issue. The program appears to run fine without any problem. I ran the program in loops that repeat 5000 times per loop, and I have yet to run into a crash. The C program is as follows: #include <stdio.h>
#include "mkl.h"
#define ARR_SIZE (36)
#define ARR_DIM (6)
#define ARR_SUPP (12)
#define ARR_C (18)
static void print_matrix_colmajor(const lapack_complex_double * restrict a,
lapack_int nr, lapack_int nc)
{
int i, j;
for ( i = 0; i < nr; ++i )
{
for ( j = 0; j < nc; ++j )
{
lapack_complex_double ze;
int k;
k = i + j * nr;
ze = a[k];
printf(" %# .8f%+.8gj", ze.real, ze.imag);
}
printf(",\n");
}
}
int main(int argc, char * argv[])
{
/* Initial input matrix data. */
const double ar[ARR_SIZE] =
{
0.57830695, 0.57732626, 0.50339199, 0.49955715, 0.65422828, 0.39991909,
0.57732626, 0.26035319, 0.98113963, 0.63757905, 0.79092958, 0.72842587,
0.50339199, 0.98113963, 0.26858524, 0.61785254, 0.62372901, 0.63216722,
0.49955715, 0.63757905, 0.61785254, 0.78260145, 0.77017289, 0.82255765,
0.65422828, 0.79092958, 0.62372901, 0.77017289, 0.78081655, 0.50652208,
0.39991909, 0.72842587, 0.63216722, 0.82255765, 0.50652208, 0.00767085
};
/* Complexified input matrix */
lapack_complex_double a[ARR_SIZE];
lapack_complex_double ap[ARR_SIZE];
int i;
/* Create the input matrix with the correct type that will be fed to
* zheevr() */
for ( i = 0; i < ARR_SIZE; ++i )
{
a[i].real = ar[i];
a[i].imag = 0.0;
ap[i].real = ar[i];
ap[i].imag = 0.0;
}
lapack_int m, info;
double w[ARR_DIM];
lapack_complex_double z[ARR_SIZE];
lapack_int isuppz[ARR_SUPP];
info = LAPACKE_zheevr(LAPACK_COL_MAJOR, 'V', 'I', 'U', ARR_DIM, a, ARR_DIM,
0.0, 0.0, 3, 5, 0.0, &m, w, z, ARR_DIM, isuppz);
printf("info: %d\n", info);
printf("m: %d\n", m);
printf("w: [");
for ( i = 0; i < m; ++i )
{
printf(" %# .8f", w[i]);
}
printf("]\n");
printf("z: [");
print_matrix_colmajor(z, ARR_DIM, m);
printf("]\n");
/* Calculate the vector-matrix dot product z.H @ a */
lapack_complex_double alpha = {1.0, 0.0};
lapack_complex_double beta = {0.0, 0.0};
lapack_complex_double c[ARR_C];
cblas_zgemm(CblasColMajor, CblasConjTrans, CblasNoTrans,
m, ARR_DIM, ARR_DIM,
&alpha, z, ARR_DIM, ap, ARR_DIM,
&beta, c, m);
printf("c = z.H @ a: [");
print_matrix_colmajor(c, m, ARR_DIM);
printf("]\n");
} Running the compiled program linked against MKL 2019.4 gives the following output
|
@congma Could you please open the file <ftype2> intent(hide),dimension(lrwork) :: rwork and change it to <ftype2> intent(hide),dimension(lrwork),depend(lrwork) :: rwork and then build and test again? |
@ilayn Thanks for looking further into the issue.
not sure if @congma is seeing similar. |
Yes I am. While the loop is running, a glimpse shows that the random crashes were still there and not substantially different from the earlier ones. |
One more data point amid the noise: I have now a python process that is apparently stuck and doing nothing (as in consuming 0% CPU time) while executing the test snippet. I may have to kill it later but I have no idea why it is doing what it's doing (i.e. nothing).
EDIT2: Sorry disregard the disregard. The stuck-process problem still happens (in my current setting, at the 2333rd of 5000 runs). Also a new ValueError was caught during the runs:
|
There's a slight hint that I might have hit something when playing around with the f2py wrapper for In the wrapper changed the declaration and attribute for the integer intent(out),dimension((compute_v?2*max(1,n):0)),depend(n,compute_v) :: isuppz and in the Python test snippet for This time, the crashes seemed gone. The
where RationaleTo quote verbatim from the netlib LAPACK docs:
Notice that the doc says "Implemented only for |
Before I have to wrap up for the day (brain hardly functioning right now), for reference here's the f2py wrappers as they currently stand on my side. I realized I've made other changes too, and I don't want to mislead anyone working on this. Notice I changed the attributes for subroutine <prefix2c>heevr(compute_v,range,lower,n,a,lda,vl,vu,il,iu,abstol,w,z,m,ldz,isuppz,work,lwork,rwork,lrwork,iwork,liwork,info)
! Standard Symmetric/HermitianEigenvalue Problem
! Complex - Single precision / Double precision
!
! if jobz = 'N' there are no eigvecs hence 0x0 'z' returned
! if jobz = 'V' and range = 'A', z is (nxn)
! if jobz = 'V' and range = 'V', z is (nxn) since returned number of eigs is unknown beforehand
! if jobz = 'V' and range = 'I', z is (nx(iu-il+1))
callstatement (*f2py_func)((compute_v?"V":"N"),range,(lower?"L":"U"),&n,a,&lda,&vl,&vu,&il,&iu,&abstol,&m,w,z,&ldz,isuppz,work,&lwork,rwork,&lrwork,iwork,&liwork,&info)
callprotoargument char*,char*,char*,int*,<ctype2c>*,int*,<ctype2>*,<ctype2>*,int*,int*,<ctype2>*,int*,<ctype2>*,<ctype2c>*,int*,int*,<ctype2c>*,int*,<ctype2>*,int*,int*,int*,int*
<ftype2c> intent(in,copy,aligned8),check(shape(a,0)==shape(a,1)),dimension(n,n) :: a
integer optional,intent(in),check(compute_v==1||compute_v==0):: compute_v = 1
character optional,intent(in),check(*range=='A'||*range=='V' ||*range=='I') :: range='A'
integer optional,intent(in),check(lower==0||lower==1) :: lower = 0
integer optional,intent(in) :: il=1
integer optional,intent(in),depend(a) :: iu=shape(a,0)
<ftype2> optional,intent(in) :: vl=0.0
<ftype2> optional,intent(in),check(vu>vl),depend(vl) :: vu=1.0
<ftype2> intent(in) :: abstol=0.0
integer optional,intent(in),depend(a),check(lwork>=max(2*shape(a,0),1)||lwork==-1) :: lwork=max(2*shape(a,0),1)
integer optional,intent(in),depend(a),check(lrwork>=max(24*shape(a,0),1)||lrwork==-1) :: lrwork=max(24*shape(a,0),1)
integer optional,intent(in),depend(a),check(liwork>=max(1,10*shape(a,0))||liwork==-1):: liwork= max(1,10*shape(a,0))
integer intent(hide),depend(a) :: n=shape(a,0)
integer intent(hide),depend(n) :: lda=max(1,n)
integer intent(hide),depend(n) :: ldz=(compute_v?max(1,n):1)
<ftype2c> intent(hide),dimension(lwork),depend(lwork) :: work
<ftype2> intent(hide),dimension(lrwork),depend(lrwork) :: rwork
integer intent(hide),dimension(liwork),depend(liwork) :: iwork
<ftype2> intent(out),dimension(n),depend(n) :: w
<ftype2c> intent(out),dimension((compute_v?ldz:0),(compute_v?(*range=='I'?iu-il+1:MAX(1,n)):0)),depend(n,ldz,compute_v,range,iu,il) :: z
integer intent(out) :: m
! Only returned if range=='A' or range=='I' and il, iu = 1, n
integer intent(out),dimension((compute_v?2*max(1,n):0)),depend(n,compute_v) :: isuppz
integer intent(out) :: info
end subroutine <prefix2c>heevr
subroutine <prefix2c>heevr_lwork(n,lower,a,lda,vl,vu,il,iu,abstol,m,w,z,ldz,isuppz,work,lwork,rwork,lrwork,iwork,liwork,info)
! LWORK routines for (c/z)heevr
fortranname <prefix2c>heevr
callstatement (*f2py_func)("N","A",(lower?"L":"U"),&n,&a,&lda,&vl,&vu,&il,&iu,&abstol,&m,&w,&z,&ldz,&isuppz,&work,&lwork,&rwork,&lrwork,&iwork,&liwork,&info)
callprotoargument char*,char*,char*,int*,<ctype2c>*,int*,<ctype2>*,<ctype2>*,int*,int*,<ctype2>*,int*,<ctype2>*,<ctype2c>*,int*,int*,<ctype2c>*,int*,<ctype2>*,int*,int*,int*,int*
! Inputs
integer intent(in):: n
integer optional,intent(in),check(lower==0||lower==1) :: lower = 0
! Not referenced
<ftype2c> intent(hide) :: a
integer intent(hide),depend(n) :: lda = max(1,n)
<ftype2> intent(hide) :: vl=0.
<ftype2> intent(hide) :: vu=1.
integer intent(hide) :: il=1
integer intent(hide) :: iu=2
<ftype2> intent(hide) :: abstol=0.
integer intent(hide) :: m=1
<ftype2> intent(hide) :: w
<ftype2c> intent(hide) :: z
integer intent(hide),depend(n):: ldz = max(1,n)
integer intent(hide) :: isuppz
integer intent(hide) :: lwork = -1
integer intent(hide) :: lrwork = -1
integer intent(hide) :: liwork = -1
! Outputs
<ftype2c> intent(out) :: work
<ftype2> intent(out) :: rwork
integer intent(out) :: iwork
integer intent(out) :: info
end subroutine <prefix2c>heevr_work |
In the LAPACK wrapper for ?heevr, enlarge the size of the isuppz array parameter in order to prevent out-of-bound access by the underlying library. For both ?heevr and ?syevr, more robust checks and default values are also put in place. Reference: scipy#11709
I am going to check now but if what you found is the culprit then this needs to be reported to MKL. Because reference implementation and OpenBLAS doesn't show this behavior. |
@oleksandr-pavlyk Ping. |
In that case, is there a method to selectively compile the wrapper as a stopgap measure? |
That's really something we want to avoid because every implementation flavor has different problems, and it bloats the code base not to mention maintainability burden. MKL's implementation of |
Interesting observation! I think a more actionable issue for MKL is that it violates its own documentation (contract) of |
I don't know if the original poster Zoe is anyone from this thread but apparently there is a MKL bug ticket is open for exactly the same reason. https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/851351 @oleksandr-pavlyk Do you mind nudging this ticket if possible ? |
@ilayn Done. |
Intel team confirmed the bug and included the fix for the upcoming MKL 2020 update 2. |
For the record, the fix is included in the official release. Last week |
This problem is related to #11601, which has been closed by #11702 ( @ilayn ). However, the crash has not been fixed by the latter PR.
The symptoms remained almost identical to the one described in my comment in #11601 (comment)
In summary, when running the test for
eigh()
, Python tends to crash with SIGSEGV or SIGABRT. Sometimes this happens during thetest_eigh()
function, sometimes after it passed with "100%" but beforepytest
returns.The test that triggers the crash is the following test function:
scipy/scipy/linalg/tests/test_decomp.py
Lines 863 to 888 in ae34ce4
Some patterns from the histories of crashes
I run the test script with
runtests.py
100 times and saved the output as text files.By grepping the output files
./runtests.py
, I notice that the last-known position in Python before it crashes could be three lines, namely 873, 876, and 877. L 873 is the actual call toeigh()
, while the crash can happen as late as 876 or 877, where the arrays returned fromeigh()
are accessed.Only 6 out of 100 runs passed without any problems.
In some cases (35 out of the 100), Python segfaults after nominally completing all the tests in
TestEigh::test_eigh
.In the cases where Python was killed with SIGABRT, 36 were at L 873 (call to
eigh()
), while 9 were at L 876 where outputz
was used. In many other runs, the test script was not featured in the Python backtrace if any.The parametrized inputs that triggered the crash were of the form
test_eigh[6-D-XXX-YYY-ZZZ-eigvals1]
. That is, the crashes happened for dimension 6, dtype double complex, witheigvals=
keyword parameter set to the tuple(2, 4)
. TheXXX
--ZZZ
parameters are boolean flags for keywordsturbo
,lower
, andoverwrite
respectively.An incomplete tally of the parameters (
turbo
,lower
, andoverwrite
), where Python crashed before finishing all the tests, is as follows:The combination
(turbo=True, lower=True, overwrite=False)
is the one missing from the 2^3 = 8 cases yet.Reproducing code example:
Scipy/Numpy/Python version information:
Scipy master branch as of ae34ce4, Numpy 1.18.1, Python 3.7.6, conda macos with MKL 2019.4.
The text was updated successfully, but these errors were encountered: