New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/TST: Appveyor failing on statespace test_simulate #3386
Comments
In one of Kerby's PRs it fails on python 2.7 but in a different simulation test |
Probably have issues with long differences between Unix and Windows. To avoid these I always use things |
This seems to happen a lot. I saw one locally and loaded the debugger after the crash. Since it wasn't a debug build, there wasn't a lot of helpful information. However, I did see that the crash happened in mkl_av2.dll which is where BLAS calls are executed. I have a suspicion that this is an obscure data alignment issue with SIMD. In many cases data is aligned on the right sized boundary (e.g. 32 bytes). Occasionally it is not, and so the error occurs. |
Another possibility is that one of the inputs to the BLAS function calls does not have the correct type on Windows. |
The integer types you pass to BLAS functions might need to be np.npy_intp which would result in 32 bit integers on 32 bit platforms and 64 bit integers on 64 bit platforms. |
I never managed to run into problems with the simulate tests. I tried for a while yesterday with Windows 8.1, 64 bit python 3.4 with statsmodels compiled with MingW/gcc, winpython. |
It is very hard to trigger. I have intentionally tried many times and cannot get a failure locally aside from the one I caught.
From: Josef Perktold
Sent: Thursday, February 2, 2017 2:03 PM
To: statsmodels/statsmodels
Cc: Kevin Sheppard; Comment
Subject: Re: [statsmodels/statsmodels] BUG/TST: Appveyor failing on statespacetest_simulate (#3386)
I never managed to run into problems with the simulate tests. I tried for a while yesterday with Windows 8.1, 64 bit python 3.4 with statsmodels compiled with MingW/gcc, winpython.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This is really strange. I have done some more runs where I printed a lot to see where the failure was -- it seems that it occurs between tests. I'm not quite sure what this implies. Maybe it is crashing when garbage collecting the various extension classes from the Cython. |
The exception code in hex is C0000005 which means access violation - something is trying to read/write data is isn't owned by it. It could also be freeing something that has already been freed. |
There is one inplace modification of It might also be worth a try to make all eps copies when calling the |
The difficulty is that it fails right after each of these tests. The actual simulation code is all python so very doubtful that this is causing a problem. I now suspect there is some issue with mkl. Trying a new mkl on appveyor to see if it works. |
I'm having trouble replicating this. I have a conda environment with the same packages as appveyor on my Windows 10 machine, and can't get a problem. @bashtage was the local failure you observed a fluke? or did you get it from repeated test runs? And was it a segfault? |
It seems to be hard to trigger. I have seen it locally twice. Both were accidental when doing entire runs of the test suite. My experiences:
Going to be hard to debug. |
Thanks for those notes! I'll keep trying. |
@jbrockmendel @josef-pkt @ChadFulton This appears to have been fixed after anaconda updated MKL to 2017.0.3. I think it can be closed. |
It still fails every once in a while |
Does is still fail? |
Yes, but not very often anymore. I had a few cases in the last few days, less than one quarter of test runs as a rough guess. |
Has the new MKL (2018) fixed this. I haven’t seen one in a long time. |
The last one, ignoring Chad's statespace PR. was 6 days ago, AFAICS |
as update, another hanging test run |
@bashtage if you have a moment, would you mind telling me a bit about your setup where you locally caught the error with the debug enabled and you saw that something was trying to access (I assume no one has seen the segfault anywhere but Windows?) |
@ChadFulton I don't manage to reproduce the simulate segfault at the moment |
not reproducible with
no failures, errors or segfault but if I do additionally
then I get one failure
I will add results here when this is finished (in a restarted python in console)
results: no statespace problem or failure cvxopt failure as before
cvxopt-1.1.7.dist-info, I don't see a version number in the imported cvxopt. |
same problem on python 3.6 after pip install of the 4.6 wheel
-> segfault in test_simulate.py running the statespace tests again finishes successfuly without failures or errors. no cvxopt failure with cvxopt-1.1.9.dist-info
looks like a different import path in 0.22.0 :( |
I guess the pandas formatting test is not run on Travis and Appveyor
I'm using pandas '0.19.1' which has pd.formats but no pd.io.formats |
This may be tangential, but have you tried moving |
I could not deterministically trigger the exception. I triggered it by running the full test suite in a loop until it was hit. I instrumented the cython code with lots of notifications and was never able to find an error. Considering this never happens on Linux or OSX, I suspect that it is either a bug in Visual Studio or MKL. The exception was always raised in a MKL dll, which also suggests this. Ideally one woud use 32 byte aligned memory, but this isn't easy with NumPy arrays which only align on 16 bytes IIRC. |
FWIW my setup was current Cython on Python 3.6 as of October 2017, so probably NumPy 1.13. |
I think it's better to disable the simulate tests on Windows for the release, so users don't get a segfault as a first impression of the release when running the test suite. Then this should be fine for the release rc. |
Good news, maybe. I did find an illegal memory access issue in the simulation smoother that could cause segmentation faults in I don't know if this fixes the ongoing problem or not since we haven't been able to reliably replicate the Appveyor segfault, but it seems promising. |
Closing as it appears that #4580 fixed the problem. |
It's not clear what's happening, except tests are failing (i.e. exiting without finishing). Error message:
First guess would be some kind of segfault in simulation, so in the simulation smoother.
The text was updated successfully, but these errors were encountered: