Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/TST: Appveyor failing on statespace test_simulate #3386

Closed
ChadFulton opened this issue Jan 22, 2017 · 33 comments
Closed

BUG/TST: Appveyor failing on statespace test_simulate #3386

ChadFulton opened this issue Jan 22, 2017 · 33 comments

Comments

@ChadFulton
Copy link
Member

It's not clear what's happening, except tests are failing (i.e. exiting without finishing). Error message:

statsmodels.tsa.statespace.tests.test_simulate.test_structural ...
Command exited with code -1073741819

First guess would be some kind of segfault in simulation, so in the simulation smoother.

@josef-pkt
Copy link
Member

In one of Kerby's PRs it fails on python 2.7 but in a different simulation test
statsmodels.tsa.statespace.tests.test_simulate.test_arma_lfilter ... Command exited with code -1073741819

@bashtage
Copy link
Member

Probably have issues with long differences between Unix and Windows. To avoid these I always use things int64_t/np.int64 rather than long or long long

@ChadFulton ChadFulton added this to the 0.9 milestone Feb 2, 2017
@bashtage
Copy link
Member

bashtage commented Feb 2, 2017

This seems to happen a lot. I saw one locally and loaded the debugger after the crash. Since it wasn't a debug build, there wasn't a lot of helpful information. However, I did see that the crash happened in mkl_av2.dll which is where BLAS calls are executed. I have a suspicion that this is an obscure data alignment issue with SIMD. In many cases data is aligned on the right sized boundary (e.g. 32 bytes). Occasionally it is not, and so the error occurs.

@bashtage
Copy link
Member

bashtage commented Feb 2, 2017

Another possibility is that one of the inputs to the BLAS function calls does not have the correct type on Windows.

@bashtage
Copy link
Member

bashtage commented Feb 2, 2017

The integer types you pass to BLAS functions might need to be np.npy_intp which would result in 32 bit integers on 32 bit platforms and 64 bit integers on 64 bit platforms.

@josef-pkt
Copy link
Member

I never managed to run into problems with the simulate tests. I tried for a while yesterday with Windows 8.1, 64 bit python 3.4 with statsmodels compiled with MingW/gcc, winpython.

@bashtage
Copy link
Member

bashtage commented Feb 2, 2017 via email

@bashtage
Copy link
Member

bashtage commented Feb 3, 2017

This is really strange. I have done some more runs where I printed a lot to see where the failure was -- it seems that it occurs between tests. I'm not quite sure what this implies. Maybe it is crashing when garbage collecting the various extension classes from the Cython.

@bashtage
Copy link
Member

bashtage commented Feb 3, 2017

The exception code in hex is C0000005 which means access violation - something is trying to read/write data is isn't owned by it. It could also be freeing something that has already been freed.

@josef-pkt
Copy link
Member

There is one inplace modification of eps while it is attached to a model
The last two cases in test_structural don't have a copy before modifying it
desired = eps.copy()

It might also be worth a try to make all eps copies when calling the simulate method.
(I don't know much about the internals and cython, but a eps.copy() would prevent that they are all using the same array memory.)

@bashtage
Copy link
Member

bashtage commented Feb 3, 2017

The difficulty is that it fails right after each of these tests. The actual simulation code is all python so very doubtful that this is causing a problem. I now suspect there is some issue with mkl. Trying a new mkl on appveyor to see if it works.

@ChadFulton
Copy link
Member Author

I'm having trouble replicating this. I have a conda environment with the same packages as appveyor on my Windows 10 machine, and can't get a problem.

@bashtage was the local failure you observed a fluke? or did you get it from repeated test runs? And was it a segfault?

@bashtage
Copy link
Member

bashtage commented Feb 9, 2017

It seems to be hard to trigger. I have seen it locally twice. Both were accidental when doing entire runs of the test suite. My experiences:

  • It won't trigger by just running test_simulate. I ran all tests in this module 10K times and didn't see a single failure
  • It requires at a minimum running all of nosetests statsmodels\tsa
  • It occurs between tests. I put a lot of print statements into the runs, basically outputiing a increasing count after each statement and a 'TEST_STARTED' 'TEST_ENDED' at the first and last. The crash always happened after 'TEST_ENDED' but before the next 'TEST_STARTED'. Since these were the firs tand last lines, it is something in the garbage collection.
  • When it occurred locally, I was able to open visual studio's debugger. I didn't have a debug build, but I could see it was stop 0xC0000005 in mkl_avx2_dll -- a BLAS library. I have a suspicion that there is a very subtle alignment issue in how NumPy arrays are boundary aligned (I think they don't use any alignment, but mkl_avx2 might be expecting 16 or 32 byte alignment). This isn't technicallly a NumPy issue, since they don't vendorize a BLAS
  • I tried setting up an appveyor run that uses pip NumPy 1.12 which comes with ATLAS to see if it occurs, but I'm didn't succeed since SciPy isn't pip-installable and SciPy's blas interfact is very important here.
  • I did set up some runs using an older version of MKL but this also crashed on Appveyor.

Going to be hard to debug.

@ChadFulton
Copy link
Member Author

Thanks for those notes! I'll keep trying.

@bashtage
Copy link
Member

bashtage commented Jul 8, 2017

@jbrockmendel @josef-pkt @ChadFulton This appears to have been fixed after anaconda updated MKL to 2017.0.3. I think it can be closed.

@josef-pkt
Copy link
Member

It still fails every once in a while
for example my PR
mkl: 2017.0.3-0
https://ci.appveyor.com/project/josef-pkt/statsmodels/build/1.0.2020/job/1x31rc0k5e3wgmc0

@bashtage
Copy link
Member

Does is still fail?

@josef-pkt
Copy link
Member

Yes, but not very often anymore. I had a few cases in the last few days, less than one quarter of test runs as a rough guess.

@bashtage
Copy link
Member

Has the new MKL (2018) fixed this. I haven’t seen one in a long time.

@josef-pkt
Copy link
Member

The last one, ignoring Chad's statespace PR. was 6 days ago, AFAICS
https://ci.appveyor.com/project/josef-pkt/statsmodels/build/1.0.2403/job/prru22dxk854jflu
mkl: 2018.0.0-h36b65af_4

@josef-pkt
Copy link
Member

as update, another hanging test run
https://ci.appveyor.com/project/josef-pkt/statsmodels/build/1.0.2491/job/dodtbv25v4n4j24b
It was quite for some time but it still doesn't last.

@bashtage
Copy link
Member

I just caught one locally. Didn't have debugging but seems to occur in mkl_avx2.dll which I don't have symbols for, and so it woulnd't help.

image

@josef-pkt josef-pkt removed this from the 0.9 milestone Apr 18, 2018
@josef-pkt josef-pkt added this to the 0.10 milestone Apr 18, 2018
@ChadFulton
Copy link
Member Author

@bashtage if you have a moment, would you mind telling me a bit about your setup where you locally caught the error with the debug enabled and you saw that something was trying to access self.nobs? e.g. version of Windows, Python, Numpy / Scipy, BLAS / LAPACK...

(I assume no one has seen the segfault anywhere but Windows?)

@josef-pkt
Copy link
Member

@ChadFulton I don't manage to reproduce the simulate segfault at the moment
Neither pytest ...\test.simulate.py nor pytest ...\statespace\tests segfaulted now.
I'm trying to run the entire test suite in different ways, but that takes time.

@josef-pkt
Copy link
Member

josef-pkt commented Apr 28, 2018

not reproducible with pytest path to statsmodel either
running in the interpreter

import statsmodels.tsa.statespace as ss
ss.test()

no failures, errors or segfault

but if I do additionally

import statsmodels.api as sm
ss.test()

then I get one failure


================================== FAILURES ===================================
_______________________ TestCompanionMatrix.test_cases ________________________
..\try_py34\lib\site-packages\statsmodels\tsa\statespace\tests\test_tools.py:39:
 in test_cases
    assert_equal(tools.companion_matrix(polynomial), result)
..\try_py34\lib\site-packages\statsmodels\tsa\statespace\tools.py:320: in compan
ion_matrix
    elif polynomial[0] == 1:
E   ValueError: The truth value of an array with more than one element is ambigu
ous. Use a.any() or a.all()
======= 1 failed, 1600 passed, 30 skipped, 29 warnings in 96.57 seconds =======

I will add results here when this is finished (in a restarted python in console)

>>> import statsmodels.api as sm
>>> sm.test()

results: no statespace problem or failure

cvxopt failure as before

================================== FAILURES ===================================
________________________________ test_testers _________________________________
..\try_py34\lib\site-packages\statsmodels\stats\tests\test_knockoff.py:81: in te
st_testers
    RegressionFDR(y, x, tv, design_method=method)
..\try_py34\lib\site-packages\statsmodels\stats\_knockoff.py:83: in __init__
    exog1, exog2, _ = _design_knockoff_sdp(exog)
..\try_py34\lib\site-packages\statsmodels\stats\_knockoff.py:161: in _design_kno
ckoff_sdp
    sol = solvers.sdp(c, G0, h0, [G1], [h1])
c:\users\josef\downloads\winpython-64bit-3.4.4.5qt5\python-3.4.4.amd64\lib\site-
packages\cvxopt\coneprog.py:4129: in sdp
    = ds)
c:\users\josef\downloads\winpython-64bit-3.4.4.5qt5\python-3.4.4.amd64\lib\site-
packages\cvxopt\coneprog.py:1396: in conelp
    misc.update_scaling(W, lmbda, ds, dz)
c:\users\josef\downloads\winpython-64bit-3.4.4.5qt5\python-3.4.4.amd64\lib\site-
packages\cvxopt\misc.py:614: in update_scaling
    offsetU = ind2, offsetVt = ind2)
E   ArithmeticError: 49

cvxopt-1.1.7.dist-info, I don't see a version number in the imported cvxopt.
We might not have any CI testing for cvxopt

@josef-pkt
Copy link
Member

same problem on python 3.6

after pip install of the 4.6 wheel
segfault on first run

>python
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD6
4)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import statsmodels.regression as smr
>>> smr.test()
Running pytest C:\WinPython64-3.6.5.0Qt5\python-3.6.5.amd64\lib\site-packages\st
atsmodels\regression --tb=short --disable-pytest-warnings
============================= test session starts =============================
platform win32 -- Python 3.6.5, pytest-3.5.0, py-1.5.3, pluggy-0.6.0

-> segfault in test_simulate.py

running the statespace tests again finishes successfuly without failures or errors.
When I run the same sm.test() again in new interpreter session but same command window, then the test finishes without problems and one failure with pandas

no cvxopt failure with cvxopt-1.1.9.dist-info

================================== FAILURES ===================================
_____________________________ test_getframe_smoke _____________________________
C:\WinPython64-3.6.5.0Qt5\python-3.6.5.amd64\lib\site-packages\statsmodels\multi
variate\tests\test_factor.py:227: in test_getframe_smoke
    assert_(isinstance(ldf, pd.formats.style.Styler))
E   AttributeError: module 'pandas' has no attribute 'formats'
 1 failed, 6291 passed, 63 skipped, 2 xfailed, 100 warnings in 692.96 seconds =
>>> import pandas
>>> pandas.__version__
'0.22.0'
>>> pandas.formats
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'pandas' has no attribute 'formats'
>>> import pandas.io.formats
>>> pandas.io.formats.style.Styler
<class 'pandas.io.formats.style.Styler'>

looks like a different import path in 0.22.0 :(

@josef-pkt
Copy link
Member

I guess the pandas formatting test is not run on Travis and Appveyor

    # The Styler option require jinja2, skip if not available
    try:
        from jinja2 import Template
    except ImportError:
        return

I'm using pandas '0.19.1' which has pd.formats but no pd.io.formats
needs another compat PR, if the code fails and not just the unit test
But given that the assert raises an error, the call to the formatting method was without exception
ldf = res.get_loadings_frame(style='display')

@jbrockmendel
Copy link
Contributor

This may be tangential, but have you tried moving test_simulate higher in the file? Elsewhere I've noticed tests that can be broken be re-ordering. If e.g. moving it causes the segfault to occur in a different test, that would be useful information.

@bashtage
Copy link
Member

I could not deterministically trigger the exception. I triggered it by running the full test suite in a loop until it was hit. I instrumented the cython code with lots of notifications and was never able to find an error. Considering this never happens on Linux or OSX, I suspect that it is either a bug in Visual Studio or MKL. The exception was always raised in a MKL dll, which also suggests this. Ideally one woud use 32 byte aligned memory, but this isn't easy with NumPy arrays which only align on 16 bytes IIRC.

@bashtage
Copy link
Member

FWIW my setup was current Cython on Python 3.6 as of October 2017, so probably NumPy 1.13.

@josef-pkt
Copy link
Member

I think it's better to disable the simulate tests on Windows for the release, so users don't get a segfault as a first impression of the release when running the test suite.
And I fix the pandas format compatibility issue.
I will leave the test failure on cvxopt. (We might have to look at it for the final release if Debian also has problems, which it had in the past with most likely buggy cvxopt.)

Then this should be fine for the release rc.

@ChadFulton
Copy link
Member Author

Good news, maybe. I did find an illegal memory access issue in the simulation smoother that could cause segmentation faults in test_simulate.py, and the fix is in #4580.

I don't know if this fixes the ongoing problem or not since we haven't been able to reliably replicate the Appveyor segfault, but it seems promising.

@ChadFulton
Copy link
Member Author

Closing as it appears that #4580 fixed the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants