Crash from python.exe using linear regression of statsmodels #1443

Closed
bertrandhaut opened this Issue Mar 4, 2014 · 21 comments

Projects

None yet

4 participants

@bertrandhaut

For a particular regression performed using the statsmodel package (sm.OLS(y,X).fit()), I get a crash of the python.exe executable.

With other data (i.e. with other y, X matrices) it works well.
This crash can be reproduced using the attached script and with attached data.

The versions used are the one provided by the Entought canopy distribution:

  • python 2.7.6 32 bits
  • numpy 1.8.0
  • statsmodels 0.5.0

According to Entought they were able to reproduce the problem on their 32 bits version. Not on their 64 bits.

@bertrandhaut

Unable to find, how to add attachement in github....

Will be post on another website this evening

@jseabold
Member
jseabold commented Mar 4, 2014

My guess is that the crash is a numpy bug. There's no C code in our linear regressions that's not called from numpy.linalg. Can you confirm where there crash comes from? It will be a bit of work for me to get setup in a 32-bit environment. First, can you split it up.

Does the crash happen in

mod = sm.OLS(y, X)

or in

mod.fit()

? I can then hopefully isolate the code for you. Alternatively, if you don't need my help trying to isolate the line that crashes, can you step through the code and find out exactly what line causes the crash? I don't know well the debugger tools on windows.

@jseabold
Member
jseabold commented Mar 4, 2014

Also, does

import numpy as np
np.test()

Pass for you?

@argriffing

If numpy is installed correctly then I would guess this is feeding nan or inf into svd or some other LAPACK function. If this is the case, numpy devs aren't too keen to call this a numpy bug so maybe it would have to be called a bug in whatever LAPACK is used. Maybe lapack-lite or maybe MKL. But MKL says this is should not be considered an MKL bug because the caller should know better than to pass infs or nans to LAPACK...

@jseabold
Member
jseabold commented Mar 4, 2014

Yeah, that's what I suspected too. You can also try

mod = sm.OLS(y, X, missing='drop')
res = mod.fit()

But I'd really prefer to see numpy do something sensible about this rather than have it look like a segfault from our end.

@josef-pkt
Member

I would be glad to have a clear example to see where this happens.
Even if it's numpy's svd, pinv, then we should already have some checks and other code that should raise, e.g. I think we use scipy.linalg.svdvals first for df_model, IIRC.

@jseabold
Member
jseabold commented Mar 4, 2014

We need to replicate to be sure that this is really the issue, but...

Assuming that it is, given our decision to not check for missing values by default, then I really think this is on numpy not to do things that lead to a segfault. Especially given @argriffing's comments from the MKL documentation in the linalg.svd thread on numpy-discussion.

WARNING
LAPACK routines assume that input matrices do not contain IEEE 754
special values such as INF
or NaN values. Using these special values may cause LAPACK to return
unexpected results or
become unstable.  
@argriffing

also fwiw, comments on answers here suggest that enthought canopy indeed uses mkl

Edit: and from a better source than random stackexchange comments:

the developer license that we purchased from Intel gives us the right to redistribute the MKL
as part of our offering.

Also while I'm linking things, here's the original source of the MKL warning jseabold quoted above.

@jseabold
Member
jseabold commented Mar 4, 2014

Yeah, that was my assumption for the above.

@bertrandhaut

To answer some of the questions above:

  • It's during the call to the ".fit()" method that the crash occurs
  • The np.test() leads to "OK (KNOWNFAIL=9, SKIP=7)"
  • The "mod = sm.OLS(y, X, missing='drop')" leads to the same error.
  • I've checked that the y and X matrices do not contains any inf and/or nan
@bertrandhaut

The crash happens at linalg.py::1327 u, s, vt = gufunc(a, signature=signature, extobj=extobj)
where "gufunc = _umath_linalg.svd_n_s"

I think also that the Canopy distribution use the intel MKL.
Doing the same type of work in MATLAB on that matrix (i.e. mdl = fitlm(X,y,'intercept',false)) works good and MATLAB also use the intel MKL

version -blas
ans =
Intel(R) Math Kernel Library Version 11.0.2 Product Build 20130123 for 32-bit applications

version -lapack
ans =
Intel(R) Math Kernel Library Version 11.0.2 Product Build 20130123 for 32-bit applications
Linear Algebra PACKage Version 3.4.1

@argriffing

@bertrandhaut Thanks for this information! It seems consistent with our speculation above. More information will probably be needed before the bug can be tracked down. For example, y and X will probably be needed (even if these arrays do not have inf or nan). I personally do not have access to Windows, MKL, or 32-bit python, so I will not be able to reproduce it but I would be curious to see more of the debug trace information if this is available.

In the short term I think the idea will be to add more checks where statsmodels calls various numpy functions, and for the longer term I think we will want to find a smallish example matrix that reproducibly segfaults 32-bit MKL with numpy svd with the idea of eventually fixing the problems that cause numpy calls to segfault.

@bertrandhaut

The files to reproduce the error are there:
https://docs.google.com/file/d/0Bzz_ZaP_wS_HOTJra3ZJM1d6ckk/edit

@jseabold
Member
jseabold commented Mar 5, 2014

Thanks, can you run exactly this, and see which line fails.

mod = sm.OLS(y, X)

endog = mod.wendog
exog = mod.wexog
pinv_wexog = np.linalg.pinv(wexog)
ncp = np.dot(pinv_exog, np.transpose(pinv_wexog)
beta = np.dot(pinv_wexog, endog)
@jseabold
Member
jseabold commented Mar 5, 2014

I'm almost certain it's the pinv. I just want to be sure that's it's in that pinv, and not some other call to svd somewhere in the results.

@jseabold
Member
jseabold commented Mar 5, 2014

I'm also not entirely certain that we can reproduce given the results from savetxt. We may need .npz files or pickled arrays. I'm doing to post a bug to the numpy tracker and see what turns up.

@jseabold
Member
jseabold commented Mar 5, 2014

Or rather can you just see that np.linalg.pinv(X) also crashes for you.

@bertrandhaut

Indeed it's the call to .pinv which causes the crash

@jseabold
Member
jseabold commented Mar 5, 2014

Great, thanks.

@jseabold
Member
jseabold commented Mar 5, 2014

You might also follow up with enthought and let them know it's a numpy problem not statsmodels, if you've filed a bug report there. They should have some interested in getting to the bottom of this.

@jseabold jseabold added this to the 0.6 milestone Mar 5, 2014
@jseabold
Member

Closed in numpy/numpy@5a3b0ab

There's nothing we can do about this. You'll have to upgrade/downgrade numpy to get around it AFAICT.

@jseabold jseabold closed this Mar 12, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment