Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when running statsmodels.tsa.stattools.adfuller #4703

Closed
lucashu1 opened this issue Jun 1, 2018 · 13 comments
Closed

Segfault when running statsmodels.tsa.stattools.adfuller #4703

lucashu1 opened this issue Jun 1, 2018 · 13 comments

Comments

@lucashu1
Copy link

lucashu1 commented Jun 1, 2018

Hi, I'm trying to run the adfuller stationarity test on a 50,000 length sequence, but keep getting a segmentation fault on the adfuller() call. I've tried passing the input in as a Pandas Series, and as a regular Python list. There are no NaN/infinite values in the sequence. Besides passing in the input sequence, the other arguments were left blank.

The error message is just Segmentation fault (core dumped).

When I run it through GDB, I get:

Installing openjdk unwinder
Traceback (most recent call last):
  File "/usr/share/gdb/auto-load/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so-gdb.py", line 52, in <module>
    class Types(object):
  File "/usr/share/gdb/auto-load/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so-gdb.py", line 66, in Types
    nmethodp_t = gdb.lookup_type('nmethod').pointer()
gdb.error: No type named nmethod.

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff2cb552b4 in ?? ()
(gdb) bt
#0  0x00007fff2cb552b4 in ?? ()
#1  0x0000000000000246 in ?? ()
#2  0x00007fff2cb55160 in ?? ()
#3  0x00007fffd5c8b2ac in ?? ()
   from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#4  0x00007fffffffa110 in ?? ()
#5  0x00007fffd57bf08d in ?? ()
   from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I know for a fact that the segfault occurs during the call to adfuller(). Any thoughts on what could be going wrong? I'm running this in a Conda Python 3.6 environment on Ubuntu 16.04.

@josef-pkt
Copy link
Member

50,000 nobs might be too large for the default lag search
try autolag=None, maxlag=10
IIRC

How long does adfuller work, i.e. very short or several seconds, before segfaulting?
Also check by how much the memory increases. adfuller is not memory efficient for large number of lags in the lag search.

inf and nan can cause errors in the LAPACK, library. But if you don't have any in your data, then something is going wrong during lag-search, I guess.
But it's just plain OLS, so my only real guess is that there is a memory problem, (which should be raising a MemoryError if everything goes well.)
But make sure the data in the call doesn't have nans, e.g. if you do some pre-processing with pandas, then pandas could introduce nans.

@lucashu1
Copy link
Author

lucashu1 commented Jun 1, 2018

It segfaults almost instantly, even with autolag=None, maxlag=10. You might be right about it being a memory issue, though.

I tried running two versions of the script to test this:

  1. Read the sequence data from the original source (HDFS) into a Pandas dataframe, do some preprocessing, pass the sequence into adfuller() --> get a segfault
  2. Write 1 script to read the sequence and save the sequence as an array to a .npy file, then have a separate script that just reads the .npy file and runs adfuller on the sequence --> no segfault

The first version does involve creating a connection to HDFS, and using pyspark to create a local SparkSession, which might take up some extra memory/CPU. Commenting out the line that leads to the HDFS connection and SparkSession initialization results in the segfault no longer happening. I'll see if I can find a workaround.

@josef-pkt
Copy link
Member

Can you access information about the data just before it goes into adfuller, e.g. using pdb or printing some information?
mainly doing a sanity check, that the data is really what it's supposed to be.

@lucashu1
Copy link
Author

lucashu1 commented Jun 1, 2018

It's definitely not the data that's the problem -- I checked for NaNs, infinities, etc. All the values themselves are fine.

Even when just loading in the values from the .npy file, it's as soon as I add the line that initializes an HDFS connection and creates the SparkSession that the segfaults start to occur, which leads me to believe that the issue has to be something related to that.

Update: I've narrowed the issue down to the creation of a pyarrow.HadoopFileSystem connection.

Without the line that initializes the HDFS connection, everything works fine. As soon as I add the lines:
from pyarrow import HadoopFileSystem; hdfs = HadoopFileSystem([host], [port]) , I get segfaults when calling adfuller. Haven't really seen anything like this before...

@josef-pkt
Copy link
Member

Are the data coming form that hdfs?

Another possible check is to make a copy of the data with np.array(data) before calling adfuller to make sure the underlying data/memory is really a numpy array, in case it's something that looks like a numpy array but doesn't behave like it when it goes into linalg/LAPACK.

@lucashu1
Copy link
Author

lucashu1 commented Jun 2, 2018

When I ran the debugging tests, the data was coming straight from the .npy file that I had saved. The .npy data was originally pulled from HDFS, although I'm pretty sure that's not the cause of the issue.

The debugging script I'm running is:

import numpy as np
from statsmodels.tsa.stattools import adfuller
from pyarrow import HadoopFileSystem

# hdfs = HadoopFileSystem(<host>, <port>)

# ADFuller
print('Loading values...')
values = np.load('temp-values.npy')
print('Calling adfuller...')
print(adfuller(values))

When I leave the hdfs = ... line in, I get a segfault as soon as adfuller() is called. When I comment it out, adfuller() runs just fine, and the result prints out correctly (it takes about 2 seconds).

I do get this warning when I initiate the HDFS connection (WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable), and the file /usr/share/gdb/auto-load/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so-gdb.py is mentioned in the GDB backtrace (see above), but I'm not sure how that would affect the call to adfuller().

Another Update:
This script with only a tiny sequence:

import numpy as np
from statsmodels.tsa.stattools import adfuller
from pyarrow import HadoopFileSystem

hdfs = HadoopFileSystem(<host>, <port>)

x = [1, 1, 1, 1, 1, 1]

# ADFuller
print('Calling adfuller...')
print(adfuller(x, maxlag=1))

still causes adfuller() to segfault immediately.

Does statsmodels interact with Java at all behind the scenes? Maybe that has something to do with it?

@josef-pkt
Copy link
Member

josef-pkt commented Jun 2, 2018

adfuller is pure numpy based in-memory computation, the large parts of the actual computation are in the Fortran linalg libraries which might be OpenBLAS in your system.
edit conda has normally MKL as the BLAS/LAPACK libraries

for example you could try to replace the adfuller call with np.linagl.svd(x) although if x is 1-dim, then it wouldn't have to do much, but it would have to access the same fortran linalg libraries.

I don't know anything about Hadoop, but it looks like adfuller doesn't have a standard c-python backend anymore.

@lucashu1
Copy link
Author

lucashu1 commented Jun 3, 2018

np.show_config() returns:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/home/root/.conda/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/home/root/.conda/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/home/root/.conda/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/home/root/.conda/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

so looks like it's OpenBLAS. I'm running all this inside an Ubuntu 16.04 Docker image (inside a Kubeflow/JupyterHub pod), for context, so some configs might be different. Not too familiar with the inner workings of numpy myself.

I guess my next step is to see if I can find a way to get a full backtrace to figure out where exactly things are going wrong.

@jbrockmendel
Copy link
Contributor

Does statsmodels interact with Java at all behind the scenes? Maybe that has something to do with it?

I was going to ask based on the gdb output if you were using jython or something. IIRC hadoop uses java.

I guess my next step is to see if I can find a way to get a full backtrace to figure out where exactly things are going wrong.

I've never gotten this to give me anything useful personally, but supposedly import faulthandler; faulthandler.enable() is intended for this kind of thing.

@bashtage
Copy link
Member

What happens when you run

import numpy as np
from statsmodels.tsa.stattools import adfuller
# ADFuller
n=100
values = np.random.randn(n)
print(adfuller(values))

And then what is n=50000?

If this works then the problem is somewhere else. It is possible that another extension module (pyarrow) is clobbering some memory which leads to the segfault.

@lucashu1
Copy link
Author

@bashtage both those scenarios worked just fine. It must be something related to pyarrow (or the underlying HDFS connection) then.

@bashtage
Copy link
Member

bashtage commented Aug 2, 2018

@ChadFulton @josef-pkt closeable.

@ChadFulton
Copy link
Member

Thanks @bashtage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants