Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

Closed
8li opened this issue Oct 3, 2017 · 6 comments

Comments

@8li
Copy link

8li commented Oct 3, 2017

For a large csr_matrix (75 million rows), I cannot slice all rows without running into a segmentation fault or memory error.

I tried to reproduce the error simply by creating a dummy csr matrix of similar density using sp.rand(), but matrices as small as 30 million produce segmentation faults as well (using sp.rand(30000000,4097,density=0.02,format='csr')).

Reproducing code example:

test_array.py:

import sys
import numpy as np
from e3fp.fingerprint.db import FingerprintDatabase as FpDb

fpdb = FpDb.load('all_0.5_1e-300_e4096.fpz')

for i in np.arange(fpdb.fp_num):
	x = fpdb.array[i,:]

e3fp.fingerprint.db is here
all_0.5_1e-300_e4096.fpz is a 7.8G file

Error message:

Segmentation fault (no Traceback); gdb found below

(gdb) run test_array.py
Starting program: /srv/home/ali/miniconda2/envs/nn+pip/bin/python test_array.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
get_csr_submatrix<long, unsigned short> (n_row=<optimized out>, n_col=<optimized out>, Ap=0x7fff7620e010, 
    Aj=0x7ff495f30010, Ax=0x7ff1dde78010, ir0=27662451, ir1=27662452, ic0=0, ic1=4097, Bp=0x144f334a0, 
    Bj=0x144f51e10, Bx=0xa422e0) at scipy/sparse/sparsetools/csr.h:1168
1168    scipy/sparse/sparsetools/csr.h: No such file or directory.
(gdb) bt
#0  get_csr_submatrix<long, unsigned short> (n_row=<optimized out>, n_col=<optimized out>, Ap=0x7fff7620e010, 
    Aj=0x7ff495f30010, Ax=0x7ff1dde78010, ir0=27662451, ir1=27662452, ic0=0, ic1=4097, Bp=0x144f334a0, 
    Bj=0x144f51e10, Bx=0xa422e0) at scipy/sparse/sparsetools/csr.h:1168
#1  0x00007fffe72bad88 in get_csr_submatrix_thunk (I_typenum=<optimized out>, T_typenum=<optimized out>, 
    a=<optimized out>) at scipy/sparse/sparsetools/csr_impl.h:2576
#2  0x00007fffe72b9c35 in call_thunk (ret_spec=118 'v', spec=0x7fffe75792e5 "iiIITiiii*V*V*W", 
    thunk=0x7fffe72ba8d0 <get_csr_submatrix_thunk(int, int, void**)>, args=0x7fffe7c8fd50)
    at scipy/sparse/sparsetools/sparsetools.cxx:359
#3  0x00007ffff7add1e5 in call_function (oparg=<optimized out>, pp_stack=0x7fffffffce78) at Python/ceval.c:4352
#4  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2989
#5  0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7fffe7c5a030, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=3, kws=0x144f32ea0, kwcount=0, defs=0x0, 
    defcount=0, closure=0x0) at Python/ceval.c:3584
#6  0x00007ffff7ade1f7 in fast_function (nk=<optimized out>, na=3, n=<optimized out>, pp_stack=0x7fffffffd098, 
    func=0x7fffe7c79f50) at Python/ceval.c:4447
#7  call_function (oparg=<optimized out>, pp_stack=0x7fffffffd098) at Python/ceval.c:4372
#8  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2989
#9  0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7fffe7c53e30, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=3, kws=0x6c62d8, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3584
#10 0x00007ffff7ade1f7 in fast_function (nk=<optimized out>, na=3, n=<optimized out>, pp_stack=0x7fffffffd2b8, 
    func=0x7fffe7c79ed8) at Python/ceval.c:4447
#11 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd2b8) at Python/ceval.c:4372
#12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2989
#13 0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7fffe7c53cb0, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3584
#14 0x00007ffff7a59a61 in function_call (func=0x7fffe7c79d70, arg=0x7feea5dce638, kw=0x0)
    at Objects/funcobject.c:523
#15 0x00007ffff7a29e93 in PyObject_Call (func=0x7fffe7c79d70, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2547
#16 0x00007ffff7a3c64f in instancemethod_call (func=0x7fffe7c79d70, arg=0x7feea5dce638, kw=0x0)
    at Objects/classobject.c:2602
#17 0x00007ffff7a29e93 in PyObject_Call (func=0x7fffe5bee280, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2547
#18 0x00007ffff7a96d47 in call_method (o=<optimized out>, name=<optimized out>, 
    nameobj=0x7ffff7dbc898 <cache_str.16340>, format=0x7ffff7b31e47 "(O)") at Objects/typeobject.c:1283
#19 0x00007ffff7ad69e7 in PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:1539
#20 0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7ffff7f459b0, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3584
#21 0x00007ffff7aded52 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at Python/ceval.c:669
#22 0x00007ffff7aff450 in run_mod (arena=0x6781c0, flags=0x7fffffffdc40, locals=0x7ffff7f71168, 
    globals=0x7ffff7f71168, filename=<optimized out>, mod=0x636ed8) at Python/pythonrun.c:1376
#23 PyRun_FileExFlags (fp=0x699230, filename=<optimized out>, start=<optimized out>, globals=0x7ffff7f71168, 
    locals=0x7ffff7f71168, closeit=1, flags=0x7fffffffdc40) at Python/pythonrun.c:1362
#24 0x00007ffff7aff62f in PyRun_SimpleFileExFlags (fp=0x699230, filename=0x7fffffffe039 "test_array.py", 
    closeit=1, flags=0x7fffffffdc40) at Python/pythonrun.c:948
#25 0x00007ffff7b14fd4 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:645
#26 0x00007ffff6d1bb35 in __libc_start_main () from /lib64/libc.so.6
#27 0x0000000000400729 in _start ()

Scipy/Numpy/Python version information:

'0.19.1', '1.13.1', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
@pv
Copy link
Member

pv commented Oct 3, 2017 via email

@8li
Copy link
Author

8li commented Oct 3, 2017

fpdb.array.check_format(True) (as well as fpdb.array.check_format()) gives the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-9e9a44999045> in <module>()
----> 1 fpdb.array.check_format(True)

/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.pyc in check_format(self, full_check)
    182                                         minor_name)
    183                 if np.diff(self.indptr).min() < 0:
--> 184                     raise ValueError("index pointer values must form a "
    185                                         "non-decreasing sequence")
    186

ValueError: index pointer values must form a non-decreasing sequence

@pv
Copy link
Member

pv commented Oct 3, 2017 via email

@8li
Copy link
Author

8li commented Oct 4, 2017

So, the sparse matrix data in question was created using vstack on smaller arrays.

If I run check_format(True) on the separate, smaller arrays, they all pass:

check_format.py:

import sys
from e3fp.fingerprint.db import FingerprintDatabase as FpDb

for filename in sys.argv[1:]:
    fpdb = FpDb.load(filename)
    fpdb.array.check_format(True)

python check_format.py [files] completes without error.

However, if I try to combine them into a single sparse matrix using vstack, the ValueError shows up at around the 5th database

check_format_combined.py:

import sys
import scipy.sparse as sp
from e3fp.fingerprint.db import FingerprintDatabase as FpDb

array = None
arrays = []

for filename in sys.argv[1:]:
    fpdb = FpDb.load(filename)
    print "Checking sub array from {}".format(filename)
    fpdb.array.check_format(True)
    arrays.append(fpdb.array)
    array = sp.vstack(arrays)
    print "Checking vstacked array of size {},{}".format(array.shape[0],array.shape[1])
    array.check_format(True)

python check_format_combined.py [files] output below:

Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/0_0.5_1e-300_e4096.fpz
Checking vstacked array of size 4687126,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/1_0.5_1e-300_e4096.fpz
Checking vstacked array of size 9301821,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/2_0.5_1e-300_e4096.fpz
Checking vstacked array of size 13968005,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/3_0.5_1e-300_e4096.fpz
Checking vstacked array of size 18603442,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/4_0.5_1e-300_e4096.fpz
Checking vstacked array of size 23255493,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/5_0.5_1e-300_e4096.fpz
Checking vstacked array of size 27951284,4097
Traceback (most recent call last):
  File "check_format_combined_git.py", line 15, in <module>
    array.check_format(True)
  File "/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 184, in check_format
    raise ValueError("index pointer values must form a "
ValueError: index pointer values must form a non-decreasing sequence

I tried it again with different files:

Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/a_0.5_1e-300_e4096.fpz
Checking vstacked array of size 4701894,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/b_0.5_1e-300_e4096.fpz
Checking vstacked array of size 9368698,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/c_0.5_1e-300_e4096.fpz
Checking vstacked array of size 14093041,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/d_0.5_1e-300_e4096.fpz
Checking vstacked array of size 18756458,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/e_0.5_1e-300_e4096.fpz
Checking vstacked array of size 23448599,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/f_0.5_1e-300_e4096.fpz
Checking vstacked array of size 28183472,4097
Traceback (most recent call last):
  File "check_format_combined_git.py", line 15, in <module>
    array.check_format(True)
  File "/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 184, in check_format
    raise ValueError("index pointer values must form a "
ValueError: index pointer values must form a non-decreasing sequence

I am wondering if I am hitting some maximum handling limit for this data structure? Previously, I noticed that I would get a segmentation fault for a 30-million rowed random csr matrix using sp.rand(30000000,4097,density=0.02,format='csr'), but I was still able to create a 25-million rowed one using sp.rand(25000000,4097,density=0.02,format='csr'). Noticeably, the ValueErrors that result from running check_format_combined.py also seem to crop up between 23 and 28 million rows....

@8li
Copy link
Author

8li commented Oct 4, 2017

So upon reading into this issue #3212, it seems that I am also running up against a int32 limit/issue, which I guess was fixed in #442

I added print "Negative indptrs = {}".format(np.sum(array.indptr < 0)) to check_format_combined.py, and indeed there are negative values starting after the 4th dataset:

Checking vstacked array of size 23448599,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/f_0.5_1e-300_e4096.fpz
Negative indptrs = 617406
Checking vstacked array of size 28183472,4097

Is it possible that the logic vstack is using to choose between int32 and int64 for the indices is flawed in our case for some reason?

idx_dtype = get_index_dtype(maxval=max(shape))

@8li
Copy link
Author

8li commented Oct 5, 2017

After stumbling upon #7871, I grabbed SciPy 1.1.0.dev0+Unknown

With this version, there are no more ValueErrors when I run fpdb.array.check_format(True) on matrices larger than 23 million rows, and there are no segmentation faults when I slice through the combined (75-million-rowed) CSR matrix.

Thanks, @pv!

@8li 8li closed this as completed Oct 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants