Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

8li · 2017-10-03T03:28:57Z

For a large csr_matrix (75 million rows), I cannot slice all rows without running into a segmentation fault or memory error.

I tried to reproduce the error simply by creating a dummy csr matrix of similar density using sp.rand(), but matrices as small as 30 million produce segmentation faults as well (using sp.rand(30000000,4097,density=0.02,format='csr')).

Reproducing code example:

test_array.py:

import sys
import numpy as np
from e3fp.fingerprint.db import FingerprintDatabase as FpDb

fpdb = FpDb.load('all_0.5_1e-300_e4096.fpz')

for i in np.arange(fpdb.fp_num):
	x = fpdb.array[i,:]

e3fp.fingerprint.db is here
all_0.5_1e-300_e4096.fpz is a 7.8G file

Error message:

Segmentation fault (no Traceback); gdb found below

(gdb) run test_array.py
Starting program: /srv/home/ali/miniconda2/envs/nn+pip/bin/python test_array.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
get_csr_submatrix<long, unsigned short> (n_row=<optimized out>, n_col=<optimized out>, Ap=0x7fff7620e010, 
    Aj=0x7ff495f30010, Ax=0x7ff1dde78010, ir0=27662451, ir1=27662452, ic0=0, ic1=4097, Bp=0x144f334a0, 
    Bj=0x144f51e10, Bx=0xa422e0) at scipy/sparse/sparsetools/csr.h:1168
1168    scipy/sparse/sparsetools/csr.h: No such file or directory.
(gdb) bt
#0  get_csr_submatrix<long, unsigned short> (n_row=<optimized out>, n_col=<optimized out>, Ap=0x7fff7620e010, 
    Aj=0x7ff495f30010, Ax=0x7ff1dde78010, ir0=27662451, ir1=27662452, ic0=0, ic1=4097, Bp=0x144f334a0, 
    Bj=0x144f51e10, Bx=0xa422e0) at scipy/sparse/sparsetools/csr.h:1168
#1  0x00007fffe72bad88 in get_csr_submatrix_thunk (I_typenum=<optimized out>, T_typenum=<optimized out>, 
    a=<optimized out>) at scipy/sparse/sparsetools/csr_impl.h:2576
#2  0x00007fffe72b9c35 in call_thunk (ret_spec=118 'v', spec=0x7fffe75792e5 "iiIITiiii*V*V*W", 
    thunk=0x7fffe72ba8d0 <get_csr_submatrix_thunk(int, int, void**)>, args=0x7fffe7c8fd50)
    at scipy/sparse/sparsetools/sparsetools.cxx:359
#3  0x00007ffff7add1e5 in call_function (oparg=<optimized out>, pp_stack=0x7fffffffce78) at Python/ceval.c:4352
#4  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2989
#5  0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7fffe7c5a030, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=3, kws=0x144f32ea0, kwcount=0, defs=0x0, 
    defcount=0, closure=0x0) at Python/ceval.c:3584
#6  0x00007ffff7ade1f7 in fast_function (nk=<optimized out>, na=3, n=<optimized out>, pp_stack=0x7fffffffd098, 
    func=0x7fffe7c79f50) at Python/ceval.c:4447
#7  call_function (oparg=<optimized out>, pp_stack=0x7fffffffd098) at Python/ceval.c:4372
#8  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2989
#9  0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7fffe7c53e30, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=3, kws=0x6c62d8, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3584
#10 0x00007ffff7ade1f7 in fast_function (nk=<optimized out>, na=3, n=<optimized out>, pp_stack=0x7fffffffd2b8, 
    func=0x7fffe7c79ed8) at Python/ceval.c:4447
#11 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd2b8) at Python/ceval.c:4372
#12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2989
#13 0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7fffe7c53cb0, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3584
#14 0x00007ffff7a59a61 in function_call (func=0x7fffe7c79d70, arg=0x7feea5dce638, kw=0x0)
    at Objects/funcobject.c:523
#15 0x00007ffff7a29e93 in PyObject_Call (func=0x7fffe7c79d70, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2547
#16 0x00007ffff7a3c64f in instancemethod_call (func=0x7fffe7c79d70, arg=0x7feea5dce638, kw=0x0)
    at Objects/classobject.c:2602
#17 0x00007ffff7a29e93 in PyObject_Call (func=0x7fffe5bee280, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2547
#18 0x00007ffff7a96d47 in call_method (o=<optimized out>, name=<optimized out>, 
    nameobj=0x7ffff7dbc898 <cache_str.16340>, format=0x7ffff7b31e47 "(O)") at Objects/typeobject.c:1283
#19 0x00007ffff7ad69e7 in PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:1539
#20 0x00007ffff7adec3e in PyEval_EvalCodeEx (co=0x7ffff7f459b0, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3584
#21 0x00007ffff7aded52 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at Python/ceval.c:669
#22 0x00007ffff7aff450 in run_mod (arena=0x6781c0, flags=0x7fffffffdc40, locals=0x7ffff7f71168, 
    globals=0x7ffff7f71168, filename=<optimized out>, mod=0x636ed8) at Python/pythonrun.c:1376
#23 PyRun_FileExFlags (fp=0x699230, filename=<optimized out>, start=<optimized out>, globals=0x7ffff7f71168, 
    locals=0x7ffff7f71168, closeit=1, flags=0x7fffffffdc40) at Python/pythonrun.c:1362
#24 0x00007ffff7aff62f in PyRun_SimpleFileExFlags (fp=0x699230, filename=0x7fffffffe039 "test_array.py", 
    closeit=1, flags=0x7fffffffdc40) at Python/pythonrun.c:948
#25 0x00007ffff7b14fd4 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:645
#26 0x00007ffff6d1bb35 in __libc_start_main () from /lib64/libc.so.6
#27 0x0000000000400729 in _start ()

Scipy/Numpy/Python version information:

'0.19.1', '1.13.1', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)

The text was updated successfully, but these errors were encountered:

pv · 2017-10-03T08:10:46Z

Does `fpdb.array.check_format(True)` find issues (e.g. out-of-bound indices) in the matrix data?

8li · 2017-10-03T14:44:09Z

fpdb.array.check_format(True) (as well as fpdb.array.check_format()) gives the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-9e9a44999045> in <module>()
----> 1 fpdb.array.check_format(True)

/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.pyc in check_format(self, full_check)
    182                                         minor_name)
    183                 if np.diff(self.indptr).min() < 0:
--> 184                     raise ValueError("index pointer values must form a "
    185                                         "non-decreasing sequence")
    186

ValueError: index pointer values must form a non-decreasing sequence

pv · 2017-10-03T14:47:32Z

This likely implies the sparse matrix data structure is corrupted, and the segfault likely follows from that. If so, the problem may be in the library (e3fp.fingerprint.db) that constructs the sparse matrix.

8li · 2017-10-04T22:06:09Z

So, the sparse matrix data in question was created using vstack on smaller arrays.

If I run check_format(True) on the separate, smaller arrays, they all pass:

check_format.py:

import sys
from e3fp.fingerprint.db import FingerprintDatabase as FpDb

for filename in sys.argv[1:]:
    fpdb = FpDb.load(filename)
    fpdb.array.check_format(True)

python check_format.py [files] completes without error.

However, if I try to combine them into a single sparse matrix using vstack, the ValueError shows up at around the 5th database

check_format_combined.py:

import sys
import scipy.sparse as sp
from e3fp.fingerprint.db import FingerprintDatabase as FpDb

array = None
arrays = []

for filename in sys.argv[1:]:
    fpdb = FpDb.load(filename)
    print "Checking sub array from {}".format(filename)
    fpdb.array.check_format(True)
    arrays.append(fpdb.array)
    array = sp.vstack(arrays)
    print "Checking vstacked array of size {},{}".format(array.shape[0],array.shape[1])
    array.check_format(True)

python check_format_combined.py [files] output below:

Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/0_0.5_1e-300_e4096.fpz
Checking vstacked array of size 4687126,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/1_0.5_1e-300_e4096.fpz
Checking vstacked array of size 9301821,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/2_0.5_1e-300_e4096.fpz
Checking vstacked array of size 13968005,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/3_0.5_1e-300_e4096.fpz
Checking vstacked array of size 18603442,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/4_0.5_1e-300_e4096.fpz
Checking vstacked array of size 23255493,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/5_0.5_1e-300_e4096.fpz
Checking vstacked array of size 27951284,4097
Traceback (most recent call last):
  File "check_format_combined_git.py", line 15, in <module>
    array.check_format(True)
  File "/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 184, in check_format
    raise ValueError("index pointer values must form a "
ValueError: index pointer values must form a non-decreasing sequence

I tried it again with different files:

Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/a_0.5_1e-300_e4096.fpz
Checking vstacked array of size 4701894,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/b_0.5_1e-300_e4096.fpz
Checking vstacked array of size 9368698,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/c_0.5_1e-300_e4096.fpz
Checking vstacked array of size 14093041,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/d_0.5_1e-300_e4096.fpz
Checking vstacked array of size 18756458,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/e_0.5_1e-300_e4096.fpz
Checking vstacked array of size 23448599,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/f_0.5_1e-300_e4096.fpz
Checking vstacked array of size 28183472,4097
Traceback (most recent call last):
  File "check_format_combined_git.py", line 15, in <module>
    array.check_format(True)
  File "/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 184, in check_format
    raise ValueError("index pointer values must form a "
ValueError: index pointer values must form a non-decreasing sequence

I am wondering if I am hitting some maximum handling limit for this data structure? Previously, I noticed that I would get a segmentation fault for a 30-million rowed random csr matrix using sp.rand(30000000,4097,density=0.02,format='csr'), but I was still able to create a 25-million rowed one using sp.rand(25000000,4097,density=0.02,format='csr'). Noticeably, the ValueErrors that result from running check_format_combined.py also seem to crop up between 23 and 28 million rows....

8li · 2017-10-04T22:42:46Z

So upon reading into this issue #3212, it seems that I am also running up against a int32 limit/issue, which I guess was fixed in #442

I added print "Negative indptrs = {}".format(np.sum(array.indptr < 0)) to check_format_combined.py, and indeed there are negative values starting after the 4th dataset:

Checking vstacked array of size 23448599,4097
Checking sub array from /srv/home/ali/projects/e3fp/postprocess/combined/gpu2/f_0.5_1e-300_e4096.fpz
Negative indptrs = 617406
Checking vstacked array of size 28183472,4097

Is it possible that the logic vstack is using to choose between int32 and int64 for the indices is flawed in our case for some reason?

scipy/scipy/sparse/construct.py

Line 602 in 03b1092

idx_dtype = get_index_dtype(maxval=max(shape))

8li · 2017-10-05T06:16:32Z

After stumbling upon #7871, I grabbed SciPy 1.1.0.dev0+Unknown

With this version, there are no more ValueErrors when I run fpdb.array.check_format(True) on matrices larger than 23 million rows, and there are no segmentation faults when I slice through the combined (75-million-rowed) CSR matrix.

Thanks, @pv!

8li closed this as completed Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

8li commented Oct 3, 2017

pv commented Oct 3, 2017 via email

8li commented Oct 3, 2017

pv commented Oct 3, 2017 via email

8li commented Oct 4, 2017 •

edited

Loading

8li commented Oct 4, 2017 •

edited

Loading

8li commented Oct 5, 2017

Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

Comments

8li commented Oct 3, 2017

Reproducing code example:

Error message:

Scipy/Numpy/Python version information:

pv commented Oct 3, 2017 via email

8li commented Oct 3, 2017

pv commented Oct 3, 2017 via email

8li commented Oct 4, 2017 • edited Loading

8li commented Oct 4, 2017 • edited Loading

8li commented Oct 5, 2017

8li commented Oct 4, 2017 •

edited

Loading

8li commented Oct 4, 2017 •

edited

Loading