-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966
Comments
Does `fpdb.array.check_format(True)` find issues (e.g. out-of-bound
indices) in the matrix data?
|
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-9e9a44999045> in <module>()
----> 1 fpdb.array.check_format(True)
/srv/home/ali/miniconda2/envs/nn+pip/lib/python2.7/site-packages/scipy/sparse/compressed.pyc in check_format(self, full_check)
182 minor_name)
183 if np.diff(self.indptr).min() < 0:
--> 184 raise ValueError("index pointer values must form a "
185 "non-decreasing sequence")
186
ValueError: index pointer values must form a non-decreasing sequence |
This likely implies the sparse matrix data structure is corrupted, and
the segfault likely follows from that. If so, the problem may be in the
library (e3fp.fingerprint.db) that constructs the sparse matrix.
|
So, the sparse matrix data in question was created using If I run check_format.py: import sys
from e3fp.fingerprint.db import FingerprintDatabase as FpDb
for filename in sys.argv[1:]:
fpdb = FpDb.load(filename)
fpdb.array.check_format(True)
However, if I try to combine them into a single sparse matrix using check_format_combined.py: import sys
import scipy.sparse as sp
from e3fp.fingerprint.db import FingerprintDatabase as FpDb
array = None
arrays = []
for filename in sys.argv[1:]:
fpdb = FpDb.load(filename)
print "Checking sub array from {}".format(filename)
fpdb.array.check_format(True)
arrays.append(fpdb.array)
array = sp.vstack(arrays)
print "Checking vstacked array of size {},{}".format(array.shape[0],array.shape[1])
array.check_format(True)
I tried it again with different files:
I am wondering if I am hitting some maximum handling limit for this data structure? Previously, I noticed that I would get a segmentation fault for a 30-million rowed random csr matrix using |
So upon reading into this issue #3212, it seems that I am also running up against a int32 limit/issue, which I guess was fixed in #442 I added
Is it possible that the logic scipy/scipy/sparse/construct.py Line 602 in 03b1092
|
After stumbling upon #7871, I grabbed SciPy 1.1.0.dev0+Unknown With this version, there are no more ValueErrors when I run Thanks, @pv! |
For a large csr_matrix (75 million rows), I cannot slice all rows without running into a segmentation fault or memory error.
I tried to reproduce the error simply by creating a dummy csr matrix of similar density using
sp.rand()
, but matrices as small as 30 million produce segmentation faults as well (usingsp.rand(30000000,4097,density=0.02,format='csr')
).Reproducing code example:
test_array.py:
e3fp.fingerprint.db is here
all_0.5_1e-300_e4096.fpz is a 7.8G file
Error message:
Segmentation fault (no Traceback); gdb found below
Scipy/Numpy/Python version information:
The text was updated successfully, but these errors were encountered: