Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Dot product of csr_matrix causes segmentation fault #3212

Closed
omerlevy opened this Issue Jan 14, 2014 · 18 comments

Comments

Projects
None yet
5 participants

I have two (scipy) CSR sparse matrices:

A (12414693, 235470)
B (235470, 48063)
Performing:

A.dot(B)
causes a segmentation fault.

Note that both matrices are extremely sparse; A contains "one-hot vectors", i.e. all-zero rows except for one element.

I am using scipy version 0.12.0.

Owner

pv commented Jan 14, 2014

The following information would be useful: (i) what is the nnz of A and B, (ii) what is the nnz (expected) of the matrix product?

As mentioned, A has 12414693 nnz values. B has 2678333 nnz values, an average of ~55.72 per row.

Member

jakevdp commented Jan 14, 2014

Just back-of-the-envelope, then: Each entry of dot(A,B) will have 235470 options for a "hit", and the probability of a hit is (2678333 / 235470 / 48063) = 2.37E-4.
This means the probability of a non-hit is (1 - 2.37E-4), and the probability of 235470 non-hits is (1 - 2.37E-4) ^ 235470 ~ 1E-25. So it is likely that every entry in the final matrix will be nonzero.
Your final matrix if of size 12414693 x 48063, so you're effectively trying to allocate ~4 x 10^13 entries, which is on the order of 10 TB double-precision (~15TB in sparse format because of the indices). I'd say that's probably where your problem lies 😄

Member

jakevdp commented Jan 14, 2014

Also, sparse matrices are currently limited to ~2^32 entries, because the indices are stored as int32. This is likely why you get a segfault rather than a memory error. I think I remember an issue/PR filed to try to upgrade this to int64, but I can't find it right now...

Contributor

argriffing commented Jan 14, 2014

maybe #442

Member

jakevdp commented Jan 14, 2014

Edit: I think I made a mistake here. The prob. of a hit is actually
(2678333 / 235470 / 48063) / 235470 = 1E-9
So the probability of 235470 non-hits is (1 - 1E-9) ^ 235470 = (1 - 3E-4)
So the number of nonzero entries should be 3E-4 * 12414693 x 48063 ~ 179 million. This could run into the int32 limit.

Member

jaimefrio commented Jan 14, 2014

@omerlevy If you run the code in my answer to your stackoveflow question, the number of non-zero entries of the resulting matrix will be c[-1].

Owner

pv commented Jan 14, 2014

The seg fault may also be caused by a failing C++ memory allocation. Some of the routines allocate workspace via stuff like std::vector<T> sums(n_col, 0);, and I don't think std::bad_alloc is caught in the SWIG wrappers.
EDIT: gh-3215 addresses this.

Needs: (i) reproducible simple example, and (ii) running in debugger to proceed further.

@jaimefrio c[-1] = 275764616

Owner

pv commented Jan 14, 2014

@omerlevy: please try to get a C stack trace (see https://wiki.python.org/moin/DebuggingWithGdb for instructions) and paste it to gist.github.com and post a link to it here.
Alternatively, write a small test program that reproduces the issue so that we can reproduce the issue.

Owner

pv commented Jan 16, 2014

@omerlevy: thanks --- that clarifies that the issue is not in c++ exceptions, but something else.

I did not manage to reproduce the issue in simple testing --- it would help here if you could upload the sparsity structure of your matrices somewhere on the net:
numpy.savez_compressed('dump.npz', A_indices=A.indices, A_indptr=A.indptr, B_indices=B.indices, B_indptr=B.indptr), assuming the data is in CSR format.

Thanks, I've uploaded it here:
http://www.filedropper.com/dump

Owner

pv commented Jan 16, 2014

@omerlevy: thanks, I'm able to reproduce the crash with this information.

Owner

pv commented Jan 16, 2014

The crash is due to 32-bit indices wrapping around. When tested with gh-442, it turns out that the matrix C = A.dot(B) would have nnz = 4570731912 (which takes 30+GB of memory; with gh-442 this fails gracefully with a MemoryError rather than crashing on systems with less memory than this available).

It might be possible to hack around this by using larger types for accumulators. OTOH, a better fix would be to finally get gh-442 merged.

Member

jaimefrio commented Jan 16, 2014

And of course 4570731912 - 2^32 = 275764616, the value in c[-1], which is positive because the indices manage to make it all the way to positive again after wrapping around...

Thank you all for the quick and helpful answers.
I will try to find a mathematical workaround for this issue in my application, that reduces the matrix's size.

Owner

pv commented Jan 19, 2014

gh-3224 converts the crash to a friendly error message

@pv pv closed this in #3224 Feb 1, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment