ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

pv · 2013-02-18T22:02:08Z

Currently Scipy's sparse matrices are limited to nnz < 2**31, and crashes and other unspecified behavior occurs if nnz becomes larger so that the int32 indices start to wrap around.

This PR enables sparse matrices to use either int32 or int64 as the data type of
the indices. The data type is chosen automatically based on the nnz of the data
and requirements of the different sparse matrix types.

The threshold selection is written so that it can be monkeypatched to be lower
for testing purposes. The 64-bit specific test suite however currently only consists
of running the existing tests with lower thresholds and randomly assigned index
array data types (these tests pass).

The direct linear algebra (superlu, probably also umfpack) do not work with int64 indices, but implementing that is out of scope for this PR. For matrices with nnz < 2**31, things should be backwards compatible.

However, more tests would be needed, so this PR is still a bit in a WIP state.
Some help would be appreciated:

Write tests / audit the code checking that the data type selection
works appropriately when needed e.g. in assignment etc.

Closes Trac #1307

larsmans · 2013-02-21T11:57:55Z

Good feature in principle, but won't this break current Cython code that works with scipy.sparse? We have quite a lot of that in scikit-learn.

pv · 2013-02-21T13:59:18Z

It won't break it --- the matrices switch to int64 only when the amount of data is so large that int32 won't cut it any more, so that 3rd party code continues to work as long as the input data is small enough. To handle larger nnz, 3rd party code would anyway need to be written with int64.

(As currently written, int64 is 'viral', so int32-index matrix times int64-index matrix is also int64-index. This can be adjusted, however, by checking nnz.)

larsmans · 2013-02-21T15:45:49Z

Alright. I guess we can just do a check for int32 indices for the time being; huge sparse matrices would have to be broken up into batches anyway.

larsmans · 2013-08-29T14:32:12Z

Just a question: there are now npy_int{32,64} instantiations. Wouldn't it be wiser to have plain int (backward compat) and or npy_intp versions? That way, there could only be one build on a 32-bit box, where 64-bit indices don't make much sense, and there's no need to hardcode the number of bits.

pv · 2014-01-14T16:07:11Z

Rebased, plus some additional fixed. The __setitem__ and stacking calls still need checking.

@larsmans: intp would probably work. However, I don't see a good reason for dropping support for 64-bit indices on 32-bit platforms, as that adds a platform dependency that can cause problems e.g. with pickled matrices and so on (also small matrices can end up with 64-bit indices).

larsmans · 2014-01-14T16:09:11Z

Good point. In scikit-learn we switched to intp in a couple of places and now some of the pickles are no longer portable :(

coveralls · 2014-01-14T17:30:49Z

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

coveralls · 2014-01-14T17:55:05Z

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

pv · 2014-01-16T20:47:45Z

Another rebase. This should handle all of index usage in scipy.sparse (grep intc returns no unnecessary hits any more).

Unit tests exercising the >32-bit index space requires quite a lot of memory.

Currently, there is a unit test that reduces the threshold of the type switch (to a much smaller nnz of ~10) or makes it random, and runs the sparse base test suite. This probably covers most of the basic operations.

coveralls · 2014-01-16T21:18:56Z

Coverage remained the same when pulling fc81527 on pv:ticket/1307 into 3b30d25 on scipy:master.

rgommers · 2014-01-18T17:42:10Z

Some casting issues for 64-bit indices on a 32-bit system:

======================================================================
ERROR: test_base.Test64Bit.test_resiliency_limit(<class 'test_base.TestLIL'>, 'test_tobsr')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/.local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/rgommers/Code/numpy/numpy/testing/decorators.py", line 146, in skipper_func
    return f(*args, **kwargs)
  File "<string>", line 2, in check
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 90, in deco
    return func(*a, **kw)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 3376, in check
    getattr(instance, method_name)()
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 1214, in test_tobsr
    assert_equal(fn(blocksize=(X,Y)).todense(), A)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/base.py", line 599, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/home/rgommers/Code/bldscipy/scipy/sparse/compressed.py", line 796, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/bsr.py", line 424, in tocoo
    row = (R * np.arange(M//R)).repeat(np.diff(self.indptr))
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

There are 15 test errors like this.

pv · 2014-01-18T18:59:23Z

Fixed.

coveralls · 2014-01-18T19:31:11Z

Coverage remained the same when pulling cc71eea on pv:ticket/1307 into 3b30d25 on scipy:master.

rgommers · 2014-01-22T20:10:26Z

After the comment from Nathan Bell on compile time in the original ticket I measured it (very scientifically, single run), and it was actually ~5% faster with this PR. Memory usage also not significantly more, so no problems there.

Note that that's somewhat important for sparsetools, my old Mac (still in use) chokes on compiling scipy unless I closed Firefox and other programs that use a lot of memory.

coveralls · 2014-01-25T23:34:41Z

Coverage remained the same when pulling 567b78d on pv:ticket/1307 into 3b30d25 on scipy:master.

Thanks to peb for the patch.

This ensures things keep working as expected also on platforms with sizeof(int) != 32 and sizeof(long long) != 64

Since we don't know the resulting fill-in from the matrix multiplication, choose a large enough data type to accommodate also a dense result. If the result nnz proves to be smaller, the data type is reduced afterward.

…SR/CSC

- never downcast - rename nnz to maxval in get_index_dtype - CSR maxval=max(nnz, shape[1]) - CSC maxval=max(nnz, shape[0]), - COO maxval=max(shape) - DIA maxval=max(shape), - BSR maxval=max(bnnz, bshape[1])

pv · 2014-02-01T18:55:30Z

Rebased

coveralls · 2014-02-01T19:32:19Z

Coverage remained the same when pulling 981da90 on pv:ticket/1307 into d378b0d on scipy:master.

ENH: sparse: enable 64-bit index arrays & nnz > 2**31

rgommers · 2014-02-02T11:50:24Z

OK time to merge this. Seems to work now, should get a few weeks in master before the 0.14.x branch. Thanks @pv.

benbowen · 2016-09-02T18:29:33Z

Is there a plan to merge this? I would like to run minimum_spanning_tree on a graph that has more edges than allowed. Seems like current maximum is 2.1 billion edges.

pv · 2016-09-03T10:00:21Z

It is already merged. Nobody has taken up enabling 64-bit for csgraph routines however.

pv · 2016-09-03T10:01:06Z

So if you need it, that could be a useful small (but not fully trivial) project to contribute to Scipy.

benbowen · 2016-09-06T13:05:59Z

Thanks, I'll take a closer look, and see if I can setup a development branch on our large-memory nodes for Scipy.

pv · 2016-09-06T13:20:18Z

Note that large memory is not necessarily needed for development --- sparse matrices with int64 index data types can be created manually.

perimosocordiae · 2016-09-06T15:22:29Z

Unfortunately, many csgraph functions require square CSR input, so you do actually need lots of memory for the indptr, even if nnz is small.

pv · 2016-09-06T15:28:24Z

Yes, but note that you can consider small matrices with 64-bit indices.

larsmans mentioned this pull request Feb 21, 2013

ENH reproducible scipy.sparse.rand #443

Merged

larsmans mentioned this pull request Feb 22, 2013

WIP: LinearClassifierMixin support for sparse coef_ scikit-learn/scikit-learn#1702

Merged

scipy-gitbot mentioned this pull request Apr 25, 2013

Sparse matrices are limited to 2^32 non-zero elements (Trac #1307) #1833

Closed

larsmans mentioned this pull request Aug 29, 2013

[WIP] Matrix factorization with missing values scikit-learn/scikit-learn#2387

Closed

8 tasks

argriffing mentioned this pull request Jan 14, 2014

Dot product of csr_matrix causes segmentation fault #3212

Closed

pv mentioned this pull request Jan 25, 2014

Fix int32 overflows in sparsetools #3224

Merged

pv added 9 commits February 1, 2014 20:35

MAINT: mark SWIG-generated files as binary

074f8b7

ENH: sparse: enable sparsetools to support long long indices

0daf018

Thanks to peb for the patch.

ENH: sparse: replace intc index type by automatically determined

0938438

MAINT: sparse: add a generate_sparsetools.py

d776d68

TST: sparse: add initial tests for 64-bit indices

8adba21

BUG: sparse: trigger 64-bit index cast in more locations

457ee3f

BUG: sparse: play safe and do 64-bit cast earlier

e3e7323

BUG: sparse: swap long long -> int64 & int -> int32 in sparsetools

87deb7b

This ensures things keep working as expected also on platforms with sizeof(int) != 32 and sizeof(long long) != 64

GEN: sparse: regenerate sparsetools files

061d375

pv added 9 commits February 1, 2014 20:36

BUG: sparse: matrix multiply pass 1 must allow for fill-in

63fd870

Since we don't know the resulting fill-in from the matrix multiplication, choose a large enough data type to accommodate also a dense result. If the result nnz proves to be smaller, the data type is reduced afterward.

BUG: sparse: fix one missing index type determination

e69bf1a

BUG: sparse: fix sparse.randn to select index type properly

dba0943

BUG: sparse: select correct index type on empty matrix init for BSR/C…

4ae96d0

…SR/CSC

BUG: sparse: correct index data type in bmat

0af65ec

BUG: sparse: fix BSR to COO conversion on 32-bit

84524b2

TST: sparse: enable more 64-bit index tests

491e8b8

BUG: sparse: be more fussy about 64-bit index data types

62ec92f

- never downcast - rename nnz to maxval in get_index_dtype - CSR maxval=max(nnz, shape[1]) - CSC maxval=max(nnz, shape[0]), - COO maxval=max(shape) - DIA maxval=max(shape), - BSR maxval=max(bnnz, bshape[1])

BUG: sparse: fix 64-bit index handling in _cs_matrix._insert_many

981da90

rgommers added a commit that referenced this pull request Feb 2, 2014

Merge pull request #442 from pv/ticket/1307

bd0c484

ENH: sparse: enable 64-bit index arrays & nnz > 2**31

rgommers merged commit bd0c484 into scipy:master Feb 2, 2014

rth mentioned this pull request Mar 28, 2017

Sparse COO.toarray fails if nnz > 2**31 #7230

Closed

8li mentioned this pull request Oct 5, 2017

Segmentation Faults and Memory Errors when slicing large csr_matrix (75 million rows) #7966

Closed

jjerphan mentioned this pull request Aug 4, 2022

RFC: Use np.int64 by default for CSR matrices' indices and indptr #16774

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

pv commented Feb 18, 2013

larsmans commented Feb 21, 2013

pv commented Feb 21, 2013

larsmans commented Feb 21, 2013

larsmans commented Aug 29, 2013

pv commented Jan 14, 2014

larsmans commented Jan 14, 2014

coveralls commented Jan 14, 2014

coveralls commented Jan 14, 2014

pv commented Jan 16, 2014

coveralls commented Jan 16, 2014

rgommers commented Jan 18, 2014

pv commented Jan 18, 2014

coveralls commented Jan 18, 2014

rgommers commented Jan 22, 2014

coveralls commented Jan 25, 2014

pv commented Feb 1, 2014

coveralls commented Feb 1, 2014

rgommers commented Feb 2, 2014

benbowen commented Sep 2, 2016 •

edited

Loading

pv commented Sep 3, 2016

pv commented Sep 3, 2016 •

edited

Loading

benbowen commented Sep 6, 2016

pv commented Sep 6, 2016 via email

perimosocordiae commented Sep 6, 2016

pv commented Sep 6, 2016 via email

ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

Conversation

pv commented Feb 18, 2013

larsmans commented Feb 21, 2013

pv commented Feb 21, 2013

larsmans commented Feb 21, 2013

larsmans commented Aug 29, 2013

pv commented Jan 14, 2014

larsmans commented Jan 14, 2014

coveralls commented Jan 14, 2014

coveralls commented Jan 14, 2014

pv commented Jan 16, 2014

coveralls commented Jan 16, 2014

rgommers commented Jan 18, 2014

pv commented Jan 18, 2014

coveralls commented Jan 18, 2014

rgommers commented Jan 22, 2014

coveralls commented Jan 25, 2014

pv commented Feb 1, 2014

coveralls commented Feb 1, 2014

rgommers commented Feb 2, 2014

benbowen commented Sep 2, 2016 • edited Loading

pv commented Sep 3, 2016

pv commented Sep 3, 2016 • edited Loading

benbowen commented Sep 6, 2016

pv commented Sep 6, 2016 via email

perimosocordiae commented Sep 6, 2016

pv commented Sep 6, 2016 via email

benbowen commented Sep 2, 2016 •

edited

Loading

pv commented Sep 3, 2016 •

edited

Loading