ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

Merged
merged 18 commits into from Feb 2, 2014

Conversation

Projects
None yet
6 participants
@pv
Member

pv commented Feb 18, 2013

Currently Scipy's sparse matrices are limited to nnz < 2**31, and crashes and other unspecified behavior occurs if nnz becomes larger so that the int32 indices start to wrap around.

This PR enables sparse matrices to use either int32 or int64 as the data type of
the indices. The data type is chosen automatically based on the nnz of the data
and requirements of the different sparse matrix types.

The threshold selection is written so that it can be monkeypatched to be lower
for testing purposes. The 64-bit specific test suite however currently only consists
of running the existing tests with lower thresholds and randomly assigned index
array data types (these tests pass).

The direct linear algebra (superlu, probably also umfpack) do not work with int64 indices, but implementing that is out of scope for this PR. For matrices with nnz < 2**31, things should be backwards compatible.

However, more tests would be needed, so this PR is still a bit in a WIP state.
Some help would be appreciated:

Write tests / audit the code checking that the data type selection
works appropriately when needed e.g. in assignment etc.

Closes Trac #1307

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Feb 21, 2013

Contributor

Good feature in principle, but won't this break current Cython code that works with scipy.sparse? We have quite a lot of that in scikit-learn.

Contributor

larsmans commented Feb 21, 2013

Good feature in principle, but won't this break current Cython code that works with scipy.sparse? We have quite a lot of that in scikit-learn.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Feb 21, 2013

Member

It won't break it --- the matrices switch to int64 only when the amount of data is so large that int32 won't cut it any more, so that 3rd party code continues to work as long as the input data is small enough. To handle larger nnz, 3rd party code would anyway need to be written with int64.

(As currently written, int64 is 'viral', so int32-index matrix times int64-index matrix is also int64-index. This can be adjusted, however, by checking nnz.)

Member

pv commented Feb 21, 2013

It won't break it --- the matrices switch to int64 only when the amount of data is so large that int32 won't cut it any more, so that 3rd party code continues to work as long as the input data is small enough. To handle larger nnz, 3rd party code would anyway need to be written with int64.

(As currently written, int64 is 'viral', so int32-index matrix times int64-index matrix is also int64-index. This can be adjusted, however, by checking nnz.)

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Feb 21, 2013

Contributor

Alright. I guess we can just do a check for int32 indices for the time being; huge sparse matrices would have to be broken up into batches anyway.

Contributor

larsmans commented Feb 21, 2013

Alright. I guess we can just do a check for int32 indices for the time being; huge sparse matrices would have to be broken up into batches anyway.

@larsmans larsmans referenced this pull request in scikit-learn/scikit-learn Feb 22, 2013

Merged

WIP: LinearClassifierMixin support for sparse coef_ #1702

@larsmans larsmans referenced this pull request in scikit-learn/scikit-learn Aug 29, 2013

Open

[WIP] Matrix factorization with missing values #2387

0 of 8 tasks complete
@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Aug 29, 2013

Contributor

Just a question: there are now npy_int{32,64} instantiations. Wouldn't it be wiser to have plain int (backward compat) and or npy_intp versions? That way, there could only be one build on a 32-bit box, where 64-bit indices don't make much sense, and there's no need to hardcode the number of bits.

Contributor

larsmans commented Aug 29, 2013

Just a question: there are now npy_int{32,64} instantiations. Wouldn't it be wiser to have plain int (backward compat) and or npy_intp versions? That way, there could only be one build on a 32-bit box, where 64-bit indices don't make much sense, and there's no need to hardcode the number of bits.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Jan 14, 2014

Member

Rebased, plus some additional fixed. The __setitem__ and stacking calls still need checking.

@larsmans: intp would probably work. However, I don't see a good reason for dropping support for 64-bit indices on 32-bit platforms, as that adds a platform dependency that can cause problems e.g. with pickled matrices and so on (also small matrices can end up with 64-bit indices).

Member

pv commented Jan 14, 2014

Rebased, plus some additional fixed. The __setitem__ and stacking calls still need checking.

@larsmans: intp would probably work. However, I don't see a good reason for dropping support for 64-bit indices on 32-bit platforms, as that adds a platform dependency that can cause problems e.g. with pickled matrices and so on (also small matrices can end up with 64-bit indices).

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Jan 14, 2014

Contributor

Good point. In scikit-learn we switched to intp in a couple of places and now some of the pickles are no longer portable :(

Contributor

larsmans commented Jan 14, 2014

Good point. In scikit-learn we switched to intp in a couple of places and now some of the pickles are no longer portable :(

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jan 14, 2014

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jan 14, 2014

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Jan 16, 2014

Member

Another rebase. This should handle all of index usage in scipy.sparse (grep intc returns no unnecessary hits any more).

Unit tests exercising the >32-bit index space requires quite a lot of memory.

Currently, there is a unit test that reduces the threshold of the type switch (to a much smaller nnz of ~10) or makes it random, and runs the sparse base test suite. This probably covers most of the basic operations.

Member

pv commented Jan 16, 2014

Another rebase. This should handle all of index usage in scipy.sparse (grep intc returns no unnecessary hits any more).

Unit tests exercising the >32-bit index space requires quite a lot of memory.

Currently, there is a unit test that reduces the threshold of the type switch (to a much smaller nnz of ~10) or makes it random, and runs the sparse base test suite. This probably covers most of the basic operations.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jan 16, 2014

Coverage Status

Coverage remained the same when pulling fc81527 on pv:ticket/1307 into 3b30d25 on scipy:master.

Coverage Status

Coverage remained the same when pulling fc81527 on pv:ticket/1307 into 3b30d25 on scipy:master.

@rgommers

This comment has been minimized.

Show comment
Hide comment
@rgommers

rgommers Jan 18, 2014

Member

Some casting issues for 64-bit indices on a 32-bit system:

======================================================================
ERROR: test_base.Test64Bit.test_resiliency_limit(<class 'test_base.TestLIL'>, 'test_tobsr')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/.local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/rgommers/Code/numpy/numpy/testing/decorators.py", line 146, in skipper_func
    return f(*args, **kwargs)
  File "<string>", line 2, in check
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 90, in deco
    return func(*a, **kw)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 3376, in check
    getattr(instance, method_name)()
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 1214, in test_tobsr
    assert_equal(fn(blocksize=(X,Y)).todense(), A)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/base.py", line 599, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/home/rgommers/Code/bldscipy/scipy/sparse/compressed.py", line 796, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/bsr.py", line 424, in tocoo
    row = (R * np.arange(M//R)).repeat(np.diff(self.indptr))
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

There are 15 test errors like this.

Member

rgommers commented Jan 18, 2014

Some casting issues for 64-bit indices on a 32-bit system:

======================================================================
ERROR: test_base.Test64Bit.test_resiliency_limit(<class 'test_base.TestLIL'>, 'test_tobsr')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/.local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/rgommers/Code/numpy/numpy/testing/decorators.py", line 146, in skipper_func
    return f(*args, **kwargs)
  File "<string>", line 2, in check
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 90, in deco
    return func(*a, **kw)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 3376, in check
    getattr(instance, method_name)()
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 1214, in test_tobsr
    assert_equal(fn(blocksize=(X,Y)).todense(), A)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/base.py", line 599, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/home/rgommers/Code/bldscipy/scipy/sparse/compressed.py", line 796, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/bsr.py", line 424, in tocoo
    row = (R * np.arange(M//R)).repeat(np.diff(self.indptr))
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

There are 15 test errors like this.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Jan 18, 2014

Member

Fixed.

Member

pv commented Jan 18, 2014

Fixed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jan 18, 2014

Coverage Status

Coverage remained the same when pulling cc71eea on pv:ticket/1307 into 3b30d25 on scipy:master.

Coverage Status

Coverage remained the same when pulling cc71eea on pv:ticket/1307 into 3b30d25 on scipy:master.

@rgommers

This comment has been minimized.

Show comment
Hide comment
@rgommers

rgommers Jan 22, 2014

Member

After the comment from Nathan Bell on compile time in the original ticket I measured it (very scientifically, single run), and it was actually ~5% faster with this PR. Memory usage also not significantly more, so no problems there.

Note that that's somewhat important for sparsetools, my old Mac (still in use) chokes on compiling scipy unless I closed Firefox and other programs that use a lot of memory.

Member

rgommers commented Jan 22, 2014

After the comment from Nathan Bell on compile time in the original ticket I measured it (very scientifically, single run), and it was actually ~5% faster with this PR. Memory usage also not significantly more, so no problems there.

Note that that's somewhat important for sparsetools, my old Mac (still in use) chokes on compiling scipy unless I closed Firefox and other programs that use a lot of memory.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jan 25, 2014

Coverage Status

Coverage remained the same when pulling 567b78d on pv:ticket/1307 into 3b30d25 on scipy:master.

Coverage Status

Coverage remained the same when pulling 567b78d on pv:ticket/1307 into 3b30d25 on scipy:master.

pv added some commits May 28, 2012

BUG: sparse: swap long long -> int64 & int -> int32 in sparsetools
This ensures things keep working as expected also on platforms
with sizeof(int) != 32 and sizeof(long long) != 64
BUG: sparse: matrix multiply pass 1 must allow for fill-in
Since we don't know the resulting fill-in from the matrix
multiplication, choose a large enough data type to accommodate also a
dense result. If the result nnz proves to be smaller, the data type is
reduced afterward.
BUG: sparse: be more fussy about 64-bit index data types
- never downcast
- rename nnz to maxval in get_index_dtype
- CSR maxval=max(nnz, shape[1])
- CSC maxval=max(nnz, shape[0]),
- COO maxval=max(shape)
- DIA maxval=max(shape),
- BSR maxval=max(bnnz, bshape[1])
@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Feb 1, 2014

Member

Rebased

Member

pv commented Feb 1, 2014

Rebased

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Feb 1, 2014

Coverage Status

Coverage remained the same when pulling 981da90 on pv:ticket/1307 into d378b0d on scipy:master.

Coverage Status

Coverage remained the same when pulling 981da90 on pv:ticket/1307 into d378b0d on scipy:master.

rgommers added a commit that referenced this pull request Feb 2, 2014

Merge pull request #442 from pv/ticket/1307
ENH: sparse: enable 64-bit index arrays & nnz > 2**31

@rgommers rgommers merged commit bd0c484 into scipy:master Feb 2, 2014

1 check passed

default The Travis CI build passed
Details
@rgommers

This comment has been minimized.

Show comment
Hide comment
@rgommers

rgommers Feb 2, 2014

Member

OK time to merge this. Seems to work now, should get a few weeks in master before the 0.14.x branch. Thanks @pv.

Member

rgommers commented Feb 2, 2014

OK time to merge this. Seems to work now, should get a few weeks in master before the 0.14.x branch. Thanks @pv.

@benbowen

This comment has been minimized.

Show comment
Hide comment
@benbowen

benbowen Sep 2, 2016

Is there a plan to merge this? I would like to run minimum_spanning_tree on a graph that has more edges than allowed. Seems like current maximum is 2.1 billion edges.

benbowen commented Sep 2, 2016

Is there a plan to merge this? I would like to run minimum_spanning_tree on a graph that has more edges than allowed. Seems like current maximum is 2.1 billion edges.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Sep 3, 2016

Member

It is already merged. Nobody has taken up enabling 64-bit for csgraph routines however.

Member

pv commented Sep 3, 2016

It is already merged. Nobody has taken up enabling 64-bit for csgraph routines however.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Sep 3, 2016

Member

So if you need it, that could be a useful small (but not fully trivial) project to contribute to Scipy.

Member

pv commented Sep 3, 2016

So if you need it, that could be a useful small (but not fully trivial) project to contribute to Scipy.

@benbowen

This comment has been minimized.

Show comment
Hide comment
@benbowen

benbowen Sep 6, 2016

Thanks, I'll take a closer look, and see if I can setup a development branch on our large-memory nodes for Scipy.

benbowen commented Sep 6, 2016

Thanks, I'll take a closer look, and see if I can setup a development branch on our large-memory nodes for Scipy.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Sep 6, 2016

Member
Member

pv commented Sep 6, 2016

@perimosocordiae

This comment has been minimized.

Show comment
Hide comment
@perimosocordiae

perimosocordiae Sep 6, 2016

Member

Unfortunately, many csgraph functions require square CSR input, so you do actually need lots of memory for the indptr, even if nnz is small.

Member

perimosocordiae commented Sep 6, 2016

Unfortunately, many csgraph functions require square CSR input, so you do actually need lots of memory for the indptr, even if nnz is small.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Sep 6, 2016

Member
Member

pv commented Sep 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment