Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

Merged
merged 18 commits into from
Feb 2, 2014

Conversation

pv
Copy link
Member

@pv pv commented Feb 18, 2013

Currently Scipy's sparse matrices are limited to nnz < 2**31, and crashes and other unspecified behavior occurs if nnz becomes larger so that the int32 indices start to wrap around.

This PR enables sparse matrices to use either int32 or int64 as the data type of
the indices. The data type is chosen automatically based on the nnz of the data
and requirements of the different sparse matrix types.

The threshold selection is written so that it can be monkeypatched to be lower
for testing purposes. The 64-bit specific test suite however currently only consists
of running the existing tests with lower thresholds and randomly assigned index
array data types (these tests pass).

The direct linear algebra (superlu, probably also umfpack) do not work with int64 indices, but implementing that is out of scope for this PR. For matrices with nnz < 2**31, things should be backwards compatible.

However, more tests would be needed, so this PR is still a bit in a WIP state.
Some help would be appreciated:

Write tests / audit the code checking that the data type selection
works appropriately when needed e.g. in assignment etc.

Closes Trac #1307

@larsmans
Copy link
Contributor

Good feature in principle, but won't this break current Cython code that works with scipy.sparse? We have quite a lot of that in scikit-learn.

@pv
Copy link
Member Author

pv commented Feb 21, 2013

It won't break it --- the matrices switch to int64 only when the amount of data is so large that int32 won't cut it any more, so that 3rd party code continues to work as long as the input data is small enough. To handle larger nnz, 3rd party code would anyway need to be written with int64.

(As currently written, int64 is 'viral', so int32-index matrix times int64-index matrix is also int64-index. This can be adjusted, however, by checking nnz.)

@larsmans
Copy link
Contributor

Alright. I guess we can just do a check for int32 indices for the time being; huge sparse matrices would have to be broken up into batches anyway.

@larsmans
Copy link
Contributor

Just a question: there are now npy_int{32,64} instantiations. Wouldn't it be wiser to have plain int (backward compat) and or npy_intp versions? That way, there could only be one build on a 32-bit box, where 64-bit indices don't make much sense, and there's no need to hardcode the number of bits.

@pv
Copy link
Member Author

pv commented Jan 14, 2014

Rebased, plus some additional fixed. The __setitem__ and stacking calls still need checking.

@larsmans: intp would probably work. However, I don't see a good reason for dropping support for 64-bit indices on 32-bit platforms, as that adds a platform dependency that can cause problems e.g. with pickled matrices and so on (also small matrices can end up with 64-bit indices).

@larsmans
Copy link
Contributor

Good point. In scikit-learn we switched to intp in a couple of places and now some of the pickles are no longer portable :(

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*.

@pv
Copy link
Member Author

pv commented Jan 16, 2014

Another rebase. This should handle all of index usage in scipy.sparse (grep intc returns no unnecessary hits any more).

Unit tests exercising the >32-bit index space requires quite a lot of memory.

Currently, there is a unit test that reduces the threshold of the type switch (to a much smaller nnz of ~10) or makes it random, and runs the sparse base test suite. This probably covers most of the basic operations.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling fc81527 on pv:ticket/1307 into 3b30d25 on scipy:master.

@rgommers
Copy link
Member

Some casting issues for 64-bit indices on a 32-bit system:

======================================================================
ERROR: test_base.Test64Bit.test_resiliency_limit(<class 'test_base.TestLIL'>, 'test_tobsr')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/.local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/rgommers/Code/numpy/numpy/testing/decorators.py", line 146, in skipper_func
    return f(*args, **kwargs)
  File "<string>", line 2, in check
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 90, in deco
    return func(*a, **kw)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 3376, in check
    getattr(instance, method_name)()
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 1214, in test_tobsr
    assert_equal(fn(blocksize=(X,Y)).todense(), A)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/base.py", line 599, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/home/rgommers/Code/bldscipy/scipy/sparse/compressed.py", line 796, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/bsr.py", line 424, in tocoo
    row = (R * np.arange(M//R)).repeat(np.diff(self.indptr))
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

There are 15 test errors like this.

@pv
Copy link
Member Author

pv commented Jan 18, 2014

Fixed.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling cc71eea on pv:ticket/1307 into 3b30d25 on scipy:master.

@rgommers
Copy link
Member

After the comment from Nathan Bell on compile time in the original ticket I measured it (very scientifically, single run), and it was actually ~5% faster with this PR. Memory usage also not significantly more, so no problems there.

Note that that's somewhat important for sparsetools, my old Mac (still in use) chokes on compiling scipy unless I closed Firefox and other programs that use a lot of memory.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 567b78d on pv:ticket/1307 into 3b30d25 on scipy:master.

Since we don't know the resulting fill-in from the matrix
multiplication, choose a large enough data type to accommodate also a
dense result. If the result nnz proves to be smaller, the data type is
reduced afterward.
- never downcast
- rename nnz to maxval in get_index_dtype
- CSR maxval=max(nnz, shape[1])
- CSC maxval=max(nnz, shape[0]),
- COO maxval=max(shape)
- DIA maxval=max(shape),
- BSR maxval=max(bnnz, bshape[1])
@pv
Copy link
Member Author

pv commented Feb 1, 2014

Rebased

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 981da90 on pv:ticket/1307 into d378b0d on scipy:master.

rgommers added a commit that referenced this pull request Feb 2, 2014
ENH: sparse: enable 64-bit index arrays & nnz > 2**31
@rgommers rgommers merged commit bd0c484 into scipy:master Feb 2, 2014
@rgommers
Copy link
Member

rgommers commented Feb 2, 2014

OK time to merge this. Seems to work now, should get a few weeks in master before the 0.14.x branch. Thanks @pv.

@benbowen
Copy link

benbowen commented Sep 2, 2016

Is there a plan to merge this? I would like to run minimum_spanning_tree on a graph that has more edges than allowed. Seems like current maximum is 2.1 billion edges.

@pv
Copy link
Member Author

pv commented Sep 3, 2016

It is already merged. Nobody has taken up enabling 64-bit for csgraph routines however.

@pv
Copy link
Member Author

pv commented Sep 3, 2016

So if you need it, that could be a useful small (but not fully trivial) project to contribute to Scipy.

@benbowen
Copy link

benbowen commented Sep 6, 2016

Thanks, I'll take a closer look, and see if I can setup a development branch on our large-memory nodes for Scipy.

@pv
Copy link
Member Author

pv commented Sep 6, 2016 via email

@perimosocordiae
Copy link
Member

Unfortunately, many csgraph functions require square CSR input, so you do actually need lots of memory for the indptr, even if nnz is small.

@pv
Copy link
Member Author

pv commented Sep 6, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants