Skip to content

ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442

Merged
merged 18 commits into from Feb 2, 2014

4 participants

@pv
SciPy member
pv commented Feb 18, 2013

Currently Scipy's sparse matrices are limited to nnz < 2**31, and crashes and other unspecified behavior occurs if nnz becomes larger so that the int32 indices start to wrap around.

This PR enables sparse matrices to use either int32 or int64 as the data type of
the indices. The data type is chosen automatically based on the nnz of the data
and requirements of the different sparse matrix types.

The threshold selection is written so that it can be monkeypatched to be lower
for testing purposes. The 64-bit specific test suite however currently only consists
of running the existing tests with lower thresholds and randomly assigned index
array data types (these tests pass).

The direct linear algebra (superlu, probably also umfpack) do not work with int64 indices, but implementing that is out of scope for this PR. For matrices with nnz < 2**31, things should be backwards compatible.

However, more tests would be needed, so this PR is still a bit in a WIP state.
Some help would be appreciated:

Write tests / audit the code checking that the data type selection
works appropriately when needed e.g. in assignment etc.

Closes Trac #1307

@larsmans

Good feature in principle, but won't this break current Cython code that works with scipy.sparse? We have quite a lot of that in scikit-learn.

@pv
SciPy member
pv commented Feb 21, 2013

It won't break it --- the matrices switch to int64 only when the amount of data is so large that int32 won't cut it any more, so that 3rd party code continues to work as long as the input data is small enough. To handle larger nnz, 3rd party code would anyway need to be written with int64.

(As currently written, int64 is 'viral', so int32-index matrix times int64-index matrix is also int64-index. This can be adjusted, however, by checking nnz.)

@larsmans

Alright. I guess we can just do a check for int32 indices for the time being; huge sparse matrices would have to be broken up into batches anyway.

@larsmans larsmans referenced this pull request in scikit-learn/scikit-learn Aug 29, 2013
Open

[WIP] Matrix factorization with missing values #2387

0 of 8 tasks complete
@larsmans

Just a question: there are now npy_int{32,64} instantiations. Wouldn't it be wiser to have plain int (backward compat) and or npy_intp versions? That way, there could only be one build on a 32-bit box, where 64-bit indices don't make much sense, and there's no need to hardcode the number of bits.

@pv
SciPy member
pv commented Jan 14, 2014

Rebased, plus some additional fixed. The __setitem__ and stacking calls still need checking.

@larsmans: intp would probably work. However, I don't see a good reason for dropping support for 64-bit indices on 32-bit platforms, as that adds a platform dependency that can cause problems e.g. with pickled matrices and so on (also small matrices can end up with 64-bit indices).

@larsmans

Good point. In scikit-learn we switched to intp in a couple of places and now some of the pickles are no longer portable :(

@coveralls

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into ** on scipy:master**.

@coveralls

Coverage Status

Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into ** on scipy:master**.

@pv
SciPy member
pv commented Jan 16, 2014

Another rebase. This should handle all of index usage in scipy.sparse (grep intc returns no unnecessary hits any more).

Unit tests exercising the >32-bit index space requires quite a lot of memory.

Currently, there is a unit test that reduces the threshold of the type switch (to a much smaller nnz of ~10) or makes it random, and runs the sparse base test suite. This probably covers most of the basic operations.

@coveralls

Coverage Status

Coverage remained the same when pulling fc81527 on pv:ticket/1307 into 3b30d25 on scipy:master.

@rgommers
SciPy member

Some casting issues for 64-bit indices on a 32-bit system:

======================================================================
ERROR: test_base.Test64Bit.test_resiliency_limit(<class 'test_base.TestLIL'>, 'test_tobsr')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/.local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/rgommers/Code/numpy/numpy/testing/decorators.py", line 146, in skipper_func
    return f(*args, **kwargs)
  File "<string>", line 2, in check
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 90, in deco
    return func(*a, **kw)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 3376, in check
    getattr(instance, method_name)()
  File "/home/rgommers/Code/bldscipy/scipy/sparse/tests/test_base.py", line 1214, in test_tobsr
    assert_equal(fn(blocksize=(X,Y)).todense(), A)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/base.py", line 599, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/home/rgommers/Code/bldscipy/scipy/sparse/compressed.py", line 796, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/rgommers/Code/bldscipy/scipy/sparse/bsr.py", line 424, in tocoo
    row = (R * np.arange(M//R)).repeat(np.diff(self.indptr))
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

There are 15 test errors like this.

@pv
SciPy member
pv commented Jan 18, 2014

Fixed.

@coveralls

Coverage Status

Coverage remained the same when pulling cc71eea on pv:ticket/1307 into 3b30d25 on scipy:master.

@rgommers
SciPy member

After the comment from Nathan Bell on compile time in the original ticket I measured it (very scientifically, single run), and it was actually ~5% faster with this PR. Memory usage also not significantly more, so no problems there.

Note that that's somewhat important for sparsetools, my old Mac (still in use) chokes on compiling scipy unless I closed Firefox and other programs that use a lot of memory.

@coveralls

Coverage Status

Coverage remained the same when pulling 567b78d on pv:ticket/1307 into 3b30d25 on scipy:master.

pv added some commits May 28, 2012
@pv pv MAINT: mark SWIG-generated files as binary 074f8b7
@pv pv ENH: sparse: enable sparsetools to support long long indices
Thanks to peb for the patch.
0daf018
@pv pv ENH: sparse: replace intc index type by automatically determined 0938438
@pv pv MAINT: sparse: add a generate_sparsetools.py d776d68
@pv pv TST: sparse: add initial tests for 64-bit indices 8adba21
@pv pv BUG: sparse: trigger 64-bit index cast in more locations 457ee3f
@pv pv BUG: sparse: play safe and do 64-bit cast earlier e3e7323
@pv pv BUG: sparse: swap long long -> int64 & int -> int32 in sparsetools
This ensures things keep working as expected also on platforms
with sizeof(int) != 32 and sizeof(long long) != 64
87deb7b
@pv pv GEN: sparse: regenerate sparsetools files 061d375
@pv pv BUG: sparse: matrix multiply pass 1 must allow for fill-in
Since we don't know the resulting fill-in from the matrix
multiplication, choose a large enough data type to accommodate also a
dense result. If the result nnz proves to be smaller, the data type is
reduced afterward.
63fd870
@pv pv BUG: sparse: fix one missing index type determination e69bf1a
@pv pv BUG: sparse: fix sparse.randn to select index type properly dba0943
@pv pv BUG: sparse: select correct index type on empty matrix init for BSR/C…
…SR/CSC
4ae96d0
@pv pv BUG: sparse: correct index data type in bmat 0af65ec
@pv pv BUG: sparse: fix BSR to COO conversion on 32-bit 84524b2
@pv pv TST: sparse: enable more 64-bit index tests 491e8b8
@pv pv BUG: sparse: be more fussy about 64-bit index data types
- never downcast
- rename nnz to maxval in get_index_dtype
- CSR maxval=max(nnz, shape[1])
- CSC maxval=max(nnz, shape[0]),
- COO maxval=max(shape)
- DIA maxval=max(shape),
- BSR maxval=max(bnnz, bshape[1])
62ec92f
@pv pv BUG: sparse: fix 64-bit index handling in _cs_matrix._insert_many 981da90
@pv
SciPy member
pv commented Feb 1, 2014

Rebased

@coveralls

Coverage Status

Coverage remained the same when pulling 981da90 on pv:ticket/1307 into d378b0d on scipy:master.

@rgommers rgommers merged commit bd0c484 into scipy:master Feb 2, 2014

1 check passed

Details default The Travis CI build passed
@rgommers
SciPy member
rgommers commented Feb 2, 2014

OK time to merge this. Seems to work now, should get a few weeks in master before the 0.14.x branch. Thanks @pv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.