-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: sparse: enable 64-bit index arrays & nnz > 2**31 #442
Conversation
Good feature in principle, but won't this break current Cython code that works with |
It won't break it --- the matrices switch to int64 only when the amount of data is so large that int32 won't cut it any more, so that 3rd party code continues to work as long as the input data is small enough. To handle larger nnz, 3rd party code would anyway need to be written with int64. (As currently written, int64 is 'viral', so int32-index matrix times int64-index matrix is also int64-index. This can be adjusted, however, by checking nnz.) |
Alright. I guess we can just do a check for |
Just a question: there are now |
Rebased, plus some additional fixed. The @larsmans: intp would probably work. However, I don't see a good reason for dropping support for 64-bit indices on 32-bit platforms, as that adds a platform dependency that can cause problems e.g. with pickled matrices and so on (also small matrices can end up with 64-bit indices). |
Good point. In scikit-learn we switched to |
Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*. |
Changes Unknown when pulling 4aa2be4 on pv:ticket/1307 into * on scipy:master*. |
Another rebase. This should handle all of index usage in scipy.sparse (grep intc returns no unnecessary hits any more). Unit tests exercising the >32-bit index space requires quite a lot of memory. Currently, there is a unit test that reduces the threshold of the type switch (to a much smaller nnz of ~10) or makes it random, and runs the sparse base test suite. This probably covers most of the basic operations. |
Some casting issues for 64-bit indices on a 32-bit system:
There are 15 test errors like this. |
Fixed. |
After the comment from Nathan Bell on compile time in the original ticket I measured it (very scientifically, single run), and it was actually ~5% faster with this PR. Memory usage also not significantly more, so no problems there. Note that that's somewhat important for sparsetools, my old Mac (still in use) chokes on compiling scipy unless I closed Firefox and other programs that use a lot of memory. |
Thanks to peb for the patch.
This ensures things keep working as expected also on platforms with sizeof(int) != 32 and sizeof(long long) != 64
Since we don't know the resulting fill-in from the matrix multiplication, choose a large enough data type to accommodate also a dense result. If the result nnz proves to be smaller, the data type is reduced afterward.
- never downcast - rename nnz to maxval in get_index_dtype - CSR maxval=max(nnz, shape[1]) - CSC maxval=max(nnz, shape[0]), - COO maxval=max(shape) - DIA maxval=max(shape), - BSR maxval=max(bnnz, bshape[1])
Rebased |
ENH: sparse: enable 64-bit index arrays & nnz > 2**31
OK time to merge this. Seems to work now, should get a few weeks in master before the 0.14.x branch. Thanks @pv. |
Is there a plan to merge this? I would like to run minimum_spanning_tree on a graph that has more edges than allowed. Seems like current maximum is 2.1 billion edges. |
It is already merged. Nobody has taken up enabling 64-bit for csgraph routines however. |
So if you need it, that could be a useful small (but not fully trivial) project to contribute to Scipy. |
Thanks, I'll take a closer look, and see if I can setup a development branch on our large-memory nodes for Scipy. |
Note that large memory is not necessarily needed for development ---
sparse matrices with int64 index data types can be created manually.
|
Unfortunately, many csgraph functions require square CSR input, so you do actually need lots of memory for the |
Yes, but note that you can consider small matrices with 64-bit indices.
|
Currently Scipy's sparse matrices are limited to
nnz < 2**31
, and crashes and other unspecified behavior occurs ifnnz
becomes larger so that theint32
indices start to wrap around.This PR enables sparse matrices to use either int32 or int64 as the data type of
the indices. The data type is chosen automatically based on the
nnz
of the dataand requirements of the different sparse matrix types.
The threshold selection is written so that it can be monkeypatched to be lower
for testing purposes. The 64-bit specific test suite however currently only consists
of running the existing tests with lower thresholds and randomly assigned index
array data types (these tests pass).
The direct linear algebra (superlu, probably also umfpack) do not work with int64 indices, but implementing that is out of scope for this PR. For matrices with
nnz < 2**31
, things should be backwards compatible.However, more tests would be needed, so this PR is still a bit in a WIP state.
Some help would be appreciated:
Closes Trac #1307