-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: reshape sparse matrices #3957
Conversation
Thanks for working on this; a My stackoverflow code returned COO out of convenience. As a method of a sparse matrix class, I would expect |
OK I'll change this, but there is no scipy convention about returned types, right? I don't have any data but I think many functions do not return the same type as their inputs. |
Implemented in the latest commit. |
Would it ever make sense to leave the exact return format unspecified, therefore allowing future modifications that change the sparse format of the returned matrix to be technically backwards compatible? Maybe it would not make sense for this particular function, but a more flexible interface could potentially avoid multiple redundant format conversions. |
I would also expect to get the same type back from Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector), and maybe get a |
Should the custom |
For the sake of completeness, and once the functionality is in there, if you want to fully replicate the behavior of numpy, then |
I can't see error-on-copy happening for reshaping sparse matrices. Numpy can do it because ndarrays are dense and flexible enough with striding that they can always reshape without copying. On the other hand it could be technically possible to force |
Without any profiling data, I think the new base class implementation of reshaping looks more efficient than the custom |
Added a direct custom lil->coo conversion which should be more efficient than the default, which I think was through csr. The implementation of the lil->coo conversion is actually simpler than the lil->csr conversion. |
By the way, one example of a sparse matrix member function whose returned matrix has a different format is the |
simplified the csr indptr calculation in the lil->csr conversion |
Are you suggesting to add a |
unit testing exceeded travis's 50 minute limit |
Ahh... I didn't realise that setting |
I don't think sharing memory without in-place modification is feasible for these formats in general, unless you mean that the reshape should be a virtual reshape, wherein it just reinterprets indexing operations as a result of the shape (and the true reshape is forced before a matrix multiplication for instance). |
Reading this again, I would guess the answer is yes for sparse matrix formats derived from Adding a copy kwarg to reshape could be possible but would be more straightforward if all sparse matrix classes had a copy kwarg in their .tocoo(). Some do, but some don't. |
In order to support setting |
Ahh yes, perhaps @larsmans is right right that |
@larsmans When I added the copy flag to coo_matrix reshape, I set the default to True because of the precedent that the 'compressed' sparse matrices have a |
This PR now also includes some unrelated changes to the lil->csr conversion implementation which should not affect the behavior of the function. The historical context is that for some reason |
@argriffing |
@larsmans that makes sense, I've changed the default to allow data sharing |
@larsmans, it's true that numpy defaults to sharing memory under
For the moment, Perhaps the solution is indeed to keep |
The documentation for It seems like many of the sparse matrix formats are pretty leaky abstractions, which makes it harder for us to optimize them. (For example, replacing |
Yes it can be confusing because there are different levels of in-place. Reshaping (with copy=True) is out-of-place at all levels. To be clearer about the multiple levels of in-place-ness, the coo
and the implementation contains the line
which indicates in-place modification at one level by changing attributes of |
@argriffing Yep, makes sense. Is there a way to make this distinction more visible to users via documentation? I suggest that we explicitly state that internal members of sparse matrices are private. That would remove the confusion from the user's point of view, and allow us to rely on invariants (like |
Yes, I was thinking in terms of the problem at hand:
For I think this simply cannot be true for user code that requires efficiency. One can instead provide the caveat that internal structures may be mutated by operations so one should not keep external references to the matrix's members when calling ops on the matrix directly. |
This sounds reasonable. Would it be expected to go into this PR? |
I don't think so. It's a more general issue. On 10 September 2014 05:07, argriffing notifications@github.com wrote:
|
The issue here is that copy shouldn't be used without being aware of those On 10 September 2014 07:52, Joel Nothman joel.nothman@sydney.edu.au wrote:
|
What else is needed for this to be merged? I will gladly help with that, if anything. Regarding in place reshaping I've argued for a complete removal in #3964. |
@memeplex I think the complication is the lack of a clear abstraction boundary for scipy sparse matrices. As @perimosocordiae has mentioned, it is 'leaky'. For example, I think the conversion functions Another 'leak' is that the internal structures of Yet another leak is the canonical vs. non-canonical internal representation of the sparse matrices, in terms of sorting or duplication of entries. This has been discussed in #4409 where I accidentally wrote something that was incorrect, which seems to have concluded the discussion. I think that some of these leaks may be under-documented or under-tested, so that if you change the |
Here's an example of the current behavior of >>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([0, 0, 0, 1, 1, 1], dtype=int)
>>> row = [0, 0, 0, 0, 0, 0]
>>> col = [0, 1, 1, 2, 3, 3]
>>> A = scipy.sparse.coo_matrix((data, (row, col)), shape=(4, 4), dtype=int)
>>> np.may_share_memory(data, A.data)
False
>>> A = scipy.sparse.coo_matrix((data, (row, col)), shape=(4, 4))
>>> np.may_share_memory(data, A.data)
True
>>> A.nnz, A.has_canonical_format
(6, False)
>>> A = A.tocsr()
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A.sum_duplicates()
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A = scipy.sparse.csr_matrix(A)
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A = scipy.sparse.csr_matrix(A.A)
>>> A.nnz, A.has_canonical_format
(2, True) |
@argriffing I notice the current implementation does not have an The implementation is quite straight-forward. I suggest we add this to the current implementation. |
closes #3507
Before this PR, sparse matrix reshaping raised
NotImplementedError
except with thelil_matrix
format.The coordinate conversion implementation is due to @WarrenWeckesser at http://stackoverflow.com/questions/16511879 so please do not merge until he approves.
The test added to the base class is modeled on the only existing sparse matrix reshaping test -- the
test_reshape
test forlil_matrix
.