Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: reshape sparse matrices #3957

Closed
wants to merge 14 commits into from
Closed

Conversation

argriffing
Copy link
Contributor

closes #3507

Before this PR, sparse matrix reshaping raised NotImplementedError except with the lil_matrix format.

The coordinate conversion implementation is due to @WarrenWeckesser at http://stackoverflow.com/questions/16511879 so please do not merge until he approves.

The test added to the base class is modeled on the only existing sparse matrix reshaping test -- the test_reshape test for lil_matrix.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 291f163 on argriffing:sparse-reshape into 1eb12db on scipy:master.

@WarrenWeckesser
Copy link
Member

Thanks for working on this; a reshape method will be a nice enhancement.

My stackoverflow code returned COO out of convenience. As a method of a sparse matrix class, I would expect reshape to return a matrix of the same matrix class. The use of COO is an implementation detail that shouldn't be exposed to the caller. For example, some day a more efficient reshape might be implemented for csr_matrix. It would be unfortunate if it had to convert to COO just to maintain backwards compatibility.

@argriffing
Copy link
Contributor Author

OK I'll change this, but there is no scipy convention about returned types, right? I don't have any data but I think many functions do not return the same type as their inputs.

@argriffing
Copy link
Contributor Author

As a method of a sparse matrix class, I would expect reshape to return a matrix of the same matrix class.

Implemented in the latest commit.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 90a0d38 on argriffing:sparse-reshape into 1eb12db on scipy:master.

@argriffing
Copy link
Contributor Author

It would be unfortunate if it had to convert to COO just to maintain backwards compatibility.

Would it ever make sense to leave the exact return format unspecified, therefore allowing future modifications that change the sparse format of the returned matrix to be technically backwards compatible? Maybe it would not make sense for this particular function, but a more flexible interface could potentially avoid multiple redundant format conversions.

@larsmans
Copy link
Contributor

larsmans commented Sep 3, 2014

I would also expect to get the same type back from reshape, or otherwise have the docs tell me that I shouldn't rely on the actual format being returned. I think it makes a lot of sense to leave the format unspecified, since it allows for further optimizations. We've done this in scikit-learn, where some feature extraction code used to return COO until we figured out how to build a CSR matrix directly.

Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector), and maybe get a copy argument that forces a copy even if not necessary. Given how reshape works in NumPy, I would expect this function to at least claim the possibility of reusing input memory.

@argriffing
Copy link
Contributor Author

Should the custom lil_matrix reshaping function at https://github.com/scipy/scipy/blob/master/scipy/sparse/lil.py#L352 be kept, or would it be OK to delete it and let lil_matrix use the new base class implementation? I kind of want to delete it.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 90a0d38 on argriffing:sparse-reshape into 1eb12db on scipy:master.

@jaimefrio
Copy link
Member

For the sake of completeness, and once the functionality is in there, if you want to fully replicate the behavior of numpy, then a.shape = new_shape should perform an in-place reshape. In numpy it raises an error if it requires a copy of the data, so this may tie in with @larsmans comments.

@argriffing
Copy link
Contributor Author

if you want to fully replicate the behavior of numpy, then a.shape = new_shape should perform an in-place reshape. In numpy it raises an error if it requires a copy of the data,

I can't see error-on-copy happening for reshaping sparse matrices. Numpy can do it because ndarrays are dense and flexible enough with striding that they can always reshape without copying.

On the other hand it could be technically possible to force s.shape = new_shape to reshape s. If this is possible I hope it is out of scope for this PR.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) when pulling 90a0d38 on argriffing:sparse-reshape into 1eb12db on scipy:master.

@argriffing
Copy link
Contributor Author

Without any profiling data, I think the new base class implementation of reshaping looks more efficient than the custom lil_matrix reshape function, so I removed the custom function. I also removed the custom test in favor of the base class test.

@argriffing
Copy link
Contributor Author

Added a direct custom lil->coo conversion which should be more efficient than the default, which I think was through csr. The implementation of the lil->coo conversion is actually simpler than the lil->csr conversion.

@argriffing
Copy link
Contributor Author

By the way, one example of a sparse matrix member function whose returned matrix has a different format is the transpose() of csr or csc matrices, which returns the complementary format for reasons of efficiency. But this is not consistently implemented because lil transpose first converts to csr then takes the transpose returning a csc and then takes the extra step to convert back to lil, presumably through csr.

@argriffing
Copy link
Contributor Author

simplified the csr indptr calculation in the lil->csr conversion

@argriffing
Copy link
Contributor Author

maybe get a copy argument that forces a copy even if not necessary. Given how reshape works in NumPy, I would expect this function to at least claim the possibility of reusing input memory.

Are you suggesting to add a copy keyword argument? Numpy reshape doesn't have this, and numpy also may return an object that shares memory. Numpy has a may_share_memory but I don't think this is implemented for scipy sparse matrices.

@argriffing
Copy link
Contributor Author

unit testing exceeded travis's 50 minute limit

@jnothman
Copy link
Contributor

jnothman commented Sep 4, 2014

For the sake of completeness, and once the functionality is in there, if you want to fully replicate the behavior of numpy, then a.shape = new_shape should perform an in-place reshape.

Ahh... I didn't realise that setting shape in numpy performed a reshape; thus in #3963 I have it do a "resize", which is a more natural in-place operation for sparse matrices. Currently the code for LIL is broken: it does self = self.reshape(shape) which is meaningless, all the more so because reshape returns a copy.

@jnothman
Copy link
Contributor

jnothman commented Sep 4, 2014

@larsmans

Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector)

I don't think sharing memory without in-place modification is feasible for these formats in general, unless you mean that the reshape should be a virtual reshape, wherein it just reinterprets indexing operations as a result of the shape (and the true reshape is forced before a matrix multiplication for instance).

@argriffing
Copy link
Contributor Author

Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector)

Reading this again, I would guess the answer is yes for sparse matrix formats derived from _data_matrix, but I'll try adding a test. I can't find a scipy sparse test aggregation that tests only the 'data matrix' sparse formats and not the other sparse formats.

Adding a copy kwarg to reshape could be possible but would be more straightforward if all sparse matrix classes had a copy kwarg in their .tocoo(). Some do, but some don't.

@jnothman
Copy link
Contributor

jnothman commented Sep 4, 2014

In order to support setting .shape (assuming this is a good idea for sparse matrices), it might be worth having a helper method on each matrix format that copies the vital slots (e.g. data, indices, indptr, _shape) from another instance. This would enable making operations appear to be in-place when they involve a copy for ease of implementation.

@jnothman
Copy link
Contributor

jnothman commented Sep 4, 2014

I would guess the answer is yes for sparse matrix formats derived from _data_matrix, but I'll try adding a test.

Ahh yes, perhaps @larsmans is right right that data can be shared, but not indices or indptr non-disctructively. data can't be shared when the major axis size is increased (e.g. csr (2, 3) -> (6, 1)) unless its indices are sorted.

@argriffing
Copy link
Contributor Author

@larsmans When I added the copy flag to coo_matrix reshape, I set the default to True because of the precedent that the 'compressed' sparse matrices have a tocoo copy flag that defaults to True. Do you know whether this default is the expected convention for optional copy flags?

@argriffing
Copy link
Contributor Author

This PR now also includes some unrelated changes to the lil->csr conversion implementation which should not affect the behavior of the function. The historical context is that for some reason reshape had been implemented for lil_matrix but not for any other sparse matrix format.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.08%) when pulling 7a34a1f on argriffing:sparse-reshape into 1eb12db on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.08%) when pulling 7a34a1f on argriffing:sparse-reshape into 1eb12db on scipy:master.

@larsmans
Copy link
Contributor

larsmans commented Sep 4, 2014

@argriffing copy=True is a common default for other operations, but in this case emulating NumPy's non-copying behavior by defaulting to False is a sensible option too (IMHO).

@argriffing
Copy link
Contributor Author

@larsmans that makes sense, I've changed the default to allow data sharing

@coveralls
Copy link

Coverage Status

Coverage increased (+0.17%) when pulling ec0686e on argriffing:sparse-reshape into 1eb12db on scipy:master.

@jnothman
Copy link
Contributor

jnothman commented Sep 5, 2014

@larsmans, it's true that numpy defaults to sharing memory under reshape, but there are operations in scipy.sparse in which teh user does not intend to modify the matrix, but in which the underlying data is mutated. Specifically, the following may call sum_duplicates:

  • coo_matrix.todok()
  • _cs_matrix.__{eq,ne,gt,lt,le,ge}__()
  • _cs_matrix.{max,min}imum()
  • _cs_matrix.__setitem__() (e.g. replacing an existing element will still trigger this if the element is duplicate)

For the moment, coo_matrix.sum_duplicates() does not appear operate in-place like _cs_matrix (and despite its docstring saying it does); had it modified in place, my_coo.reshape(...).todok() would break my_coo.

Perhaps the solution is indeed to keep reshape with copy=False but to make sure the above operations copy data when duplicates are summed (i.e. add a copy_data) param to sum_duplicates which is set to True when the method is called internally.

@perimosocordiae
Copy link
Member

The documentation for coo_matrix.sum_duplicates() refers to the sparse matrix itself being modified in-place, rather than its internal arrays.

It seems like many of the sparse matrix formats are pretty leaky abstractions, which makes it harder for us to optimize them. (For example, replacing coo_matrix.row with a different data structure would break lots of user code that expects a plain numpy array.)

@argriffing
Copy link
Contributor Author

Yes it can be confusing because there are different levels of in-place. Reshaping (with copy=True) is out-of-place at all levels. coo_matrix sum_duplicates is in-place in the sense that it modifies the coo_matrix itself rather than just returning a new coo_matrix, but it is out-of-place at the level of the internal arrays. It would theoretically be possible to do coo sum_duplicates at the deeper level of in-place-ness by changing the contents of the internal arrays themselves and then using numpy's in-place array re-sizing.

To be clearer about the multiple levels of in-place-ness, the coo sum_duplicates docstring says

This is an *in place* operation

and the implementation contains the line

self.row = self.row[order]

which indicates in-place modification at one level by changing attributes of self, but not at a deeper level because the ndarray operation itself is not in-place.

@perimosocordiae
Copy link
Member

@argriffing Yep, makes sense. Is there a way to make this distinction more visible to users via documentation?

I suggest that we explicitly state that internal members of sparse matrices are private. That would remove the confusion from the user's point of view, and allow us to rely on invariants (like self.has_canonical_format) without wondering if the user has modified internal structures since the last call to sum_duplicates.

@jnothman
Copy link
Contributor

jnothman commented Sep 6, 2014

Yes it can be confusing because there are different levels of in-place.

Yes, I was thinking in terms of the problem at hand: copy=False operations. coo_matrix.sum_duplicates() could well be written with modification to rows/cols/data in-place, and may in the future. This is still a problem we need to consider as different from numpy when offering copy=False as a default.

I suggest that we explicitly state that internal members of sparse matrices are private.

For I think this simply cannot be true for user code that requires efficiency. One can instead provide the caveat that internal structures may be mutated by operations so one should not keep external references to the matrix's members when calling ops on the matrix directly.

@argriffing
Copy link
Contributor Author

One can instead provide the caveat that internal structures may be mutated by operations so one should not keep external references to the matrix's members when calling ops on the matrix directly.

This sounds reasonable. Would it be expected to go into this PR?

@jnothman
Copy link
Contributor

jnothman commented Sep 9, 2014

I don't think so. It's a more general issue.

On 10 September 2014 05:07, argriffing notifications@github.com wrote:

One can instead provide the caveat that internal structures may be
mutated by operations so one should not keep external references to the
matrix's members when calling ops on the matrix directly.

This sounds reasonable. Would it be expected to go into this PR?


Reply to this email directly or view it on GitHub
#3957 (comment).

@jnothman
Copy link
Contributor

jnothman commented Sep 9, 2014

The issue here is that copy shouldn't be used without being aware of those
sorts of things...

On 10 September 2014 07:52, Joel Nothman joel.nothman@sydney.edu.au wrote:

I don't think so. It's a more general issue.

On 10 September 2014 05:07, argriffing notifications@github.com wrote:

One can instead provide the caveat that internal structures may be
mutated by operations so one should not keep external references to the
matrix's members when calling ops on the matrix directly.

This sounds reasonable. Would it be expected to go into this PR?


Reply to this email directly or view it on GitHub
#3957 (comment).

@memeplex
Copy link

What else is needed for this to be merged? I will gladly help with that, if anything. Regarding in place reshaping I've argued for a complete removal in #3964.

@argriffing
Copy link
Contributor Author

@memeplex I think the complication is the lack of a clear abstraction boundary for scipy sparse matrices. As @perimosocordiae has mentioned, it is 'leaky'.

For example, I think the conversion functions
https://github.com/scipy/scipy/blob/master/scipy/sparse/lil.py#L372
https://github.com/scipy/scipy/blob/master/scipy/sparse/csc.py#L126
https://github.com/scipy/scipy/blob/master/scipy/sparse/coo.py#L325
all have the secret requirement to preserve the distinction between zeros that are represented through the sparsity structure vs. zeros that are represented by a value of zero. The interface of csgraph depends on this distinction.

Another 'leak' is that the internal structures of scipy.sparse matrices may share memory with other structures. Of course, this could be seen as a useful feature.

Yet another leak is the canonical vs. non-canonical internal representation of the sparse matrices, in terms of sorting or duplication of entries. This has been discussed in #4409 where I accidentally wrote something that was incorrect, which seems to have concluded the discussion.

I think that some of these leaks may be under-documented or under-tested, so that if you change the scipy.sparse code in a way that seems to agree with the documentation and passes the scipy unit tests, it still may break code written by a user. If this is true, then maybe the situation could be improved by adding more tests and docs as @jnothman has suggested in #4409 for csgraph.

@argriffing
Copy link
Contributor Author

Here's an example of the current behavior of scipy.sparse. I'm not sure how much of it is relevant to sparse matrix reshaping, or how much is documented or tested, or which parts are considered bugs or incidental undefined behavior, or features.

>>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([0, 0, 0, 1, 1, 1], dtype=int)
>>> row = [0, 0, 0, 0, 0, 0]
>>> col = [0, 1, 1, 2, 3, 3]
>>> A = scipy.sparse.coo_matrix((data, (row, col)), shape=(4, 4), dtype=int)
>>> np.may_share_memory(data, A.data)
False
>>> A = scipy.sparse.coo_matrix((data, (row, col)), shape=(4, 4))
>>> np.may_share_memory(data, A.data)
True
>>> A.nnz, A.has_canonical_format
(6, False)
>>> A = A.tocsr()
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A.sum_duplicates()
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A = scipy.sparse.csr_matrix(A)
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A = scipy.sparse.csr_matrix(A.A)
>>> A.nnz, A.has_canonical_format
(2, True)

@zaxliu
Copy link

zaxliu commented Sep 18, 2015

@argriffing I notice the current implementation does not have an order argument. This caused me some trouble as I would the reshape to work in the 'Fortran' order rather than the 'C' order.

The implementation is quite straight-forward. I suggest we add this to the current implementation.

@argriffing
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reshape sparse matrices