ENH: reshape sparse matrices #3957

argriffing · 2014-09-03T15:27:54Z

Before this PR, sparse matrix reshaping raised NotImplementedError except with the lil_matrix format.

The coordinate conversion implementation is due to @WarrenWeckesser at http://stackoverflow.com/questions/16511879 so please do not merge until he approves.

The test added to the base class is modeled on the only existing sparse matrix reshaping test -- the test_reshape test for lil_matrix.

coveralls · 2014-09-03T16:19:15Z

Coverage increased (+0.0%) when pulling 291f163 on argriffing:sparse-reshape into 1eb12db on scipy:master.

WarrenWeckesser · 2014-09-03T16:42:51Z

Thanks for working on this; a reshape method will be a nice enhancement.

My stackoverflow code returned COO out of convenience. As a method of a sparse matrix class, I would expect reshape to return a matrix of the same matrix class. The use of COO is an implementation detail that shouldn't be exposed to the caller. For example, some day a more efficient reshape might be implemented for csr_matrix. It would be unfortunate if it had to convert to COO just to maintain backwards compatibility.

argriffing · 2014-09-03T16:47:15Z

OK I'll change this, but there is no scipy convention about returned types, right? I don't have any data but I think many functions do not return the same type as their inputs.

…original matrix

argriffing · 2014-09-03T16:58:06Z

As a method of a sparse matrix class, I would expect reshape to return a matrix of the same matrix class.

Implemented in the latest commit.

coveralls · 2014-09-03T17:06:17Z

Coverage increased (+0.0%) when pulling 90a0d38 on argriffing:sparse-reshape into 1eb12db on scipy:master.

argriffing · 2014-09-03T17:08:54Z

It would be unfortunate if it had to convert to COO just to maintain backwards compatibility.

Would it ever make sense to leave the exact return format unspecified, therefore allowing future modifications that change the sparse format of the returned matrix to be technically backwards compatible? Maybe it would not make sense for this particular function, but a more flexible interface could potentially avoid multiple redundant format conversions.

larsmans · 2014-09-03T17:21:38Z

I would also expect to get the same type back from reshape, or otherwise have the docs tell me that I shouldn't rely on the actual format being returned. I think it makes a lot of sense to leave the format unspecified, since it allows for further optimizations. We've done this in scikit-learn, where some feature extraction code used to return COO until we figured out how to build a CSR matrix directly.

Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector), and maybe get a copy argument that forces a copy even if not necessary. Given how reshape works in NumPy, I would expect this function to at least claim the possibility of reusing input memory.

argriffing · 2014-09-03T17:34:45Z

Should the custom lil_matrix reshaping function at https://github.com/scipy/scipy/blob/master/scipy/sparse/lil.py#L352 be kept, or would it be OK to delete it and let lil_matrix use the new base class implementation? I kind of want to delete it.

coveralls · 2014-09-03T17:35:55Z

Coverage increased (+0.0%) when pulling 90a0d38 on argriffing:sparse-reshape into 1eb12db on scipy:master.

jaimefrio · 2014-09-03T17:42:29Z

For the sake of completeness, and once the functionality is in there, if you want to fully replicate the behavior of numpy, then a.shape = new_shape should perform an in-place reshape. In numpy it raises an error if it requires a copy of the data, so this may tie in with @larsmans comments.

argriffing · 2014-09-03T17:50:18Z

if you want to fully replicate the behavior of numpy, then a.shape = new_shape should perform an in-place reshape. In numpy it raises an error if it requires a copy of the data,

I can't see error-on-copy happening for reshaping sparse matrices. Numpy can do it because ndarrays are dense and flexible enough with striding that they can always reshape without copying.

On the other hand it could be technically possible to force s.shape = new_shape to reshape s. If this is possible I hope it is out of scope for this PR.

coveralls · 2014-09-03T17:58:37Z

Coverage increased (+0.07%) when pulling 90a0d38 on argriffing:sparse-reshape into 1eb12db on scipy:master.

argriffing · 2014-09-03T18:19:15Z

Without any profiling data, I think the new base class implementation of reshaping looks more efficient than the custom lil_matrix reshape function, so I removed the custom function. I also removed the custom test in favor of the base class test.

…rough csr

argriffing · 2014-09-03T19:24:21Z

Added a direct custom lil->coo conversion which should be more efficient than the default, which I think was through csr. The implementation of the lil->coo conversion is actually simpler than the lil->csr conversion.

argriffing · 2014-09-03T19:30:54Z

By the way, one example of a sparse matrix member function whose returned matrix has a different format is the transpose() of csr or csc matrices, which returns the complementary format for reasons of efficiency. But this is not consistently implemented because lil transpose first converts to csr then takes the transpose returning a csc and then takes the extra step to convert back to lil, presumably through csr.

argriffing · 2014-09-03T19:40:21Z

simplified the csr indptr calculation in the lil->csr conversion

argriffing · 2014-09-03T19:53:11Z

maybe get a copy argument that forces a copy even if not necessary. Given how reshape works in NumPy, I would expect this function to at least claim the possibility of reusing input memory.

Are you suggesting to add a copy keyword argument? Numpy reshape doesn't have this, and numpy also may return an object that shares memory. Numpy has a may_share_memory but I don't think this is implemented for scipy sparse matrices.

argriffing · 2014-09-03T21:20:53Z

unit testing exceeded travis's 50 minute limit

jnothman · 2014-09-04T00:16:23Z

For the sake of completeness, and once the functionality is in there, if you want to fully replicate the behavior of numpy, then a.shape = new_shape should perform an in-place reshape.

Ahh... I didn't realise that setting shape in numpy performed a reshape; thus in #3963 I have it do a "resize", which is a more natural in-place operation for sparse matrices. Currently the code for LIL is broken: it does self = self.reshape(shape) which is meaningless, all the more so because reshape returns a copy.

jnothman · 2014-09-04T00:19:19Z

@larsmans

Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector)

I don't think sharing memory without in-place modification is feasible for these formats in general, unless you mean that the reshape should be a virtual reshape, wherein it just reinterprets indexing operations as a result of the shape (and the true reshape is forced before a matrix multiplication for instance).

argriffing · 2014-09-04T01:03:45Z

Aside, I'd also like to know whether I can expect the new matrix to share memory with the input if at all possible (e.g., when flattening CSR/CSC/COO to a vector)

Reading this again, I would guess the answer is yes for sparse matrix formats derived from _data_matrix, but I'll try adding a test. I can't find a scipy sparse test aggregation that tests only the 'data matrix' sparse formats and not the other sparse formats.

Adding a copy kwarg to reshape could be possible but would be more straightforward if all sparse matrix classes had a copy kwarg in their .tocoo(). Some do, but some don't.

jnothman · 2014-09-04T01:07:11Z

In order to support setting .shape (assuming this is a good idea for sparse matrices), it might be worth having a helper method on each matrix format that copies the vital slots (e.g. data, indices, indptr, _shape) from another instance. This would enable making operations appear to be in-place when they involve a copy for ease of implementation.

jnothman · 2014-09-04T01:10:19Z

I would guess the answer is yes for sparse matrix formats derived from _data_matrix, but I'll try adding a test.

Ahh yes, perhaps @larsmans is right right that data can be shared, but not indices or indptr non-disctructively. data can't be shared when the major axis size is increased (e.g. csr (2, 3) -> (6, 1)) unless its indices are sorted.

argriffing · 2014-09-04T15:00:22Z

@larsmans When I added the copy flag to coo_matrix reshape, I set the default to True because of the precedent that the 'compressed' sparse matrices have a tocoo copy flag that defaults to True. Do you know whether this default is the expected convention for optional copy flags?

argriffing · 2014-09-04T15:13:04Z

This PR now also includes some unrelated changes to the lil->csr conversion implementation which should not affect the behavior of the function. The historical context is that for some reason reshape had been implemented for lil_matrix but not for any other sparse matrix format.

coveralls · 2014-09-04T15:47:08Z

Coverage increased (+0.08%) when pulling 7a34a1f on argriffing:sparse-reshape into 1eb12db on scipy:master.

coveralls · 2014-09-04T16:09:59Z

Coverage increased (+0.08%) when pulling 7a34a1f on argriffing:sparse-reshape into 1eb12db on scipy:master.

larsmans · 2014-09-04T16:56:25Z

@argriffing copy=True is a common default for other operations, but in this case emulating NumPy's non-copying behavior by defaulting to False is a sensible option too (IMHO).

argriffing · 2014-09-04T17:12:30Z

@larsmans that makes sense, I've changed the default to allow data sharing

coveralls · 2014-09-04T18:15:21Z

Coverage increased (+0.17%) when pulling ec0686e on argriffing:sparse-reshape into 1eb12db on scipy:master.

jnothman · 2014-09-05T07:13:02Z

@larsmans, it's true that numpy defaults to sharing memory under reshape, but there are operations in scipy.sparse in which teh user does not intend to modify the matrix, but in which the underlying data is mutated. Specifically, the following may call sum_duplicates:

coo_matrix.todok()
_cs_matrix.__{eq,ne,gt,lt,le,ge}__()
_cs_matrix.{max,min}imum()
_cs_matrix.__setitem__() (e.g. replacing an existing element will still trigger this if the element is duplicate)

For the moment, coo_matrix.sum_duplicates() does not appear operate in-place like _cs_matrix (and despite its docstring saying it does); had it modified in place, my_coo.reshape(...).todok() would break my_coo.

Perhaps the solution is indeed to keep reshape with copy=False but to make sure the above operations copy data when duplicates are summed (i.e. add a copy_data) param to sum_duplicates which is set to True when the method is called internally.

perimosocordiae · 2014-09-05T15:18:30Z

The documentation for coo_matrix.sum_duplicates() refers to the sparse matrix itself being modified in-place, rather than its internal arrays.

It seems like many of the sparse matrix formats are pretty leaky abstractions, which makes it harder for us to optimize them. (For example, replacing coo_matrix.row with a different data structure would break lots of user code that expects a plain numpy array.)

argriffing · 2014-09-05T16:22:30Z

Yes it can be confusing because there are different levels of in-place. Reshaping (with copy=True) is out-of-place at all levels. coo_matrix sum_duplicates is in-place in the sense that it modifies the coo_matrix itself rather than just returning a new coo_matrix, but it is out-of-place at the level of the internal arrays. It would theoretically be possible to do coo sum_duplicates at the deeper level of in-place-ness by changing the contents of the internal arrays themselves and then using numpy's in-place array re-sizing.

To be clearer about the multiple levels of in-place-ness, the coo sum_duplicates docstring says

This is an *in place* operation

and the implementation contains the line

self.row = self.row[order]

which indicates in-place modification at one level by changing attributes of self, but not at a deeper level because the ndarray operation itself is not in-place.

perimosocordiae · 2014-09-05T19:00:48Z

@argriffing Yep, makes sense. Is there a way to make this distinction more visible to users via documentation?

I suggest that we explicitly state that internal members of sparse matrices are private. That would remove the confusion from the user's point of view, and allow us to rely on invariants (like self.has_canonical_format) without wondering if the user has modified internal structures since the last call to sum_duplicates.

jnothman · 2014-09-06T09:42:54Z

Yes it can be confusing because there are different levels of in-place.

Yes, I was thinking in terms of the problem at hand: copy=False operations. coo_matrix.sum_duplicates() could well be written with modification to rows/cols/data in-place, and may in the future. This is still a problem we need to consider as different from numpy when offering copy=False as a default.

I suggest that we explicitly state that internal members of sparse matrices are private.

For I think this simply cannot be true for user code that requires efficiency. One can instead provide the caveat that internal structures may be mutated by operations so one should not keep external references to the matrix's members when calling ops on the matrix directly.

argriffing · 2014-09-09T19:07:33Z

One can instead provide the caveat that internal structures may be mutated by operations so one should not keep external references to the matrix's members when calling ops on the matrix directly.

This sounds reasonable. Would it be expected to go into this PR?

jnothman · 2014-09-09T21:52:45Z

I don't think so. It's a more general issue.

On 10 September 2014 05:07, argriffing notifications@github.com wrote:

One can instead provide the caveat that internal structures may be
mutated by operations so one should not keep external references to the
matrix's members when calling ops on the matrix directly.

This sounds reasonable. Would it be expected to go into this PR?

—
Reply to this email directly or view it on GitHub
#3957 (comment).

jnothman · 2014-09-09T21:53:15Z

The issue here is that copy shouldn't be used without being aware of those
sorts of things...

On 10 September 2014 07:52, Joel Nothman joel.nothman@sydney.edu.au wrote:

I don't think so. It's a more general issue.

On 10 September 2014 05:07, argriffing notifications@github.com wrote:

One can instead provide the caveat that internal structures may be
mutated by operations so one should not keep external references to the
matrix's members when calling ops on the matrix directly.

This sounds reasonable. Would it be expected to go into this PR?

—
Reply to this email directly or view it on GitHub
#3957 (comment).

memeplex · 2015-02-12T10:37:16Z

What else is needed for this to be merged? I will gladly help with that, if anything. Regarding in place reshaping I've argued for a complete removal in #3964.

argriffing · 2015-02-12T15:17:42Z

@memeplex I think the complication is the lack of a clear abstraction boundary for scipy sparse matrices. As @perimosocordiae has mentioned, it is 'leaky'.

For example, I think the conversion functions
https://github.com/scipy/scipy/blob/master/scipy/sparse/lil.py#L372
https://github.com/scipy/scipy/blob/master/scipy/sparse/csc.py#L126
https://github.com/scipy/scipy/blob/master/scipy/sparse/coo.py#L325
all have the secret requirement to preserve the distinction between zeros that are represented through the sparsity structure vs. zeros that are represented by a value of zero. The interface of csgraph depends on this distinction.

Another 'leak' is that the internal structures of scipy.sparse matrices may share memory with other structures. Of course, this could be seen as a useful feature.

Yet another leak is the canonical vs. non-canonical internal representation of the sparse matrices, in terms of sorting or duplication of entries. This has been discussed in #4409 where I accidentally wrote something that was incorrect, which seems to have concluded the discussion.

I think that some of these leaks may be under-documented or under-tested, so that if you change the scipy.sparse code in a way that seems to agree with the documentation and passes the scipy unit tests, it still may break code written by a user. If this is true, then maybe the situation could be improved by adding more tests and docs as @jnothman has suggested in #4409 for csgraph.

argriffing · 2015-02-12T17:39:44Z

Here's an example of the current behavior of scipy.sparse. I'm not sure how much of it is relevant to sparse matrix reshaping, or how much is documented or tested, or which parts are considered bugs or incidental undefined behavior, or features.

>>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([0, 0, 0, 1, 1, 1], dtype=int)
>>> row = [0, 0, 0, 0, 0, 0]
>>> col = [0, 1, 1, 2, 3, 3]
>>> A = scipy.sparse.coo_matrix((data, (row, col)), shape=(4, 4), dtype=int)
>>> np.may_share_memory(data, A.data)
False
>>> A = scipy.sparse.coo_matrix((data, (row, col)), shape=(4, 4))
>>> np.may_share_memory(data, A.data)
True
>>> A.nnz, A.has_canonical_format
(6, False)
>>> A = A.tocsr()
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A.sum_duplicates()
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A = scipy.sparse.csr_matrix(A)
>>> A.nnz, A.has_canonical_format
(4, True)
>>> A = scipy.sparse.csr_matrix(A.A)
>>> A.nnz, A.has_canonical_format
(2, True)

zaxliu · 2015-09-18T11:15:30Z

@argriffing I notice the current implementation does not have an order argument. This caused me some trouble as I would the reshape to work in the 'Fortran' order rather than the 'C' order.

The implementation is quite straight-forward. I suggest we add this to the current implementation.

argriffing · 2016-02-01T17:47:12Z

closing https://mail.scipy.org/pipermail/numpy-discussion/2016-January/074810.html

alexbrc added 2 commits September 3, 2014 11:22

ENH: reshape sparse matrices

a8d425e

TST: add another sparse matrix reshaping test case

291f163

PEP8: remove a space character

91589b3

MAINT: force reshaped sparse matrices to have the same format as the …

90a0d38

…original matrix

MAINT: let lil_matrix use base implementation of reshape

82edb69

MAINT: lil_matrix directly converts to coo_matrix instead of going th…

59390f0

…rough csr

MAINT: simplify lil->csr

ba3445b

MAINT: simplify conversion from lil matrix

7a34a1f

argriffing mentioned this pull request Sep 4, 2014

reshape csr and csc sparse matrices without copying data #3967

Open

MAINT: by default do not copy coo_matrix data on reshape

ec0686e

argriffing closed this Feb 1, 2016

pv added the needs-champion label Feb 1, 2016

perimosocordiae mentioned this pull request Mar 28, 2016

COMPAT: Expand compatibility fromnumeric.py #6002

Merged

perimosocordiae mentioned this pull request Sep 22, 2017

reshape sparse matrices #3507

Closed

drhagen mentioned this pull request Sep 28, 2017

ENH: Implement reshape method on sparse matrices #7945

Merged

perimosocordiae removed the needs-champion label Oct 12, 2017

j20232 mentioned this pull request May 25, 2021

Issue about reshaping a sparse matrix to a flattened vector #14131

Closed

ENH: reshape sparse matrices #3957

ENH: reshape sparse matrices #3957

Conversation

argriffing commented Sep 3, 2014

coveralls commented Sep 3, 2014

WarrenWeckesser commented Sep 3, 2014

argriffing commented Sep 3, 2014

argriffing commented Sep 3, 2014

coveralls commented Sep 3, 2014

argriffing commented Sep 3, 2014

larsmans commented Sep 3, 2014

argriffing commented Sep 3, 2014

coveralls commented Sep 3, 2014

jaimefrio commented Sep 3, 2014

argriffing commented Sep 3, 2014

coveralls commented Sep 3, 2014

argriffing commented Sep 3, 2014

argriffing commented Sep 3, 2014

argriffing commented Sep 3, 2014

argriffing commented Sep 3, 2014

argriffing commented Sep 3, 2014

argriffing commented Sep 3, 2014

jnothman commented Sep 4, 2014

jnothman commented Sep 4, 2014

argriffing commented Sep 4, 2014

jnothman commented Sep 4, 2014

jnothman commented Sep 4, 2014

argriffing commented Sep 4, 2014

argriffing commented Sep 4, 2014

coveralls commented Sep 4, 2014

coveralls commented Sep 4, 2014

larsmans commented Sep 4, 2014

argriffing commented Sep 4, 2014

coveralls commented Sep 4, 2014

jnothman commented Sep 5, 2014

perimosocordiae commented Sep 5, 2014

argriffing commented Sep 5, 2014

perimosocordiae commented Sep 5, 2014

jnothman commented Sep 6, 2014

argriffing commented Sep 9, 2014

jnothman commented Sep 9, 2014

jnothman commented Sep 9, 2014

memeplex commented Feb 12, 2015

argriffing commented Feb 12, 2015

argriffing commented Feb 12, 2015

zaxliu commented Sep 18, 2015

argriffing commented Feb 1, 2016