Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: sparse: establish an official policy for duplicate elements #5807

Open
perimosocordiae opened this issue Feb 4, 2016 · 14 comments
Open

RFC: sparse: establish an official policy for duplicate elements #5807

perimosocordiae opened this issue Feb 4, 2016 · 14 comments
Labels

Comments

@perimosocordiae
Copy link
Member

@perimosocordiae perimosocordiae commented Feb 4, 2016

Background
As an admission to performance and ease of construction, most of our sparse matrix formats admit duplicate entries. For example:

>>> foo = coo_matrix(([4,5], ([0,0],[1,1])), shape=(2,2))
>>> foo.nnz
2
>>>foo.A
array([[0, 9],
       [0, 0]])

Unfortunately, this causes a lot of problems for sparse matrix operations (#4409, #5394, #5806), and causes confusion regarding the true meaning of nnz, even ignoring the issue of explicit zeros (#3343).

Most sparse matrix formats have a method sum_duplicates() which operates in-place to canonicalize the internal storage, but it's unclear whether other methods are allowed to call this without first making a copy (see #5741 (comment)).

Proposal
I think that allowing duplicate entries in internal sparse matrix representations was an API mistake, but now that we allow it, it's difficult to disallow without breaking lots of existing user code. Therefore, I propose that we make it a policy that:

  1. Duplicate entries are not preserved. That is, it's okay to canonicalize in-place.
  2. Whenever a method other than sum_duplicates() triggers in-place canonicalization, a SparseEfficiencyWarning is thrown, to alert the user that something potentially unexpected is going on.
  3. The presence/lack of duplicate entries is remembered with a boolean flag, which we will document and encourage users to toggle if they manually modify a sparse matrix's internal members.

If this gets traction here, I'll send it along to the Scipy-dev mailing list as well. Thoughts?

@rgommers

This comment has been minimized.

Copy link
Member

@rgommers rgommers commented Feb 14, 2016

+10 for documenting this kind of thing clearly.

(1) and (2) of the proposal seem fine to me, I'm not sure about (3). I was trying to find a place in our docs where we explain the structure of the matrix formats and if/how to manually modify data. But there's very little in the tutorial or refguide - the best description is probably this one: http://www.scipy-lectures.org/advanced/scipy_sparse/index.html

@pv

This comment has been minimized.

Copy link
Member

@pv pv commented Feb 15, 2016

This is blocking: (at least) gh-5394

Agreed on that it would have been better to not permit duplicates. I understand they are useful at matrix assembly phase, but I think this would be better handled by a specialized matrix format rather than forcing it everywhere. It is unclear to me if the present support for duplicates has good uses; I guess most either use lil/dok (which are also inefficient due to Python overhead), or accumulate coo-format arrays manually.

I would go allowing canonicalization at any time (1).

@pv

This comment has been minimized.

Copy link
Member

@pv pv commented Feb 15, 2016

The other RFC to decide is gh-3343

@ewmoore

This comment has been minimized.

Copy link
Member

@ewmoore ewmoore commented Feb 15, 2016

Adding a flag that user is suppose to toggle is just asking for subtle broken programs. +1 to canonicalization at any time (1).

@perimosocordiae

This comment has been minimized.

Copy link
Member Author

@perimosocordiae perimosocordiae commented Feb 16, 2016

Yeah, it seems like (3) could be better handled by adding some documentation, rather than more complexity in the code.

It looks like (1) is agreed to be a good choice. We should figure out a good place to document it, then inform the mailing list.

I think we can implement (2) pretty cheaply by adding a warn=False keyword argument to sum_duplicates(), and force internal callers to pass warn=True. It's not a particularly elegant solution, but I like it better than using inspect to peek at the caller's stack frame.

@hpaulj

This comment has been minimized.

Copy link

@hpaulj hpaulj commented May 6, 2016

I first used sparse matrices in MATLAB for finite element problems (in the 1990s). Coming from that background I regard the summation of duplicates as a useful, if not essential, feature. I just used it in a Stackoverflow answer, http://stackoverflow.com/a/37040831/901925

The only documentation that I'm aware of is on the coo_matrix page

By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

I haven't studied the compile code much, so don't know exactly where and how the summation is implemented. My observation is that the .row and .col attributes of a coo format might actually be the arrays provided as input in the (data, (i, j)) pattern (if they are arrays of the correct dtype). So at least in some cases duplicates are present simply because inputs have not been processed or checked.

I haven't observed whether duplicates are allowed with the other formats. bmat uses coo matrices as intermediates, but doesn't allow overlapping blocks. It is obvious how a lil could have duplicates. dok is a dictionary, so it can't have duplicates, can it?

I haven't taken time yet to look at the cited issues where this seems to be a problem. But I just wanted to express support for the existing practice. Documentation can always be improved without affecting legacy practices.

@jni

This comment has been minimized.

Copy link
Contributor

@jni jni commented Sep 28, 2016

Hi all,

I'd like to contribute a big -1 to this discussion. #5394 already sent me on a bug hunt and reduced efficiency in my own code (since I have to explicitly make a copy of a coo_matrix before converting to CSR). As mentioned in #6603, in that PR, canonization occurs even when copy=True is passed to tocsr().

One of the motivating examples is #4409, which in my reading points to an application (csgraph) that absolutely cannot work when sparse is canonizing arrays left, right, and center, namely, the existence of graph edges with weight 0 (as opposed to non-edges).

Another, #5806, handled the non-canonical case cleanly.

I can't count right now the number of places I use the internal structure of COO and CSR. It's certainly used within scikit-image. In my own code I've implemented a CSR matrix that can get expanded with new rows (here). I'd expect arbitrary canonization to wreak havoc with it. Along with any number of user subclasses.

My suggestion is not to mess with users' objects unless explicitly allowed by the user. For example, you could add canonize=False to all the places where canonization would be useful. If False, make a copy of the original array if necessary. This mirrors the common pattern of having an output=input kwarg for in-place operations.

In short, this isn't a small case of "we might break some users' code but it's ok since it's undocumented". This has massive downstream ramifications.

@jni

This comment has been minimized.

Copy link
Contributor

@jni jni commented Sep 28, 2016

To be clear, my argument is -1 for the policy being in-place canonization by default, not for establishing a policy. =P

@perimosocordiae

This comment has been minimized.

Copy link
Member Author

@perimosocordiae perimosocordiae commented Sep 28, 2016

@jni: Note that "canonicalization" in this context refers only to duplicate entries, not explicit zeros. The discussion on explicit zeros (gh-3343) seems to have converged on preservation, specifically for the csgraph example you mentioned. If there are operations which don't preserve explicit zeros (other than, say, eliminate_zeros()) those should be reported as bugs.

That said, there is also a case for using duplicate entries in a sparse matrix to represent a multigraph. I personally think that this goes a little too far, and that a specialized data structure would be more appropriate. I'd like to hear what others think on the topic.

My ideal is to allow duplicate entries only during construction via __init__: once the spmatrix subclass is initialized, all of its internal members should be duplicate-free and unchanged by normal operations. This would allow us to get rid of the numerous self.has_canonical_format checks and calls to sum_duplicates() which currently litter the codebase, eliminating all the problems with in-place canonicalization. It does mean that the arguments passed to the __init__ method may end up getting copied in some cases, but judicious use of kwargs should allow the user to say "I know what I'm doing, just use these arrays directly" with the understanding that duplicate entries will cause incorrect results for some operations.

I'm interested to find out which use-cases would break under this "canonicalize-at-init" proposal. I can't think of any at the moment, but I'm sure some exist!

@pv

This comment has been minimized.

Copy link
Member

@pv pv commented Sep 28, 2016

perimosocordiae added a commit to perimosocordiae/scipy that referenced this issue Sep 28, 2016
Duplicates are allowed in the constructor,
but not handled anywhere else.
See scipygh-5807 for the initial discussion.
@perimosocordiae

This comment has been minimized.

Copy link
Member Author

@perimosocordiae perimosocordiae commented Sep 28, 2016

@pv My attempt is linked above. If you pass copy=False, canonicalize=False, then no checking is done and the arrays are used unmodified.

The default as of now is copy=False, canonicalize=True, which is slightly counterintuitive because it does make a copy in the case where canonicalization is actually needed.

Perhaps something like copy=None could be the default, indicating that we'll try to avoid copying if possible, but don't guarantee it. In that case, a warning would be raised if the user asks for canonicalization with copy=False directly.

perimosocordiae added a commit to perimosocordiae/scipy that referenced this issue Sep 29, 2016
Duplicates are allowed in the constructor,
but not handled anywhere else.
See scipygh-5807 for the initial discussion.
perimosocordiae added a commit to perimosocordiae/scipy that referenced this issue Sep 30, 2016
Duplicates are allowed in the constructor,
but not handled anywhere else.
See scipygh-5807 for the initial discussion.
@jni

This comment has been minimized.

Copy link
Contributor

@jni jni commented Oct 2, 2016

@perimosocordiae

If you pass copy=False, canonicalize=False, then no checking is done and the arrays are used unmodified.

... And everything you do with that matrix from then onwards is potentially broken. This isn't a very reassuring statement.

The default as of now is copy=False, canonicalize=True

How so? In 0.18, if I create a COO matrix, thankfully, the row, column, and data arrays are used untouched.

I'm interested to find out which use-cases would break under this "canonicalize-at-init" proposal

  • I have a row-expandable CSR class that subclasses csr_matrix, here. It works by replacing data, indices, and indptr with properties that return views into a larger array (and this larger array gets reallocated if needed). I suspect that it depends quite strongly on the array internals not being messed with.
  • I wasn't thinking of it in terms of a multi-graph, but that is actually in a strong sense what I am doing here (also used in scikit-image here). Note that both cases depend on the COO matrix not being canonicalized at build time. In the first case, the call copy() in this line was made necessary by canonicalization during tocsr() in 0.18, which, as noted in #6429 and #6603, is unexpected in a function that is intended to return a copy of the input array.
@pv

This comment has been minimized.

Copy link
Member

@pv pv commented Oct 2, 2016

It seems to me the only fully backward compatible behavior is to not canonicalize at any stage. Canonicalization could be done at construction time with copy=True, or when constructing new arrays in arithmetic operations etc. --- also that is in principle backward incompatible change, but could be more friendly towards code that relies on the undefined behavior vs modifying the internal state of the matrices.

@nchisholm

This comment has been minimized.

Copy link

@nchisholm nchisholm commented Oct 6, 2016

I would like to contribute my use-case here, since this caused some recent issues after upgrading to 0.18. I've got some finite element code where I was using coo_matrix primarily as storage during the matrix assembly process. This assembly happens multiple times since I'm working with a non-linear problem that requires being solved iteratively. After assembly at each iteration, I convert the coo_matrix to a csr_matrix, where duplicate entries are summed automatically. This sort of thing is encouraged by the documentation describing the intended usage of coo_matrix (as mentioned above)

  • COO is a fast format for constructing sparse matrices
  • Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
  • By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

The problem for me happens after the first iteration, because calling .to_csc() (among other methods) unexpectedly causes elimination of duplicate entries in the coo_matrix. This did not happen in versions < 0.18. When the assembly process happens for the second iteration, the coo_matrix has been modified, and there is a problem because some of the entries are now gone. My workaround was to simply store the matrix data in my own custom class (since I no longer have the needed control over coo_matrix), and then convert MyCooMatrix -> coo_matrix -> csr_matrix at each iteration.

The documentation above gives the impression that coo_matrix should be used for assembly, and my particular use case seems to illustrate a potential problem with the current behavior. The documentation also suggests that the coo_matrix is more for convenience rather than performance, and thus one should expect to deal with potential inefficiency with the added convenience of having duplicates allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.