Skip to content

RFC: sparse: establish an official policy for duplicate elements #5807

@perimosocordiae

Description

@perimosocordiae

Background
As an admission to performance and ease of construction, most of our sparse matrix formats admit duplicate entries. For example:

>>> foo = coo_matrix(([4,5], ([0,0],[1,1])), shape=(2,2))
>>> foo.nnz
2
>>>foo.A
array([[0, 9],
       [0, 0]])

Unfortunately, this causes a lot of problems for sparse matrix operations (#4409, #5394, #5806), and causes confusion regarding the true meaning of nnz, even ignoring the issue of explicit zeros (#3343).

Most sparse matrix formats have a method sum_duplicates() which operates in-place to canonicalize the internal storage, but it's unclear whether other methods are allowed to call this without first making a copy (see #5741 (comment)).

Proposal
I think that allowing duplicate entries in internal sparse matrix representations was an API mistake, but now that we allow it, it's difficult to disallow without breaking lots of existing user code. Therefore, I propose that we make it a policy that:

  1. Duplicate entries are not preserved. That is, it's okay to canonicalize in-place.
  2. Whenever a method other than sum_duplicates() triggers in-place canonicalization, a SparseEfficiencyWarning is thrown, to alert the user that something potentially unexpected is going on.
  3. The presence/lack of duplicate entries is remembered with a boolean flag, which we will document and encourage users to toggle if they manually modify a sparse matrix's internal members.

If this gets traction here, I'll send it along to the Scipy-dev mailing list as well. Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest for Comments; typically used to gather feedback for a substantial change proposalscipy.sparse

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions