-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: sparse: non-canonical test cleanup and fixes #5394
Conversation
53f135e
to
1072f2e
Compare
The PEP8 issue I missed is now fixed and Travis is green. I'll admit that this is a big change, so I'm happy to split it up into smaller chunks if desired. |
Nice. No need to split I'd think - it's not that large. |
This looks OK to me, but it will cause trouble with code that is sneaky with the sparse matrix encoding details, for example breaking code that uses a csc graph representation that distinguishes between missing edges vs. edges with value zero. I disagree with the idea that scipy sparse matrices should support that kind of 'masking' use-case, but I'm not sure whether there was any consensus. Here's an inconsistency in the development branch which I guess is fixed by this PR (both csr and csc will behave like csr below). >>> from scipy.sparse import csr_matrix, csc_matrix
>>>
>>> a = sparse.csr_matrix(np.identity(4))
>>> for i, j in product(range(4), repeat=2): a[i, j] = (1 if i == j else 0)
...
>>> a.nonzero()
(array([0, 1, 2, 3], dtype=int32), array([0, 1, 2, 3], dtype=int32))
>>>
>>> x = sparse.csc_matrix(np.identity(4))
>>> for i, j in product(range(4), repeat=2): x[i, j] = (1 if i == j else 0)
...
>>> x.nonzero()
(array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], dtype=int32), array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)) |
I hope that @jnothman would look at this before it is merged. |
I can confirm that CSC now behaves like CSR in your example with this PR, because the CSC I'm inclined to call the old behavior a bug, but you're right that we should try to find users who were relying on it. |
1072f2e
to
f1ce87f
Compare
@@ master #5394 diff @@
======================================
Files 235 237 +2
Stmts 43379 43540 +161
Branches 8168 8164 -4
Methods 0 0
======================================
+ Hit 33765 33975 +210
- Partial 2596 2599 +3
+ Missed 7018 6966 -52
|
|
||
The is an *in place* operation | ||
""" | ||
self.sort_indices() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add if self.has_canonical_format:
handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Based on the discussion in gh-5807, the deduplications could be done in-place |
The has_canonical_format logic is inherited from _cs_matrix
# Sort them to be in C-style order | ||
ind = np.lexsort((col, row)) | ||
ind = np.argsort(row, kind='mergesort') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this require self.has_sorted_indices
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, col
will always be in sorted order here, as it gets generated by _sparsetools.expandptr
, so we only need to stable-sort by row
to ensure the correct ordering.
One remaining question, otherwise LGTM |
MAINT: sparse: non-canonical test cleanup and fixes
Thanks, @pv! |
I noticed that there were a lot of tests for scipy.sparse marked as known failures, and a lot of confusion regarding what is actually supported vs. what is truly a bug (i.e., #3343, #4409, #4530 (comment)).
This PR attempts to clean that up a bit, by:
sum_duplicates()
for the BSR format. This means that all classes with asum_duplicates
method now actually sum their duplicates, which simplifies control flow and enables many useful operations for BSR matrices.knownfailure
tests which actually pass, or could be easily fixed. This necessitated adding a calls tosum_duplicates
in various methods, so we should be careful about the run-time impact of these changes.nnz
for non-canonical matrices into skips, rather than known failures. Hopefully this will make it clearer that this behavior is not a bug (it's just annoying).