Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added sparse nonzero functionality to min, max, argmin, argmax #11899

Closed
wants to merge 1 commit into from

Conversation

dloney
Copy link

@dloney dloney commented Apr 20, 2020

Added nonzero parsing functionality into sparse matrix min, max, argmin, and argmax functions.

What does this implement/fix?

This allows a user to selectively consider only the nonzero values in the sparse min, max, argmin, and argmax functions. A programmer will often use a sparse matrix when zeros do not matter. Previous to this change, the programmer will have needed to mathematically manipulate the sparse matrix so that these functions gave the correct result when considering nonzero values. By passing in the nonzero keyword, these functions are now able to consider only nonzero values within the matrix. The change defaults to the existing functionality and only activates these changes if the user activates the input keyword.

This issue exists in several of my existing programs. As a benchmark, this implementation improves performance between 1.15x and 1.35x compared to manipulating the sparse matrix to identify nonzero values.

Additional information

This broadly mirrors the same functionality as implemented in cupy, which should help the compatibility of the packages.

@tylerjereddy tylerjereddy added scipy.sparse enhancement A new feature or improvement labels Apr 20, 2020
Copy link
Member

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposed API makes sense, though I think we may want to post to the mailing list to solicit feedback once this PR is ready to go.

if out is not None:
raise ValueError(("Sparse matrices do not support "
"an 'out' parameter."))

validateaxis(axis)

if hasattr(self, 'has_canonical_format'):
if not self.has_canonical_format and nonzero:
self.eliminate_zeros()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be eliminating zeros here. There are use-cases that distinguish between explicit and implicit zeros (like the csgraph module), so we can't always assume that explicit zeros should be ignored. It's also an in-place update, which I wouldn't expect to happen when calling .max().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this mainly to have compatibility with the non-canonical forms. There could be a couple of different alternative implementations. The first is it could consider explicit zeros that are left in the matrix. The second could be a forced zero elimination if any values are zeros in the data. The main issue with the former is that it could give unexpected behavior if the user didn't call eliminate zeros prior to calling the function. That may be the best alternative especially if the flag is updated from 'nonzero' to 'explicit' along with the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for changing the name rather than eliminating zeros.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new keyword should be in place. I used "explicit", since it's considering values that are within the actual matrix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also mention that I reworked the tests to allow the non-canonical sparse matrices to be correctly validated in the 'explicit' case.

scipy/sparse/data.py Outdated Show resolved Hide resolved
return am
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the if-block always returns, I'd drop the else here and preserve the original indentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're thinking here. The if block was introduced to bypass the compare operation, otherwise the zero would be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear. What I meant was:

if nonzero:
    if m == 0:
        return []
    return am

if compare(m, zero):
    return mat.row...

...etc...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand now that the elses are redundant. I made those changes.

Added nonzero parsing funcitonality into sparse matrix methods

Added sparse nonzero functionality

Removed redundant else block

Changed working and removed enforced summation

Corrected formatting issues
@dloney
Copy link
Author

dloney commented Jul 9, 2020

@perimosocordiae Any additional thoughts about the proposed changes?

@perimosocordiae
Copy link
Member

I haven't had a chance to look this over again recently, but I'm still +1 on the general idea.

I'll be on vacation for the next week, so I probably won't have more concrete comments for a little while. In the mean time, could you send an email to the scipy-dev@python.org mailing list with a summary of the proposed changes (and a link to this PR)?

Comment on lines +316 to +317
considered. If a row/column is empty, a zero will be returned
to indicate it contains no nonzero values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a row/column is empty, a zero will be returned indicate it contains no nonzero values.

This is highly nonstandard behavior, no? max([]) is an exception in Python and csr_matrix((2,0)).max(axis=1) is an exception in SciPy. I do not think returning 0 from max/min of an empty list is considered correct behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row/column is empty only in the sense there are no explicit values contained in the row/column, so that in dense form it would only contain zeros. Returning a zero under those conditions is correct because only zeros are present in the row/column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but when explicit=True, the implicit zeros are not considered in the min/max operation. It is min/maxing over an empty list of explicit elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but when explicit=True, the implicit zeros are not considered in the min/max operation. It is min/maxing over an empty list of explicit elements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would returning a NaN be more appropriate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, min/max would return None on an empty vector, but you can't do that with a NumPy array. NaN is a fine sentinel for float32 and float64, but there is no equivalent for integer types.

Another possibility is min([]) == inf and max([]) == -inf, for float dtypes, and 232-1 and -232 for integer dtypes. But that is pretty ugly.

This is admittedly a hard problem when the data structure does not allow for missing integer values.

@quantresearch1
Copy link

Hello, I am interested in picking up this PR if @dloney is not involved anymore.
I have merged main into this code and resolved the conflicts here https://github.com/quantresearch1/scipy/tree/sparse_nonzero_dev . Should I raise a PR ?

@tupui
Copy link
Member

tupui commented Jun 25, 2022

Thank you both for working on this. As said in #16467 everyone agreed to continue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants