Skip to content

Conversation

dschult
Copy link
Contributor

@dschult dschult commented Feb 20, 2024

Adds support for indexing with 1D arrays.

Builds on #19833 (csr-1d) to provide indexing for 1D CSR arrays, though this code also works for DOK with minor changes after #19715 (dok-1d) gets merged.

The diff will be much easier to read after #19833 is merged. Look at only the last commit to see the diff for just this PR. I will rebase as needed, but I think the CSR PR will be merged soon and the diff here will become cleaner.

  • adds test_indexing1d.py
  • refactors IndexMixin._validate_indices to use an ndim-independent approach for both 1d and 2d.
  • the supported formats will need methods _get_int, _get_slice and _get_array. These are dispatched to after processing the index.
  • this tries to incorporate all improvements from the recently merged defect: sparse: 1d bool mask with wrong shape should raise IndexError #19957 (and doesn't change those tests).
  • the ndim-independent approach means some helper functions are no longer needed and/or have been inlined. The two remaining helper functions are now methods so they have access to self.ndim. They are _as_indices and _compatible_boolean_array.

@github-actions github-actions bot added scipy.sparse Meson Items related to the introduction of Meson as the new build system for SciPy enhancement A new feature or improvement labels Feb 20, 2024
@lucascolley lucascolley removed the Meson Items related to the introduction of Meson as the new build system for SciPy label Feb 20, 2024
dschult added 7 commits March 17, 2024 21:53
Move TestGetSet1d to indexing1d.py. Simplify helper functions.
… fix fancy

This should set us up for nD indexing when the time comes.
Note: ndindex could work for much of this -- but not for sparse array boolean
In the future we could maybe implement sparse integer array indexing.

We should think about whether 0D return should be sparse array or ndarray.
Currently ndarray.

This doesn't make reduction functions return sparse arrays yet -- only indexing
@dschult
Copy link
Contributor Author

dschult commented Mar 18, 2024

I've changed the tests to use np.testing assert_equal and assert_allclose in these new tests. I'll create a separate PR to change the new-but-merged-tests to use them too.

I've also added indexing support for np.newaxis/None (though it can't make 3D+). And that was an excuse to revamp some of the getitem/setitem code to be easier to move to nd. Quanstide's ndindex library would give almost everything we need for validate_indices, but it doesn't support boolean sparse arrays as indexes. Anyway, we can now index 2d and 1d sparse arrays.

I believe the test failures are not related to this PR.

Note: This PR does not change reduction operations to have them return 1D sparse arrays. That will be a separate PR.

Note: Another separate PR: We need to decide how to handle 0D. A[3,4] should return a 0D object as per array_api. But should it return a numpy or scipy.sparse 0D object? In general, an array type "should" return it's own type when indexed. But sparse is not a standard array type -- we don't expect the entire array api to be implemented. But the answer to this question will impact this same code.

@lucascolley lucascolley added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jun 3, 2024
@dschult
Copy link
Contributor Author

dschult commented Jun 4, 2024

I've made the suggested changes. Thanks for the note about Python 3.10+ being supported (3.9 dropped)

@dschult
Copy link
Contributor Author

dschult commented Jun 4, 2024

I updated as suggested (and realized why the else block suggestion was correct -- we don't need to convert the dtype at that location.

This should be ready. The failing test seems unrelated.

@stefanv you looked at this some a while back -- can you take another look?

@perimosocordiae
Copy link
Member

This looks ready to merge from my end. @stefanv feel free to push the button if you agree.

@dschult
Copy link
Contributor Author

dschult commented Jun 6, 2024

Thoughts for release notes:
In New features section for sparse:

  • sparse arrays now support indexing for 1D-arrays using ints, slices and arrays.
    Methods min, max, nanmin, nanmax return 1D COO arrays when axis
    is specified.

Getting ahead of myself, but feedback is welcome:
Once the migration guide document is available we can make a Highlights entry something like:

  • sparse arrays are now fully functional replacements for sparse matrices.
    Consider converting your code to use sparse arrays. See migration_to_sparray.rst
    for guidelines to this process and open an issue if you have questions. We have
    dedicated reviewer bandwidth to help with these conversion over the next two releases.

if spot.size:
return self.data[spot[0]]
return self.data.dtype.type(0)
raise IndexError(f'index ({idx}) out of range')
Copy link
Member

@stefanv stefanv Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to have the error removed from the check so far; it's not necessary to change for this PR, but I think it'd be more intuitive to write:

if (idx < 0) or (idx >= self.shape[0]):
    raise IndexError(...)

Is the current if check correct? It has <= in both positions, but presumably idx should be strictly less than self.shape[0].

What about negative indices?

Copy link
Contributor Author

@dschult dschult Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your style comment -- and I agree that the <= should be < self.shape[0].

Negative indices are handled in the _validate_indices method.

Which reminded me that _validate_indices also checks the size of idx, so there is no need to check it here. That is why the incorrect <= check wasn't flagged by the existing test in test_indexing1d.py::test_getelement for index values just beyond the limits of shape.

Thanks for flagging this -- I have removed this "if/raise" statement as it is redundant with logic in _validate_indices.

return self._get_array(list(i_range))

def _get_array(self, idx):
idx = np.asarray(idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call can blow up the range, is that something we are worried about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the method _asindices handles the input idx. I can guess that we aren't worried about it blowing up because we don't support sparse non-boolean arrays as indices, and we handle boolean arrays (in _validate_indices) by converting them using ix.nonzero().

Comment on lines +48 to +52
if res.shape == () and new_shape != ():
if len(new_shape) == 1:
return self.__class__([res], shape=new_shape, dtype=self.dtype)
if len(new_shape) == 2:
return self.__class__([[res]], shape=new_shape, dtype=self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but I don't 100% get which case we're handling here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment indicating that this is to handle multiple np.newaxis indices, i.e. A[3, 4, None, None].

N = len(idx_range)
if N == 1 and x.size == 1:
self._set_int(idx_range[0], x.flat[0])
idx = np.arange(*idx.indices(self.shape[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does densify the range above; is that OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was basing this idiom on the 2D case from this module, and get_arrayXslice in _csr.py. I think the comment above isn't clear. We use a Python range to check for the special case when the slice has a single element -- and we call _set_int instead of _set_array. Part of the lack of clarity is that this shortcut should be finished with a return and it isn't. So, in fact the code in the PR was setting the element once in _set_int and then again in _set_array later on. That's my mistake. And the tests don't catch it because it is just doing extra work -- not creating a wrong answer.

I have added a comment and a return statement to make this more clear, and actually take the shortcut when we can.

def _compatible_boolean_index(idx, desired_ndim):
"""Check for boolean array or array-like. peek before asarray for array-like"""
# assume already an array if attr ndim exists: skip to bottom
if not hasattr(idx, 'ndim'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit risky; but not sure there's any single attribute that would tell us whether we are dealing with a numpy array or any array-like derivative. Maybe .__array__, but again not sure what external arrays would implement; we'd perhaps have to check with someone who works on the Array API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it feels risky -- but it is following the logic of the previous code (the removed line 384 below) which used ndim as an indicator of a compatible array type.

I will look for a more appropriate check for compatible array types, but I think that should be a separate PR.
I have added/improved comments explaining that this assumption is being made. Also for better clarity, I added return None at the end of this function to be more clear that returning None carries information here .

remove superfluous check of idx
add multple comments to clarify logic
add clarifying `return None` at end of _compatible_boolean_index
@stefanv stefanv merged commit 66ec333 into scipy:main Jun 20, 2024
@stefanv
Copy link
Member

stefanv commented Jun 20, 2024

Thank you, @dschult!

@dschult dschult deleted the index-1d branch July 9, 2024 18:03
@lucascolley lucascolley removed the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants