# Fancy indexing for CSR and CSC matrices. #2689

Merged
merged 4 commits into from Aug 13, 2013

## Conversation

Projects
None yet
2 participants
Contributor

### cowlicks commented Aug 5, 2013

 This is based off of @Daniel-B-Smith's work (thanks!). I also add support for indexing with sparse boolean matrices, but it does not have tests yet. I will add tests for this tomorrow. For now I just wanted to get this out because it finishes what Daniel started. Here's a demo of what I mean by indexing with sparse boolean matrices. ``````>>> from scipy.sparse import csr_matrix >>> from numpy.random import randint >>> x = csr_matrix(randint(6, size=(4,4))) >>> x.todense() matrix([[3, 5, 0, 4], [4, 3, 4, 0], [0, 3, 2, 0], [2, 2, 0, 1]], dtype=int64) >>> x[x > 3] = 1 >>> x.todense() matrix([[3, 1, 0, 1], [1, 3, 1, 0], [0, 3, 2, 0], [2, 2, 0, 1]], dtype=int64) ``````

### pv reviewed Aug 5, 2013

scipy/sparse/compressed.py Outdated
 def _set_boolean_spmatrix(self, index, x): if not issubclass(index.dtype.type, (np.bool_, np.bool, bool)): raise VauleError('index must be a boolean sparse matrix.')

#### pv Aug 5, 2013

Member

Typo `ValueError`, run `pyflakes` to check if there are more of those

### pv reviewed Aug 5, 2013

scipy/sparse/compressed.py Outdated
 if not issubclass(index.dtype.type, (np.bool_, np.bool, bool)): raise VauleError('index must be a boolean sparse matrix.') """This probably has the worst roundoff."""

#### pv Aug 5, 2013

Member

Comments as `#`.

Indexing with a boolean matrix can also be implemented as `self[index.nonzero()]` which may be a better way than this.

#### cowlicks Aug 5, 2013

Contributor

Oops, yeah I didn't mean to push that comment.

### pv reviewed Aug 5, 2013

scipy/sparse/coo.py Outdated
 self.shape = M.shape elif M.size == 0: self.shape = (1,1)

#### pv Aug 5, 2013

Member

Remove empty line here

Contributor

### cowlicks commented Aug 6, 2013

 I think it may be better to submit this without sparse boolean indexing, since it is taking a while to complete it. But all other fancy indexing is complete. I'll submit the sparse boolean indexing in my next PR.
Member

### pv commented Aug 6, 2013

 @cowlicks: why does the boolean indexing path need to be guarded by `isinstance(..., np.ndarray)`? Since it only calls `index.nonzero()` shouldn't it work as-is also for sparse matrices?

### pv reviewed Aug 6, 2013

scipy/sparse/compressed.py Outdated
 newindx = start def _set_one(self, row, col, val): """Set one value at a time.""" if not (isscalarlike(row) and isscalarlike(col)):

#### pv Aug 6, 2013

Member

This check is unnecessary --- this internal routine will always be called with scalar row and col?

### pv reviewed Aug 6, 2013

scipy/sparse/csc.py Outdated
 if isintlike(row) or isinstance(row, slice): if (isintlike(row) or isinstance(row, slice) or isintlike(col) or

#### pv Aug 6, 2013

Member

This needs a source code comment on the logic going on there.

I think the idea is that if the index is slice or int, you need to swap the axes, but if it's fancy indexing, the output shape comes from the index arrays and doesn't need to be swapped.

Is the case `sparr[3, array_2d]` handled correctly?

#### cowlicks Aug 6, 2013

Contributor

The way it looks now. The result through this path is sparse and the result through the else statement is always dense. Your case is handled, but it has a dense result. I'm not sure why it is done this way. But sparse results should always be returned.

### pv reviewed Aug 6, 2013

scipy/sparse/sputils.py Outdated
 def _unpack_index(self, index): """ Parse index. """ if isinstance(index, tuple):

#### pv Aug 6, 2013

Member

The case of indexing with `arr[bool_matrix] -> arr[bool_matrix.nonzero()]` needs to be dealt with in this function, I think.

It's not correct to map this case to indexing with `(bool_matrix, slice(None))`, see http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

According to that documentation, boolean indexing always uses `nonzero()` so I think it doesn't need `isinstance(ndarray)` guards? Does the documentation match reality here?

### pv reviewed Aug 6, 2013

scipy/sparse/sputils.py Outdated
 return i, j def _index_to_arrays(self, i, j): i, j = self._check_boolean(i, j)

#### pv Aug 6, 2013

Member

I think the boolean checking should be moved to `_unpack_index`, so that it is guaranteed the result is always either integers, slices, or arrays of integers. There's currently no case in the downstream code that deal specially with boolean indices, so doing this will simplify things (and reduce the number of public functions in IndexMixing).

This result type invariant should then be explained in the `_unpack_index` docstring.

Contributor

### cowlicks commented Aug 6, 2013

 @pv Yeah I noticed this last night. It almost works, and it is simpler than what I tried at first. But indexing with vector-like sparse matrices doesn't work yet. This is because `spmatrix.nonzero()` always returns a tuple of length 2, but length 1 is expected for indexing with vector-like things. There could be more problems further down the line but I'm trying to fix this atm.
Member

### pv commented Aug 6, 2013

 I don't think indexing with vector-like sparse matrices should be special cased. They are in reality always 2-dimensional, `len(x.shape) == 2`. I think they should be treated similarly as 2-dimensional arrays.
Contributor

### cowlicks commented Aug 6, 2013

 So basically... Only indexing with boolean sparse matrices that are the same shape as the indexee should be supported?
Member

### pv commented Aug 6, 2013

 Yes, I think so, at least for this PR. Note that Numpy itself doesn't enforce shape restriction --- it just does `x[bx.nonzero()]`. The decision whether it's a good idea to try to start guessing the user's intent needs some more thought. Numpy itself is not consistent with how it treats 2-d arrays: ``````>>> x = np.random.rand(20, 30) >>> ix = np.array([[True,False,True]]) >>> x[ix] array([ 0.79933457, 0.09600886]) >>> x[5:4,ix] array([ 0.79933457, 0.09600886]) `````` I think Scipy should raise an error for the second case, for now.
Member

### pv commented Aug 6, 2013

 Anyway, other than that, looks good to me!
Contributor

### cowlicks commented Aug 6, 2013

 I'm clueless on this build error in py 2.6. But 2.7 and 3.3 passed. ``````\$ sudo apt-get install -qq libatlas-dev libatlas-base-dev liblapack-dev gfortran No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself. The build has been terminated `````` edit: close & reopen fixed it.

### cowlicks reviewed Aug 7, 2013

scipy/sparse/csr.py Outdated
 return self[row,:] * P elif isinstance(row, slice): # [1:2,??] if ((isintlike(col) and

Contributor

### EDIT:

What I said below was incorrect. Disregard.

This path is not necessary because we can do the same thing with `extractor` as `_get_submatrix`. But `_get_submatrix` is limited to slices with `step = 1`, however I left it in because, @Daniel-B-Smith did. I figured there must be some reason for this, like `_get_submatrix` being faster for `step = 1` or something. Is this the case?

### pv reviewed Aug 8, 2013

scipy/sparse/sputils.py Outdated
 # First check for single boolean index. from .base import spmatrix # This feels dirty but... if (isinstance(index, (spmatrix, np.ndarray)) and (index.shape == self.shape) and

#### pv Aug 8, 2013

Member

Numpy doesn't do this shape check for boolean indexing: you can index with any boolean matrix.
This probably should read `index.ndim == 2` instead.

### pv reviewed Aug 8, 2013

scipy/sparse/csc.py Outdated
 or isintlike(col) or isinstance(col, slice)): return self.T[col, row].T # ndarrays or something else. Dense result. elif isinstance(key, tuple):

#### pv Aug 8, 2013

Member

If you change this to `else:` and remove the following two code paths, do things work? AFAICS, `_unpack_index` already takes care of boolean indexing.

### pv reviewed Aug 8, 2013

scipy/sparse/csc.py Outdated
 return self.T[:,key].T # [[1,2]] return self.T[:, key].T def _bl_to_tl_sort(self, row, col):

#### pv Aug 8, 2013

Member

I think the transformation here can also be achieved by `numpy.lexsort`.
But it seems to me it is not needed.

#### cowlicks Aug 8, 2013

Contributor

Oh wow, lexsort indeed does this. I spent more time than I'd like to admit figuring out a simple way to do that sort.

(╯°□°）╯︵ ┻━┻

However I think it is needed to give the correct order of the elements in the result for CSC's boolean indexing since it is transposed/converted to CSR to do the operation.

#### pv Aug 8, 2013

Member

If the order of indices matters, then also the `elif isinstance(key, tuple):` code path should be affected, right?
Boolean indexing reduces via `.nonzero()` to integer indexing.

What I'm trying to get at is that after the call to `_unpack_index(key)`, the `__getitem__` method should not refer to the `key` variable any more.

### pv reviewed Aug 8, 2013

scipy/sparse/sputils.py Outdated
 " shapes.") if (isinstance(row, np.ndarray) and row.dtype.kind == 'b'): # Check for equal shapes. if (row.shape == self.shape):

#### pv Aug 8, 2013

Member

I think also this shape check should be removed to be consistent with Numpy.

I.e., instead of this check, just do `row = _boolean_index_to_array(row)`, similarly as is done for `col` below.

### pv reviewed Aug 8, 2013

scipy/sparse/sputils.py Outdated
 col = self._boolean_index_to_array(col)[0] return row, col def _boolean_index_to_array(self, i):

#### pv Aug 8, 2013

Member

I think this routine should just do

``````if i.ndim > 1:
raise IndexError('invalid index shape')
return i.nonzero()[0]
``````

Numpy actually does not do the ndim check --- if the number of dimensions in some boolean index array matches the number of dimensions of the array itself, numpy simply ignores the other indexes. It's perhaps better to deviate from this behavior and raise an indexerror instead here.

Contributor

### cowlicks commented Aug 9, 2013

 This introduces a failing test due to a bug in CSC's `.nonzero()` where the indices are not sorted C-style as described here: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean I will try to fix the bug causing this after lunch.
Contributor

### cowlicks commented Aug 10, 2013

 I'm not sure if giving CSC its own `.nonzero` method was the right thing to do. Any other ideas?
Member

### pv commented Aug 10, 2013

 I think if the CSC `.nonzero` method gives output different from Numpy ndarrays, that is a bug that should be fixed.

### pv reviewed Aug 10, 2013

 return row_out, col_out def _bl_to_tl_sort(self, row, col):

Member

Unused routine

### pv reviewed Aug 10, 2013

 # Sort them to be in C-style order ind = np.lexsort((col, row)) for i, j in enumerate(ind):

#### pv Aug 10, 2013

Member

Same as (?)

``````row_out = row[ind]
col_out = col[ind]
``````

If so, can drop the `.copy()` lines above, too.

### pv reviewed Aug 10, 2013

 """CSC can't use _cs_matrix's .nonzero method because it returns the indices sorted for self transposed. """ # Get row and col indices

#### pv Aug 10, 2013

Member

Add a comment stating that this code is the same as in `_cs_matrix.tocoo`

### pv reviewed Aug 11, 2013

 @@ -181,8 +181,11 @@ def __init__(self, arg1, shape=None, dtype=None, copy=False): if np.rank(M) != 2: raise TypeError('expected rank <= 2 array or matrix') elif M.size == 0: self.shape = (1,1)

#### pv Aug 11, 2013

Member

This is probably acceptable work-around for now. The matrices should however be changed to accept also zero shapes...

The reason is that people will end up adapting their code to the bug, and this then will make it difficult to make it work properly when sparse matrices are fixed so that size-0 matrices is allowed...

Another possibility for now could be to return a LIL matrix, as those can be 0-sized.

### pv reviewed Aug 11, 2013

 if b != -1: x = A[a, -1] y = B[a, -1] yield dec.skipif(

#### pv Aug 11, 2013

Member

These yield statements have no effect: the rest of the test is not written as a test generator.
Also, `else:` seems to be missing, and the comparison is the wrong way around?

Member

### pv commented Aug 11, 2013

 Cleaned up the test, sent a PR. cowlicks#1
``` Merge pull request #1 from pv/pr-2689 ```
`Pr 2689`
``` d9ae272 ```
Contributor

### cowlicks commented Aug 11, 2013

 Looks like more problems related not having size 0 matrices. When I change the check in `base.py` to allow size zero matrices, all current tests pass. But there are no tests for size 0 matrices. Every way I try to assign an index to them fails. I think supporting this would be a lot of work and out of the scope of this PR. So I think I'm stuck tracking everything down that could give a size zero matrix, and adding checks so it will return a 1x1 empty matrix.
Contributor

### cowlicks commented Aug 11, 2013

 Also I think there would be a lot of non-obvious decisions to be made when squishing sparse matrix schemes into a 1D form. For example would COO just store things as (None, col, val) and (row, None, val) or a 2 element tuple thing? What would compressed formats do? How would the squished scheme show it was horizontal or vertical? I'm not sure there is a reasonable way to do this for many formats. I think we would just have to invent something special that would masquerade as the other formats whenever they are size zero.
``` WIP: Add check for size zero slice result. ```
``` 0e523ee ```
Member

### pv commented Aug 12, 2013

 I think returning size-1 result from a slice that should give size 0 is incorrect. It's better to have it raise an exception instead. So I think you should remove all special-casing where 0-size result is converted to 1-size, and have the corresponding tests be knownfailures. The correct fix would then be to fix the system so that it allows for 0 x n and n x 0 sparse matrices.
Contributor

### cowlicks commented Aug 12, 2013

 @pv That sounds like the right thing to do. Would this case would be a `ValueError` instead of an `IndexError`?
Member

### pv commented Aug 12, 2013

 I think either one will do. Hopefully we'll manage to fix this for 0.14 (or even 0.13 if someone is fast enough).

### cowlicks added some commits Aug 12, 2013

``` WIP: Disallow indexes which return size 0 sparse matrices. ```
`Also add errors in this case.`
``` c1fee97 ```
``` WIP, TST: Add knownfailures to tests for size == 0 cases. ```
`Also removed tests that previously checked the size 0 workaround.`
``` c859927 ```
Contributor

### cowlicks commented Aug 12, 2013

 Okay, I reverted the work around for size 0 matrices, and added the `knownfail` tests. I also added catches for cases when slicing would create a size 0 matrix.

### pv added a commit that referenced this pull request Aug 13, 2013

``` Merge pull request #2689 from cowlicks/fancy-indexing ```
```ENH: sparse: Fancy indexing for CSR and CSC matrices

Add initial fancy indexing support for CSR and CSC.```
``` 76bce8b ```

### pv merged commit `c859927` into scipy:master Aug 13, 2013 1 check failed

#### 1 check failed

default The Travis CI build could not complete due to an error
Details
Member

### pv commented Aug 13, 2013

 Merged + some additional fixes in 76bce8b. Thanks @cowlicks & @Daniel-B-Smith
Member

### pv commented Aug 13, 2013

 There's probably a lot to optimize in the CSR/CSC indexing business --- it's probably not the most efficient thing to have a Python loop that insert elements one by one... Correctness trumps speed, though.

Closed

Closed

### thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013

``` WIP: Refactor _unpack_index. ```
```Addresses comments by @pv:
scipy#2689 (diff)
scipy#2689 (diff)
scipy#2689 (diff)```
``` 62008ad ```

### thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013

``` WIP: Refactored CSC __getitem__ to use changes in _unpack_index. ```
```This introduces a failing test due to a bug in CSC's .nonzero() where
the indices are not sorted C-style as described here:
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

scipy#2689 (diff)
scipy#2689 (diff)```
``` 80c406a ```

### thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013

``` WIP: Remove unused method. Refactor sort in nonzero. ```
```Adresses @pv's comments:
scipy#2689 (diff)
scipy#2689 (diff)
scipy#2689 (diff)```
``` ab1733d ```

Closed