New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fancy indexing for CSR and CSC matrices. #2689

Merged
merged 4 commits into from Aug 13, 2013

Conversation

Projects
None yet
2 participants
@cowlicks
Copy link
Contributor

cowlicks commented Aug 5, 2013

This is based off of @Daniel-B-Smith's work (thanks!).

I also add support for indexing with sparse boolean matrices, but it does not have tests yet. I will add tests for this tomorrow. For now I just wanted to get this out because it finishes what Daniel started.

Here's a demo of what I mean by indexing with sparse boolean matrices.

>>> from scipy.sparse import csr_matrix
>>> from numpy.random import randint

>>> x = csr_matrix(randint(6, size=(4,4)))
>>> x.todense()

matrix([[3, 5, 0, 4],
        [4, 3, 4, 0],
        [0, 3, 2, 0],
        [2, 2, 0, 1]], dtype=int64)

>>> x[x > 3] = 1
>>> x.todense()

matrix([[3, 1, 0, 1],
        [1, 3, 1, 0],
        [0, 3, 2, 0],
        [2, 2, 0, 1]], dtype=int64)

@pv

pv reviewed Aug 5, 2013
View changes

scipy/sparse/compressed.py Outdated

def _set_boolean_spmatrix(self, index, x):
if not issubclass(index.dtype.type, (np.bool_, np.bool, bool)):
raise VauleError('index must be a boolean sparse matrix.')

This comment has been minimized.

@pv

pv Aug 5, 2013

Member

Typo ValueError, run pyflakes to check if there are more of those

@pv

pv reviewed Aug 5, 2013
View changes

scipy/sparse/compressed.py Outdated
if not issubclass(index.dtype.type, (np.bool_, np.bool, bool)):
raise VauleError('index must be a boolean sparse matrix.')

"""This probably has the worst roundoff."""

This comment has been minimized.

@pv

pv Aug 5, 2013

Member

Comments as #.

Indexing with a boolean matrix can also be implemented as self[index.nonzero()] which may be a better way than this.

This comment has been minimized.

@cowlicks

cowlicks Aug 5, 2013

Contributor

Oops, yeah I didn't mean to push that comment.

@pv

pv reviewed Aug 5, 2013
View changes

scipy/sparse/coo.py Outdated
self.shape = M.shape
elif M.size == 0:
self.shape = (1,1)

This comment has been minimized.

@pv

pv Aug 5, 2013

Member

Remove empty line here

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 6, 2013

I think it may be better to submit this without sparse boolean indexing, since it is taking a while to complete it. But all other fancy indexing is complete. I'll submit the sparse boolean indexing in my next PR.

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 6, 2013

@cowlicks: why does the boolean indexing path need to be guarded by isinstance(..., np.ndarray)? Since it only calls index.nonzero() shouldn't it work as-is also for sparse matrices?

@pv

pv reviewed Aug 6, 2013
View changes

scipy/sparse/compressed.py Outdated
newindx = start
def _set_one(self, row, col, val):
"""Set one value at a time."""
if not (isscalarlike(row) and isscalarlike(col)):

This comment has been minimized.

@pv

pv Aug 6, 2013

Member

This check is unnecessary --- this internal routine will always be called with scalar row and col?

@pv

pv reviewed Aug 6, 2013
View changes

scipy/sparse/csc.py Outdated

if isintlike(row) or isinstance(row, slice):
if (isintlike(row) or isinstance(row, slice) or isintlike(col) or

This comment has been minimized.

@pv

pv Aug 6, 2013

Member

This needs a source code comment on the logic going on there.

I think the idea is that if the index is slice or int, you need to swap the axes, but if it's fancy indexing, the output shape comes from the index arrays and doesn't need to be swapped.

Is the case sparr[3, array_2d] handled correctly?

This comment has been minimized.

@cowlicks

cowlicks Aug 6, 2013

Contributor

The way it looks now. The result through this path is sparse and the result through the else statement is always dense. Your case is handled, but it has a dense result. I'm not sure why it is done this way. But sparse results should always be returned.

@pv

pv reviewed Aug 6, 2013
View changes

scipy/sparse/sputils.py Outdated
def _unpack_index(self, index):
""" Parse index.
"""
if isinstance(index, tuple):

This comment has been minimized.

@pv

pv Aug 6, 2013

Member

The case of indexing with arr[bool_matrix] -> arr[bool_matrix.nonzero()] needs to be dealt with in this function, I think.

It's not correct to map this case to indexing with (bool_matrix, slice(None)), see http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

According to that documentation, boolean indexing always uses nonzero() so I think it doesn't need isinstance(ndarray) guards? Does the documentation match reality here?

@pv

pv reviewed Aug 6, 2013
View changes

scipy/sparse/sputils.py Outdated
return i, j

def _index_to_arrays(self, i, j):
i, j = self._check_boolean(i, j)

This comment has been minimized.

@pv

pv Aug 6, 2013

Member

I think the boolean checking should be moved to _unpack_index, so that it is guaranteed the result is always either integers, slices, or arrays of integers. There's currently no case in the downstream code that deal specially with boolean indices, so doing this will simplify things (and reduce the number of public functions in IndexMixing).

This result type invariant should then be explained in the _unpack_index docstring.

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 6, 2013

@pv Yeah I noticed this last night. It almost works, and it is simpler than what I tried at first. But indexing with vector-like sparse matrices doesn't work yet. This is because spmatrix.nonzero() always returns a tuple of length 2, but length 1 is expected for indexing with vector-like things. There could be more problems further down the line but I'm trying to fix this atm.

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 6, 2013

I don't think indexing with vector-like sparse matrices should be special cased.

They are in reality always 2-dimensional, len(x.shape) == 2. I think they should be treated similarly as 2-dimensional arrays.

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 6, 2013

So basically... Only indexing with boolean sparse matrices that are the same shape as the indexee should be supported?

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 6, 2013

Yes, I think so, at least for this PR.

Note that Numpy itself doesn't enforce shape restriction --- it just does x[bx.nonzero()].

The decision whether it's a good idea to try to start guessing the user's intent needs some more thought. Numpy itself is not consistent with how it treats 2-d arrays:

>>> x = np.random.rand(20, 30)
>>> ix = np.array([[True,False,True]])
>>> x[ix]
array([ 0.79933457,  0.09600886])
>>> x[5:4,ix]
array([ 0.79933457,  0.09600886])

I think Scipy should raise an error for the second case, for now.

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 6, 2013

Anyway, other than that, looks good to me!

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 6, 2013

I'm clueless on this build error in py 2.6. But 2.7 and 3.3 passed.

$ sudo apt-get install -qq libatlas-dev libatlas-base-dev liblapack-dev gfortran
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

edit: close & reopen fixed it.

@cowlicks cowlicks closed this Aug 7, 2013

@cowlicks cowlicks reopened this Aug 7, 2013

@cowlicks

View changes

scipy/sparse/csr.py Outdated
return self[row,:] * P
elif isinstance(row, slice):
# [1:2,??]
if ((isintlike(col) and

This comment has been minimized.

@cowlicks

cowlicks Aug 7, 2013

Contributor

EDIT:

What I said below was incorrect. Disregard.

This path is not necessary because we can do the same thing with extractor as _get_submatrix. But _get_submatrix is limited to slices with step = 1, however I left it in because, @Daniel-B-Smith did. I figured there must be some reason for this, like _get_submatrix being faster for step = 1 or something. Is this the case?

@pv

pv reviewed Aug 8, 2013
View changes

scipy/sparse/sputils.py Outdated
# First check for single boolean index.
from .base import spmatrix # This feels dirty but...
if (isinstance(index, (spmatrix, np.ndarray)) and
(index.shape == self.shape) and

This comment has been minimized.

@pv

pv Aug 8, 2013

Member

Numpy doesn't do this shape check for boolean indexing: you can index with any boolean matrix.
This probably should read index.ndim == 2 instead.

@pv

pv reviewed Aug 8, 2013
View changes

scipy/sparse/csc.py Outdated
or isintlike(col) or isinstance(col, slice)):
return self.T[col, row].T
# ndarrays or something else. Dense result.
elif isinstance(key, tuple):

This comment has been minimized.

@pv

pv Aug 8, 2013

Member

If you change this to else: and remove the following two code paths, do things work? AFAICS, _unpack_index already takes care of boolean indexing.

@pv

pv reviewed Aug 8, 2013
View changes

scipy/sparse/csc.py Outdated
return self.T[:,key].T # [[1,2]]
return self.T[:, key].T

def _bl_to_tl_sort(self, row, col):

This comment has been minimized.

@pv

pv Aug 8, 2013

Member

I think the transformation here can also be achieved by numpy.lexsort.
But it seems to me it is not needed.

This comment has been minimized.

@cowlicks

cowlicks Aug 8, 2013

Contributor

Oh wow, lexsort indeed does this. I spent more time than I'd like to admit figuring out a simple way to do that sort.

(╯°□°)╯︵ ┻━┻

However I think it is needed to give the correct order of the elements in the result for CSC's boolean indexing since it is transposed/converted to CSR to do the operation.

This comment has been minimized.

@pv

pv Aug 8, 2013

Member

If the order of indices matters, then also the elif isinstance(key, tuple): code path should be affected, right?
Boolean indexing reduces via .nonzero() to integer indexing.

What I'm trying to get at is that after the call to _unpack_index(key), the __getitem__ method should not refer to the key variable any more.

@pv

pv reviewed Aug 8, 2013
View changes

scipy/sparse/sputils.py Outdated
" shapes.")
if (isinstance(row, np.ndarray) and row.dtype.kind == 'b'):
# Check for equal shapes.
if (row.shape == self.shape):

This comment has been minimized.

@pv

pv Aug 8, 2013

Member

I think also this shape check should be removed to be consistent with Numpy.

I.e., instead of this check, just do row = _boolean_index_to_array(row), similarly as is done for col below.

@pv

pv reviewed Aug 8, 2013
View changes

scipy/sparse/sputils.py Outdated
col = self._boolean_index_to_array(col)[0]
return row, col

def _boolean_index_to_array(self, i):

This comment has been minimized.

@pv

pv Aug 8, 2013

Member

I think this routine should just do

if i.ndim > 1:
    raise IndexError('invalid index shape')
return i.nonzero()[0]

Numpy actually does not do the ndim check --- if the number of dimensions in some boolean index array matches the number of dimensions of the array itself, numpy simply ignores the other indexes. It's perhaps better to deviate from this behavior and raise an indexerror instead here.

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 9, 2013

This introduces a failing test due to a bug in CSC's .nonzero() where
the indices are not sorted C-style as described here:
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

I will try to fix the bug causing this after lunch.

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 10, 2013

I'm not sure if giving CSC its own .nonzero method was the right thing to do. Any other ideas?

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 10, 2013

I think if the CSC .nonzero method gives output different from Numpy ndarrays, that is a bug that should be fixed.


return row_out, col_out

def _bl_to_tl_sort(self, row, col):

This comment has been minimized.

@pv

pv Aug 10, 2013

Member

Unused routine


# Sort them to be in C-style order
ind = np.lexsort((col, row))
for i, j in enumerate(ind):

This comment has been minimized.

@pv

pv Aug 10, 2013

Member

Same as (?)

row_out = row[ind]
col_out = col[ind]

If so, can drop the .copy() lines above, too.

"""CSC can't use _cs_matrix's .nonzero method because it
returns the indices sorted for self transposed.
"""
# Get row and col indices

This comment has been minimized.

@pv

pv Aug 10, 2013

Member

Add a comment stating that this code is the same as in _cs_matrix.tocoo

@@ -181,8 +181,11 @@ def __init__(self, arg1, shape=None, dtype=None, copy=False):

if np.rank(M) != 2:
raise TypeError('expected rank <= 2 array or matrix')
elif M.size == 0:
self.shape = (1,1)

This comment has been minimized.

@pv

pv Aug 11, 2013

Member

This is probably acceptable work-around for now. The matrices should however be changed to accept also zero shapes...

EDIT: Ok, I changed my mind about this: it's better to raise an IndexError here.
The reason is that people will end up adapting their code to the bug, and this then will make it difficult to make it work properly when sparse matrices are fixed so that size-0 matrices is allowed...

Another possibility for now could be to return a LIL matrix, as those can be 0-sized.

if b != -1:
x = A[a, -1]
y = B[a, -1]
yield dec.skipif(

This comment has been minimized.

@pv

pv Aug 11, 2013

Member

These yield statements have no effect: the rest of the test is not written as a test generator.
Also, else: seems to be missing, and the comparison is the wrong way around?

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 11, 2013

Cleaned up the test, sent a PR. cowlicks#1

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 11, 2013

Looks like more problems related not having size 0 matrices.

When I change the check in base.py to allow size zero matrices, all current tests pass. But there are no tests for size 0 matrices. Every way I try to assign an index to them fails. I think supporting this would be a lot of work and out of the scope of this PR.

So I think I'm stuck tracking everything down that could give a size zero matrix, and adding checks so it will return a 1x1 empty matrix.

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 11, 2013

Also I think there would be a lot of non-obvious decisions to be made when squishing sparse matrix schemes into a 1D form. For example would COO just store things as (None, col, val) and (row, None, val) or a 2 element tuple thing? What would compressed formats do? How would the squished scheme show it was horizontal or vertical? I'm not sure there is a reasonable way to do this for many formats. I think we would just have to invent something special that would masquerade as the other formats whenever they are size zero.

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 12, 2013

I think returning size-1 result from a slice that should give size 0 is incorrect. It's better to have it raise an exception instead.

So I think you should remove all special-casing where 0-size result is converted to 1-size, and have the corresponding tests be knownfailures. The correct fix would then be to fix the system so that it allows for 0 x n and n x 0 sparse matrices.

@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 12, 2013

@pv That sounds like the right thing to do. Would this case would be a ValueError instead of an IndexError?

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 12, 2013

I think either one will do. Hopefully we'll manage to fix this for 0.14 (or even 0.13 if someone is fast enough).

cowlicks added some commits Aug 12, 2013

WIP, TST: Add knownfailures to tests for size == 0 cases.
Also removed tests that previously checked the size 0 workaround.
@cowlicks

This comment has been minimized.

Copy link
Contributor

cowlicks commented Aug 12, 2013

Okay, I reverted the work around for size 0 matrices, and added the knownfail tests. I also added catches for cases when slicing would create a size 0 matrix.

pv added a commit that referenced this pull request Aug 13, 2013

Merge pull request #2689 from cowlicks/fancy-indexing
ENH: sparse: Fancy indexing for CSR and CSC matrices

Add initial fancy indexing support for CSR and CSC.

@pv pv merged commit c859927 into scipy:master Aug 13, 2013

1 check failed

default The Travis CI build could not complete due to an error
Details
@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 13, 2013

Merged + some additional fixes in 76bce8b. Thanks @cowlicks & @Daniel-B-Smith

@pv

This comment has been minimized.

Copy link
Member

pv commented Aug 13, 2013

There's probably a lot to optimize in the CSR/CSC indexing business --- it's probably not the most efficient thing to have a Python loop that insert elements one by one... Correctness trumps speed, though.

@pv pv referenced this pull request Aug 13, 2013

Closed

CSR and CSC fancy indexing #2474

thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013

thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013

WIP: Refactored CSC __getitem__ to use changes in _unpack_index.
This introduces a failing test due to a bug in CSC's .nonzero() where
the indices are not sorted C-style as described here:
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

This addresses @pv comments:
scipy#2689 (diff)
scipy#2689 (diff)

thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment