Skip to content

Conversation

@cowlicks
Copy link
Contributor

@cowlicks cowlicks commented Aug 5, 2013

This is based off of @Daniel-B-Smith's work (thanks!).

I also add support for indexing with sparse boolean matrices, but it does not have tests yet. I will add tests for this tomorrow. For now I just wanted to get this out because it finishes what Daniel started.

Here's a demo of what I mean by indexing with sparse boolean matrices.

>>> from scipy.sparse import csr_matrix
>>> from numpy.random import randint

>>> x = csr_matrix(randint(6, size=(4,4)))
>>> x.todense()

matrix([[3, 5, 0, 4],
        [4, 3, 4, 0],
        [0, 3, 2, 0],
        [2, 2, 0, 1]], dtype=int64)

>>> x[x > 3] = 1
>>> x.todense()

matrix([[3, 1, 0, 1],
        [1, 3, 1, 0],
        [0, 3, 2, 0],
        [2, 2, 0, 1]], dtype=int64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo ValueError, run pyflakes to check if there are more of those

@cowlicks
Copy link
Contributor Author

cowlicks commented Aug 6, 2013

I think it may be better to submit this without sparse boolean indexing, since it is taking a while to complete it. But all other fancy indexing is complete. I'll submit the sparse boolean indexing in my next PR.

@pv
Copy link
Member

pv commented Aug 6, 2013

@cowlicks: why does the boolean indexing path need to be guarded by isinstance(..., np.ndarray)? Since it only calls index.nonzero() shouldn't it work as-is also for sparse matrices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is unnecessary --- this internal routine will always be called with scalar row and col?

@cowlicks
Copy link
Contributor Author

cowlicks commented Aug 6, 2013

@pv Yeah I noticed this last night. It almost works, and it is simpler than what I tried at first. But indexing with vector-like sparse matrices doesn't work yet. This is because spmatrix.nonzero() always returns a tuple of length 2, but length 1 is expected for indexing with vector-like things. There could be more problems further down the line but I'm trying to fix this atm.

@pv
Copy link
Member

pv commented Aug 6, 2013

I don't think indexing with vector-like sparse matrices should be special cased.

They are in reality always 2-dimensional, len(x.shape) == 2. I think they should be treated similarly as 2-dimensional arrays.

@cowlicks
Copy link
Contributor Author

cowlicks commented Aug 6, 2013

So basically... Only indexing with boolean sparse matrices that are the same shape as the indexee should be supported?

@pv
Copy link
Member

pv commented Aug 6, 2013

Yes, I think so, at least for this PR.

Note that Numpy itself doesn't enforce shape restriction --- it just does x[bx.nonzero()].

The decision whether it's a good idea to try to start guessing the user's intent needs some more thought. Numpy itself is not consistent with how it treats 2-d arrays:

>>> x = np.random.rand(20, 30)
>>> ix = np.array([[True,False,True]])
>>> x[ix]
array([ 0.79933457,  0.09600886])
>>> x[5:4,ix]
array([ 0.79933457,  0.09600886])

I think Scipy should raise an error for the second case, for now.

@pv
Copy link
Member

pv commented Aug 6, 2013

Anyway, other than that, looks good to me!

@cowlicks
Copy link
Contributor Author

cowlicks commented Aug 6, 2013

I'm clueless on this build error in py 2.6. But 2.7 and 3.3 passed.

$ sudo apt-get install -qq libatlas-dev libatlas-base-dev liblapack-dev gfortran
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

edit: close & reopen fixed it.

@cowlicks cowlicks closed this Aug 7, 2013
@cowlicks cowlicks reopened this Aug 7, 2013
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT:

What I said below was incorrect. Disregard.

This path is not necessary because we can do the same thing with extractor as _get_submatrix. But _get_submatrix is limited to slices with step = 1, however I left it in because, @Daniel-B-Smith did. I figured there must be some reason for this, like _get_submatrix being faster for step = 1 or something. Is this the case?

@cowlicks
Copy link
Contributor Author

cowlicks commented Aug 9, 2013

This introduces a failing test due to a bug in CSC's .nonzero() where
the indices are not sorted C-style as described here:
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

I will try to fix the bug causing this after lunch.

@cowlicks
Copy link
Contributor Author

I'm not sure if giving CSC its own .nonzero method was the right thing to do. Any other ideas?

@pv
Copy link
Member

pv commented Aug 10, 2013

I think if the CSC .nonzero method gives output different from Numpy ndarrays, that is a bug that should be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused routine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These yield statements have no effect: the rest of the test is not written as a test generator.
Also, else: seems to be missing, and the comparison is the wrong way around?

@pv
Copy link
Member

pv commented Aug 11, 2013

Cleaned up the test, sent a PR. cowlicks#1

@cowlicks
Copy link
Contributor Author

Looks like more problems related not having size 0 matrices.

When I change the check in base.py to allow size zero matrices, all current tests pass. But there are no tests for size 0 matrices. Every way I try to assign an index to them fails. I think supporting this would be a lot of work and out of the scope of this PR.

So I think I'm stuck tracking everything down that could give a size zero matrix, and adding checks so it will return a 1x1 empty matrix.

@cowlicks
Copy link
Contributor Author

Also I think there would be a lot of non-obvious decisions to be made when squishing sparse matrix schemes into a 1D form. For example would COO just store things as (None, col, val) and (row, None, val) or a 2 element tuple thing? What would compressed formats do? How would the squished scheme show it was horizontal or vertical? I'm not sure there is a reasonable way to do this for many formats. I think we would just have to invent something special that would masquerade as the other formats whenever they are size zero.

@pv
Copy link
Member

pv commented Aug 12, 2013

I think returning size-1 result from a slice that should give size 0 is incorrect. It's better to have it raise an exception instead.

So I think you should remove all special-casing where 0-size result is converted to 1-size, and have the corresponding tests be knownfailures. The correct fix would then be to fix the system so that it allows for 0 x n and n x 0 sparse matrices.

@cowlicks
Copy link
Contributor Author

@pv That sounds like the right thing to do. Would this case would be a ValueError instead of an IndexError?

@pv
Copy link
Member

pv commented Aug 12, 2013

I think either one will do. Hopefully we'll manage to fix this for 0.14 (or even 0.13 if someone is fast enough).

Also removed tests that previously checked the size 0 workaround.
@cowlicks
Copy link
Contributor Author

Okay, I reverted the work around for size 0 matrices, and added the knownfail tests. I also added catches for cases when slicing would create a size 0 matrix.

pv added a commit that referenced this pull request Aug 13, 2013
ENH: sparse: Fancy indexing for CSR and CSC matrices

Add initial fancy indexing support for CSR and CSC.
@pv pv merged commit c859927 into scipy:master Aug 13, 2013
@pv
Copy link
Member

pv commented Aug 13, 2013

Merged + some additional fixes in 76bce8b. Thanks @cowlicks & @Daniel-B-Smith

@pv
Copy link
Member

pv commented Aug 13, 2013

There's probably a lot to optimize in the CSR/CSC indexing business --- it's probably not the most efficient thing to have a Python loop that insert elements one by one... Correctness trumps speed, though.

@pv pv mentioned this pull request Aug 13, 2013
thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013
thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013
This introduces a failing test due to a bug in CSC's .nonzero() where
the indices are not sorted C-style as described here:
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean

This addresses @pv comments:
scipy#2689 (diff)
scipy#2689 (diff)
thouis pushed a commit to thouis/scipy that referenced this pull request Oct 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants