Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lots of indexing features. #57

Merged
merged 10 commits into from Jan 2, 2018

Conversation

hameerabbasi
Copy link
Collaborator

@hameerabbasi hameerabbasi commented Dec 31, 2017

Adds support for:

  • Returning scalars when slicing, e.g. x[1, 1] if x is 2-D is now scalar.
  • Returning arrays when the last index is an Ellipsis, e.g. x[1, 1, ...] if x is 2-D is now a ()-shaped COO object.
  • Slices with steps other than None and 1 (even negative steps).
  • Inbdexing with custom dtypes. So if x has a custom dtype, x['field'] is now supported.
  • Now throws IndexErrors consistent with Numpy for string and out of range indices.

@hameerabbasi
Copy link
Collaborator Author

@nils-werner, your input here would be valuable.

@hameerabbasi
Copy link
Collaborator Author

cc @mrocklin Since you wrote the original __getindex__, it'd be helpful if you took a look at this.

@mrocklin
Copy link
Collaborator

mrocklin commented Dec 31, 2017 via email

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Some questions/comments below.

sparse/core.py Outdated
coords.append(self.coords[i, idx[0]])

for i in range(1, np.ndim(data)):
coords.append(idx[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this. Can you help explain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom dtypes can have multiple dimensions inside a single scalar 'field'. For example, np.dtype([('grades', np.float64, (2, 2))]) will have a (2, 2) element "inside" each "scalar" element. self.data['grades'] gets a (nnz, 2, 2) array in this case, but the resulting COO array has to be self.shape + (2, 2). We first use np.where to get the nonzero indices. Then we append the matching parts of the original coordinates above, we get those from idx[0]. After this, we get the parts that come from the dimensions of the field itself, and append those to coords.

Finally, we flatten the data since we're done calculating all the coords of the resulting array.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I see. I had expected the dtype to continue being of shape (2, 2) but I see that NumPy rolls this into the normal array's shape

In [1]: import numpy as np

In [2]: np.ones((5,), dtype=np.dtype([('grades', np.float64, (2, 2))]))
Out[2]: 
array([([[ 1.,  1.], [ 1.,  1.]],), ([[ 1.,  1.], [ 1.,  1.]],),
       ([[ 1.,  1.], [ 1.,  1.]],), ([[ 1.,  1.], [ 1.,  1.]],),
       ([[ 1.,  1.], [ 1.,  1.]],)],
      dtype=[('grades', '<f8', (2, 2))])

In [3]: x = np.ones((5,), dtype=np.dtype([('grades', np.float64, (2, 2))]))

In [4]: x.shape
Out[4]: (5,)

In [5]: x['grades'].shape
Out[5]: (5, 2, 2)

In [6]: x['grades'].dtype
Out[6]: dtype('float64')

sparse/core.py Outdated
coords = []

for i in range(self.ndim):
coords.append(self.coords[i, idx[0]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just self.coords[:, idx[0]]?

sparse/core.py Outdated
return self
mask = np.ones(self.nnz, dtype=bool)
mask = np.ones(self.nnz, dtype=np.bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't do anything

In [5]: np.bool is bool
Out[5]: True

In [6]: np.bool_
Out[6]: numpy.bool_

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to make it more performant but looks like Numpy treats them the same...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just Numpy, these are exactly the same objects, the is operator is object identity testing in Pyhton.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant np.bool_ and bool. If you create arrays with both they end up being the same dtype.

sparse/core.py Outdated
step = ind.step if ind.step is not None else 1
if step > 0:
start = ind.start if ind.start is not None else 0
start = max(start, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone does something like x[-10:] ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to separate out a normalize_slice function that reduces slice objects to a canonical form. It looks like there might be some useful functions in dask/array/slicing.py that might be helpful here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalize_* and check_index functions there might be helpful in particular

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything similar for bool indexing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a brief look I'm not seeing much.

sparse/core.py Outdated
for i in range(1, np.ndim(data)):
coords.append(idx[i])

return COO(coords, data.flatten(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data.flatten() call seems odd to me here. This is for struct dtypes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should be data[idx].flatten(). We have to filter out nonzero entries.

@hameerabbasi
Copy link
Collaborator Author

Whoops. Looks like some old build results need to be cleaned.

@hameerabbasi
Copy link
Collaborator Author

It turns out boolean slices were already handled in slicing.py, but the exceptions weren't, so I added those.

@hameerabbasi
Copy link
Collaborator Author

Merging at 20:00 German time if there are no other proposed changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants