New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lots of indexing features. #57

Merged
merged 10 commits into from Jan 2, 2018

Conversation

Projects
None yet
2 participants
@hameerabbasi
Collaborator

hameerabbasi commented Dec 31, 2017

Adds support for:

  • Returning scalars when slicing, e.g. x[1, 1] if x is 2-D is now scalar.
  • Returning arrays when the last index is an Ellipsis, e.g. x[1, 1, ...] if x is 2-D is now a ()-shaped COO object.
  • Slices with steps other than None and 1 (even negative steps).
  • Inbdexing with custom dtypes. So if x has a custom dtype, x['field'] is now supported.
  • Now throws IndexErrors consistent with Numpy for string and out of range indices.

hameerabbasi added some commits Dec 31, 2017

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

@nils-werner, your input here would be valuable.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

cc @mrocklin Since you wrote the original __getindex__, it'd be helpful if you took a look at this.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Dec 31, 2017

@mrocklin

Nice work. Some questions/comments below.

coords.append(self.coords[i, idx[0]])
for i in range(1, np.ndim(data)):
coords.append(idx[i])

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

I don't understand the purpose of this. Can you help explain?

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 1, 2018

Collaborator

Custom dtypes can have multiple dimensions inside a single scalar 'field'. For example, np.dtype([('grades', np.float64, (2, 2))]) will have a (2, 2) element "inside" each "scalar" element. self.data['grades'] gets a (nnz, 2, 2) array in this case, but the resulting COO array has to be self.shape + (2, 2). We first use np.where to get the nonzero indices. Then we append the matching parts of the original coordinates above, we get those from idx[0]. After this, we get the parts that come from the dimensions of the field itself, and append those to coords.

Finally, we flatten the data since we're done calculating all the coords of the resulting array.

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

Right, I see. I had expected the dtype to continue being of shape (2, 2) but I see that NumPy rolls this into the normal array's shape

In [1]: import numpy as np

In [2]: np.ones((5,), dtype=np.dtype([('grades', np.float64, (2, 2))]))
Out[2]: 
array([([[ 1.,  1.], [ 1.,  1.]],), ([[ 1.,  1.], [ 1.,  1.]],),
       ([[ 1.,  1.], [ 1.,  1.]],), ([[ 1.,  1.], [ 1.,  1.]],),
       ([[ 1.,  1.], [ 1.,  1.]],)],
      dtype=[('grades', '<f8', (2, 2))])

In [3]: x = np.ones((5,), dtype=np.dtype([('grades', np.float64, (2, 2))]))

In [4]: x.shape
Out[4]: (5,)

In [5]: x['grades'].shape
Out[5]: (5, 2, 2)

In [6]: x['grades'].dtype
Out[6]: dtype('float64')
coords = []
for i in range(self.ndim):
coords.append(self.coords[i, idx[0]])

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

This is just self.coords[:, idx[0]]?

return self
mask = np.ones(self.nnz, dtype=bool)
mask = np.ones(self.nnz, dtype=np.bool)

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

This change doesn't do anything

In [5]: np.bool is bool
Out[5]: True

In [6]: np.bool_
Out[6]: numpy.bool_

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 1, 2018

Collaborator

My intent was to make it more performant but looks like Numpy treats them the same...

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

Not just Numpy, these are exactly the same objects, the is operator is object identity testing in Pyhton.

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 1, 2018

Collaborator

I meant np.bool_ and bool. If you create arrays with both they end up being the same dtype.

step = ind.step if ind.step is not None else 1
if step > 0:
start = ind.start if ind.start is not None else 0
start = max(start, 0)

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

What if someone does something like x[-10:] ?

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

It may be useful to separate out a normalize_slice function that reduces slice objects to a canonical form. It looks like there might be some useful functions in dask/array/slicing.py that might be helpful here.

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

The normalize_* and check_index functions there might be helpful in particular

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 1, 2018

Collaborator

Is there anything similar for bool indexing?

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

On a brief look I'm not seeing much.

for i in range(1, np.ndim(data)):
coords.append(idx[i])
return COO(coords, data.flatten(),

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

The data.flatten() call seems odd to me here. This is for struct dtypes?

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 1, 2018

Collaborator

You're right, it should be data[idx].flatten(). We have to filter out nonzero entries.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

Whoops. Looks like some old build results need to be cleaned.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

It turns out boolean slices were already handled in slicing.py, but the exceptions weren't, so I added those.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 2, 2018

Merging at 20:00 German time if there are no other proposed changes.

@hameerabbasi hameerabbasi merged commit df4c0c0 into pydata:master Jan 2, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@hameerabbasi hameerabbasi deleted the hameerabbasi:enhance-indexing branch Jan 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment