New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba indexing speedups #128

Merged
merged 22 commits into from Mar 25, 2018

Conversation

Projects
None yet
3 participants
@hameerabbasi
Collaborator

hameerabbasi commented Mar 20, 2018

Significant speedups to indexing. Benchmark:

import sparse
x = sparse.random((1000, 1000, 1000), density=0.01)
%timeit x[5]

master:

46.9 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This branch:

174 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that this speedup is partially because of Numba but also because our old indexing was completely suboptimal in big-O complexity.

The partial support that we had for advanced indexing is currently broken, but I plan to work on full-blown advanced indexing soon.

As far as I can tell, however, it, too, requires list[list]] on Numba's end. xref numba/numba#2560

Fixes #60

@codecov-io

This comment has been minimized.

codecov-io commented Mar 20, 2018

Codecov Report

Merging #128 into master will increase coverage by 0.52%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #128      +/-   ##
==========================================
+ Coverage   95.06%   95.59%   +0.52%     
==========================================
  Files          10       11       +1     
  Lines        1217     1227      +10     
==========================================
+ Hits         1157     1173      +16     
+ Misses         60       54       -6
Flag Coverage Δ
#python27 94.83% <100%> (+0.53%) ⬆️
#python36 95.27% <100%> (+0.53%) ⬆️
Impacted Files Coverage Δ
sparse/slicing.py 99.2% <100%> (+2.7%) ⬆️
sparse/coo/core.py 92.98% <100%> (-0.73%) ⬇️
sparse/coo/indexing.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f19f0a...bfb4f3b. Read the comment docs.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 20, 2018

I will note, however, that this indexing is geared towards integer or tuple of integer. Large slices WILL make this slow as molasses. Maybe a fallback if the slices are too large? See below. :-)

Edit: Turns out, it isn't as bad as I thought, even asymptotically. Worst-case for this: O(output-nnz log output-nnz + prod(slice-lengths)), worst case for old algorithm: O(input-nnz). For old, nnz of the array being indexed. This should yield significant speedups if you're using it in e.g. Dask (splitting apart a big array).

I'm working on a hybrid approach that will be faster in most, if not all, cases. :-)

First benchmark:

%timeit x[:500, :500]

Old:

297 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

New:

147 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Second benchmark:

%timeit x[:500, :500, :500]

Old:

285 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

New:

2.71 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 21, 2018

Latest benchmarks:

In[2]: import sparse
In[3]: x = sparse.random((1000, 1000, 1000), density=0.01)
In[4]: %timeit x[5] # Compilation
189 µs ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[5]: %timeit x[5]
157 µs ± 637 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[6]: %timeit x[:500]
76.2 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[7]: %timeit x[:500, :500]
144 ms ± 984 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[8]: %timeit x[:500, :500, :500]
126 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Faster in all cases than old. Falls back to simple matching based on a "guesstimate" of which one will be faster.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 21, 2018

cc @mrocklin Ready for review. :-)

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Mar 21, 2018

Cool. I'm looking forward to it. I took a brief glance at it and noticed that it was fairly involved, including many new routines with complex logic. This will probably take some time to review properly and so I may not get to this today.

If you're inclined I suspect that you could make reviewing (and thus reading by future devs) easier by including small and simple doctests within each utility function that show what it does with a minimal example.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 21, 2018

I had already partially done that, but I made it more complete.

if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(None) for ind in index):
if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(0, dim, 1)
for ind, dim in zip_longest(index, self.shape)):

This comment has been minimized.

@mrocklin

mrocklin Mar 21, 2018

Collaborator

Is this change fixing an error that previously existed? If so then is this tested somewhere?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 21, 2018

Collaborator

Not really. For the purposes of this project, I decided that slice objects without any None in them were better than those with None in the start, stop or step.

This (and the relevant changes in slicing.py) are to accommodate those changes.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

For the slice check why

not isinstance(ind, Iterable) and ind == slice(0, dim, 1)

Rather than

ind == slice(0, dim, 1)

Separately, I'm curious about the use of zip_longest.

mask = _mask(self.coords, index, self.shape)
if isinstance(mask, slice):
n = len(range(mask.start, mask.stop, mask.step))

This comment has been minimized.

@mrocklin

mrocklin Mar 21, 2018

Collaborator

This may be somewhat inefficient in Python 2

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 21, 2018

Collaborator

In Python 2, we are using xrange (imported from compatibility.py as range). It doesn't construct a list here, if that's what you're worried about.

@numba.jit(nopython=True)
def _get_mask_pairs_inner(starts_old, stops_old, c, idx): # pragma: no cover

This comment has been minimized.

@mrocklin

mrocklin Mar 21, 2018

Collaborator

It looks like there are a number of functions like this that are only used once. I'm somewhat inclined to just inline these directly into where they are called rather than construct a new function. I personally find it difficult to follow a new codebase if I have to jump between several functions. Giant long functions are also unpleasant of course, but I find them easier than those same lines of code split out to many places in the same file.

This comment has been minimized.

@mrocklin

mrocklin Mar 21, 2018

Collaborator

This seems to happen a number of times here, and I suspect it is partially what is partially what is causing me to be unable to review this PR efficiently. I wonder if the explanation could be handled as well with comments.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 21, 2018

Collaborator

I did it that way at first, but apparently Numba has some unfixed bugs with type inference if I inline these into their parent functions. If I keep the logic unchanged and inline these functions, Numba compilation fails as it fails to infer the type of some of the lists.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 21, 2018

Collaborator

If it helps, bottom-up reviewing may be better in this case than top-down.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

I've merged at least two functions. I tried to remove another one, but that unfortunately tanks the performance.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 21, 2018

A couple of replies.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 21, 2018

I added a bit of comments to the main function.

@hameerabbasi hameerabbasi force-pushed the hameerabbasi:numba-indexing branch from f4f14fe to 279cc40 Mar 22, 2018

@hameerabbasi hameerabbasi force-pushed the hameerabbasi:numba-indexing branch from 279cc40 to 4fbfc82 Mar 22, 2018

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 22, 2018

Rebased onto infrastructure fix commits.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 24, 2018

@mrocklin I'm done working on this. Do you have time to review in the near future?

@mrocklin

Sorry for the delay on this. There is enough new logic in this PR that it requires time to review. It's hard for me to justify spending time on this during working hours in the week. I have a bit more time on the weekend.

I'm still fairly confused about how this works. It's not clear to me what you mean at a high level by a mask or a pair. I wonder if it would make sense to move all of this logic to a separate slicing module and add some high-level documentation at the module-level (just a docstring at the top of the file) about how things are organized. Ideally this would be sufficient to explain to someone in the future who tracking down a bug how things worked so that they could quickly understand the skeleton of how your code works.

Sorry to be slow here and also somewhat dense. I'm not as good as I used to be at focusing on something deeply for a long time.

if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(None) for ind in index):
if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(0, dim, 1)
for ind, dim in zip_longest(index, self.shape)):

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

For the slice check why

not isinstance(ind, Iterable) and ind == slice(0, dim, 1)

Rather than

ind == slice(0, dim, 1)

Separately, I'm curious about the use of zip_longest.

mask = np.zeros(len(coords), dtype=np.bool)
for item in idx:
mask |= _mask(coords, item, shape)
ind_ar[i] = [idx, idx + 1, 1]

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

Just checking, but this will never get other numpy arrays of indices or boolean values?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

Right now advanced indexing isn't supported but it will be in the future. That is, technically, advanced indexing.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

Is the organization of code as you've made it here easy to extend to advanced cases?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

I'm fairly certain how it will be done, but I haven't mapped it out in code yet. However, it requires list[list[int]] on Numba's end.

stop = idx.stop if idx.stop is not None else shape
return (coords >= start) & (coords < stop) & \
(coords % step == start % step)
def _mask(coords, indices, shape):

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

Request small docstring here. Reviewing this I don't know what you mean by a mask. I can take a guess, but it's unlikely that someone who has to fix bugs in this code in the future will understand immediately.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

Maybe it would make sense to move all of these functions to a separate file (slicing.py?) and add a module-level doscstring describing how they work together?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

This sounds like a good idea, but slicing.py isn't suitable as it's common to COO and DOK.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

A different name then?

Returns
-------
mask : np.ndarray
The starts and stops in the mask.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

and steps?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

Steps turn out to always be 1 in this algorithm.

mask : np.ndarray
The starts and stops in the mask.
is_slice : bool
Whether or not the array represents a continuous slice.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

Continuous slide meaning step == 1 or just "a slice" of any sort?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

step == 1.

i = 0
while i < len(indices):
# Guesstimate whether working with pairs is more efficient or

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

What do you mean pairs here?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

Start-stop pairs.

@@ -815,10 +809,24 @@ def test_gt():
(slice(2, 0, -1), slice(None, None), -1),
(slice(-2, None, None),),
(slice(-1, None, None), slice(-2, None, None)),

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

Does your new code add any new functionality? For example are there types of slices that we now support that we didn't before?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

No, it doesn't at the moment.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 24, 2018

I think your idea of moving this to a separate module and adding a module-level docstring is probably best. I'll request you to hold off reviewing further until I make that change (it'll make things easier). I'll hopefully do this tomorrow at some point.

The algorithm itself probably isn't that difficult, but I can see the need for terminology being clarified.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Mar 24, 2018

Thanks. In general I'm excited about the performance gains here. Another option would just be to merge and move ahead without review. Ideally in the future we'll have more people who have time to both write and review code.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 24, 2018

I'll leave the choice of whether or not to review to you (you're the one doing it), but I see the need for clear code either way. I'll make the changes tomorrow, and if you still find it difficult to review (or can't/don't want to for other reasons, which is perfectly fine), then I'll go ahead and merge. :-)

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 24, 2018

I've clarified everything, hopefully. :-) Feel free to review and/or merge.

@mrocklin

This is a great read. Thank you. I'm only about half way through and need to stop, but I just wanted to say that I really appreciate the care spent in making this a pleasure to go through. I'm pretty confident that a new reader can sit down and look through this from top to bottom and come away with a sense for how things work (or at least the top half :)). I've left some small comments, but presumably my thinking will refine as I finish things (which may not happen for a day or so).

@numba.jit(nopython=True)
def _compute_mask(coords, indices): # pragma: no cover
"""
Gets the mask for the coords given the indices in slice format.

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

We should define slice format somewhere. Either here or in _mask above (and then refer to that function here). I think you mean the (start, stop, step) triples.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

That seems like a good idea.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 25, 2018

Collaborator

It's already defined in the parameters section.

on which is faster.
Exploits the structure in sorted coords, which is that for a constant
value of coords[i - 1], coords[i] is sorted. It uses this sortedness

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

What does it mean for coords[i] to be sorted?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

It means, concretely, that for all i that won't err, and all values, coords[i, coords[i-1] == v1 & coords[i-2] == v2 & ...] is sorted. (due to the linear sorting order). I'll edit this into the docstring.

# Guesstimate whether working with pairs is more efficient or
# working with the mask directly
n_current_slices = _get_slice_len(indices[i]) * len(starts) + 2
if n_current_slices * np.log(n_current_slices / max(len(starts), 1)) > n_matches:

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

What is the motivation for these values? Should they be documented somewhere?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

It's a binary search vs linear filter consideration. One is bigger sometimes than the other. But they are done in different domains, binary search to find start-end pairs and linear filter for all possible values across those pairs.

i = 0
while i < len(indices):
# Guesstimate whether working with pairs is more efficient or
# working with the mask directly

This comment has been minimized.

@mrocklin

mrocklin Mar 24, 2018

Collaborator

I'm curious, what is the performance cost to just using one strategy or the other? If it is not too great then we might consider simplifying to one.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 24, 2018

Collaborator

Unfortunately both are needed. In some cases, one is better, in some cases the other, and the difference is usually large.

For example, simple integer slicing mask construction is close to O(log nnz) in one and O(nnz) in the other.

And for slices such as x[:500, :500, :500], the first approach boils down to O(500^3) worst case. However, if we work with pairs for the first two indices and switch to a mask later, it's a lot faster.

I made sure that the tests hit both cases, so we should be good. :-)

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Mar 24, 2018

It seems like a lot of the logic here is based around performance considerations. Do we want to add benchmarks for these various cases?

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 24, 2018

It seems like a lot of the logic here is based around performance considerations. Do we want to add benchmarks for these various cases?

Seems like an excellent idea!

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 25, 2018

I noticed there were a few cases we supported but didn't actually test for. That isn't additional in this PR, but it was nice to add tests for those cases as well.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Mar 25, 2018

I'm fairly certain how it will be done, but I haven't mapped it out in code yet. However, it requires list[list[int]] on Numba's end.

How confident are we that this will be implemented soon and, if implemented, that it will be fast enough? If neither of those are true then do we have a backup plan?

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 25, 2018

How confident are we that this will be implemented soon and, if implemented, that it will be fast enough? If neither of those are true then do we have a backup plan?

numba/numba#2560 is the issue we're looking for. It has a milestone of 0.38 (RC), which is due out April 11. Unless it's pushed to the next release, we should be okay.

There may be a way with flat data structures as well, but it's slightly more complicated. May be more performant, though. I'm shooting for next weekend. It requires some Numba trickery in any case (I think that trickery should be possible in nopython mode and if so, it'll be fast, roughly equivalent to n integer indexes where n is the size of the advanced index).

If nothing works out we can simply loop over the relevant indices in Python.

Do you need it urgently in your work?

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Mar 25, 2018

No, I just want to make sure that we don't over-engineer into a solution that can not be extended. I suspect that accepting lists and arrays of boolean and integer values will be important for XArray support, which is a nice use case to target.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Mar 25, 2018

Good thinking! But don't worry, I'm fairly sure that with integer indexing and some transposing and reshaping, we'll be good to go.

self.x[:50]
def time_index_slice2(self):
self.x[:50, :50]

This comment has been minimized.

@mrocklin

mrocklin Mar 25, 2018

Collaborator

Are different steps likely to make a difference here?

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 25, 2018

Collaborator

What matters most here is the length of the slice and the number of nonzeros within it.

-------
starts, stops : list[int]
The reduced starts and stops.
Examples

This comment has been minimized.

@mrocklin

mrocklin Mar 25, 2018

Collaborator

Style nit, add line breaks before section titles.

This comment has been minimized.

@hameerabbasi

hameerabbasi Mar 25, 2018

Collaborator

I'm not too used to RST, I believe that's required for clean rendering as well.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Mar 25, 2018

I still haven't gone through all of the logic here, but I've probably gone as far as I'm likely to. In general things seem fine to me. +1

Thanks for slogging through this @hameerabbasi . I hope you're enjoying yourself :)

@hameerabbasi hameerabbasi merged commit 222fb8c into pydata:master Mar 25, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@hameerabbasi hameerabbasi deleted the hameerabbasi:numba-indexing branch Mar 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment