Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba indexing speedups #128

Merged
merged 22 commits into from Mar 25, 2018
Merged

Conversation

hameerabbasi
Copy link
Collaborator

@hameerabbasi hameerabbasi commented Mar 20, 2018

Significant speedups to indexing. Benchmark:

import sparse
x = sparse.random((1000, 1000, 1000), density=0.01)
%timeit x[5]

master:

46.9 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This branch:

174 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that this speedup is partially because of Numba but also because our old indexing was completely suboptimal in big-O complexity.

The partial support that we had for advanced indexing is currently broken, but I plan to work on full-blown advanced indexing soon.

As far as I can tell, however, it, too, requires list[list]] on Numba's end. xref numba/numba#2560

Fixes #60

@codecov-io
Copy link

codecov-io commented Mar 20, 2018

Codecov Report

Merging #128 into master will increase coverage by 0.52%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #128      +/-   ##
==========================================
+ Coverage   95.06%   95.59%   +0.52%     
==========================================
  Files          10       11       +1     
  Lines        1217     1227      +10     
==========================================
+ Hits         1157     1173      +16     
+ Misses         60       54       -6
Flag Coverage Δ
#python27 94.83% <100%> (+0.53%) ⬆️
#python36 95.27% <100%> (+0.53%) ⬆️
Impacted Files Coverage Δ
sparse/slicing.py 99.2% <100%> (+2.7%) ⬆️
sparse/coo/core.py 92.98% <100%> (-0.73%) ⬇️
sparse/coo/indexing.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f19f0a...bfb4f3b. Read the comment docs.

@hameerabbasi
Copy link
Collaborator Author

hameerabbasi commented Mar 20, 2018

I will note, however, that this indexing is geared towards integer or tuple of integer. Large slices WILL make this slow as molasses. Maybe a fallback if the slices are too large? See below. :-)

Edit: Turns out, it isn't as bad as I thought, even asymptotically. Worst-case for this: O(output-nnz log output-nnz + prod(slice-lengths)), worst case for old algorithm: O(input-nnz). For old, nnz of the array being indexed. This should yield significant speedups if you're using it in e.g. Dask (splitting apart a big array).

I'm working on a hybrid approach that will be faster in most, if not all, cases. :-)

First benchmark:

%timeit x[:500, :500]

Old:

297 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

New:

147 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Second benchmark:

%timeit x[:500, :500, :500]

Old:

285 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

New:

2.71 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@hameerabbasi
Copy link
Collaborator Author

hameerabbasi commented Mar 21, 2018

Latest benchmarks:

In[2]: import sparse
In[3]: x = sparse.random((1000, 1000, 1000), density=0.01)
In[4]: %timeit x[5] # Compilation
189 µs ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[5]: %timeit x[5]
157 µs ± 637 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[6]: %timeit x[:500]
76.2 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[7]: %timeit x[:500, :500]
144 ms ± 984 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[8]: %timeit x[:500, :500, :500]
126 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Faster in all cases than old. Falls back to simple matching based on a "guesstimate" of which one will be faster.

@hameerabbasi
Copy link
Collaborator Author

cc @mrocklin Ready for review. :-)

@mrocklin
Copy link
Collaborator

Cool. I'm looking forward to it. I took a brief glance at it and noticed that it was fairly involved, including many new routines with complex logic. This will probably take some time to review properly and so I may not get to this today.

If you're inclined I suspect that you could make reviewing (and thus reading by future devs) easier by including small and simple doctests within each utility function that show what it does with a minimal example.

@hameerabbasi
Copy link
Collaborator Author

I had already partially done that, but I made it more complete.

if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(None) for ind in index):

if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(0, dim, 1)
for ind, dim in zip_longest(index, self.shape)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change fixing an error that previously existed? If so then is this tested somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. For the purposes of this project, I decided that slice objects without any None in them were better than those with None in the start, stop or step.

This (and the relevant changes in slicing.py) are to accommodate those changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the slice check why

not isinstance(ind, Iterable) and ind == slice(0, dim, 1)

Rather than

ind == slice(0, dim, 1)

Separately, I'm curious about the use of zip_longest.

mask = _mask(self.coords, index, self.shape)

if isinstance(mask, slice):
n = len(range(mask.start, mask.stop, mask.step))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be somewhat inefficient in Python 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 2, we are using xrange (imported from compatibility.py as range). It doesn't construct a list here, if that's what you're worried about.



@numba.jit(nopython=True)
def _get_mask_pairs_inner(starts_old, stops_old, c, idx): # pragma: no cover
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are a number of functions like this that are only used once. I'm somewhat inclined to just inline these directly into where they are called rather than construct a new function. I personally find it difficult to follow a new codebase if I have to jump between several functions. Giant long functions are also unpleasant of course, but I find them easier than those same lines of code split out to many places in the same file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to happen a number of times here, and I suspect it is partially what is partially what is causing me to be unable to review this PR efficiently. I wonder if the explanation could be handled as well with comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it that way at first, but apparently Numba has some unfixed bugs with type inference if I inline these into their parent functions. If I keep the logic unchanged and inline these functions, Numba compilation fails as it fails to infer the type of some of the lists.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps, bottom-up reviewing may be better in this case than top-down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged at least two functions. I tried to remove another one, but that unfortunately tanks the performance.

@hameerabbasi
Copy link
Collaborator Author

A couple of replies.

@hameerabbasi
Copy link
Collaborator Author

I added a bit of comments to the main function.

@hameerabbasi
Copy link
Collaborator Author

Rebased onto infrastructure fix commits.

@hameerabbasi
Copy link
Collaborator Author

@mrocklin I'm done working on this. Do you have time to review in the near future?

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this. There is enough new logic in this PR that it requires time to review. It's hard for me to justify spending time on this during working hours in the week. I have a bit more time on the weekend.

I'm still fairly confused about how this works. It's not clear to me what you mean at a high level by a mask or a pair. I wonder if it would make sense to move all of this logic to a separate slicing module and add some high-level documentation at the module-level (just a docstring at the top of the file) about how things are organized. Ideally this would be sufficient to explain to someone in the future who tracking down a bug how things worked so that they could quickly understand the skeleton of how your code works.

Sorry to be slow here and also somewhat dense. I'm not as good as I used to be at focusing on something deeply for a long time.

if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(None) for ind in index):

if len(index) != 0 and all(not isinstance(ind, Iterable) and ind == slice(0, dim, 1)
for ind, dim in zip_longest(index, self.shape)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the slice check why

not isinstance(ind, Iterable) and ind == slice(0, dim, 1)

Rather than

ind == slice(0, dim, 1)

Separately, I'm curious about the use of zip_longest.

mask = np.zeros(len(coords), dtype=np.bool)
for item in idx:
mask |= _mask(coords, item, shape)
ind_ar[i] = [idx, idx + 1, 1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, but this will never get other numpy arrays of indices or boolean values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now advanced indexing isn't supported but it will be in the future. That is, technically, advanced indexing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the organization of code as you've made it here easy to extend to advanced cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain how it will be done, but I haven't mapped it out in code yet. However, it requires list[list[int]] on Numba's end.

stop = idx.stop if idx.stop is not None else shape
return (coords >= start) & (coords < stop) & \
(coords % step == start % step)
def _mask(coords, indices, shape):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request small docstring here. Reviewing this I don't know what you mean by a mask. I can take a guess, but it's unlikely that someone who has to fix bugs in this code in the future will understand immediately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to move all of these functions to a separate file (slicing.py?) and add a module-level doscstring describing how they work together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea, but slicing.py isn't suitable as it's common to COO and DOK.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different name then?

Returns
-------
mask : np.ndarray
The starts and stops in the mask.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and steps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steps turn out to always be 1 in this algorithm.

mask : np.ndarray
The starts and stops in the mask.
is_slice : bool
Whether or not the array represents a continuous slice.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuous slide meaning step == 1 or just "a slice" of any sort?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step == 1.


i = 0
while i < len(indices):
# Guesstimate whether working with pairs is more efficient or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean pairs here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start-stop pairs.

@@ -815,10 +809,24 @@ def test_gt():
(slice(2, 0, -1), slice(None, None), -1),
(slice(-2, None, None),),
(slice(-1, None, None), slice(-2, None, None)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your new code add any new functionality? For example are there types of slices that we now support that we didn't before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't at the moment.

@hameerabbasi
Copy link
Collaborator Author

I think your idea of moving this to a separate module and adding a module-level docstring is probably best. I'll request you to hold off reviewing further until I make that change (it'll make things easier). I'll hopefully do this tomorrow at some point.

The algorithm itself probably isn't that difficult, but I can see the need for terminology being clarified.

@mrocklin
Copy link
Collaborator

Thanks. In general I'm excited about the performance gains here. Another option would just be to merge and move ahead without review. Ideally in the future we'll have more people who have time to both write and review code.

@hameerabbasi
Copy link
Collaborator Author

I'll leave the choice of whether or not to review to you (you're the one doing it), but I see the need for clear code either way. I'll make the changes tomorrow, and if you still find it difficult to review (or can't/don't want to for other reasons, which is perfectly fine), then I'll go ahead and merge. :-)

@hameerabbasi
Copy link
Collaborator Author

I've clarified everything, hopefully. :-) Feel free to review and/or merge.

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great read. Thank you. I'm only about half way through and need to stop, but I just wanted to say that I really appreciate the care spent in making this a pleasure to go through. I'm pretty confident that a new reader can sit down and look through this from top to bottom and come away with a sense for how things work (or at least the top half :)). I've left some small comments, but presumably my thinking will refine as I finish things (which may not happen for a day or so).

@numba.jit(nopython=True)
def _compute_mask(coords, indices): # pragma: no cover
"""
Gets the mask for the coords given the indices in slice format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define slice format somewhere. Either here or in _mask above (and then refer to that function here). I think you mean the (start, stop, step) triples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already defined in the parameters section.

on which is faster.

Exploits the structure in sorted coords, which is that for a constant
value of coords[i - 1], coords[i] is sorted. It uses this sortedness
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for coords[i] to be sorted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means, concretely, that for all i that won't err, and all values, coords[i, coords[i-1] == v1 & coords[i-2] == v2 & ...] is sorted. (due to the linear sorting order). I'll edit this into the docstring.

# Guesstimate whether working with pairs is more efficient or
# working with the mask directly
n_current_slices = _get_slice_len(indices[i]) * len(starts) + 2
if n_current_slices * np.log(n_current_slices / max(len(starts), 1)) > n_matches:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for these values? Should they be documented somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a binary search vs linear filter consideration. One is bigger sometimes than the other. But they are done in different domains, binary search to find start-end pairs and linear filter for all possible values across those pairs.

i = 0
while i < len(indices):
# Guesstimate whether working with pairs is more efficient or
# working with the mask directly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what is the performance cost to just using one strategy or the other? If it is not too great then we might consider simplifying to one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately both are needed. In some cases, one is better, in some cases the other, and the difference is usually large.

For example, simple integer slicing mask construction is close to O(log nnz) in one and O(nnz) in the other.

And for slices such as x[:500, :500, :500], the first approach boils down to O(500^3) worst case. However, if we work with pairs for the first two indices and switch to a mask later, it's a lot faster.

I made sure that the tests hit both cases, so we should be good. :-)

@mrocklin
Copy link
Collaborator

It seems like a lot of the logic here is based around performance considerations. Do we want to add benchmarks for these various cases?

@hameerabbasi
Copy link
Collaborator Author

It seems like a lot of the logic here is based around performance considerations. Do we want to add benchmarks for these various cases?

Seems like an excellent idea!

@hameerabbasi
Copy link
Collaborator Author

I noticed there were a few cases we supported but didn't actually test for. That isn't additional in this PR, but it was nice to add tests for those cases as well.

@mrocklin
Copy link
Collaborator

I'm fairly certain how it will be done, but I haven't mapped it out in code yet. However, it requires list[list[int]] on Numba's end.

How confident are we that this will be implemented soon and, if implemented, that it will be fast enough? If neither of those are true then do we have a backup plan?

@hameerabbasi
Copy link
Collaborator Author

How confident are we that this will be implemented soon and, if implemented, that it will be fast enough? If neither of those are true then do we have a backup plan?

numba/numba#2560 is the issue we're looking for. It has a milestone of 0.38 (RC), which is due out April 11. Unless it's pushed to the next release, we should be okay.

There may be a way with flat data structures as well, but it's slightly more complicated. May be more performant, though. I'm shooting for next weekend. It requires some Numba trickery in any case (I think that trickery should be possible in nopython mode and if so, it'll be fast, roughly equivalent to n integer indexes where n is the size of the advanced index).

If nothing works out we can simply loop over the relevant indices in Python.

Do you need it urgently in your work?

@mrocklin
Copy link
Collaborator

No, I just want to make sure that we don't over-engineer into a solution that can not be extended. I suspect that accepting lists and arrays of boolean and integer values will be important for XArray support, which is a nice use case to target.

@hameerabbasi
Copy link
Collaborator Author

Good thinking! But don't worry, I'm fairly sure that with integer indexing and some transposing and reshaping, we'll be good to go.

self.x[:50]

def time_index_slice2(self):
self.x[:50, :50]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are different steps likely to make a difference here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What matters most here is the length of the slice and the number of nonzeros within it.

-------
starts, stops : list[int]
The reduced starts and stops.
Examples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit, add line breaks before section titles.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too used to RST, I believe that's required for clean rendering as well.

@mrocklin
Copy link
Collaborator

I still haven't gone through all of the logic here, but I've probably gone as far as I'm likely to. In general things seem fine to me. +1

Thanks for slogging through this @hameerabbasi . I hope you're enjoying yourself :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants