Add k-way merge adaptor. #97

bsteinb · 2016-02-12T20:20:05Z

Merges an arbitrary number of iterators in ascending order.

Uses std's BinaryHeap to decide which iterator to take from next. This seems quite heavyweight. Two-way merge benchmarks take roughly ten times longer than the dedicated two-way merge adaptor. Profiling identifies BinaryHeaps sift_up as the hot-spot.

Not completely sure about the interface, the double use of IntoIterator in the free-standing function in particular.

bluss · 2016-02-12T21:49:22Z

src/adaptors.rs

+    I::Item: PartialOrd
+{
+    fn partial_cmp(&self, other: &NonEmpty<I>) -> Option<Ordering> {
+        self.head.partial_cmp(&other.head).map(Ordering::reverse)


Instead of mapping reverse I'd just use other.head.partial_cmp(&self.head) here. Simple and less noise.

What's more important is that it should implement all of lt, le, gt, ge. Implementing the specific comparison operators should have a noticable effect in benchmarks. BinaryHeap uses >, >=, <= it looks like (we just impl all).

A comment somewhere that this type implements comparisons reversed to be used in a min heap would be good.

bluss · 2016-02-12T21:52:35Z

Hey, this is interesting. We want this in itertools.

@frankmcsherry talked to me about the same algorithm, but here you are, with the first PR and that's completely OK. I'm just wondering if we can glean some tricks from his implementation in https://github.com/frankmcsherry/differential-dataflow/blob/master/src/iterators/merge.rs

The main trick is that it doesn't use the binary heap at all, so that it can be a bit more efficient. But we don't need to do that optimization now. I don't think there is anything in BinaryHeap that lets us do something similar.

I guess the quickcheck tests are still not working? I should fix that. This thing definitely deserves a quickcheck test.

bsteinb · 2016-02-12T22:16:09Z

Cool. Sorry about cutting in line here, I was extra careful to check for open pull requests and issues.

@frankmcsherry's implementation does look similar, but might manage to cut a few corners, here and there. Although it does implement half of a heap somewhere in there. I would be interested in seeing the difference in performance.

Ideally, I would like to be able to get around the Ord bound and offer a variant that merges based on a predicate, but can't see an elegant way to achieve that using BinaryHeap.

The quickcheck tests are indeed broken, quickcheck_macros is missing from the list of dependencies. I have it fixed locally and can submit another PR, if you'd like.

I'll get to working on your remarks now.

…::reverse.

bluss · 2016-02-12T22:36:02Z

I'm fixing quickcheck, it's going to be without quickcheck_macros; syntax extensions break too often so that's annoying.

bsteinb · 2016-02-12T22:43:25Z

OK. I have tried to address your remarks in the last two commits.

bluss · 2016-02-12T22:45:41Z

Great. did you see the idea about implementing lt, le, gt, ge? I think it makes a difference.

bsteinb · 2016-02-12T23:03:24Z

Nope, skipped right past it. I have added explicit implementations, but I do not observe a significant change in the benchmark results.

bluss · 2016-02-12T23:05:45Z

Oh 😞. I'm a bit surprised it might be for other element types. Thank you anyway.

bluss · 2016-02-12T23:51:38Z

I want to merge this, don't have time today, but I'll get to it. I would remove the kmerge method in the Itertools trait in fact (this is not an operation on a single iterator imo). I would also fix the bounds on Clone to not use NotEmpty at all.

bsteinb · 2016-02-13T10:37:23Z

Yeah, I did not like NonEmpty showing up in the public interface the way it did. Looks like I simply gave up too soon when the compiler would not stop complaining about missing impls. You motivated me to take another stab at it and lo and behold, NonEmpty is gone from the bounds and can now be made private.

pczarn · 2016-02-13T10:38:33Z

I had an idea for the binary heap in an unrelated algorithm. Since I has unknown size, maybe you shouldn't move iterators and instead add a binary heap of indices or pointers to I as an indirection. Further, perhaps you could implement and use decrease-key or increase-key operations instead of pop and push.

This PR is completely fine though.

bsteinb · 2016-02-13T10:44:05Z

I disagree about kmerge not being an adaptor. It operates on a sequence of Iterators which can be modelled by an Iterator of Iterators. kmerge in this view adapts the "outer" Iterator.

That way, one can think of kmerge as a variant of flat_map with a different order of the elements of the resulting sequence. Surely, if flat_map is worthy of being a member of the cadre of adaptors, kmerge can be accepted as well.

bsteinb · 2016-02-13T11:00:44Z

@pczarn Your concern about moving around instances of I seems justified, but IMO could better be tackled by making changes to the heap data structure itself.

I am not entirely sure what increase-key and decrease-key operations are (have not had a lot of formal CS training), but I think I have accidentally used them in my experiments. I will write more in a separate reply.

bsteinb · 2016-02-13T11:27:02Z

@bluss So, I have been looking at the McSherry implementation and the main difference seems to be in the way it does not pop the largest element, modifies it and pushes it back (incurring a sift_down and a sift_up), but rather modifies it in place and re-establishes the heap invariant by doing a sift_down. I guess this is @pczarn's increase-key operation, which was all new to me.

Since I cannot get at the guts of std's BinaryHeap that way, I have copied its implementation into a separate crate and added two new methods:

pub fn pop_push<F>(&mut self, f: F)
    where F: FnOnce(Option<T>) -> Option<T>;

pub fn pop_push_back<F>(&mut self, f: F)
    where F: FnOnce(T) -> Option<T>;

They pass the top element of the heap to f and f can then decide whether it wants to push something back. The heap invariant is only re-enforced once f is done. This gets around the sift_down sift_up pair of the pop push implementation, but might still use one more swap than the McSherry version (some unsafe might help here).

Using this new API cuts run-time in half in the benchmarks.

How best to proceed with this? Is this API something that should be made part of the BinaryHeap in std? Would you prefer to host an ad-hoc heap implementation of a heap in itertools that implements this in some form? Is the PR fine the way it is for now?

bluss · 2016-02-13T16:00:51Z

It's cool, we don't need to do any optimizations now (#97 (comment)), as long as we have an API that permits them later (which we have). The indirection suggestion is pretty interesting, but mcsherry's improvement is the most important part. Until now I've preferred to not depend on any special case datastructures in itertools. If we include it, I imagine we use a very stripped down version of the binary heap.

bsteinb · 2016-02-20T14:16:21Z

Are there any open questions left here or is getting this merged just a matter of you finding the time to do it?

bluss · 2016-02-21T13:16:33Z

There isn't, I was just wondering where the contains-rs discussion would go.

bsteinb · 2016-02-21T17:38:17Z

Heh, wherever it is going, it is not going there fast.

bluss · 2016-02-21T17:49:12Z

I'm sorry that it's already been a week, I haven't put in as much time as I used to. We'll merge this, then I can push my quickcheck fix too.

bsteinb · 2016-02-21T17:52:57Z

No need to feel sorry. I assume we both do this in our free time. It was not my intention to rush you, just wanted to make sure I had not missed one of your suggestions again.

Add k-way merge adaptor.

bluss · 2016-02-22T18:12:28Z

Thank you! Issue #98 is the follow up issue for future improvements.

bsteinb added 5 commits February 12, 2016 20:53

Add kmerge adaptor.

dad8972

Add kmerge free function.

6f10911

Add kmerge tests.

d8b821a

Add kmerge quickcheck tests.

17ab1ad

Add kmerge benchmarks.

5f0d718

bluss reviewed Feb 12, 2016
View reviewed changes

bsteinb added 2 commits February 12, 2016 23:24

Swap order of cmp and partial_cmp arguments instead of using Ordering…

4974286

…::reverse.

Document the equality and ordering implementations for NonEmpty.

7d00758

Add explicit implementations for lt et al.

ff0608f

bsteinb added 2 commits February 13, 2016 10:22

Remove NonEmpty from bounds on Clone impl for KMerge.

9134a5e

Make NonEmpty private.

e5caca1

bsteinb mentioned this pull request Feb 14, 2016

Extension to BinaryHeap for efficiently modifying the greatest element contain-rs/discuss#13

Closed

bluss added a commit that referenced this pull request Feb 22, 2016

Merge pull request #97 from bsteinb/kmerge

18278c5

Add k-way merge adaptor.

bluss merged commit 18278c5 into rust-itertools:master Feb 22, 2016

bluss mentioned this pull request Feb 22, 2016

Optimize kmerge #98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add k-way merge adaptor. #97

Add k-way merge adaptor. #97

bsteinb commented Feb 12, 2016

bluss Feb 12, 2016

bluss Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 12, 2016

bluss commented Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 13, 2016

pczarn commented Feb 13, 2016

bsteinb commented Feb 13, 2016

bsteinb commented Feb 13, 2016

bsteinb commented Feb 13, 2016

bluss commented Feb 13, 2016

bsteinb commented Feb 20, 2016

bluss commented Feb 21, 2016

bsteinb commented Feb 21, 2016

bluss commented Feb 21, 2016

bsteinb commented Feb 21, 2016

bluss commented Feb 22, 2016

Add k-way merge adaptor. #97

Add k-way merge adaptor. #97

Conversation

bsteinb commented Feb 12, 2016

bluss Feb 12, 2016

Choose a reason for hiding this comment

bluss Feb 12, 2016

Choose a reason for hiding this comment

bluss commented Feb 12, 2016

bsteinb commented Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 12, 2016

bluss commented Feb 12, 2016

bluss commented Feb 12, 2016

bsteinb commented Feb 13, 2016

pczarn commented Feb 13, 2016

bsteinb commented Feb 13, 2016

bsteinb commented Feb 13, 2016

bsteinb commented Feb 13, 2016

bluss commented Feb 13, 2016

bsteinb commented Feb 20, 2016

bluss commented Feb 21, 2016

bsteinb commented Feb 21, 2016

bluss commented Feb 21, 2016

bsteinb commented Feb 21, 2016

bluss commented Feb 22, 2016