Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k-way merge adaptor. #97

Merged
merged 10 commits into from
Feb 22, 2016
Merged

Add k-way merge adaptor. #97

merged 10 commits into from
Feb 22, 2016

Conversation

bsteinb
Copy link
Contributor

@bsteinb bsteinb commented Feb 12, 2016

Merges an arbitrary number of iterators in ascending order.

Uses std's BinaryHeap to decide which iterator to take from next. This seems quite heavyweight. Two-way merge benchmarks take roughly ten times longer than the dedicated two-way merge adaptor. Profiling identifies BinaryHeaps sift_up as the hot-spot.

Not completely sure about the interface, the double use of IntoIterator in the free-standing function in particular.

I::Item: PartialOrd
{
fn partial_cmp(&self, other: &NonEmpty<I>) -> Option<Ordering> {
self.head.partial_cmp(&other.head).map(Ordering::reverse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of mapping reverse I'd just use other.head.partial_cmp(&self.head) here. Simple and less noise.

What's more important is that it should implement all of lt, le, gt, ge. Implementing the specific comparison operators should have a noticable effect in benchmarks. BinaryHeap uses >, >=, <= it looks like (we just impl all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment somewhere that this type implements comparisons reversed to be used in a min heap would be good.

@bluss
Copy link
Member

bluss commented Feb 12, 2016

Hey, this is interesting. We want this in itertools.

@frankmcsherry talked to me about the same algorithm, but here you are, with the first PR and that's completely OK. I'm just wondering if we can glean some tricks from his implementation in https://github.com/frankmcsherry/differential-dataflow/blob/master/src/iterators/merge.rs

The main trick is that it doesn't use the binary heap at all, so that it can be a bit more efficient. But we don't need to do that optimization now. I don't think there is anything in BinaryHeap that lets us do something similar.

I guess the quickcheck tests are still not working? I should fix that. This thing definitely deserves a quickcheck test.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 12, 2016

Cool. Sorry about cutting in line here, I was extra careful to check for open pull requests and issues.

@frankmcsherry's implementation does look similar, but might manage to cut a few corners, here and there. Although it does implement half of a heap somewhere in there. I would be interested in seeing the difference in performance.

Ideally, I would like to be able to get around the Ord bound and offer a variant that merges based on a predicate, but can't see an elegant way to achieve that using BinaryHeap.

The quickcheck tests are indeed broken, quickcheck_macros is missing from the list of dependencies. I have it fixed locally and can submit another PR, if you'd like.

I'll get to working on your remarks now.

@bluss
Copy link
Member

bluss commented Feb 12, 2016

I'm fixing quickcheck, it's going to be without quickcheck_macros; syntax extensions break too often so that's annoying.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 12, 2016

OK. I have tried to address your remarks in the last two commits.

@bluss
Copy link
Member

bluss commented Feb 12, 2016

Great. did you see the idea about implementing lt, le, gt, ge? I think it makes a difference.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 12, 2016

Nope, skipped right past it. I have added explicit implementations, but I do not observe a significant change in the benchmark results.

@bluss
Copy link
Member

bluss commented Feb 12, 2016

Oh 😞. I'm a bit surprised it might be for other element types. Thank you anyway.

@bluss
Copy link
Member

bluss commented Feb 12, 2016

I want to merge this, don't have time today, but I'll get to it. I would remove the kmerge method in the Itertools trait in fact (this is not an operation on a single iterator imo). I would also fix the bounds on Clone to not use NotEmpty at all.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 13, 2016

Yeah, I did not like NonEmpty showing up in the public interface the way it did. Looks like I simply gave up too soon when the compiler would not stop complaining about missing impls. You motivated me to take another stab at it and lo and behold, NonEmpty is gone from the bounds and can now be made private.

@pczarn
Copy link

pczarn commented Feb 13, 2016

I had an idea for the binary heap in an unrelated algorithm. Since I has unknown size, maybe you shouldn't move iterators and instead add a binary heap of indices or pointers to I as an indirection. Further, perhaps you could implement and use decrease-key or increase-key operations instead of pop and push.

This PR is completely fine though.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 13, 2016

I disagree about kmerge not being an adaptor. It operates on a sequence of Iterators which can be modelled by an Iterator of Iterators. kmerge in this view adapts the "outer" Iterator.

That way, one can think of kmerge as a variant of flat_map with a different order of the elements of the resulting sequence. Surely, if flat_map is worthy of being a member of the cadre of adaptors, kmerge can be accepted as well.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 13, 2016

@pczarn Your concern about moving around instances of I seems justified, but IMO could better be tackled by making changes to the heap data structure itself.

I am not entirely sure what increase-key and decrease-key operations are (have not had a lot of formal CS training), but I think I have accidentally used them in my experiments. I will write more in a separate reply.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 13, 2016

@bluss So, I have been looking at the McSherry implementation and the main difference seems to be in the way it does not pop the largest element, modifies it and pushes it back (incurring a sift_down and a sift_up), but rather modifies it in place and re-establishes the heap invariant by doing a sift_down. I guess this is @pczarn's increase-key operation, which was all new to me.

Since I cannot get at the guts of std's BinaryHeap that way, I have copied its implementation into a separate crate and added two new methods:

pub fn pop_push<F>(&mut self, f: F)
    where F: FnOnce(Option<T>) -> Option<T>;
pub fn pop_push_back<F>(&mut self, f: F)
    where F: FnOnce(T) -> Option<T>;

They pass the top element of the heap to f and f can then decide whether it wants to push something back. The heap invariant is only re-enforced once f is done. This gets around the sift_down sift_up pair of the pop push implementation, but might still use one more swap than the McSherry version (some unsafe might help here).

Using this new API cuts run-time in half in the benchmarks.

How best to proceed with this? Is this API something that should be made part of the BinaryHeap in std? Would you prefer to host an ad-hoc heap implementation of a heap in itertools that implements this in some form? Is the PR fine the way it is for now?

@bluss
Copy link
Member

bluss commented Feb 13, 2016

It's cool, we don't need to do any optimizations now (#97 (comment)), as long as we have an API that permits them later (which we have). The indirection suggestion is pretty interesting, but mcsherry's improvement is the most important part. Until now I've preferred to not depend on any special case datastructures in itertools. If we include it, I imagine we use a very stripped down version of the binary heap.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 20, 2016

Are there any open questions left here or is getting this merged just a matter of you finding the time to do it?

@bluss
Copy link
Member

bluss commented Feb 21, 2016

There isn't, I was just wondering where the contains-rs discussion would go.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 21, 2016

Heh, wherever it is going, it is not going there fast.

@bluss
Copy link
Member

bluss commented Feb 21, 2016

I'm sorry that it's already been a week, I haven't put in as much time as I used to. We'll merge this, then I can push my quickcheck fix too.

@bsteinb
Copy link
Contributor Author

bsteinb commented Feb 21, 2016

No need to feel sorry. I assume we both do this in our free time. It was not my intention to rush you, just wanted to make sure I had not missed one of your suggestions again.

bluss added a commit that referenced this pull request Feb 22, 2016
Add k-way merge adaptor.
@bluss bluss merged commit 18278c5 into rust-itertools:master Feb 22, 2016
@bluss bluss mentioned this pull request Feb 22, 2016
@bluss
Copy link
Member

bluss commented Feb 22, 2016

Thank you! Issue #98 is the follow up issue for future improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants