New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SliceExt::binary_search performance #45333

Merged
merged 1 commit into from Nov 11, 2017

Conversation

@alkis
Contributor

alkis commented Oct 16, 2017

Improve the performance of binary_search by reducing the number of unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups.

Before:

test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)

After:

test slice::binary_search_l1                               ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups                     ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups                     ... bench:          98 ns/iter (+/- 14)
@rust-highfive

This comment has been minimized.

Show comment
Hide comment
@rust-highfive

rust-highfive Oct 16, 2017

Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

Collaborator

rust-highfive commented Oct 16, 2017

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@@ -20,6 +20,6 @@ extern crate test;
mod any;
mod hash;
mod iter;
mod mem;

This comment has been minimized.

@Mark-Simulacrum

Mark-Simulacrum Oct 16, 2017

Member

Why is this removed?

@Mark-Simulacrum

Mark-Simulacrum Oct 16, 2017

Member

Why is this removed?

This comment has been minimized.

@alkis

alkis Oct 16, 2017

Contributor

It doesn't exist.

@alkis

alkis Oct 16, 2017

Contributor

It doesn't exist.

This comment has been minimized.

@Mark-Simulacrum

Mark-Simulacrum Oct 16, 2017

Member

@alexcrichton How was this able to get past CI? Do we not run benches as part of libcore's tests?

@Mark-Simulacrum

Mark-Simulacrum Oct 16, 2017

Member

@alexcrichton How was this able to get past CI? Do we not run benches as part of libcore's tests?

This comment has been minimized.

@cuviper

cuviper Oct 17, 2017

Member

I don't think we run benches in CI. The file was moved in #44943.

@cuviper

cuviper Oct 17, 2017

Member

I don't think we run benches in CI. The file was moved in #44943.

Show outdated Hide outdated src/libcore/benches/slice.rs Outdated
Show outdated Hide outdated src/libcore/benches/slice.rs Outdated
@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Oct 17, 2017

Contributor

I like it, trading an unpredictable Branch for a predictable one (bound check) is neat.

It doesn't exit when it finds the first equal element -- so does this PR change the result for some inputs?

Contributor

bluss commented Oct 17, 2017

I like it, trading an unpredictable Branch for a predictable one (bound check) is neat.

It doesn't exit when it finds the first equal element -- so does this PR change the result for some inputs?

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 17, 2017

Contributor

The biggest benefit comes from trading all the unpredictable branches inside the loop with conditional moves. This is otherwise known as "branchless binary search" and has been shown to be faster than "traditional binary search" and as fast as linear search for small inputs. There is a recent paper that covers different layouts for comparison based searching: https://arxiv.org/abs/1509.05053. It covers a lot more than "branchless binary search", nevertheless it is a good read.

Answering your question: the PR should not change the results for any input. I added some extra test cases to increase confidence.

Contributor

alkis commented Oct 17, 2017

The biggest benefit comes from trading all the unpredictable branches inside the loop with conditional moves. This is otherwise known as "branchless binary search" and has been shown to be faster than "traditional binary search" and as fast as linear search for small inputs. There is a recent paper that covers different layouts for comparison based searching: https://arxiv.org/abs/1509.05053. It covers a lot more than "branchless binary search", nevertheless it is a good read.

Answering your question: the PR should not change the results for any input. I added some extra test cases to increase confidence.

@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Oct 17, 2017

Contributor

Cool! Although I think this needs to be tested further with other payloads, like (u64, u64), where the comparison isn't compiled into a single cmov*. My impression is that it's slower when that's not the case as it it's forced to go through all branches, then the conditional move.

Contributor

arthurprs commented Oct 17, 2017

Cool! Although I think this needs to be tested further with other payloads, like (u64, u64), where the comparison isn't compiled into a single cmov*. My impression is that it's slower when that's not the case as it it's forced to go through all branches, then the conditional move.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 17, 2017

Contributor

@arthurprs I don't think anything will change if (u64, u64) is used (or even strings for that matter). The cmov is used to select the new base after the comparison is done and its result is placed in the flags register: it is either base or mid. If the comparison operator has branches of its own that are not predictable there is not much we can do about it at this level. It has to be done by code external to binary_search_by.

Contributor

alkis commented Oct 17, 2017

@arthurprs I don't think anything will change if (u64, u64) is used (or even strings for that matter). The cmov is used to select the new base after the comparison is done and its result is placed in the flags register: it is either base or mid. If the comparison operator has branches of its own that are not predictable there is not much we can do about it at this level. It has to be done by code external to binary_search_by.

@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Oct 17, 2017

Contributor

Good point. I also ran a few benches just to confirm and it checks out.

Contributor

arthurprs commented Oct 17, 2017

Good point. I also ran a few benches just to confirm and it checks out.

@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Oct 17, 2017

Contributor

@alkis there are no test cases that cover this — with duplicate elements. I've had time to check this at my computer now, and the new algorithm does produce different results for some inputs:

The example (playground link) is searching for 0 in a vector of zeros.

  1. It's important to walk into breaking changes, however small, with open eyes
  2. Now we know that it exists — can we find out if it matters? We've had bugs filed for breakage for less.

(Less important, but still interesting, is that the old algorithm will finish as quickly as possible with this input and the new algorithm it's a worst case.)

Contributor

bluss commented Oct 17, 2017

@alkis there are no test cases that cover this — with duplicate elements. I've had time to check this at my computer now, and the new algorithm does produce different results for some inputs:

The example (playground link) is searching for 0 in a vector of zeros.

  1. It's important to walk into breaking changes, however small, with open eyes
  2. Now we know that it exists — can we find out if it matters? We've had bugs filed for breakage for less.

(Less important, but still interesting, is that the old algorithm will finish as quickly as possible with this input and the new algorithm it's a worst case.)

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 17, 2017

Contributor

@bluss: thanks for testing this. I don't think these changes are material and very unlikely to result in breakages. Let me explain why:

  • old: returns the first match. The match is implementation defined. Also it is not a well defined element: it is neither the lower bound nor the one before the upper bound. It is hard to make a case for using such match with any reasonable expectations.
  • new: if a match exists it returns the lower bound. This is a stronger guarantee than the old implementation. This is an implementation detail but it has the risk of falling into Hyrum's Law in the future.

In addition the two implementations will have different behavior when the list is not sorted given the partial order defined by the predicate. I also think that this is not important.

For performance it is true that the new algorithm always does log2(size) steps but the old does at most log2(size) steps. This is both good and bad. Bad because it can be slower in contrived situations and good because it has predictable performance.

I can add a few more benchmarks if you think they add value:

  • string slices as keys
  • with dups
  • with different sizes: sizeof array fits in L1 cache, L2 cache and L3 cache

About possible breakages, if you have suggestions on how to investigate them before merging I would be glad to take a look.

Contributor

alkis commented Oct 17, 2017

@bluss: thanks for testing this. I don't think these changes are material and very unlikely to result in breakages. Let me explain why:

  • old: returns the first match. The match is implementation defined. Also it is not a well defined element: it is neither the lower bound nor the one before the upper bound. It is hard to make a case for using such match with any reasonable expectations.
  • new: if a match exists it returns the lower bound. This is a stronger guarantee than the old implementation. This is an implementation detail but it has the risk of falling into Hyrum's Law in the future.

In addition the two implementations will have different behavior when the list is not sorted given the partial order defined by the predicate. I also think that this is not important.

For performance it is true that the new algorithm always does log2(size) steps but the old does at most log2(size) steps. This is both good and bad. Bad because it can be slower in contrived situations and good because it has predictable performance.

I can add a few more benchmarks if you think they add value:

  • string slices as keys
  • with dups
  • with different sizes: sizeof array fits in L1 cache, L2 cache and L3 cache

About possible breakages, if you have suggestions on how to investigate them before merging I would be glad to take a look.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 17, 2017

Contributor

I improved the benchmarks a bit. Please take another look.

Contributor

alkis commented Oct 17, 2017

I improved the benchmarks a bit. Please take another look.

@Mark-Simulacrum

This comment has been minimized.

Show comment
Hide comment
@Mark-Simulacrum

Mark-Simulacrum Oct 18, 2017

Member

I think we should definitely get a crater run on this to at least note any test failures that would be caused by landing this PR. Is it possible to keep the old behavior without removing the performance gains this PR makes? I'm somewhat hesitant to change the behavior of this, and I'm not sure I entirely agree with all of your points about this being unlikely to hurt anyone in practice.

I agree that the new behavior on equal data is perhaps more elegant, but the old behavior (as I understand) is stable, if defined only by the algorithm used. Since it's stable, changing it now would lead me to believe that someone could depend on it -- even if they shouldn't be -- and we should be very hesitant to change it. Perhaps a survey of Rust code on GH that uses binary search could help here -- how much, if any, of it will change behavior given stretches of equal data? If we determine that probably none, then I'd be more inclined to make this change. If at least a couple different crates, then I'm pretty much against this.

With regards to unsorted data, I don't think there's any problem in changing behavior there -- a binary search on unsorted data isn't going to be reliable, at all, and this is something we explicitly document.

So, to summarize: I think that we should be very hesitant to land this without more data that this doesn't break people in practice (crater run and GH survey). It's "silent" breakage in that code continues to compile but changes in behavior, which is nearly the worst kind of breakage we can introduce.

r? @BurntSushi

cc @rust-lang/libs -- how do we feel about the potential breakage here? Author's thoughts and the breakage are outlined in #45333 (comment), mine are above.

Member

Mark-Simulacrum commented Oct 18, 2017

I think we should definitely get a crater run on this to at least note any test failures that would be caused by landing this PR. Is it possible to keep the old behavior without removing the performance gains this PR makes? I'm somewhat hesitant to change the behavior of this, and I'm not sure I entirely agree with all of your points about this being unlikely to hurt anyone in practice.

I agree that the new behavior on equal data is perhaps more elegant, but the old behavior (as I understand) is stable, if defined only by the algorithm used. Since it's stable, changing it now would lead me to believe that someone could depend on it -- even if they shouldn't be -- and we should be very hesitant to change it. Perhaps a survey of Rust code on GH that uses binary search could help here -- how much, if any, of it will change behavior given stretches of equal data? If we determine that probably none, then I'd be more inclined to make this change. If at least a couple different crates, then I'm pretty much against this.

With regards to unsorted data, I don't think there's any problem in changing behavior there -- a binary search on unsorted data isn't going to be reliable, at all, and this is something we explicitly document.

So, to summarize: I think that we should be very hesitant to land this without more data that this doesn't break people in practice (crater run and GH survey). It's "silent" breakage in that code continues to compile but changes in behavior, which is nearly the worst kind of breakage we can introduce.

r? @BurntSushi

cc @rust-lang/libs -- how do we feel about the potential breakage here? Author's thoughts and the breakage are outlined in #45333 (comment), mine are above.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Oct 18, 2017

Member

I'd be interested in seeing what a crater run turned up, but I'm not particularly worried about this change in behavior. We've never documented which of multiple equal values is arrived at, and it seems like you're in a bit of a weird place if you have that kind of data anyway.

Member

sfackler commented Oct 18, 2017

I'd be interested in seeing what a crater run turned up, but I'm not particularly worried about this change in behavior. We've never documented which of multiple equal values is arrived at, and it seems like you're in a bit of a weird place if you have that kind of data anyway.

@retep998

This comment has been minimized.

Show comment
Hide comment
@retep998

retep998 Oct 18, 2017

Member

Honestly, if it was going to pick a value from among equals, then I'd expect it to pick the lower bound. Because the current implementation picks at effectively random (even if it is deterministic), there's absolutely no way I'd be able to rely on which value it picked. The lower bound is still deterministic, so it won't break any code that wants deterministic behavior, but I have a really hard time imagining what sort of code can manage to rely on which specific element the current algorithm picks at random. Of course we should still do a crater run to be sure, but if we can't find any legitimate cases, then we should absolutely do this change.

Member

retep998 commented Oct 18, 2017

Honestly, if it was going to pick a value from among equals, then I'd expect it to pick the lower bound. Because the current implementation picks at effectively random (even if it is deterministic), there's absolutely no way I'd be able to rely on which value it picked. The lower bound is still deterministic, so it won't break any code that wants deterministic behavior, but I have a really hard time imagining what sort of code can manage to rely on which specific element the current algorithm picks at random. Of course we should still do a crater run to be sure, but if we can't find any legitimate cases, then we should absolutely do this change.

@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Oct 18, 2017

Contributor

I found myself wanting the lower bound of the equal subsequence recently. On the other hand I'd argue against guaranteeing this sort of behavior.

Also, this discussion resembles the stable/unstable sort thing.

Contributor

arthurprs commented Oct 18, 2017

I found myself wanting the lower bound of the equal subsequence recently. On the other hand I'd argue against guaranteeing this sort of behavior.

Also, this discussion resembles the stable/unstable sort thing.

@kennytm

This comment has been minimized.

Show comment
Hide comment
@kennytm

kennytm Oct 18, 2017

Member

The documentation of binary_search never guaranteed which index will be returned when there's an equal range. The new behavior is compatible with the existing definition.

Looks up a series of four elements. The first is found, with a uniquely determined position; the second and third are not found; the fourth could match any position in [1, 4].

let s = [0, 1, 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55];

assert_eq!(s.binary_search(&13),  Ok(9));
assert_eq!(s.binary_search(&4),   Err(7));
assert_eq!(s.binary_search(&100), Err(13));
let r = s.binary_search(&1);
assert!(match r { Ok(1...4) => true, _ => false, });
Member

kennytm commented Oct 18, 2017

The documentation of binary_search never guaranteed which index will be returned when there's an equal range. The new behavior is compatible with the existing definition.

Looks up a series of four elements. The first is found, with a uniquely determined position; the second and third are not found; the fourth could match any position in [1, 4].

let s = [0, 1, 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55];

assert_eq!(s.binary_search(&13),  Ok(9));
assert_eq!(s.binary_search(&4),   Err(7));
assert_eq!(s.binary_search(&100), Err(13));
let r = s.binary_search(&1);
assert!(match r { Ok(1...4) => true, _ => false, });
@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 18, 2017

Contributor

@arthups I think not having lower_bound and upper_bound are holes in the std library. We can definitely add those. I can send a separate PR if there is consensus.

Contributor

alkis commented Oct 18, 2017

@arthups I think not having lower_bound and upper_bound are holes in the std library. We can definitely add those. I can send a separate PR if there is consensus.

@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Oct 18, 2017

Contributor

ping @frankmcsherry, you might be interested in this

Contributor

bluss commented Oct 18, 2017

ping @frankmcsherry, you might be interested in this

@frankmcsherry

This comment has been minimized.

Show comment
Hide comment
@frankmcsherry

frankmcsherry Oct 18, 2017

Contributor

This appeals to me greatly. I can't recall if I complained to @bluss and that is why I got the ping, but (i) better performance is great, and (ii) being able to get to the lower bound is important for lots of applications. While this PR doesn't commit to that, it does mean that I could in principle start using binary_search which I couldn't do before (I mean, I could, but then I'd have to do geometric search backwards, and .. =/).

My understanding is that this could be slower if the comparison is very expensive, much like quicksort can be slower than mergesort if the comparison is expensive. Benchmarking on large randomly permuted String slices could tease that out (if each comparison is now an access to a random location in GBs of data). This would be more visible if there would be an early match (e.g. "find a string starting with 'q'").

Also, if I understand the linked article, there is the potential downside that most architectures will not prefetch through a computed address, as produced by cmov. In the article they recover the performance with explicit prefetching (see Figure 8), but it seems plausible that there could be more memory stalls with the conditional move approach than with a branching approach. No opinion on which is better / worse, though.

Edit: Also, my understanding is that a lot of the "branchless" benefits go away if your comparison function has a branch in it, e.g. for String.

@Mark-Simulacrum Rust has changed behavior a few times (for me wrt sort) and as the docs don't commit to any specific semantics other than "a" match, I'd be optimistic that this could land. I'm pro "check and see what breaks and try hard to avoid any", having just complained about that to Niko, but I'm also pro "actually commit to some semantics" and the first element is what I typically find needed.

@kennytm The counter-point that I made recently (even though I'd love to have this land asap), is that no matter what the docs say if the change breaks code it breaks code. Not knowing anything about this stuff, a crater run seems like a great thing to do to grok whether people unknowingly over-bound to semantics that weren't documented. If they did, breaking their stuff and saying "sorry, but" still hurts (them, and the perception of stability).

Contributor

frankmcsherry commented Oct 18, 2017

This appeals to me greatly. I can't recall if I complained to @bluss and that is why I got the ping, but (i) better performance is great, and (ii) being able to get to the lower bound is important for lots of applications. While this PR doesn't commit to that, it does mean that I could in principle start using binary_search which I couldn't do before (I mean, I could, but then I'd have to do geometric search backwards, and .. =/).

My understanding is that this could be slower if the comparison is very expensive, much like quicksort can be slower than mergesort if the comparison is expensive. Benchmarking on large randomly permuted String slices could tease that out (if each comparison is now an access to a random location in GBs of data). This would be more visible if there would be an early match (e.g. "find a string starting with 'q'").

Also, if I understand the linked article, there is the potential downside that most architectures will not prefetch through a computed address, as produced by cmov. In the article they recover the performance with explicit prefetching (see Figure 8), but it seems plausible that there could be more memory stalls with the conditional move approach than with a branching approach. No opinion on which is better / worse, though.

Edit: Also, my understanding is that a lot of the "branchless" benefits go away if your comparison function has a branch in it, e.g. for String.

@Mark-Simulacrum Rust has changed behavior a few times (for me wrt sort) and as the docs don't commit to any specific semantics other than "a" match, I'd be optimistic that this could land. I'm pro "check and see what breaks and try hard to avoid any", having just complained about that to Niko, but I'm also pro "actually commit to some semantics" and the first element is what I typically find needed.

@kennytm The counter-point that I made recently (even though I'd love to have this land asap), is that no matter what the docs say if the change breaks code it breaks code. Not knowing anything about this stuff, a crater run seems like a great thing to do to grok whether people unknowingly over-bound to semantics that weren't documented. If they did, breaking their stuff and saying "sorry, but" still hurts (them, and the perception of stability).

@frankmcsherry

This comment has been minimized.

Show comment
Hide comment
@frankmcsherry

frankmcsherry Oct 18, 2017

Contributor
if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize) }

I don't know either way, but is (cmp == Less) as usize idiomatic? Is the compiler unable to turn a match statement into the right assembly? I didn't even realize that Rust commits to this being 0 or 1.

Contributor

frankmcsherry commented Oct 18, 2017

if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize) }

I don't know either way, but is (cmp == Less) as usize idiomatic? Is the compiler unable to turn a match statement into the right assembly? I didn't even realize that Rust commits to this being 0 or 1.

@frankmcsherry

This comment has been minimized.

Show comment
Hide comment
@frankmcsherry

frankmcsherry Oct 18, 2017

Contributor

In talking out the issue of intended semantics, what are the use cases where a collection may have multiple matching keys and returning "an arbitrary element" is acceptable? I'm familiar with "collection has distinct keys" in which things are unambiguous, and "collection has multiple keys, return the range please". I admit to never having seen an application that calls for what Rust currently offers; if someone could clue me in I would appreciate it!

Contributor

frankmcsherry commented Oct 18, 2017

In talking out the issue of intended semantics, what are the use cases where a collection may have multiple matching keys and returning "an arbitrary element" is acceptable? I'm familiar with "collection has distinct keys" in which things are unambiguous, and "collection has multiple keys, return the range please". I admit to never having seen an application that calls for what Rust currently offers; if someone could clue me in I would appreciate it!

@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Oct 18, 2017

Contributor

I think this implementation is really neat but just using it in a new set of methods (for lower and upper bounds), leaving binary search unchanged sounds like the best solution. It makes the faster implementation available, it makes the useful lower bound algorithm available, and it avoids doing more steps than needed in binary_search. (Even if it looks like this extra work often has “negative cost”, we know there are cases where it has a cost.)

Contributor

bluss commented Oct 18, 2017

I think this implementation is really neat but just using it in a new set of methods (for lower and upper bounds), leaving binary search unchanged sounds like the best solution. It makes the faster implementation available, it makes the useful lower bound algorithm available, and it avoids doing more steps than needed in binary_search. (Even if it looks like this extra work often has “negative cost”, we know there are cases where it has a cost.)

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 18, 2017

Contributor

bluss@ it would extremely interesting to see cases that consistently get a regression out of this only because the old implementation bails out early. Also I don't think having a slow binary_search with a note saying "if you want performance use lower_bound" is a good choice for the standard library.

Contributor

alkis commented Oct 18, 2017

bluss@ it would extremely interesting to see cases that consistently get a regression out of this only because the old implementation bails out early. Also I don't think having a slow binary_search with a note saying "if you want performance use lower_bound" is a good choice for the standard library.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 18, 2017

Contributor

@frankmcsherry

  1. I don't see how this can be slower if the comparison is very expensive unless you are talking about getting lucky and landing on the correct element before you do N comparisons. @arthurprs benchmarked a tuple (which presumably is both more expensive and has an extra branch in the comparison) and found it having the same results as this implementation.
  2. "branchless" benefits might go away if the comparison has branches in it. Maybe the CPU can detect dependent branches in this case and elide the cost of the second one. We can only know if we have a benchmark.
  3. (cmp == Less) as usize not sure if this is idiomatic or not. I would be happy to switch to the idiomatic version if it has the same performance and is as succinct.
  4. For sets with multiple matches the usual operation you want on it is equal_range. Perhaps we need to add that as well on top of lower_bound and upper_bound.
Contributor

alkis commented Oct 18, 2017

@frankmcsherry

  1. I don't see how this can be slower if the comparison is very expensive unless you are talking about getting lucky and landing on the correct element before you do N comparisons. @arthurprs benchmarked a tuple (which presumably is both more expensive and has an extra branch in the comparison) and found it having the same results as this implementation.
  2. "branchless" benefits might go away if the comparison has branches in it. Maybe the CPU can detect dependent branches in this case and elide the cost of the second one. We can only know if we have a benchmark.
  3. (cmp == Less) as usize not sure if this is idiomatic or not. I would be happy to switch to the idiomatic version if it has the same performance and is as succinct.
  4. For sets with multiple matches the usual operation you want on it is equal_range. Perhaps we need to add that as well on top of lower_bound and upper_bound.
@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Oct 18, 2017

Contributor

It's clear to me that lower_bound and upper_bound are desirable, deprecating (or not) binary_search. I fell that the name (self documenting/discoverable) is reason alone not to deprecate it though.

@alkis If you have "duplicates" and/or an expensive cmp function the new code might be slower.

Contributor

arthurprs commented Oct 18, 2017

It's clear to me that lower_bound and upper_bound are desirable, deprecating (or not) binary_search. I fell that the name (self documenting/discoverable) is reason alone not to deprecate it though.

@alkis If you have "duplicates" and/or an expensive cmp function the new code might be slower.

@frankmcsherry

This comment has been minimized.

Show comment
Hide comment
@frankmcsherry

frankmcsherry Oct 18, 2017

Contributor

@alkis

  1. It's not necessarily about getting very lucky. If you are doing a search by a key, rather than by element, you can find a key very early in the search and then spend the rest of your time finding the first element. The example from above was sorted strings and searching using just the first character to find a string starting with 'q'.

Edit: btw, I much prefer your code to what exists at the moment, which I don't use because it doesn't do what I need. I'm just trying to call out the things folks should worry about and be sure they are ok with.

Contributor

frankmcsherry commented Oct 18, 2017

@alkis

  1. It's not necessarily about getting very lucky. If you are doing a search by a key, rather than by element, you can find a key very early in the search and then spend the rest of your time finding the first element. The example from above was sorted strings and searching using just the first character to find a string starting with 'q'.

Edit: btw, I much prefer your code to what exists at the moment, which I don't use because it doesn't do what I need. I'm just trying to call out the things folks should worry about and be sure they are ok with.

@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Oct 18, 2017

Contributor

@alkis To be brief, you can (at least I could) reproduce the slower case using let v = (0..size).map(|i| i / 32).collect::<Vec<_>>(); for the sorted sequence in the benchmark in the PR.

Contributor

bluss commented Oct 18, 2017

@alkis To be brief, you can (at least I could) reproduce the slower case using let v = (0..size).map(|i| i / 32).collect::<Vec<_>>(); for the sorted sequence in the benchmark in the PR.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 18, 2017

Contributor

@frankmcsherry and @bluss see updated benchmarks:

Before:

test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)

After:

test slice::binary_search_l1                               ... bench:          17 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          24 ns/iter (+/- 2)
test slice::binary_search_l3                               ... bench:         139 ns/iter (+/- 27)
test slice::binary_search_l1_with_dups                     ... bench:          17 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          25 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         137 ns/iter (+/- 21)

I don't see the regression.

Furthermore lets step back a bit. Do we expect no regressions? I do not think this is realistic. If we accept the fact there are going to be regressions we have to use Amdhal's Law to assess the tradeoff. The performance increase on arrays that fit in L1 or L2 cache is about 2x. This is not trivial. So unless you think the regressions on the cases you think it regresses represent the majority of cases, I find it unwise to block this PR.

Think of it in reverse. If the current code was the code in this PR, would be approve a change that changes it to the current code because it is faster on the contrived cases you mention?

Contributor

alkis commented Oct 18, 2017

@frankmcsherry and @bluss see updated benchmarks:

Before:

test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)

After:

test slice::binary_search_l1                               ... bench:          17 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          24 ns/iter (+/- 2)
test slice::binary_search_l3                               ... bench:         139 ns/iter (+/- 27)
test slice::binary_search_l1_with_dups                     ... bench:          17 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          25 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         137 ns/iter (+/- 21)

I don't see the regression.

Furthermore lets step back a bit. Do we expect no regressions? I do not think this is realistic. If we accept the fact there are going to be regressions we have to use Amdhal's Law to assess the tradeoff. The performance increase on arrays that fit in L1 or L2 cache is about 2x. This is not trivial. So unless you think the regressions on the cases you think it regresses represent the majority of cases, I find it unwise to block this PR.

Think of it in reverse. If the current code was the code in this PR, would be approve a change that changes it to the current code because it is faster on the contrived cases you mention?

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Oct 18, 2017

Contributor

FWIW: I think the biggest risk of this change is that unit tests will break if they depend on the element/position returned by the current implementation of binary_search.

Contributor

alkis commented Oct 18, 2017

FWIW: I think the biggest risk of this change is that unit tests will break if they depend on the element/position returned by the current implementation of binary_search.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Nov 5, 2017

Contributor

Btw I have implementations of lower_bound, upper_bound and equal_range ready for rust-lang/rfcs#2184. If we assume they will be part of stdlib then we do not have to make binary_search return the lower bound.

Contributor

alkis commented Nov 5, 2017

Btw I have implementations of lower_bound, upper_bound and equal_range ready for rust-lang/rfcs#2184. If we assume they will be part of stdlib then we do not have to make binary_search return the lower bound.

@ishitatsuyuki

This comment has been minimized.

Show comment
Hide comment
@ishitatsuyuki

ishitatsuyuki Nov 5, 2017

Member

I also settled on a different comparison operator for bounds. The order shouldn't matter then.

Member

ishitatsuyuki commented Nov 5, 2017

I also settled on a different comparison operator for bounds. The order shouldn't matter then.

@bluss bluss assigned bluss and unassigned aturon Nov 6, 2017

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Nov 7, 2017

Contributor

For the impatient you can use fast_binary_search from https://crates.io/crates/ordslice.

Contributor

alkis commented Nov 7, 2017

For the impatient you can use fast_binary_search from https://crates.io/crates/ordslice.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Nov 11, 2017

Member

Triage reminder for @bluss

Member

shepmaster commented Nov 11, 2017

Triage reminder for @bluss

@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Nov 11, 2017

Contributor

I'll do a clarification and summary. Just correct me @alkis or others if I'm wrong.

  • This PR improves performance of the binary search methods for slices
  • The resulting index may change compared with the previous version when it comes to picking an element among a run of elements that are all equal
  • The lower bound guarantee that was discussed in this thread, does not exist. (Important to underline so that there are no misunderstandings)
  • This PR does not change what guarantees we are making for the binary_search methods
  • There were no regressions found in the crater run

I personally think it will be no problem to introduce lower_bound, upper_bound etc methods to the std lib sooner or later, so there's no pressure to cram that functionality into the binary search methods.

Contributor

bluss commented Nov 11, 2017

I'll do a clarification and summary. Just correct me @alkis or others if I'm wrong.

  • This PR improves performance of the binary search methods for slices
  • The resulting index may change compared with the previous version when it comes to picking an element among a run of elements that are all equal
  • The lower bound guarantee that was discussed in this thread, does not exist. (Important to underline so that there are no misunderstandings)
  • This PR does not change what guarantees we are making for the binary_search methods
  • There were no regressions found in the crater run

I personally think it will be no problem to introduce lower_bound, upper_bound etc methods to the std lib sooner or later, so there's no pressure to cram that functionality into the binary search methods.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Nov 11, 2017

Contributor

@bluss this is great summary. Perhaps one thing missing is the reason for the lack of lower_bound guarantee of binary_search: performance.

Contributor

alkis commented Nov 11, 2017

@bluss this is great summary. Perhaps one thing missing is the reason for the lack of lower_bound guarantee of binary_search: performance.

@bluss

Please add a test that exercises input data that is sorted, with runs of equal elements. This case has been an important case in the PR and we should cover it for the future. Maybe it is a good idea to leave a comment with those tests that we don't guarantee a particular output but we test the current behaviour of the algorithm.

@Manishearth

With this small change and the test @bluss mentioned this should be ready to go.

r=me if necessary, though preferably @bluss should sign off

Show outdated Hide outdated src/libcore/slice/mod.rs Outdated
@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Nov 11, 2017

Contributor

@alkis I'd like to point out that with this change using a binary search in the BTreeMap nodes could be a win. Maybe you want to take a stab at that.

Contributor

arthurprs commented Nov 11, 2017

@alkis I'd like to point out that with this change using a binary search in the BTreeMap nodes could be a win. Maybe you want to take a stab at that.

@ishitatsuyuki

This comment has been minimized.

Show comment
Hide comment
@ishitatsuyuki

ishitatsuyuki Nov 11, 2017

Member

How? Binary trees are not contagious arrays, and the optimization techniques should be different.

EDIT: I see your point. BTree uses partially contagious arrays.

Member

ishitatsuyuki commented Nov 11, 2017

How? Binary trees are not contagious arrays, and the optimization techniques should be different.

EDIT: I see your point. BTree uses partially contagious arrays.

@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Nov 11, 2017

Contributor

They're BTree's not BinaryTree's.

Contributor

arthurprs commented Nov 11, 2017

They're BTree's not BinaryTree's.

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Nov 11, 2017

Contributor

@ishitatsuyuki perhaps @arthurprs is talking about the array in each BTree node itself. Not sure what is happening there but it might be linear search instead of binary search.

Contributor

alkis commented Nov 11, 2017

@ishitatsuyuki perhaps @arthurprs is talking about the array in each BTree node itself. Not sure what is happening there but it might be linear search instead of binary search.

Show outdated Hide outdated src/libcore/tests/slice.rs Outdated
Improve the performance of binary_search by reducing the number of
unpredictable conditional branches in the loop. In addition improve the
benchmarks to test performance in l1, l2 and l3 caches on sorted arrays
with or without dups.

Before:

```
test slice::binary_search_l1                               ... bench:  48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:  63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench: 152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:  36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:  64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench: 153 ns/iter (+/- 6)
```

After:

```
test slice::binary_search_l1                               ... bench:  15 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:  23 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench: 100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups                     ... bench:  15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:  23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups                     ... bench:  98 ns/iter (+/- 14)
```
@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Nov 11, 2017

Contributor

@bors r+

Thanks a lot @alkis for this improvement!

Contributor

bluss commented Nov 11, 2017

@bors r+

Thanks a lot @alkis for this improvement!

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 11, 2017

Contributor

📌 Commit 2ca111b has been approved by bluss

Contributor

bors commented Nov 11, 2017

📌 Commit 2ca111b has been approved by bluss

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 11, 2017

Contributor

⌛️ Testing commit 2ca111b with merge 24bb4d1...

Contributor

bors commented Nov 11, 2017

⌛️ Testing commit 2ca111b with merge 24bb4d1...

bors added a commit that referenced this pull request Nov 11, 2017

Auto merge of #45333 - alkis:master, r=bluss
Improve SliceExt::binary_search performance

Improve the performance of binary_search by reducing the number of unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups.

Before:

```
test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)
```

After:

```
test slice::binary_search_l1                               ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups                     ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups                     ... bench:          98 ns/iter (+/- 14)
```
@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Nov 11, 2017

Contributor

Thanks for the thorough reviews!

Contributor

alkis commented Nov 11, 2017

Thanks for the thorough reviews!

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 11, 2017

Contributor

☀️ Test successful - status-appveyor, status-travis
Approved by: bluss
Pushing 24bb4d1 to master...

Contributor

bors commented Nov 11, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: bluss
Pushing 24bb4d1 to master...

@bors bors merged commit 2ca111b into rust-lang:master Nov 11, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

alkis added a commit to alkis/superslice-rs that referenced this pull request Nov 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment