Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lower_bound, upper_bound and equal_range for slices where T: Ord to complement binary_search #2184

Open
alkis opened this issue Oct 20, 2017 · 45 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@alkis
Copy link

alkis commented Oct 20, 2017

Motivation

Fom discussion on rust-lang/rust#45333.

There is demand for lower_bound, upper_bound and equal_range especially for sorted sequences of non-unique elements. In such sequences the aforementioned operations are more interesting and useful than binary_search. The addition of these three ops will complement the support of sorted slices in the standard library.

Proposed APIs

lower_bound
/// Finds the first element in the sorted slice that is _not less_ than `x`.
///
/// Returns the index `a` such that all elements in `self[..a]` are less than `x` and all elements in `self[a..]` are greater or equal to `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// assert_eq!(a.lower_bound(9), 0);
/// assert_eq!(a.lower_bound(10), 0);
/// assert_eq!(a.lower_bound(11), 1);
/// assert_eq!(a.lower_bound(12), 2);
/// assert_eq!(a.lower_bound(13), 2);
/// assert_eq!(a.lower_bound(14), 4);
/// assert_eq!(a.lower_bound(15), 4);
/// assert_eq!(a.lower_bound(16), 5);
/// ```
fn lower_bound(&self, x: &T) -> usize 
where
    T: Ord;

/// `lower_bound` with a comparator.
fn lower_bound_by<'a, F>(&'a self, f: F) -> usize
where
    F: FnMut(&'a T) -> Ordering;

/// `lower_bound` with key extraction.
fn lower_bound_by_key<'a, B, F>(&'a self, b: &B, f: F)  -> usize
where
    B: Ord,
    F: FnMut(&'a T) -> B
upper_bound
/// Finds the first element in the sorted slice that is _greater_ than `x`.
///
/// Returns the index `a` such that all elements in `self[..a]` are less or equal to `x` and all elements in `self[a..]` are greater than `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// assert_eq!(a.upper_bound(9), 0);
/// assert_eq!(a.upper_bound(10), 1);
/// assert_eq!(a.upper_bound(11), 2);
/// assert_eq!(a.upper_bound(12), 2);
/// assert_eq!(a.upper_bound(13), 4);
/// assert_eq!(a.upper_bound(14), 4);
/// assert_eq!(a.upper_bound(15), 5);
/// assert_eq!(a.upper_bound(16), 5);
/// ```
fn upper_bound(&self, x: &T) -> usize
where
    T: Ord;

/// `upper_bound` with a comparator.
fn upper_bound_by<'a, F>(&'a self, f: F) -> usize
where
    F: FnMut(&'a T) -> Ordering;

/// `upper_bound` with key extraction.
fn upper_bound_by_key<'a, B, F>(&'a self, b: &B, f: F)  -> usize
where
    B: Ord,
    F: FnMut(&'a T) -> B
equal_range
/// Finds all elements _equal_ to `x`.
///
/// Returns the `Range` `a..b` such that all elements in `self[..a]` are less than `x`, all elements in `self[a..b]` are equal to `x`, and all elements in `self[b..]` are greater than `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
///  for i in (9..17) {
///    assert_eq!(a.equal_range(i), (a.lower_bound(i), a.upper_bound(i)));
///  }
/// ```

fn equal_range(&self, x: &T) -> std::ops::Range<usize> 
where
    T: Ord;

/// `equal_range` with a comparator.
fn equal_range_by<'a, F>(&'a self, f: F) -> std::ops::Range<usize> 
where
    F: FnMut(&'a T) -> Ordering;

/// `equal_range` with key extraction.
fn equal_range_by_key<'a, B, F>(&'a self, b: &B, f: F)  -> std::ops::Range<usize> 
where
    B: Ord,
    F: FnMut(&'a T) -> B
@ranma42
Copy link
Contributor

ranma42 commented Oct 20, 2017

Is is more convenient to have an Option return value or to just return the appropriate slice boundary?
There are use cases for both, it is not clear to me which one would be more ergonomic.
Also, how about returning a Range in equal_range*?

@alkis
Copy link
Author

alkis commented Oct 20, 2017

Returning a Range in equal_range sounds great. I updated the original post.

I don't understand the question about Option. What should we be the return types for upper_bound and lower_bound?

@ranma42
Copy link
Contributor

ranma42 commented Oct 20, 2017

Since they are looking for "the first element in the sorted slice that is ..." I would return the length of the slice. This would make it easy to use these function to find the position that would be the insertion point of an element so that the order is preserved (and the element is added as the first/last of a streak of elements that compare as equal).
I am unsure if my explanation is comprehensible... I will check it again tomorrow morning XD

@alkis
Copy link
Author

alkis commented Oct 20, 2017

We return the length of the slice or the position of the lower bound? What happens if the element does not exist (past the end of the slice)? Do you think returning 0 for lower bound and self.len() for upper bound makes sense in that case?

@ranma42
Copy link
Contributor

ranma42 commented Oct 21, 2017

I think 0 would not be a suitable substitute for None if the slice is not empty, because it would identify an element that does not satisfy the property.

For example, [21, 23, 23].lower_bound(26) should not return 0, because 21 is not "the first element in the sorted slice that is not less than" 26. Instead I think returning 3 would be ok, as it would correctly indicate that we could find no element that is not less then 26.

A similar reasoning also applies to upper_bound. This approach could be summarized in the following table:

x lower_bound upper_bound equal_range
20- 0 0 any empty range (ex: Range(0, 0))
21 0 1 Range(0, 1)
22 1 1 any empty range (ex: Range(1, 1))
23 1 3 (or None) Range(1, 3)
24+ 3 (or None) 3 (or None) any empty range (ex: Range(3, 3))

Note that while the current definition of equal_range allows for any empty range to be returned, we might constrain it more and require that it always returns the range that corresponds to the would-be position of the x elements, even when there are none.

@alkis
Copy link
Author

alkis commented Oct 21, 2017

Thanks Andrea. I updated the code above with examples. Let me know if you agree.

equal_range(x) is equivalent to (lower_bound(x), upper_bound(x)) so everything just follows if we nail down what lower_bound and upper_bound return.

@ranma42
Copy link
Contributor

ranma42 commented Oct 21, 2017

Yep, the examples match my expectations :)

Some nits (I think most of these are basically just typos):

  • upper_bound is still marked as returning Some<usize>
  • the descriptions for {lower,upper}_bound still state "If such an element exists it returns Some with the index of such element. Otherwise it returns None.". I would expect something like "If such an element exists it returns the index of such element. Otherwise it returns the length of the slice."
  • the description for equal_range does not match returning (lower_bound(x), upper_bound(x)) (12 does not exist in the example slice, yet the equal_range would not be (0, 0) nor (5, 5)).

Some proposals for an alternative wording of the description (very slice-oriented):

  • lower_bound returns the index a such that all elements in self[..a] are less than x and all elements in self[a..] are greater or equal to x.
  • upper_bound returns the index b such that all elements in self[..b] are less or equal to x and all elements in self[b..] are greater than x.
  • equal_range returns the Range a..b such that all elements in self[..a] are less than x, all elements in self[a..b] are equal to x, and all elements in self[b..] are greater than x.

This wording is more verbose, but it shows why the "element not found" is not actually a special case for these functions and highlights in a very direct way the relationship between equal_range and {lower,upper}_bound.

@alkis
Copy link
Author

alkis commented Oct 21, 2017

I like the new wording. Incorporated.

@scottmcm
Copy link
Member

scottmcm commented Oct 25, 2017

I think "any empty range" is unfortunate, because there's very useful information that throws away.

The other use of equal_range is, with an obviously-placeholder name,

/// Returns the range of positions at which `x` could be inserted such
/// that the sequence would remain sorted.
fn where_to_insert_in_sorted(&self, x: &T) -> RangeInclusive<usize>;

That return value would have the same start and end as equal_range. (C++ avoids this distinction by returning a pair that can be considered as [a,b) for "the items" and [a,b] for "the insertion positions".)

Phrased differently, I think it's important that upper_bound and lower_bound are the .start and .end of the equal_range return value (just calculated a bit more efficiently).

@arthurprs
Copy link

Isn't there any meaningful data to be returned from a "not found" lower_bound? Can't it behave similar to binary_search?

Same for upper_bound.

@alkis
Copy link
Author

alkis commented Oct 25, 2017

@scottmcm equal_range does not return "any empty range". What is returned is defined as your first suggestion.

@arthurprs can you give an example?

@arthurprs
Copy link

@alkis nevermind, that was a confusion moment.

I think lower_bound should return 0 and upper_bound should return slice.len() if no range can be found. Same as you proposed on #2184 (comment)

@alkis
Copy link
Author

alkis commented Oct 25, 2017

@arthurprs this is what it will return given the description. The examples also show these cases.

@arthurprs
Copy link

I see, it's just that fn upper_bound(&self, x: &T) -> Option<usize> still has the Option<> in the result type.

@alkis
Copy link
Author

alkis commented Oct 25, 2017

Fixed.

@scottmcm scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Oct 26, 2017
@alkis
Copy link
Author

alkis commented Oct 30, 2017

What's the next step? Does this need to go through a PR submission to the RFC repository?

@bluss
Copy link
Member

bluss commented Nov 4, 2017

Is there a reason these functions don't use Result in their return type like binary_search does. Isn't it sometimes still interesting to know if there is an existing equal element or not?

@alkis
Copy link
Author

alkis commented Nov 4, 2017 via email

@alkis
Copy link
Author

alkis commented Nov 7, 2017

I published the implementation in a crate here: https://crates.io/crates/ordslice. When/if this lands the crate can be deprecated.

@Amanieu
Copy link
Member

Amanieu commented Nov 15, 2017

Could we add the option for the bound to be inclusive or exclusive using Bound? The result would look like this:

fn lower_bound(&self, x: Bound<&T>) -> usize 
where
    T: Ord;

This is similar to the API I used for RBTree in intrusive-collections.

@alkis
Copy link
Author

alkis commented Nov 15, 2017

Do you have an idea of the performance cost of such a change? Also do you have examples of using such API in practice that either allow use-cases that are not possible otherwise and/or more readable code?

@arthurprs
Copy link

Can't exclusive bounds be emulated by altering the comparison fn?

@sicking
Copy link

sicking commented Jul 12, 2018

the need for something like lower_bound/upper_bound in a number of situations. What I've ended up doing is to use binary_seach_by and using a closure which only returns Greater or Less. For example in this function in rand.

The main thing that was awkward with that solution, and which still feels awkward in the lower_bound_by/upper_bound_by API is the use of Ordering as the closure return type. In part because the enum names take up about half of the closure function.

But for lower_bound_by/upper_bound_by also because returning Equal effectively treated the same as either Less or Greater. I.e. lower_bound_by/upper_bound_by effectively want a binary answer, but uses a three-value enum to express that.

I think the lower_bound_by/upper_bound_by API would be a lot more pleasant to use if the closure returned a bool instead. That way you could write myslice.lower_bound_by(|item| item.foo <= 20).

The only problem is that this doesn't work well with equal_range_by, which would need to keep using Ordering.

@alkis
Copy link
Author

alkis commented Jul 12, 2018

You can find lower_bound, upper_bound, and equal_range in https://docs.rs/superslice/0.1.0/superslice/.

@sicking
Copy link

sicking commented Jul 12, 2018

I was hoping to move forward the discussion so that maybe these functions can be added to the standard library. But happy to do that elsewhere if that's preferred?

@sicking
Copy link

sicking commented Jul 20, 2018

For equal_range, I think it'd be more useful if you could provide a range of numbers to be searched for in the slice. Similar to BTreeMap::range. So something like:

let vec = vec![3, 5, 6, 6, 8, 10, 12];
assert_eq!(vec.search_something_range(5..9), (1, 5));
assert_eq!(vec.search_something_range(5..10), (1, 5));
assert_eq!(vec.search_something_range(5..=10), (1, 6));
assert_eq!(vec.search_something_range((Exclusive(6), Inclusive(20)), (4, 7));

It'd also be cool if you could do

let vec = vec![3, 5, 6, 6, 8, 10, 12];
assert_eq!(vec.search_something_range(6), (2, 4));

However that'd mean we couldn't use the RangeBounds trait.

@sicking
Copy link

sicking commented Jul 27, 2018

FWIW, if there's interest in having this use cases discussed in this RFC directly addressed in stdlib, independent of what API is used to do it, then I suggest speaking up. The PR in rust-lang/rust#52530 got rejected because it wasn't clear that the usecase needed addressing.

@ghost
Copy link

ghost commented Aug 16, 2018

I would be really happy to have that in stdlib

@spacejam
Copy link

spacejam commented Nov 21, 2018

I am hitting this issue in sled, and finding myself writing a few wrappers around standard library functionality that do not feel obviously correct (similar to rust-lang/rust#52529). It's important for me to quickly locate these bounds, both exclusively and inclusively, to enable various tree search and iteration features.

@matklad
Copy link
Member

matklad commented Feb 22, 2019

Could we add the option for the bound to be inclusive or exclusive using Bound

Why this would be useful? What does xs.lower_bound(Bound::Unbounded) mean, for example? In general, I find it hard to understand at a glance what is the difference between xs.lower_bound(Bound::Exclusive(&92)) and xs.lower_bound(Bound::Inclusive(&92)). std::collections::BTreeMap only uses Bound with range.

I rather strongly feel that the proposed signature of

fn (lower|upper)_bound(&self, x: &T) -> usize 
where
    T: Ord;

is "canonical". For a given x, there are number of possible insertion points for x in the underlying slice, and lower / upper return smallest/greatest one respectively. Adding Result or Bound into the mix seem to only complicate the API? It seems like we should just add these methods to std already :)

I am not so sure about equal_range:

  • while I've used lower/upper bound in a number of projects, I personally haven't found a use for equal_range, which could be an indication that this is not a super-useful method after all.
  • I imagine the Bounds API might be better for this case?
pub fn binary_search_range<K, R>(&self, range: R) -> std::ops::Range<usize> where
    K: Ord + ?Sized,
    R: RangeBounds<K>,
    T: Borrow<K>,

@Amanieu
Copy link
Member

Amanieu commented Feb 22, 2019

Why this would be useful? What does xs.lower_bound(Bound::Unbounded) mean, for example?

It would always return 0 (the first element).

In general, I find it hard to understand at a glance what is the difference between xs.lower_bound(Bound::Exclusive(&92)) and xs.lower_bound(Bound::Inclusive(&92)).

If the slice does not contain 92, then they are equivalent. Otherwise, Inclusive will point to the first instance of 92 in the slice (there can be more than one), and Exclusive will point to the element after the last 92.

@matklad
Copy link
Member

matklad commented Feb 22, 2019

So, lower_bound(Exclusive(&x)) is the same as upper_bound(Inclusive(&x))? And vice verse?

10   92  92  92  100
     ^           ^
     LI          LE
     UE          UI

If this is the case, looks like Bound does not by us any expressivity? EDIT: for "point" queries. I 100% see usefulness of Bound for "range" queries, where the return type is not usize, but a range/slice.

@Ixrec
Copy link
Contributor

Ixrec commented Feb 22, 2019

I haven't been following this thread, but having the "lower bound" of 92 ever be above all the 92s seems like we've either lost track of what "lower" means, or these are really bad names for whatever operations are being proposed.

Assuming these are the operations the names imply they are, and we do want Bounds that also mean what their names imply, wouldn't the expected behavior be more like this?

10  92  92  92  100
^   ^       ^   ^
LE  LI      UI  UE

@Amanieu
Copy link
Member

Amanieu commented Feb 22, 2019

I haven't been following this thread, but having the "lower bound" of 92 ever be above all the 92s seems like we've either lost track of what "lower" means, or these are really bad names for whatever operations are being proposed.

You need to think of the bounds in terms of a range of values that you are interested in. So lower_bound means "the range of values that I am interested in starts at 92 (inclusive/exclusive depending on the bound), give me the index of the first value in that range".

The definition for upper_bound is similar: "the range of values that I am interested in ends at 173 (inclusive/exclusive), give me the index of the value in that range".

@Ixrec
Copy link
Contributor

Ixrec commented Feb 22, 2019

Interesting. I think that means the names are off. lower_bound(x) seems like it'd mean the lower bound of values equal to x. If what we want is the first x value, then we should just call the method something obvious like first(x).

But "upper bound"/"upper bound (inclusive)"/"lower bound (exclusive)" apparently means "the first non-x value after an x value" which is a super weird concept I don't know of a concise name for. Hm...

@spacejam
Copy link

For a couple concrete "point" query examples I'm working with, consider a B+ tree where pointers to children are stored in a sorted slice as (separator_key, child_node) where the separator_key is the lowest inclusive bound of values that may be present in that node. While traversing the tree, we need to find the child with the highest separator_key that is less than or equal to the key being searched for in the tree.

In general, I find all of the below use cases useful for working with versioned trees:

  • find the highest value that is less than or equal to a key (B+ tree child separator keys while trying to find a leaf node that may contain a key)
  • find the highest value that is less than a key (get key+value before a given key, useful for GC in MVCC and reverse iterators in unidirectionally-linked trees)
  • find the lowest value that is greater than or equal to a key (build simple iterators)
  • find the lowest value that is greater than a key (get next value, useful for MVCC)

All of the above can be expressed with iterators, but that prevents their compatibility with binary_search*.

@matklad
Copy link
Member

matklad commented Feb 22, 2019

I believe that this is expressible with bound-less methods like this:

xs[..xs.upper_bound(key)].last() highest less or equal
xs[..xs.lower_bound(key)].last() highest less
xs.get(xs.lower_bound(key))      lowest greater or equal
xs.get(xs.upper_bound(key))      lowest greater

The first two seem trickier than ideal, but I don't think they can be better expressed with Bound semantics of #2184 (comment) (which I believe is what @Amanieu proposes and which coinsides with my intuition). The reason is that adding Bound does not change the set of indices you can get, you still need to -1 somehow.

The version of @Ixrec (#2184 (comment)) does make this easier though.

On the meta level, if we can come up with two semantics for Bound which are differ by 1, that probably means that Bound would be a problematic API.

@Amanieu
Copy link
Member

Amanieu commented Feb 22, 2019

It's not just a simple -1. If you have multiple identical values then the difference between an inclusive and exclusive lower bound can be more than 1 element:

10  92  92  92  100
    ^           ^
    LI          LE

@matklad
Copy link
Member

matklad commented Feb 22, 2019

@Amanieu sorry, I wasn't clear, the -1 referred to the difference between #2184 (comment) and #2184 (comment). That is, it's the difference between semantics one might thing Bound has, not the difference between LI and LE (which can be arbitrary in one semantics and zero or one in another semantics) .

@avl
Copy link

avl commented Sep 18, 2019

It would be nice if this could move forward somehow. I was surprised that there was a binary_search in std, but that it apparently selects an element at random if there are multiple elements with the same value.

@VillSnow
Copy link

VillSnow commented Jun 18, 2020

C++ has partition_point. It has clear definition like

let i = partition_point(slice, pred);
for x in &slice[..i] { assert!(  pred(x) ); }
for x in &slice[i..] { assert!( !pred(x) ); }

I can use this when I cannot remember the definition of lower/upper_bound :)
It is basic and primitive edition of lower/upper_bound and I want it in Rust rather than lower/upper_bound. (both is great)

@matklad
Copy link
Member

matklad commented Jun 19, 2020

@VillSnow this sounds like a great idea, as this is indeed more general than binary search, and has trivial design space. Would you like to send a PR to libcore to add this? I am not on the libs team, but seems like this API can be accepted without too much deliberation:

impl<T> [T] {
  fn partition_point<T, P>(&self, mut pred: P) -> usize
  where
      P: FnMut(&T) -> bool,
}

As basic check, rust also use partition and partition_in_place, so cargo-culting the name from C++ seems fine.

@VillSnow
Copy link

VillSnow commented Jun 21, 2020

@matklad
It would be my first PR in my life. How can I send a PR?
I cloned rust-lang/rust, passed x.py check, implement my partition_point and passed x.py check again. ( and accidentally my Linux machine downs so it tooks a bit time to recover :_( done)

my understanding is; first make a issue "Traking issue for partition_point", set issue number of unstable attribute and send PR?
How to set feature of unstable attribute? Can I make as I like?

VillSnow/rust@4c8ce48

Manishearth added a commit to Manishearth/rust that referenced this issue Jun 28, 2020
Add partition_point

Add partition_point in C++.
Although existing binary_search in rust does not suitable when the slice has multiple hits,
this function returns exact point of partition.
The definition of this function is very clear and able to accept general matter, therefore you can easily get index which you want like lower/upper_bound.

rust-lang/rfcs#2184
@kartva
Copy link

kartva commented Jun 3, 2021

Maybe this issue can be closed, now that this has merged?

@matklad
Copy link
Member

matklad commented Oct 22, 2022

I think equal_range is the next thing we can obviously just do and perhaps also should do. It's not as frequently needed as partition point, but:

  • it is needed sometimes (here's, eg, polyfil in rust-analyzer).
  • its a general-purpose utility with obvious implementation and not really any tradeoffs
  • "perfect" implementation with no duplicate binary searches and bounds checks, is tricky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests