Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Binary Search (API Change Proposal) #81

Closed
liborty opened this issue Aug 9, 2022 · 54 comments
Closed

Improved Binary Search (API Change Proposal) #81

liborty opened this issue Aug 9, 2022 · 54 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@liborty
Copy link

liborty commented Aug 9, 2022

Improved Binary Search

The problem

Quoting from: primitive.slice:
"If there are multiple matches, then any one of the matches could be returned. The index is chosen deterministically, but is subject to change in future versions of Rust. If the value is not found then Result::Err is returned, containing the index where a matching element could be inserted while maintaining sorted order."

Some of these issues are since to some extent addressed by partition_point with a different functionality.

Motivation, use-cases

  • A typical application is efficient removal of duplicates of a particular value. For that, we need to find all (partially) equal items.
  • In probability/statistics, we often want to count the number of occurrences of a particular event/value, intermixed amongst many other values/measurements within a sample. Then the solution is to sort the sample and then deploy my improved binary search, where range.end-range.start gives the answer. We can then keep the sorted list and possibly test for some other events/values in the same way.
  • Suppose you have a matrix/spreadsheet/database and instead of the usual retrieval of a value for some given coordinate(s), you need to solve the inverse problem, that is to find all the (pairs of) coordinates of where a particular value occurs.

Solution

The proposed solution has the following new benefits:

  • It only requires PartialOrd instead of Ord.
  • It works on both ascending and descending orders.
  • It only searches the specified input Range. This will often be quicker.
  • The data can come from any source in address (index) space of type T and is only sampled (lazily) at the binary search points.
  • The existing std binary_search varieties use Err mechanism to deliver a genuine sort-order result, which is arguably not quite right. This is corrected. No errors are returned.
  • It solves the ambiguity about which (of possibly several) match positions is returned. The result is now unambiguous: output range from the sort position of the first occurrence of the sought item till the last. Empty range n..n, means that the item was not found and n is its insert position. This mechanism is logically consistent and avoids doing Error processing of useful results.

Links and related work

I have implemented the proposal as function binary_find. It can be presently seen in my crate indxvec.

@liborty liborty added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Aug 9, 2022
@the8472
Copy link
Member

the8472 commented Aug 9, 2022

A typical application is efficient removal of duplicates. For that, we need to find all (partially) equal items, not just one random equal item.

Can yo provide more details? Like how that isn't covered by sorting (which has to be done anyway for a binary search) and then using Vec::dedup?

It works on both ascending and descending orders,

I don't see how this is necessary for deduplication.

It solves the ambiguity hinted at in the above quote from std documentation.

But so does partition_point?

@liborty
Copy link
Author

liborty commented Aug 9, 2022

Thanks for your perceptive questions.

Yes, it is true that Vec::dedup will, in principle, solve the outlined example application. In practice, my proposal will often make it more efficient. For example, when we want to remove only duplicates of a specific value (possibly from lots of lists). Then it is wasteful to search linearly all of them for all the duplicates. We might not even want to remove the other duplicates at all.
Sorry, I should have written 'duplicates of some value' instead of just 'duplicates', implying all duplicates of all values, which is a different thing. I hear you say: "ok, so I pass a closure to dedup_with that tests for equality of only that value." Yes but then, to my knowledge, it will not deploy binary search.

Different orders may not be necessary for deduplication but when you already have a sort of a very long list, you do not want to have to reverse it just because your binary search can cope with only one direction. Also, there are many other applications for a properly implemented general binary search.

Yes, I acknowledge that the partition_point also solves the Err returns but, then again, it has somewhat different functionality. I actually started by looking at binary_search, since it is there and I found it inadequate for my needs. Initially, I did not even suspect that there might be another method with an unrelated name which solves similar problem somewhat better.
Yes, I should have studied the documentation more thoroughly but I offer this as an argument for implementing binary_search properly, precisely because it is there and that is what people use first.

@the8472
Copy link
Member

the8472 commented Aug 9, 2022

but when you already have a sort of a very long list, you do not want to have to reverse it just because your binary search can cope with only one direction.

Or you can use Reverse:

    let slice = &[100usize, 80, 77, 4, 1];
    assert_eq!(Ok(1), slice.binary_search_by_key(&Reverse(&80), |e| Reverse(e)));

@liborty
Copy link
Author

liborty commented Aug 9, 2022

but when you already have a sort of a very long list, you do not want to have to reverse it just because your binary search can cope with only one direction.

Or you can use Reverse:

    let slice = &[100usize, 80, 77, 4, 1];
    assert_eq!(Ok(1), slice.binary_search_by_key(&Reverse(&80), |e| Reverse(e)));

Yes, or just use a complement index: slice.len()-idx. When you have several occurrences, it will also reverse your resulting range. I am not disputing that there are alternative ways of programming anything with anything :)
The point I would make is that if you try this with plain binary_search, which returns a random occurrence, you will end up in quite a mess.

@scottmcm
Copy link
Member

scottmcm commented Aug 9, 2022

I'll cc rust-lang/rfcs#2184, where something like this has been discussed (inspired by https://en.cppreference.com/w/cpp/algorithm/equal_range) for almost 5 years now.

I think the -> Range<usize> version of binary search makes sense to offer. I don't know how to spell its name, though.

@liborty
Copy link
Author

liborty commented Aug 9, 2022

Yes, if the existing binary_search is to remain, then we need a new short and distinguished name. How about: binary_find ?

As to the functionality, there is an argument for keeping that simple too, as per my suggestion. That is, using implied PartialOrd instead of passing in a full blown comparator closure. After all, as has been rightly pointed out, the partition_point can be bent to that purpose.

PartialOrd is already an improvement in generality on the Ord of the existing binary_search.

Regards the above discussion about working on both ascending and descending orders: yes, that is to some extent optional but I have included it because it is practically a zero cost exercise ( a single test of a bool ).

I note that if you accidentally use the wrong order, the existing binary_search gives misleading results.

I am going to detect the order automatically, for convenience and safety.

@scottmcm
Copy link
Member

scottmcm commented Aug 10, 2022

I'm not on libs-api, but I'd be very surprised if a bool like that for the different orders got in.

.binary_search_range(x, true) is neither readable nor rememberable, and while .binary_search_range(x, Order::Ascending) is readable, it's also super-verbose and the extra enum doesn't really pull its weight.

See also how there's no bool on sort for which way you want it sorted.


What you might consider instead -- as a separate ACP! -- would be something to make it easier to use cmp::Reverse to handle this situation.

For example, if there was a method to go &[T] -> &[cmp::Reverse<T>], then you'd have things like

[5, 4, 2, 1].bikeshed_this().binary_search(Reverse(3)) // Err(2)
[1, 4, 2, 3].bikeshed_this_mut().sort() // [4, 3, 2, 1]

@liborty
Copy link
Author

liborty commented Aug 10, 2022

See my above edit.
I propose .binary_find(x) with order detected automatically.

PS: there is bool on my implementations of mergesort and hashsort but that is another story.
I see ascending and descending orders as essentially equivalent and no reason to implicitly assume any one of them.

@shepmaster
Copy link
Member

You could go for a builder style API. Hand-wavy and without real names...

a_slice.binary_finder().first()
a_slice.binary_finder().last()
a_slice.binary_finder().any()
a_slice.binary_finder().range()

@liborty
Copy link
Author

liborty commented Aug 11, 2022

Your suggestion can be used to unite my proposal, let us call it .binary_find(), with the existing binary_search().

All of those fields can be trivially obtained from my binary_find returned range.
Except any, which is returned by the current binary_search.

@programmerjake
Copy link
Member

imho it'd be useful to be able to binary search on non-slice things, e.g. a custom data type that is indexable, but data is not contiguous in memory.

@liborty
Copy link
Author

liborty commented Aug 11, 2022

I am offering also my .binsearch_indexed(), which works much in the same way. The name can be changed.

@liborty
Copy link
Author

liborty commented Aug 13, 2022

It works like this (self is a slice of any type &[T]):

    /// Binary search of an index sorted slice in ascending or descending order. 
    /// Like `binsearch` but using indirection via sort index `idx`.
    fn binsearch_indexed(self, idx:&[usize], val:&T) -> Range<usize> 
        where T: PartialOrd;

Where idx is obtained by:

    /// Makes a sort index for self, using key generating closure `keyfn`
    fn keyindex(self, keyfn:fn(&T)->f64, ascending:bool) -> Vec<usize>;

The beauty of it is that T can be any arbitrarily complex user defined type but as long as you supply the keyfn (closure), the index is built using my super-fast hashsort (over f64 keys). I could subsume keyindex within binarysearch_indexed but I find that having the explicit index available is often useful for other purposes as well.

When calling keyindex, the closure can capture any additional variables it might need. I use it in rstats crate to sort whole multidimensional vectors (as T), with keys derived as any (scalar) function on the vectors, such as their distance from some point.
Where the distance constitutes a PartialOrd for the purpose of the search.

@programmerjake
Copy link
Member

It works like this (self is a slice of any type &[T]):

    /// Binary search of an index sorted slice in ascending or descending order. 
    /// Like `binsearch` but using indirection via sort index `idx`.
    fn binsearch_indexed(self, idx:&[usize], val:&T) -> Range<usize> 
        where T: PartialOrd;

that's nice, but it doesn't actually do what imho is needed, which is more like:

// TODO: pick better name
// TODO: add versions with u32, u64, and u128 instead of usize -- needed
// because you might be searching on disk or over the network rather
// than in-memory and usize can be too small.
// TODO: switch to using Try trait rather than Result.

// Note how this needs no slice to be constructed in order to
// search, not everything that can be searched fits in memory or
// wants to pay the up-front cost of building a slice, e.g. if you're
// searching in something with 400 trillion entries, nearly no one
// has that much memory to construct a slice of, nor wants to
// spend hours filling memory with indexes before it can conduct
// a simple <1s search.
pub fn try_bin_search_algo<E>(mut search_in: Range<usize>, mut cmp_at: impl FnMut(usize) -> Result<Ordering, E>) -> Result<Result<usize, usize>, E> {
    // hope i got this right
    while !search_in.is_empty() {
        let mid = (search_in.end - search_in.start) / 2 + search_in.start;
        match cmp_at(mid)? {
            Ordering::Less => search_in.end = mid,
            Ordering::Greater => search_in.start = mid + 1,
            Ordering::Equal => return Ok(Ok(mid)),
        }
    }
    Ok(Err(search_in.start))
}

@liborty
Copy link
Author

liborty commented Aug 14, 2022

So, you envisage combining this with a random disk/internet access? I have not thought that far.
Does that not warrant a separate method?

PS. I think you also need to check that the sought value is actually inside the range, i.e be more careful about the Err value.

@programmerjake
Copy link
Member

So, you envisage combining this with a random disk/internet access?

yes, or something else other than just an index-into-slice-and-compare, e.g.:

// compute sqrt(a) via binary search.
try_bin_search_algo(0..(1 << 32), |v| Ok(if Some(v2) = v.checked_mul(v) { a.cmp(&v2) } else { Ordering::Less }))

I have not thought that far. Does that not warrant a separate method?

probably...

@liborty
Copy link
Author

liborty commented Aug 14, 2022

Yes, like I said, I can subsume the keyindex method and just pass the closure to the binary search directly.
You are quite right that doing the comparisons with a closure probe 'lazily', saves time and space for generating all the key values. Thank you for that!

@cuviper
Copy link
Member

cuviper commented Aug 14, 2022

It seems to me that there is enough design space left that this isn't ready for the standard library yet.

@programmerjake
Copy link
Member

This makes me think that Rust needs something in the standard library like C++'s <algorithms> header, where we can put binary search, sorting algorithms (which also need a version that doesn't need a slice to sort), and more.

Maybe core::algo::*?

@CAD97
Copy link

CAD97 commented Aug 18, 2022

I decided against passing in a comparator closure, as the same effect can be achieved, imho in a cleaner way, by implementing PartialOrd for your custom data type T

Just to note, this isn't always possible. E.g., in string interners, it's really nice to store

type Span = Range<u32>;
type Symbol = NonZeroU32;
pub struct Interner<S> {
    hasher: S,
    string_to_symbol: HashSet<Symbol, ()>,
    symbol_to_span: Vec<Span>,
    span_to_string: String,
}

with then hash/comparison being of &str <=> &span_to_string[symbol_to_span[symbol]]. If the key actually stored in the data structure you're searching is a key and not an index, it cannot implement Partial/Ord/Eq directly; it needs to store an actual reference back to the shared resource. And in a case like this, that's not even really possible, since the String can be appended to and reallocate, invalidating extant references.

This is just what I'm the most familiar with, but this pattern of raw_entry usage to do "dependency injection" for comparisons is a surprisingly commonly available optimization once you know how to look for it.

@liborty
Copy link
Author

liborty commented Aug 18, 2022

I struggle to understand this.
Given that you are being asked to implement PartialOrd on your own custom type T,
how is it not possible to include the same struct fields in T?

@CAD97
Copy link

CAD97 commented Aug 18, 2022

See https://github.com/CAD97/strena as an example. The HashMap raw_entry API allows you to provide dependency injected fn hash(&K) -> u64 and fn eq(&Q, &K) for lookup.

@liborty
Copy link
Author

liborty commented Aug 18, 2022

Then why not just do your binary search over the symbols?

@CAD97
Copy link

CAD97 commented Aug 19, 2022

Because the symbols aren't sorted intrinsically. They're indexes, sorted by the string which they reference, which to recover you need to provide outside data. ex.:

impl Interner {
    pub fn get(&self, s: &str) -> Option<Symbol> {
        let string_to_symbol: &[Symbol] = &self.string_to_symbol;
        let symbol_to_span: &[Span] = &self.symbol_to_span;
        let span_to_string: &str = &self.span_to_string;

        let range = binary_find(0..string_to_symbol.len(), |ix| {
            span_to_string[symbol_to_span[string_to_symbol[ix]].clone()].cmp(s)
        });
        Some(range)
            .filter(|range| !range.is_empty())
            .map(|range| string_to_symbol[range.start])
    }
}
adjusted implementation
pub fn binary_find(range: Range<usize>, cmp: impl Fn(usize) -> Ordering) -> Range<usize> {
    use Ordering::*;

    // `last` finds the exclusive end of the range of the matching items
    let last = |ix: usize| -> usize {
        (ix + 1..range.end)
            .find(|&i| cmp(i) == Greater)
            .unwrap_or(range.end)
    };
    // `first` finds the inclusive start of the range of the matching items
    let first = |ix: usize| -> usize {
        (range.start..ix)
            .rfind(|&i| cmp(i) == Less)
            .map_or(range.start, |i| i + 1)
    };

    // Checking for errors, special cases and order
    if range.is_empty() {
        return range;
    };
    match (cmp(range.start), cmp(range.end - 1)) {
        // searchee is before range.start
        (Greater, _) => return range.start..range.start,
        // searchee is beyond range.end
        (_, Less) => return range.end..range.end,
        // search range is all equal
        (Equal, Equal) => return range,
        // searchee at range.start
        (Equal, _) => return range.start..last(range.start),
        // searchee at range.end
        (_, Equal) => return first(range.end - 1)..range.end,
        _ => (),
    }

    // Clean binary search
    let mut hi = range.end - 1; // initial high index
    let mut lo = range.start; // initial low index
    loop {
        let mid = lo + (hi - lo) / 2; // binary chop here with truncation
        if mid > lo {
            // still some range left
            match cmp(mid) {
                // still too low
                Less => lo = mid,
                // still too high
                Greater => hi = mid,
                // found it!
                Equal => return first(mid)..last(mid),
            }
        } else {
            // interval is exhausted, searchee not found
            return hi..hi;
        };
    }
}

@liborty
Copy link
Author

liborty commented Aug 20, 2022

Yes but now every cmp closure has to check for the descending order on every invocation, which is not practical. If you can solve that problem, I will be happy to adopt your proposal.

Also, we can not return an empty range when the searchee is outside it, as the empty range is reserved for indicating the insert order of a missing searchee within the range. However, this is not such a problem and if you consult my latest code, you will see that I have solved it already by returning the range limit in the direction of the outlying index as Err(limit).

I like the probe closure being explicit. I think it makes the code much easier to follow than burying it, the searchee and everything, inside the opaque captured environment of cmp. This seems just as bad as any global variables ever were. Any reason why we can not have both probe and cmp?

@CAD97
Copy link

CAD97 commented Aug 20, 2022

Yes but now every cmp closure has to check for the descending order on every invocation

No, because you just prestore that information in the closure the same way you're doing currently. Or use cmp::Reverse, or whatever; Rust's sorting APIs do not assume which order you want, they require you to put in the correct ordering.

If anything, this is better, because with this version you can pass in a static order whereas your code always does a dynamic comparison ordering.

Also, we can not return an empty range when the searchee is outside it, as the empty range is reserved for indicating the insert order of a missing searchee within the range.

If I return begin..begin, that means that the insertion position is before the first element, right? Same for end..end is after the end element. If you want to know if the element is strictly contained in the range, this isn't the API that's serving that purpose.

Though that does mean the end..end return at the end is wrong. (I mostly just copied your impl, assuming it good, fixing up types.) It should be mid..mid, where mid is the insertion point to retain a sorted order.

I like the probe closure being explicit. I think it makes the code much easier to follow than burying it, the searchee and everything inside the opaque cmp captured environment. This seems just as bad as any global variables ever were. Any reason why we can not have both probe and cmp?

I'm going to turn this question around. What benefit is there to having a separate index and compare step? Having it in one step saves a (potentially) dynamic call. Rust likes its stateful closures; in fact, this should probably even take FnMut.

Having a unified compare does have benefits, in that it (potentially) lowers monomorphization pressure, but more importantly, your API requires the key to be sized, copy, 'static (because you extract the key by value fn(usize) -> T) whereas using a single cmp closure allows the indexed values to be ephemeral to the comparison closure, by reference or by value.


Also, always indexing in u128 is going to be a big problem. u128 is emulated and slow on plenty of platforms, and you physically can't have more than isize::MAX district items anyway.

We probably need to get the bounds by continuing the binary search as well, rather than falling back to a linear probe. Or at a minimum use some sort of exponential probe to avoid the worst case performance of [0, [1; BIG]..., 2].

@CAD97
Copy link

CAD97 commented Aug 21, 2022

Actually, begin..begin means that the insertion position is at the first element.

AIUI, given a sorted two-element slice [ a , b ] and binary searching for x, the correct semantics are

0..0 => [ x , a , b ].is_sorted()
1..1 => [ a , x , b ].is_sorted()
2..2 => [ a , b , x ].is_sorted()
0..1 => x == a && x < b
1..2 => a < x && x == b
0..2 => x == a && x == b

and these are all possible outcomes of a binary search on a sorted list. Where do we disagree?

I suppose, if this is a subslice, insertion at the edges may not maintain the entire slice being sorted. However, this is still a determinable output condition, and there's no reason to assume that binary searching a subslice that your item is not in will give you a position that keeps the superslice still sorted; the correct answer is to say inserting at the edge would keep the input slice range sorted.

I think the more important distinction, though, is that "insertion at an element" is meaningless. You insert between elements. The range 0..0 is an empty range which is before the first element, as is observed with Vec::splice.

When I say "insertion position is before the first element," I do mean the position directly before the first element.


I'm also going to repeat my counterquery for visibility:

What benefit is there to having a separate index and compare step?

Because the benefit of not is that a separate index step really wants a signature of fn index(&mut self, ix: usize) -> Self::Key<'_>; that is, return a key which can optionally borrow from the container and only needs to be valid until the next call to index. (Think "random access lending iterator".)

@liborty
Copy link
Author

liborty commented Aug 21, 2022

0..0 => [ x , a , b ].is_sorted()
...
and these are all possible outcomes of a binary search on a sorted list. Where do we disagree?

Ah, I see what is going on! I was referring to the subscripts of the input range, as supplied. Assuming that any insertion process would on seing 0..0 insert x in position 0 and shift the rest one up. Whereas you are assuming that this has already been done and your subscripts are referring to the new slice. So, we do not disagree.

On the count of having an all-in-one comparator, I think you persuade me about the benefits.

I accept that the current functions/methods implicitly assume that everything is in ascending order but that seems to me an unwarranted assumption. In any case, I would like to relax it. I think that having both orders automatically recognised and correctly acted upon has benefits in generality and code robustness. Presumably, this will now have to be done externally to binary_find. Where an if statement similar to mine will be issuing two different calls on the binary_find, with two different codes for the cmp closure, one involving Reverse? Seems more demanding on the user but, as you say, not everyone will necessarily need that.

Thank you for your clarifications, I find them very useful.

@CAD97
Copy link

CAD97 commented Aug 21, 2022

Presumably, this will now have to be done externally, issuing two different calls on the binary_find

No, one call is sufficient; just do a let cmp = if ascending { |x| key.cmp(x) } else { |x| key.cmp(x).reverse() } like indxvec currently does internally, or let cmp = |x| { let c = key.cmp(x); if ascending { c } else { c.reverse() } }. (The latter might have better inlining behavior.)

I accept that the current functions/methods implicitly assume that everything is in ascending order but that seems to me an unwarranted assumption.

For intrinsic ordering, I might agree with you. But when injecting a comparator/ordering, it is absolutely reasonable to assume that the ordering is the same that the ordered input is ordered by.

Whereas you are assuming that this has already been done and your subscripts are referring to the new slice.

... no? Given index..index, I should do vec.insert(index, item) to insert it and maintain the sort order. Yes, that results in vec[index] being item, but this is because we inserted it at that index. The index is both the insertion index and the index after insertion, because that's just restating the same thing twice.

@liborty
Copy link
Author

liborty commented Aug 22, 2022

Thanks to your excellent suggestions we are making progress. I have now implemented all of them apart from replacing also the linear search of matching items with binary search. In a few extreme cases it will be much quicker but also, in many usual cases where there are only a few matching items, it will be slower. If there is a feeling that it is worth it, I can certainly do that: basically a recursive call with a comparator that looks at two adjacent items to find the edge cases.

The biggest latest improvement I made is generic search Range, so that the user can choose what index type suits best: usize, u128, or anything else (satisfying the generic trait bounds). Here is an example application to root finding, using f64:

use core::ops::Range;
#[test]
fn roottest() { 
    let num:f64 = 1234567890.0;
    let root:f64 = 5.3;
    let range = broot(num,root);
    println!("{} to the power of {YL}1/{}{UN}\nbinary_find:\t{} error: {}\npowf:\t\t{} error: {}", 
    num.yl(), root, range.start.gr(), (num-range.start.powf(5.3)).gr(),
    num.powf(1./root).gr(),(num-num.powf(1./root).powf(root)).gr()); 
}

///num.powf(1./root) using binary search
fn broot(num:f64,root:f64) -> Range<f64> {
    binary_find(1_f64..num,
        |&probe| { 
            if probe.powf(root) < num { Less }
            else if probe.powf(root) > num { Greater }
            else { Equal }})
}
running 1 test
1234567890 to the power of 1/5.3
binary_find:    51.92543988913025 error: -0.000000476837158203125
powf:           51.92543988913026 error: -0.0000011920928955078125
test roottest ... ok

@CAD97
Copy link

CAD97 commented Aug 22, 2022

In std, the range index type would probably be bound by Step1. This can theoretically even be accomplished exclusively with step_unchecked and without cloning the range keys, though I don't know what the performance impact of doing so would be; std would probably want to specialize on Copy to use the simple let mid = step_halfway(lo, hi) approach.

However, I'm bowing out on design here. I have strong opinions on the proper dependency injection API, but less so on the rest of the details.

(Though I do want to note that in parallel to raw_entry, perhaps this sh/could be named raw_binary_search.)

Footnotes

  1. Step is surprisingly close to C++'s named requirement LegacyRandomAccessIterator / concept random_access_iterator, minus the dereference for item access of course. (Note C++ copy constructors are Rust Clone.)

@programmerjake
Copy link
Member

more than just Step would be needed to search in 0..!0u128, since that exceeds the range of usize by a large factor. you'd probably want:

trait StepHalfway: Step {
    /// returns `(self + other) / 2` but without overflowing.
    fn halfway(&self, other: &Self) -> Self;
}

@programmerjake
Copy link
Member

programmerjake commented Aug 22, 2022

also it shouldn't be limited to just integer types, e.g. binary searching a list of SHA1 or SHA256 hashes would be very useful for git stuff. (admittedly that could be done with u128 indexes searching in a ordered list on disk, but >128-bit indexes would still be useful).

other examples: searching for the 10^600-th prime by binary searching on PrimePi, this would use BigInt indexes.

@CAD97
Copy link

CAD97 commented Aug 22, 2022

Step eventually should be able to be stabilized and implementable for any type that fits its interface.

@scottmcm
Copy link
Member

scottmcm commented Aug 22, 2022

As a meta-comment, this thread feels very much like what I'd expect in a crate design discussion. That's not saying that these ideas are bad, but the super-general version of all this feels like something that belongs in a crate, not in core, since it's far from the plausible original "hey, can we just get one method on slice that's a bit better for these cases".

Especially if it's already implemented in a crate, then people can just use that crate today, as opposed to the many months (at least) it'll take before they could use it in core.

@liborty
Copy link
Author

liborty commented Aug 22, 2022

I must be missing something in this discussion of step because binary_find works just fine on u128 already, since several versions back. Also on f64 which is not step, as you can see from my very example above your posts. For this reason, Range indexing is internally avoided. Range is only used to determine the limits of the search. (It is too easy in Rust to overcomplicate matters unnecessarily). I only ever need to add/subtract 1 and divide by 2. With this, I believe I am still within the scope of: "hey, can we just get one method on slice that's a bit better for these cases".

This is the signature:

pub fn binary_find<T,F>(range: Range<T>,cmpr: F ) -> Range<T>
    where T: PartialOrd+Copy+Add<Output=T>+Sub<Output=T>+Div<Output=T>+From::<u8>,
          F: Fn(&T)->Ordering {

Also, I miss the purpose of making the comparator closure FnMut rather than just Fn. Could anyone who thinks it is needed please give a simple example of how it might be actually useful? Would you want to change what you are looking for in the middle of the search? I think not.

@programmerjake
Copy link
Member

some simple reasons to have it be FnMut: reusing a http connection, or where you need to cache partial results.

@CAD97
Copy link

CAD97 commented Aug 23, 2022

I only ever need to add/subtract 1 and divide by 2. With this, I believe I am still within the scope of: "hey, can we just get one method on slice that's a bit better for these cases".

This is the signature:

pub fn binary_find<K, F>(range: Range<K>, cmp: F) -> Range<K>
where
    K: PartialOrd + Copy + Add<Output=K> + Sub<Output=K> + Div<Output=K> + From::<u8>,
    F: Fn(&K) -> Ordering,

(Reformatted to be legible. Please, use rustfmt, at least if you're posting code on rust-lang repositories. Maintaining a consistent style helps code to be quickly understood.)

This signature is still vastly more complicated than any other bound in the standard library APIs. And it's not even what you say it is! You're not requiring "add one and divide by two", you're requiring From<u8> and Div! How would this make any sense for e.g. Uuid, Sha1, or other types which are Step (have predecessor/successor operations) but not just numbers.

(x: Sha1) / Sha1::from(2) is nonsensical.

Step (or ) is the semantically correct bound here, as random-access successor/predecessor is what you need to do a binary search. If Step isn't powerful enough, then it should be generalized and/or extended to support this usecase.

That said, though, binary searching over a dataset with more elements (which is sorted in some fashion!) than the address space is an extremely niche use case. And will likely want to switch strategies when moving from impossibly sparse to some reasonable scale to work on.

@liborty
Copy link
Author

liborty commented Aug 23, 2022

Your formatting has actually destroyed my signature snippet. It mismatched the trait arguments and chopped off the end of the bounds. No wonder you have difficulties reading it.

Are we talking about the same thing here? My type T is the type of the index, not of the data.
The bounds may look complicated but it is not my fault that there is no Countable trait in Rust.
How do you propose to find quickly a midpoint of an uncountable index range? Step or no step?

@timvermeulen
Copy link

I second the request to please use rustfmt in these discussions. I've had some (slight) difficulties reading these code snippets because I'm so used to reading Rust code that is formatted using it.

@liborty
Copy link
Author

liborty commented Aug 23, 2022

OK, here it is:

pub fn binary_find<T, F>(range: Range<T>, cmpr: &mut  F) -> Range<T>
where
    T: PartialOrd + Copy + Add<Output = T> + Sub<Output = T> + Div<Output = T> + From<u8>,
    F: FnMut(&T) -> Ordering,

@liborty
Copy link
Author

liborty commented Aug 23, 2022

and here is the source listing of the latest version (Edit: made the comparator FnMut)

/// General Binary Search
/// Search within the specified Range<T>, which is always ascending.
/// The (indexing) range values can be of any generic type T satisfying the listed bounds.
/// Typically usize for searching efficiently in-memory, u128 for searching whole disks or internet,
/// or f64 for solving equations.
/// Comparator closure `cmpr` is comparing against a search item captured from its environment.
/// The sort order reflected by `cmpr` can be either ascending or descending (increasing/decreasing).
/// When item is in order before range.start, empty range range.start..range.start is returned.
/// When item is in order after range.end-1, range.end..range.end is returned.
/// Normally binary_find returns Range of all the consecutive values
/// that are PartiallyEqual to the sought item.
/// When item is not found, then the returned range will be empty and
/// its start (and end) will be the sort position where the item can be inserted.
pub fn binary_find<T, F>(range: Range<T>, cmpr: &mut F) -> Range<T>
where
    T: PartialOrd + Copy + Add<Output = T> + Sub<Output = T> + Div<Output = T> + From<u8>,
    F: FnMut(&T) -> Ordering,
{
    let one = T::from(1); // generic one
    let two = T::from(2); // generic two
    let lasti = range.end - one;

    // Closure to find the last matching item in direction up/down from idx
    // or till limit is reached. Equality is defined by `cmpr`.
    let scan = |idx: &T, limit: &T, cpr: &mut F, up: bool| -> T {
        let mut probe = *idx;
        let step = |p: &mut T| if up { *p = *p + one } else { *p = *p - one };
        step(&mut probe);
        while cpr(&probe) == Equal {
            // exits at the first non-equal item
            if probe == *limit {
                step(&mut probe);
                break;
            };
            step(&mut probe);
        }
        if up {
            probe
        } else {
            probe + one
        } // into Range limit
    };

    // Checking end cases
    if range.is_empty() {
        return range;
    };

    match cmpr(&range.start) {
        Greater => {
            return range.start..range.start;
        } // item is before the range
        Equal => {
            if cmpr(&range.end) == Equal {
                return range;
            }; // all in range match
            return range.start..scan(&range.start, &lasti, cmpr, true);
        }
        _ => (),
    };
    match cmpr(&lasti) {
        Less => {
            return range.end..range.end;
        } // item is after the range
        Equal => {
            return scan(&lasti, &range.start, cmpr, false)..range.end;
        }
        _ => (),
    };
    // Binary search
    let mut hi = lasti; // initial high index
    let mut lo = range.start; // initial low index
    loop {
        let mid = lo + (hi - lo) / two; // binary chop here with truncation
        if mid > lo {
            // still some range left
            match cmpr(&mid) {
                Less => lo = mid,
                Greater => hi = mid,
                Equal => {
                    return scan(&mid, &range.start, cmpr, false)..scan(&mid, &lasti, cmpr, true)
                }
            }
        } else {
            return hi..hi;
        }; // interval is exhausted, val not found
    }
}

@programmerjake
Copy link
Member

once you found one equal item, rather than using a linear search in scan, imho using an exponential search to find the first not-equal item and then binary searching between that and the last equal item would be much more efficient, with worst case runtime O(log N) rather than O(N)

@timvermeulen
Copy link

I've seen exponential search used for this, too — still O(log n) in the worst case, but clearly better (or at least fewer comparisons) than binary search for a small number of equal elements.

@liborty
Copy link
Author

liborty commented Aug 23, 2022

Great advice, thanks!
Anything else you can think of?

@liborty
Copy link
Author

liborty commented Aug 23, 2022

I have just realised that I can get even better results by reusing the last bounds of the already performed binary search!
Coming up in the next version.

@CAD97
Copy link

CAD97 commented Aug 23, 2022

The bounds may look complicated but it is not my fault that there is no Countable trait in Rust.

As I've been saying, that trait is Step.

How do you propose to find quickly a midpoint of an uncountable index range?

let distance = Step::steps_between(&start, &end)?;
let mid = Step::forward_unchecked(start, distance / 2);

With Range<impl Step>, it's trivially possible to bridge to working Range<usize> if you're okay with extra clones of the index type:

pub fn binary_find<K: Step>(range: Range<K>, cmp: impl Fn(&K) -> Ordering) -> Range<K> {
    let trans = |i: usize| Step::forward(range.start.clone()).unwrap();
    if let Some(length) = Step::distance_between(&range.start, &range.end) {
        let out = your_binary_find(0..length, |i| cmp(&trans(i));
        trans(out.start)..trans(out.end)
    } else {
        range
    }
}

(Sorry about the broken formatting in my last post I did it on a phone without checking the output and just before I went to sleep.)

@liborty
Copy link
Author

liborty commented Aug 23, 2022

That sounds good. Do you have any clearer idea, how it actually obtains the distance_between? Because if it just linearly counts the steps, that would be too slow. If not, then I will attempt to change to it next.

Here is my current version which implements everything up to here, except that Step. It now includes binary search for the end(s) of the matching range, restricted to the confines of the final range of the binary search that found the initial match. Which is a great idea, even if I say so myself :)

Edit: I can find only steps_between and that is nightly only experimental feature.

@programmerjake
Copy link
Member

pub fn binary_find<K: Step>(range: Range<K>, cmp: impl Fn(&K) -> Ordering) -> Range<K> {
    let trans = |i: usize| Step::forward(range.start.clone()).unwrap();
    if let Some(length) = Step::distance_between(&range.start, &range.end) {
        let out = your_binary_find(0..length, |i| cmp(&trans(i));
        trans(out.start)..trans(out.end)
    } else {
        range
    }
}

so if you want to search in the range 0..!0u128 it just always returns the input range!? that's just plain broken imho. Step by itself currently doesn't have the functionality needed to efficiently binary search ranges larger than usize.

@CAD97
Copy link

CAD97 commented Aug 23, 2022

I keep saying the same things.

  • Yes, Step is unstable.
    • But that doesn't matter.
    • We're not discussing the implementation of a crate that works on stable.
    • We're discussing a new unstable API to potentially add to std.
  • Yes, Step is limited to getting a max distance of usize.
    • But because Step is unstable, we can change that if desired.
      • Probably remove steps_between from Step and add it to a subtrait.
    • It's not necessarily needed, though, as you can binary search buckets and then binary search within the buckets.

e.g. smth like untested

fn split(x: i128) -> (i64, u64) {
    (((x as u128) >> 64) as u64 as i64, x as u128 as u64)
}
fn unsplit(hi: i64, lo: u64) -> i128 {
    lo as i128 & ((hi as u64 as u128) << 64) as i128
}

fn search(range: Range<i128>, cmp: impl Fn(i128) -> Ordering) -> Range<i128> {
    let cmp_hi = |&hi| cmp(unsplit(hi, 0));
    let hi = binary_find(split(range.start).0..=split(range.end).0);
    let cmp_lo = |hi| |&lo| cmp(unsplit(hi, lo));
    let lo = binary_find(0..=u64::MAX, cmp_lo(hi.start)).start..binary_find(0..=u64::MAX, cmp_lo(hi.end)).end;
    unsplit(hi.start, lo.start)..unsplit(hi.end, lo.end)
}

(requires binary_find support for RangeInclusive / finite impl RangeBounds, and I remain sad that ranges being Iterator instead of just IntoIterator means RangeInclusive doesn't expose fields.)

@programmerjake
Copy link
Member

fn split(x: i128) -> (i64, u64) {
    (((x as u128) >> 64) as u64 as i64, x as u128 as u64)
}
fn unsplit(hi: i64, lo: u64) -> i128 {
    lo as i128 & ((hi as u64 as u128) << 64) as i128
}

fn search(range: Range<i128>, cmp: impl Fn(i128) -> Ordering) -> Range<i128> {
    let cmp_hi = |&hi| cmp(unsplit(hi, 0));
    let hi = binary_find(split(range.start).0..=split(range.end).0);
    let cmp_lo = |hi| |&lo| cmp(unsplit(hi, lo));
    let lo = binary_find(0..=u64::MAX, cmp_lo(hi.start)).start..binary_find(0..=u64::MAX, cmp_lo(hi.end)).end;
    unsplit(hi.start, lo.start)..unsplit(hi.end, lo.end)
}

that has several issues:

  • it searches outside the input range, e.g. if start is 0x10000000001000000 it will search starting from 0x10000000000000000. This is an issue because some comparison functions will fail if an index outside the given range is passed in, e.g. |idx| (5000000000 / (idx - (1 << 64))).cmp(&0) will divide by zero if called on 0x10000000000000000
  • it only works if usize is 64-bit. if it's 32 or 16-bit it'll need to be orders of magnitude more complex.

imho if I'd have to go through all that just to binary search on u128, i'd rather just write the binary search myself to work on u128 -- defeating the entire purpose of having binary search in core in the first place.

Therefore, I think we need fn halfway(&self, other: &Self) -> Self in either Step or some other trait, since the current Step is totally insufficient.

@liborty
Copy link
Author

liborty commented Aug 24, 2022

it searches outside the input range

I will second that. I have gone into quite a lot of trouble to avoid any such access, even to range.end. If for no other reason than that will panic with 'out of range' error even on an ordinary Vec.

imho if I'd have to go through all that just to binary search on u128, i'd rather just write the binary search myself to work on u128 -- defeating the entire purpose of having binary search in core in the first place.

Which is what I have done and then generalised it to all enumerable range types T. The only benefit I see to Step is that it is somehow 'more idiomatic'. However, I always took 'idiomatic' as equivalent to 'simpler' and/or 'more general', otherwise there is no point in it. My explicit trait bounds fulfil the purpose right now, in stable Rust.

Therefore, I think we need fn halfway(&self, other: &Self) -> Self in either Step or some other trait, since the current Step is totally insufficient.

Yes, I agree, this would persuade me. Probably some other trait, as Step is embroiled in long discussions since 2017 and still marked as 'fly-by-night' only. Might it work for an ExactSizeIterator? I note that this ratio would probably have to be implemented internally in much the same way as what I had done already, i.e. as a numerical division. Where ratio is a more general division of the range by any (natural number) factor r, not just r==2. For a start, there are other than binary searches. I might have to implement that myself for my crate, so I do not have to wait.

fn ratio(&self, end: &Self, r: usize) -> Self

@CAD97
Copy link

CAD97 commented Aug 24, 2022

It is clear that we are at this point no longer discussing an API for std but instead the implementation of indxvec. I'm muting this.

We already have slice::pub fn binary_search_by<'a, F>(&'a self, f: F) -> Result<usize, usize>. Adding slice::binary_find[_by] -> Range<usize> seems reasonable. Indirected binary find that works just on Range (or perhaps a RandomAccessIterator like is already used internally in std) is possible but clearly not ready for std inclusion as there are too many open design questions.

(The implementation for i128 is an example of the tiered search technique, not a proposal of how to do it. It would be not too difficult to fix to avoid the mentioned deficiencies.)

@liborty liborty closed this as completed Aug 27, 2022
@liborty
Copy link
Author

liborty commented Aug 27, 2022

I have now completed the general design & implementation, satisfying all the requirements mentioned in these discussions. I commend it to your attention and adoption. It is in the trait Search. You can read the description of my new algorithm in README.md.

It is true that it now applies to Range rather than to a slice, thus formally justifying the closure of this proposal. However, I think you would do well to adopt it in some form or another. I am always ready to answer any more questions.

It is not using Step, which is not ready and would not, in any case, enable solving equations with Range<f64>, as my design does. Together with any other numerical indices, such as usize, u128, etc. This fully general implementation can easily be specialised, for instance only to usize, should you not wish to carry those trait bounds for converting numeric types.

Also, the Search trait is ready 'off the shelf' for implementing other searches with divisions not limited to 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

8 participants