Skip to content

Avoid bounds checking at slice::binary_search #30917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 22, 2016

Conversation

arthurprs
Copy link
Contributor

Avoid bounds checking for binary search. All calculated indexes are safe and the branch is useless.

@rust-highfive
Copy link
Contributor

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@sfackler
Copy link
Member

Is there any benchmark data for this change?

@arthurprs arthurprs closed this Jan 14, 2016
@arthurprs arthurprs reopened this Jan 14, 2016
@arthurprs arthurprs closed this Jan 14, 2016
@arthurprs
Copy link
Contributor Author

Sorry, wrong buttons. I thought this would be a no brainer. If necessary I can arrange some tests to show the improvements. I saw some in my burst trie implementation a while a go.

@arthurprs arthurprs reopened this Jan 14, 2016
@nikomatsakis
Copy link
Contributor

On the topic of correctly implementing binary search being a no brainer:

http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html ;)

But more seriously, any time we move away from safe code it definitely makes sense to have some definitive benchmarks (and, ideally, we'd be testing those, though we don't quite have the infrastructure for it right now).

@arthurprs
Copy link
Contributor Author

Just to be extra clear, this doesn't change the logic at all, just use the unchecked get instead of index operator.

Anyway I'll produce some numbers tomorrow.

@nikomatsakis
Copy link
Contributor

@arthurprs

Just to be extra clear, this doesn't change the logic at all, just use the unchecked get instead of index operator.

Yes, but it does raise the stakes on that logic being correct. :)

Anyway I'll produce some numbers tomorrow.

Sounds great!

@arthurprs
Copy link
Contributor Author

Small but noticeable improvement. Here are the results in my machine. Code used in Gist

As for the algorithm I checked it for the common mistakes and I believe it's correct.

running 20 tests
test string_new_10     ... bench:          37 ns/iter (+/- 8)
test string_new_100    ... bench:          80 ns/iter (+/- 8)
test string_new_1000   ... bench:         117 ns/iter (+/- 9)
test string_new_10000  ... bench:         162 ns/iter (+/- 20)
test string_new_100000 ... bench:         222 ns/iter (+/- 31)
test string_std_10     ... bench:          38 ns/iter (+/- 9)
test string_std_100    ... bench:          94 ns/iter (+/- 8)
test string_std_1000   ... bench:         120 ns/iter (+/- 11)
test string_std_10000  ... bench:         180 ns/iter (+/- 16)
test string_std_100000 ... bench:         244 ns/iter (+/- 30)
test usize_new_10      ... bench:          22 ns/iter (+/- 2)
test usize_new_100     ... bench:          45 ns/iter (+/- 4)
test usize_new_1000    ... bench:          63 ns/iter (+/- 5)
test usize_new_10000   ... bench:          91 ns/iter (+/- 11)
test usize_new_100000  ... bench:         110 ns/iter (+/- 11)
test usize_std_10      ... bench:          22 ns/iter (+/- 3)
test usize_std_100     ... bench:          48 ns/iter (+/- 5)
test usize_std_1000    ... bench:          66 ns/iter (+/- 8)
test usize_std_10000   ... bench:          91 ns/iter (+/- 10)
test usize_std_100000  ... bench:         124 ns/iter (+/- 10)

@Gankra
Copy link
Contributor

Gankra commented Jan 17, 2016

On the topic of profiling array search being a no brainer:

http://cglab.ca/~morin/misc/arraylayout-v2/

😸

@arthurprs
Copy link
Contributor Author

Well, I don't think changing memory layouts apply here. There's all sort of fancy ways to try (microarchitecture dependent) speed up the binary search itself like using conditional moves but those are only suitable when searching over integers. If anyone knows how to improve a generic version like this it'd be nice though.

@ranma42
Copy link
Contributor

ranma42 commented Jan 18, 2016

Would it be possible to get rid of the redundant check by updating the slice on which we are doing the search instead of keeping track of its boundaries manually?

@alexcrichton
Copy link
Member

I was curious what was going on here in terms of codegen as cases like this typically seem like LLVM could get all the optimizations in the right place.

I whipped up an example of what's going on right now in both versions of this core function, and the only difference in IR is listed below. Otherwise all basic blocks are the same between the two versions.

; version using get_unchecked()
while_body:                                                                                             
  %base.05 = phi i64 [ 0, %entry-block ], [ %base.1, %join ]                                            
  %lim.04 = phi i64 [ 3, %entry-block ], [ %11, %join ]                                                 
  %3 = lshr i64 %lim.04, 1                                                                              
  %4 = add i64 %base.05, %3                                                                             
  %5 = icmp eq i64 %4, 2                                                                                
  %6 = icmp ult i64 %4, 2                                                                               
  %..i.i = select i1 %6, i8 -1, i8 1                                                                    
  %sret_slot.0.i.i = select i1 %5, i8 0, i8 %..i.i                                                      
  switch i8 %sret_slot.0.i.i, label %match_else [                                                       
    i8 0, label %match_case                                                                             
    i8 -1, label %match_case3                                                                           
    i8 1, label %join                                                                                   
  ]                                                                                                     

; version using safe indexing                                                                                                 
while_body:                                                                                             
  %base.07 = phi i64 [ 0, %entry-block ], [ %base.1, %join ]                                            
  %lim.06 = phi i64 [ 3, %entry-block ], [ %12, %join ]                                                 
  %3 = lshr i64 %lim.06, 1                                                                              
  %4 = add i64 %base.07, %3                                                                             
  %5 = icmp ugt i64 %4, 2                                                                               
  br i1 %5, label %cond.i, label %"_ZN3bar16_$LT$closure$GT$12closure.3826E.exit", !prof !0             

cond.i:                                                                                                 
  %.lcssa = phi i64 [ %4, %while_body ]                                                                 
  tail call void @_ZN9panicking18panic_bounds_check20hd44ea11c616af168XYLE(...)        
  unreachable                                                                                           

"_ZN3bar16_$LT$closure$GT$12closure.3826E.exit":                                                        
  %6 = icmp eq i64 %4, 2                                                                                
  %7 = icmp ult i64 %4, 2                                                                               
  %..i.i = select i1 %7, i8 -1, i8 1                                                                    
  %sret_slot.0.i.i = select i1 %6, i8 0, i8 %..i.i                                                      
  switch i8 %sret_slot.0.i.i, label %match_else [                                                       
    i8 0, label %match_case                                                                             
    i8 -1, label %match_case3                                                                           
    i8 1, label %join                                                                                   
  ]                                                                                                     

So looking at these two snippets, LLVM today cannot assert that this is a constant branch which always takes the false one:

br i1 %5, label %cond.i, label %"_ZN3bar16_$LT$closure$GT$12closure.3826E.exit", !prof !0

With the change to get_unchecked, we are ourselves making the assertions that this is indeed a constant branch and the out of bounds case is never hit. What this check is basically doing is:

if base + (lim >> 1) > 2 {
    panic!()
}

It is essentially ensuring that the index is less than the length of the slice, which is in this case 3. To me this does seem like something that's actually pretty hard for a compiler to statically determine. There's quite a few changes to base and lim and it's pretty difficult to see how (using simple rules) they're sum is always statically less than the length of the array.

I would prefer to reason to ourselves that this property is indeed always true (before we start resorting to unsafe), but that may not be too easy. Additionally, the example in the gist I gave has lots of constant propagation, so overflow checks were likely never really a matter (e.g. the Java bug), so that's also a class of "possible error".


Overall, the wins seems so negligible here and this is such a tricky area that I'd somewhat lean more towards sticking with the safe version.

@ranma42
Copy link
Contributor

ranma42 commented Jan 20, 2016

@alexcrichton you correctly say, the main issue is that LLVM does not manage to prove base + (lim >> 1) < self.len() for all iterations.
I was hoping that the the slicing would help, because it makes checks more "local" but it does not fix all of the difficulties. In particular, LLVM becomes unable to prove that (s.len() >> 1) < s.len() given that s.len() != 0. I think this should be easier to prove, but it is a pattern that InstSimplify does not capture yet.

@arthurprs
Copy link
Contributor Author

This gives an easy ~10% improvement on a few cases. It's easily a win-win to me. It's not like there's no unsafe blocks in core/slice.rs already (34).

@ranma42
Copy link
Contributor

ranma42 commented Jan 20, 2016

@arthurprs Some of those unsafe blocks are required, as they perform operations that cannot be done using only safe code (example: create slices from raw pointers). It looks like (all?) the unsafe blocks which are there just to avoid a bounds check have a comment that explicitly states their purpose.
I think your commit should at least align to this "convention".

NB: Actually I would love if we had a way to track when LLVM learns how to optimise these checks so that we can remove unsafe blocks... maybe manually trying them out once in a while is sufficient?

@alexcrichton
Copy link
Member

@arthurprs to me at least this may not necessarily be a win-win. While everything we've tested on has shown that this continues to work, this quote from @nikomatsakis's link is particularly telling:

I was shocked to learn that the binary search program that Bentley proved correct and subsequently tested in Chapter 5 of Programming Pearls contains a bug

This just goes to show that unsafe code is incredibly tricky at its core and needs to be taken very lightly. The way LLVM does transformations is typically relatively simple (e.g. easy to verify), so if we could coerce LLVM into proving that the bounds check isn't needed here that'd be the best situation. If that doesn't happen then there's some nontrivial logic which means the proof the bounds check isn't necessary is more complicated (e.g. we can't trivially say "oh it's faster let's merge!").

I'm wary here because binary_search is such a core piece of functionality and could be critical if it's wrong. I would personally probably be ok merging, but I mean to point out that we shouldn't take this lightly.

@bluss
Copy link
Member

bluss commented Jan 20, 2016

If you merge the sound unchecked indexing through “generativity” framework you can implement binary search safely without bounds checks.. 😄 (code is here!)

Jokes aside, this change seems to be something libstd should do.. we write unsafe code so that users don't have to (or is that a platitude without any connection to reality, not sure?).

@ranma42
Copy link
Contributor

ranma42 commented Jan 21, 2016

@bluss I tried again doing the slice thing after your example (in particular, splitting at len>>1, then checking the size of the tail, which is guaranteed to be at least as big as the tail):

#![crate_type="lib"]

use std::cmp::Ordering::*;

pub fn binary_search_by_split(a: &[i32]) -> Result<usize, usize> {
    let mut s = a;
    let mut base = 0usize;

    loop {
        let (head, tail) = s.split_at(s.len() >> 1);
        if tail.is_empty() {
            return Err(base)
        }
        match tail[0].cmp(&0) {
            Equal => return Ok(base),
            Less => {
                base += head.len() + 1;
                s = &tail[1..];
            }
            Greater => s = head,
        }
    }
}

This still has a check which might cause a panic (slice_index_len_fail), but it depends on a very simple condition. Adding an optimisation to InstructionSimplify for the case x >> y <= x makes the check go away and everything is optimised in safe Rust.
I will try to propose this in a patch to LLVM shortly (in any case, independently of the results of this PR).

Notice that in order to completely get rid of bound checks, the binary search code should still be modified, but it would then be possible to do so without any unsafe code.

@bluss
Copy link
Member

bluss commented Jan 21, 2016

@ranma42 I love it! You can use .split_first() to make it even prettier.

@alexcrichton
Copy link
Member

@ranma42 nice! Out of curiosity, is it the split_at which doesn't have the elided bounds check there?

That seems at least pretty easily verifiable by a human in terms of not having any problems, so I'd be fine changing to unsafe indexing there as well.

@ranma42
Copy link
Contributor

ranma42 commented Jan 21, 2016

@bluss Thanks for pointing that out, I updated my local example and checked that everything works just fine with split_first(), too :)

@alexcrichton Yes, the call to split_at is inlined (up to the Index::index call). The check for start <= end (which would result in slice_index_order_fail) is optimised away, as start = 0, but the one for end <= len is not and results in a call to slice_index_len_fail. I agree that the code looks "trivially correct", but unless there is an imminent need for this optimisation, I would prefer to avoid adding unsafe code and wait for LLVM to optimise this pattern.

In fact, I even try to go in the opposite direction: find out why checks in libcore/libstd are not optimised and push the optimisations in LLVM, so that we can eventually get rid of unsafe blocks without regressing performance "for free" next time we update LLVM.
(Ideally these optimisations should be trivial and/or checked with Souper or similar tools)

@bluss
Copy link
Member

bluss commented Jan 21, 2016

@ranma42 related by example, I remember that a case of >> 1 was the only bounds check not elided in this function.

@alexcrichton
Copy link
Member

If the code in the standard library were adapted to what @ranma42 has pasted then I'd be fine with the unsafe indexing in split_at. It seems pretty easy to verify that the unsafety there is justified.

@arthurprs
Copy link
Contributor Author

These are some neat alternatives!

@alexcrichton what do you mean with unsafe indexing in split_at?

@alexcrichton
Copy link
Member

@arthurprs oh so in @ranma42's example all the bounds checks are elided except for the call to split_at, in which case I'd be fine using an unsafe block to implement that manually (e.g. via slice_unchecked).

@alexcrichton
Copy link
Member

@arthurprs yes it sounds like once we upgrade LLVM @ranma42's version will have no bounds checks in optimized code. It is true yeah as well that today we don't have unchecked slicing for slices (but we probably should...)

@arthurprs
Copy link
Contributor Author

Ok sounds great!

Yeah, I think so. I use from_raw_parts a lot more than I'd like.

@arthurprs
Copy link
Contributor Author

Question, with the current llvm wouldn't "/ 2" optimize the bounds checking out? I tried it but it doesn't. Judging by the @ranma42 commit message and the diff I was assuming that it would.

Here are the results for the new version

running 21 tests
test _warmup           ... bench:           0 ns/iter (+/- 0)
test string_new_10     ... bench:          38 ns/iter (+/- 4)
test string_new_100    ... bench:          85 ns/iter (+/- 7)
test string_new_1000   ... bench:         118 ns/iter (+/- 10)
test string_new_10000  ... bench:         167 ns/iter (+/- 18)
test string_new_100000 ... bench:         228 ns/iter (+/- 26)
test string_std_10     ... bench:          42 ns/iter (+/- 8)
test string_std_100    ... bench:          89 ns/iter (+/- 8)
test string_std_1000   ... bench:         130 ns/iter (+/- 12)
test string_std_10000  ... bench:         176 ns/iter (+/- 17)
test string_std_100000 ... bench:         240 ns/iter (+/- 29)
test usize_new_10      ... bench:          22 ns/iter (+/- 2)
test usize_new_100     ... bench:          45 ns/iter (+/- 4)
test usize_new_1000    ... bench:          58 ns/iter (+/- 6)
test usize_new_10000   ... bench:          85 ns/iter (+/- 6)
test usize_new_100000  ... bench:         106 ns/iter (+/- 8)
test usize_std_10      ... bench:          23 ns/iter (+/- 3)
test usize_std_100     ... bench:          50 ns/iter (+/- 7)
test usize_std_1000    ... bench:          65 ns/iter (+/- 7)
test usize_std_10000   ... bench:          93 ns/iter (+/- 8)
test usize_std_100000  ... bench:         115 ns/iter (+/- 10)

@ranma42
Copy link
Contributor

ranma42 commented Jan 22, 2016

@arthurprs LLVM is clever enough to realise that the costly division in / 2 can be optimised to a shift >> 1, but not so clever to realise that the optimisations that apply to the first one also apply to the other one. Unfortunately, because of the sequence of passes used in Rust, the division is replaced early, so the comparison will not be simplified.

@arthurprs
Copy link
Contributor Author

@ranma42 Thank you, that's very interesting.

if tail.is_empty() {
return Err(base)
}
match f(&tail[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use split_first here instead? It will do the job of .is_empty(), getting the first element, and slicing out the &tail[1..], all in one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler adds another branch, I'm not sure why.

assembly diff https://www.diffchecker.com/5p8siwcp

        if let Some((pivot, rest)) = tail.split_first() {
            match f(pivot) {
                Equal => return Ok(base + head.len()),
                Less => {
                    base += head.len() + 1;
                    s = rest;
                }
                Greater => s = head,
            }
        } else {
            return Err(base)
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranma42 do you know what's the case here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know why the code should not be the same.
Looking at the IR, the LLVM basic block that generates that jump seems to be

"_ZN5slice12_$u5b$T$u5d$8split_at21h15129678983694567549E.exit": ; preds = %loop_body
  %5 = getelementptr inbounds i32, i32* %s.sroa.0.0.ph, i32 %3
  %6 = icmp eq i32 %s.sroa.7.0, %3
  %switchtmp = icmp eq i32* %5, null
  %or.cond = or i1 %6, %switchtmp
  br i1 %or.cond, label %match_case, label %match_case4

@nikomatsakis
Copy link
Contributor

@arthurprs the new, slice-based version is pretty neat. Thanks for pushing on that.

@alexcrichton
Copy link
Member

@bors: r+ 7e5b9d7

We can try to investigate the split_first vs tail implementation in a future PR I think, but it sounds like a newer LLVM will take this instantiation and optimize away all bounds checks, so yay!

bors added a commit that referenced this pull request Jan 22, 2016
Avoid bounds checking for binary search. All calculated indexes are safe and the branch is useless.
@bors
Copy link
Collaborator

bors commented Jan 22, 2016

⌛ Testing commit 7e5b9d7 with merge cded89a...

@bors bors merged commit 7e5b9d7 into rust-lang:master Jan 22, 2016
waj pushed a commit to waj/llvm that referenced this pull request Feb 6, 2016
This commit extends the patterns recognised by InstSimplify to also handle (x >> y) <= x in the same way as (x /u y) <= x.

The missing optimisation was found investigating why LLVM did not optimise away bound checks in a binary search: rust-lang/rust#30917

Patch by Andrea Canciani!

Differential Revision: http://reviews.llvm.org/D16402
@arthurprs
Copy link
Contributor Author

arthurprs commented Oct 24, 2016

💔 We still have the bounds check there, I'm unsure why since we have the optimization in rust llvm branch https://github.com/rust-lang/llvm/blob/rust-llvm-2016-07-09/lib/Analysis/InstructionSimplify.cpp#L2826

Can somebody familiar with llvm take a quick look?

pub extern fn test_bs(s: &[u8], b:u8) -> bool {
  s.binary_search(&b).is_ok()
}

becomes

        jmp     .LBB0_1
.LBB0_11:
        inc     r9
        dec     r8
        mov     rdi, r9
        mov     rsi, r8
.LBB0_1:
        mov     rcx, rsi
        shr     rcx
        cmp     rsi, rcx
        jb      .LBB0_12
        mov     r8, rsi
        sub     r8, rcx
        je      .LBB0_3
        lea     r9, [rdi + rcx]
        xor     esi, esi
        cmp     byte ptr [r9], dl
        mov     r10b, -1
        mov     al, 1
        jb      .LBB0_7
        mov     r10b, 1
.LBB0_7:
        je      .LBB0_9
        mov     sil, r10b
.LBB0_9:
        test    sil, sil
        je      .LBB0_4
        cmp     sil, 1
        mov     rsi, rcx
        je      .LBB0_1
        jmp     .LBB0_11
.LBB0_3:
        xor     eax, eax
.LBB0_4:
        ret
.LBB0_12:
        push    rbp
        mov     rbp, rsp
        mov     rdi, rcx
        call    _ZN4core5slice20slice_index_len_fail17h98d51f66eb16cf45E@PL

@bluss
Copy link
Member

bluss commented Oct 24, 2016

From the Rust code, it ends up as

  %3 = lshr i64 %s.sroa.7.0.i.i, 1
  %4 = icmp ult i64 %s.sroa.7.0.i.i, %3

and the optimization is looking for lshr in combination with ugt or ule. It might be that?

(ult is unsigned less than and so on)

@arthurprs
Copy link
Contributor Author

arthurprs commented Oct 24, 2016

Good catch, it could be that. Because optimization for lt is only valid if both the value and it's shift are != 0.

@bluss
Copy link
Member

bluss commented Oct 24, 2016

@arthurprs Can you open a new issue for this bug?

llvm-beanz pushed a commit to llvm-beanz/llvm-submodules that referenced this pull request Oct 26, 2016
Summary:
Extends InstSimplify to handle both `x >=u x >> y` and `x >=u x udiv y`.

This is a folloup of rL258422 and
rust-lang/rust#30917 where llvm failed to
optimize away the bounds checking in a binary search.

Patch by Arthur Silva!

Reviewers: sanjoy

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D25941

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@285228 91177308-0d34-0410-b5e6-96231b3b80d8
@arthurprs
Copy link
Contributor Author

arthurprs commented Oct 26, 2016

Just an update. llvm just committed a patch that should finally solve this https://github.com/llvm-project/llvm/commit/cf6e9a81f676b3e3885f86af704e834ef5c04264

earl pushed a commit to earl/llvm-mirror that referenced this pull request Oct 26, 2016
Summary:
Extends InstSimplify to handle both `x >=u x >> y` and `x >=u x udiv y`.

This is a folloup of rL258422 and
rust-lang/rust#30917 where llvm failed to
optimize away the bounds checking in a binary search.

Patch by Arthur Silva!

Reviewers: sanjoy

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D25941

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@285228 91177308-0d34-0410-b5e6-96231b3b80d8
dylanmckay pushed a commit to avr-llvm/llvm that referenced this pull request Oct 27, 2016
Summary:
Extends InstSimplify to handle both `x >=u x >> y` and `x >=u x udiv y`.

This is a folloup of rL258422 and
rust-lang/rust#30917 where llvm failed to
optimize away the bounds checking in a binary search.

Patch by Arthur Silva!

Reviewers: sanjoy

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D25941

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@285228 91177308-0d34-0410-b5e6-96231b3b80d8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants