Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shar's algorithm #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add shar's algorithm #6

wants to merge 1 commit into from

Conversation

ryao
Copy link

@ryao ryao commented May 27, 2023

This is based on the D version by Alex Muscar, which was based on an implementation in Writing Efficient Programs by Jon L. Bentley. It was originally published by Knuth and is attributed to Shar.

https://muscar.eu/shar-binary-search-meta.html

Previously, I had mistaken the monobound binary search for a loop version of Shar's algorithm, but after looking at them more deeply, I realize that they are subtly different.

On my Ryzen 7 5800X, this will outperform the monobound binary search on array sizes that are powers of 2 or near powers of 2.

@ryao ryao force-pushed the shar branch 2 times, most recently from 934d8f5 to 6c6c900 Compare May 27, 2023 23:17
@ryao
Copy link
Author

ryao commented May 27, 2023

This implementation's loop body is very nice. On GCC 12.2 for amd64, it is:

.L204:
        leal    (%rcx,%rax), %r9d
        movl    %edx, checks(%rip)
        cmpl    %r8d, (%rdi,%r9,4)
        cmovle  %r9d, %eax
        addl    $1, %edx
        shrl    %ecx
        jne     .L204

If this were unrolled like in the D version and a switch statement were used to jump to the correct starting point, you would get unrolled iterations that look like this from GCC 12.x (later versions of GCC generate 4 statements per unrolled loop iteration; a bug report has been filed):

.L33:
        leal    512(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx
.L35:
        leal    256(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx

That is 3 instructions per iteration, and through the magic of switch statements, you would only need 1 branch executed only once to make the unrolled version work in place of the current loop. If ++checks is still being added to the loop iterations, then it is really 5 instructions per unrolled loop iteration. This particular implementation precomputes how many iterations it makes, so it is possible to increment checks once to eliminate some loop overhead. I did not do that in the interest of having a fair comparison. At present, compilers do not notice that the total number of loop operations is known such that they can move the increment operation out of the loop, although perhaps a future compiler might infer that.

Anyway, unrolling this via a switch statement would give us around a 25% faster binary search than monobound on powers of 2, but I did not submit that version because none of the other binary search implementations had the benefits of unrolling, even though this one is uniquely suited for it since it can achieve 3 instructions per unrolled loop iteration. It would not be an unreasonable to unroll it enough that it can handle all array sizes representable in a 32-bit integer.

This is based on the D version by Alex Muscar, which was based on an
implementation in Writing Efficient Programs by Jon L. Bentley. It was
originally published by Knuth and is attributed to Shar.

https://muscar.eu/shar-binary-search-meta.html

Previously, I had mistaken the monobound binary search for a loop
version of Shar's algorithm, but after looking at them more deeply, I
realize that they are subtly different.

On my Ryzen 7 5800X, this will outperform the monobound binary search on
array sizes that are powers of 2 or near powers of 2.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
@scandum
Copy link
Owner

scandum commented May 28, 2023

On my own system, monobound still outperforms on powers of 2. When compiled with clang it slows down dramatically, as does monobound. This had made me wonder if either are true algorithmic improvements, or they just work around compiler gimmicks.

I'm hesitant about merging, though it's definitely an interesting approach to binary searching. I'll take a closer look if I end up doing more work on binary searching in the future.

@ryao
Copy link
Author

ryao commented May 28, 2023

The main innovation in monobound is explained fairly well here:

https://en.algorithmica.org/hpc/data-structures/binary-search/#the-bottleneck

This should have the same innovation, which in part is why I previously thought the two were both variants of Shar’s algorithm.

The differences between the two are probably more dependent on the quirks of cpu pipeline design than anything else.

@ryao
Copy link
Author

ryao commented May 28, 2023

@scandum Out of curiosity, which processor, platform and compiler versions did you use?

To give the equivalent of what I am asking for myself:

  • Ryzen 5 5800X
  • Linux 5.16.14-gentoo-x86_64
  • glibc 2.37-r1
  • gcc 12.2.1_p20221008 p1
  • llvm/clang 16.0.1

On the topic of compilers making things slow, I had filed llvm/llvm-project#62948 because in a modified version of your benchmark, the movb instruction killed performance. However, when I adapted the code for the original benchmark, the movb instruction was still emitted, but performance was unaffected. I wonder if your poor performance with Clang is the same issue. Modern x86 processors reportedly do not do register renaming on 8-bit registers and instructions that operate on them do not change the upper bits of the corresponding 64-bit regsiter, so performance can suffer when using instructions that operate on 8-bit registers unless the data dependency is eliminated somehow (e.g. zeroing the 64-bit destination register before executing the 8-bit operation).

Another issue that I encountered when testing was an issue in LLVM's x86 backend that caused it to use branches for unpredictable comparisons instead of cmov. Performance of some variations of binary search that I tested was terrible when compiling with Clang until I passed -mllvm -x86-cmov-converter=false to avoid that problem.

That said, I was interested in optimizing binary search for 4KB B-Tree leaves in openzfs/zfs#14866. To avoid cache effects making the results unrealistic, I ended up allocating a huge array and then doing binary searches a different 4KB chunks of it each time. I ended up settling on an unrolled version of the monobound binary search, although for use in B-Tree code, I had to tweak it to return a lower bound rather than only matches.

I also had to rework the code to use inlining of a comparator function to allow me to write the code once and use it on B-Trees that stored different elements. I also had two versions of the generic comparator that were identical with the exception that one would be marked __attribute__((always_inline)) inline and have a different name. Then I passed -fno-inline when compiling so that I could benchmark differences between inlined comparators and comparators that used function calls.

In addition, comparators often return -1, 0 or 1 by doing (((a) > (b)) - ((a) < (b))) and as it turned out, GCC generated terrible code when inlining such comparators, since we would have things like ((((a) > (b)) - ((a) < (b))) < 0) after inlining, and it would not reduce that to ((a) < (b)). Clang handled this in some cases, but not in others. I filed a bugs against both GCC and LLVM regarding that:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901
llvm/llvm-project#62790

In my local tests against a modified version of the benchmark that focused on 4KB search performance, an unrolled version of the monobound binary search performed second best with LLVM/Clang and okay with GCC. An unrolled version of the binary search here performed the absolute best when compiled with GCC, and terribly when compiled with clang unless I manually edited the assembly to stop using movb/subb and then it performed on par with the monobound binary search. Note that the previous comments I made referred to my test results for the unmodified benchmark (unmodified except for the changes necessary to evaluate shar's algorithm).

That said, I had ended up evaluating at least a dozen different variations of binary search (although the variations I tested were far less ambitious than the variations you tested). The two best were an unrolled monobound and unrolled shar's algorithm. Interestingly, a slight variation on an unrolled shar's algorithm that incremented the array pointer instead of an index generated the best output on RISC-V (although I did not test on RISC-V). Incrementing the pointer instead of an index resulted in a loop that had only 4 RISC-V instructions while incrementing the loop index used 8 instructions per loop. RISC-V's load instruction can only take a source, a destination and an offset, such that array accesses that typically take 1 instruction on Intel take 5 instructions on RISC-V.

Unfortunately, without the yet to be finalized RISC-V bit manipulation extension, the cost of calculating lzc is so high that shar's algorithm is likely not practical for a general purpose binary search on RISC-V, although it is likely usable if you can guarantee that the array will always be the same size. I also peeked at compiler output for ARMv8, although I did not do runtime tests on it. Beyond saying that Clang's output was a work of art, while GCC's had room for improvement, there is not much to say.

My apologies for the wall of text, but doing a "brain dump" of the knowledge that I have acquired over the past few weeks is my way of saying thanks for your prior work in this area. Instead of writing my own micro-benchmark, I modified yours to be suitable for my application and it gave me good enough results to convince me to spend what in hindsight seems like a ridiculous amount of time micro-optimizing a fundamental algorithm. In a way, you both saved me time and made me spend even more time, but I had fun doing this, so that is what counts. :)

@scandum
Copy link
Owner

scandum commented May 29, 2023

WSL 2 gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) on an i3-8100.

I haven't looked at the assembly, but the benchmark shows that clang runs the code as branched. Using -mllvm -x86-cmov-converter=false it does indeed become branchless.

I have similar performance issues (2x slowdown) for my rotation algorithms using clang, https://github.com/scandum/rotate. Using -mllvm -x86-cmov-converter=false doesn't fix those.

I actually never learned to read assembly properly, probably should, but I mostly focus on higher level improvements.

I've worked on b-tree like structures myself as well. https://github.com/scandum/binary_cube and one optimized for sorting https://github.com/scandum/gridsort. It's how I originally got started on binary searches. It's good to see others work on them as well, many are still stuck in the node age.

I wrote most of my algorithms to accept a > b in the comparator function / macro, which is compatible with a - b for stable sorting. I'd say c++ messed up by using a < b as it's not compatible with C's a - b. Might be worth looking into.

Also keep in mind my benchmark reports the best time, this works well when it comes to filtering out noise. Executions that take 150-500 micro seconds are generally the most accurate. Making micro optimizations is tricky, if not impossible, using average time. Glad to hear it's been useful to you. :-)

@scandum
Copy link
Owner

scandum commented May 29, 2023

p.s.

Also keep in mind that when you start getting down to the micro second, pretty irrelevant things start playing a role. You can make an irrelevant code change, and suddenly one variant starts doing better than another variant.

So it's possible, based on my own observations, that a minimal improvement is situational instead of absolute.

In the case you publish anything, definitely add average time to the benchmark, since some people will go crazy (for silly reasons) if they find out you're benching the best time.

@ryao
Copy link
Author

ryao commented May 29, 2023

I wrote most of my algorithms to accept a > b in the comparator function / macro, which is compatible with a - b for stable sorting. I'd say c++ messed up by using a < b as it's not compatible with C's a - b. Might be worth looking into.

a - b does not work for unsigned integers. I have also been told that it does not work for very large integers. That is the reason that a number of C projects use (((a) > (b)) - ((a) < (b))) to generate -1, 0 or 1 in comparators and then do comparisons against that in using the comparators, since that always works regardless of how big the numbers are or signed versus unsigned. It is unfortunate that GCC (and to a lesser extent Clang) does not optimize that well when inlining is done. I seem to be the first to report the issue to them. I found C++'s approach to be better optimized by optimizing compilers, although hopefully things will change now that I have filed bug reports.

I actually never learned to read assembly properly, probably should, but I mostly focus on higher level improvements.

I learned the basics in college when I was required to write simple programs in MIPS assembly. That has been good enough for me to review compiler output (with the help of instruction documentation) for other assembly languages to find places where the compiler output is poor. When I find places in hot code paths where the output is poor, I can try to tweak the C code to make it into something on which existing compiler optimization passes generate better output.

I only started doing this recently and the only other place where I have done it is on a checksum algorithm. The PRs for that are still WIPs, but the preliminary results were good. You can see them in openzfs/zfs#14234 and openzfs/zfs#14219. I need to go back to rework one before it can be merged and the other was one where I was a co-author and things are blocked on the other author. The main idea there is to replace hand written inline SIMD assembly with C code using GNU C's vector extension. This enabled the compiler to do optimizations between the C loop and what had previously been inline assembly that it could not touch.

The current results are faster than the existing handwritten code and part of that is attributable to the optimizations that the compiler was able to do on the GNU C vector code that could not be done with a C loop that had an inline assembly loop body. I would not have been able to produce that result had I not reviewed the assembly output and tweaked the code so that the compiler would do better. Interestingly, I found that Clang responded better than GCC to this to the point where I plan to cache a version of Clang's output to use when GCC is used to avoid bad code generation.

I've worked on b-tree like structures myself as well. https://github.com/scandum/binary_cube and one optimized for sorting https://github.com/scandum/gridsort. It's how I originally got started on binary searches. It's good to see others work on them as well, many are still stuck in the node age.

We adopted in-memory B-Trees in 2019, which gave a nice performance boost. None of us really touched the code until the past year when a few of us started doing tweaks to improve the performance:

openzfs/zfs@c0bf952
openzfs/zfs@9dcdee7
openzfs/zfs@677c6f8
openzfs/zfs#14835

Most of this is informed by profiling, although in my case, I was nerdsniped. Coincidentally, all of instances in which I reviewed assembly code for performance purposes in the past year had been caused by nerd sniping.

p.s.

Also keep in mind that when you start getting down to the micro second, pretty irrelevant things start playing a role. You can make an irrelevant code change, and suddenly one variant starts doing better than another variant.

So it's possible, based on my own observations, that a minimal improvement is situational instead of absolute.

In the case you publish anything, definitely add average time to the benchmark, since some people will go crazy (for silly reasons) if they find out you're benching the best time.

Thanks for the tips. I do not plan to publish results outside of github issues and I am likely to put this down for now. There are a few things that I am leaving unresolved, such as why does code with a single unpredictable branch outperform code that avoids it:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007#c5

My current suspicion is that the use of rand() in your original benchmark code causes that last branch to be unusually predictable. Historically, libc's rand() uses a problematic random number generation algorithm. Unfortunately, while webpages talking about it in detail were easily found 10 years ago, I cannot find them as easily right now. In particular, random.org used to have an excellent write-up, but I cannot find it. The best I can find is this:

https://www.reddit.com/r/cpp/comments/ct30do/the_fault_of_rand_visualized/

@scandum
Copy link
Owner

scandum commented May 30, 2023

From a compatibility / performance perspective, the main importance is that software that works with a > b (unstable if true) still works with a - b, even if the latter is restricted to signed integers.

Regarding B-Trees, the leaf size seems rather large at 4096 elements? Unless it's the size in bytes. Even with binary insertion that'd still be a bit iffy. My solution for gridsort was to sort leaves with a branchless merge sort I named quadsort. An actual b-tree would likely have to choose between merging and binary insertion for optimal performance.

As for rand(), I've observed something similar. Is there a good alternative in the standard library?

@ryao
Copy link
Author

ryao commented May 31, 2023

From a compatibility / performance perspective, the main importance is that software that works with a > b (unstable if true) still works with a - b, even if the latter is restricted to signed integers.

The latter can break if you try to compare INT_MIN vs INT_MAX or any other comparison that causes integer underflow, so unfortunately, it does not work in general even for signed integers.

Regarding B-Trees, the leaf size seems rather large at 4096 elements? Unless it's the size in bytes. Even with binary insertion that'd still be a bit iffy. My solution for gridsort was to sort leaves with a branchless merge sort I named quadsort. An actual b-tree would likely have to choose between merging and binary insertion for optimal performance.

It is in bytes. Typically, we have a maximum of 512 8-byte elements, but that number drops to 170 elements at the larger 24-byte element sized used in at least one place. When we adopted B-Trees in an area of the code that is primarily used to store directory entry hashes, the leaf size was reduced to 512 bytes to reduce CPU time spent copying on insertions. Everywhere else, we use a 4096-byte leaf size.

The B-Tree code was enough of an improvement over the AVL tree code that it took a few years before anyone looked into further improvements. CPU time is more often spent elsewhere, such as in memory allocations, contention on locks protecting other data structures and parity calculations, so people often look for performance improvements elsewhere. There probably is room for additional improvements to our B-Tree code. However, those will likely happen slowly since people’s attention is often elsewhere.

As for rand(), I've observed something similar. Is there a good alternative in the standard library?

Usually, the portable solution is to read /dev/urandom or /dev/random. This has recently changed with glibc 2.36 when it adopted the BSD arc4random functions:

https://www.gnu.org/software/libc/manual/html_node/High-Quality-Random.html
https://savannah.gnu.org/forum/forum.php?forum_id=10216

I just learned that these are usable on older versions of glibc as well as musl through libbsd. The various BSDs support it as well as Solaris, illumos and MacOS, so it is fairly portable now that glibc supports it too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants