-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MISC] simplify enumerating all bins in a binning_bitvector #2930
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/seqan/seqan3/4zhzTzrv7FZKp44KQDnEHN1PLwcN |
Codecov Report
@@ Coverage Diff @@
## master #2930 +/- ##
=======================================
Coverage 98.28% 98.28%
=======================================
Files 267 267
Lines 11462 11466 +4
=======================================
+ Hits 11265 11269 +4
Misses 197 197
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGFM, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Just one 💅
Please merge when satisfied
Even though there are much fewer and easier to understand LOC as well as fewer instructions in the assembly, the benchmark isn't always faster. However, the benchmark parameters are quite small.
Click for details
Master
$ interleaved_bloom_filter_benchmark --benchmark_enable_random_interleaving=true --benchmark_repetitions=100 --benchmark_min_time=0.1 --benchmark_filter=bulk_count_benchmark --benchmark_display_aggregates_only=yes | grep --color=auto mean | sort
2022-01-25T17:40:17+01:00
Running interleaved_bloom_filter_benchmark
Run on (16 X 3892.69 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.17, 0.35, 0.28
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/64/16384/2/1000_mean 16095 ns 16095 ns 100 hashes/sec=62.1336M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/64/512/2/1000_mean 16109 ns 16109 ns 100 hashes/sec=62.0838M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/8192/128/2/1000_mean 1209346 ns 1209351 ns 100 hashes/sec=826.976k/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/8192/4/2/1000_mean 1208471 ns 1208476 ns 100 hashes/sec=827.572k/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/64/16384/2/1000_mean 7562 ns 7562 ns 100 hashes/sec=132.262M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/64/512/2/1000_mean 7052 ns 7052 ns 100 hashes/sec=141.822M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/8192/128/2/1000_mean 294799 ns 294800 ns 100 hashes/sec=3.39248M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/8192/4/2/1000_mean 292890 ns 292891 ns 100 hashes/sec=3.41471M/s
PR
$ interleaved_bloom_filter_benchmark --benchmark_enable_random_interleaving=true --benchmark_repetitions=100 --benchmark_min_time=0.1 --benchmark_filter=bulk_count_benchmark --benchmark_display_aggregates_only=yes | grep mean | sort
2022-01-25T17:36:40+01:00
Running interleaved_bloom_filter_benchmark
Run on (16 X 3892.69 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.24, 0.25, 0.22
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/64/16384/2/1000_mean 16028 ns 16028 ns 100 hashes/sec=62.3961M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/64/512/2/1000_mean 16024 ns 16024 ns 100 hashes/sec=62.413M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/8192/128/2/1000_mean 1257416 ns 1257420 ns 100 hashes/sec=795.442k/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>>/8192/4/2/1000_mean 1258726 ns 1258733 ns 100 hashes/sec=794.643k/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/64/16384/2/1000_mean 7330 ns 7330 ns 100 hashes/sec=136.466M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/64/512/2/1000_mean 6815 ns 6815 ns 100 hashes/sec=146.759M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/8192/128/2/1000_mean 324572 ns 324574 ns 100 hashes/sec=3.08282M/s
bulk_count_benchmark<seqan3::interleaved_bloom_filter<seqan3::data_layout::uncompressed>>/8192/4/2/1000_mean 321442 ns 321444 ns 100 hashes/sec=3.11216M/s
uncompressed count | master | PR |
---|---|---|
64/16384/2/1000_mean | 132.262 M/s | 136.466 M/s |
64/512/2/1000_mean | 141.822 M/s | 146.759 M/s |
8192/128/2/1000_mean | 3.39248 M/s | 3.08282 M/s |
8192/4/2/1000_mean | 3.41471 M/s | 3.11216 M/s |
compressed count | master | PR |
---|---|---|
64/16384/2/1000_mean | 62.1336 M/s | 62.3961 M/s |
64/512/2/1000_mean | 62.0838 M/s | 62.4130 M/s |
8192/128/2/1000_mean | 826.976 k/s | 795.442 k/s |
8192/4/2/1000_mean | 827.572 k/s | 794.643 k/s |
// Each iteration can handle 64 bits, so we need to iterate `((rhs.size() + 63) >> 6` many times | ||
for (size_t batch = 0, bin = 0; batch < ((rhs.size() + 63) >> 6); bin = 64 * ++batch) | ||
// Jump to the next 1 and return the number of places jumped in the bit_sequence | ||
auto jump_to_next_1bit = [](size_t & x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto jump_to_next_1bit = [](size_t & x) | |
auto jump_to_next_1bit = [] (size_t & x) |
Honestly, wasn't expecting this. Thank you! I think you misunderstood me, the old code wasn't any more suboptimal than the current one, it was just harder to read. As you can see here, the https://godbolt.org/z/o64srTo38 instructions are pretty much the same (the compiler seems to optimise the additional if conditional completely out). What I meant with double the instructions was when converting this double, nested loop into an iterator approach. So all the differences should just be measurement variances. |
But the iterator-stuff is follow-up? |
No, this was the reason for me to drop the iterator idea completely and make it a callback approach, if we want to expose a way to enumerate the bins in the binning_vector. |
Ah alright, so
|
This PR simplifies how the bins of a binning_bitvector are enumerated.
It removes the special case if the bit_sequence is
1 << 63
(most significant bit set) by realizing that the before mentioned UB (undefined behaviour) only happens when shifting by1 << 64
, any smaller shift value is fine.This isn't the case here, as
std::countr_zero
will only return64
for0
, but this case is checked by the for loop. So we know that the number of zeros is at most<= 63
. The second insight is to do a secondtmp >>= 1
instead of incrementingstep++
before the actual shift, this will ensure that the shifted by value will be at most<= 63
and thus eliminating the previous edge case.Idea moved to seqan/product_backlog#412