Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upOcc::get performance, redux #76
Conversation
anp
added some commits
Jul 25, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Jul 26, 2016
Contributor
This is very interesting, Adam! To me, the speedups are too small to justify the use of the loop. I would hope that these differences disappear at some point due to compiler optimizations. But I might be mistaken. What is the real speedup you get in your application (just to get a feeling)?
|
This is very interesting, Adam! To me, the speedups are too small to justify the use of the loop. I would hope that these differences disappear at some point due to compiler optimizations. But I might be mistaken. What is the real speedup you get in your application (just to get a feeling)? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aatch
Jul 27, 2016
One change I'm investigating is making k a u32 instead of a usize. On a 64-bit platform, with the default settings, the way the vectorized code works means that it is limited to processing 4 elements at a time, but limit k to a u32 allows it to process 8 elements at a time. My testing shows that for k == 32, this increases speed for the loop over the non-vectorized version by a reasonable amount.
For k == 32:
filter-count:
test search_index_seeds ... bench: 33,397 ns/iter (+/- 1,498)
loop:
test search_index_seeds ... bench: 37,146 ns/iter (+/- 1,266)
loop, k: u32:
test search_index_seeds ... bench: 31,221 ns/iter (+/- 1,982)
For k == 64
filter-count:
test search_index_seeds ... bench: 49,810 ns/iter (+/- 9,850)
loop, k: u32
test search_index_seeds ... bench: 33,503 ns/iter (+/- 1,824)
And for smaller k, it is still is slower, I don't think it's a real problem, since 32/64 are apparently the more common sizes.
filter-count:
test search_index_seeds ... bench: 18,364 ns/iter (+/- 1,374)
loop, k: u32:
test search_index_seeds ... bench: 27,241 ns/iter (+/- 1,781)
At this point, you do need a more in-depth benchmark to see the impact, as the whole point of this structure is to control the space/time tradeoff for counting occurrences. Cache effects and even paging are going to start being relevant.
Aatch
commented
Jul 27, 2016
|
One change I'm investigating is making For
For
And for smaller
At this point, you do need a more in-depth benchmark to see the impact, as the whole point of this structure is to control the space/time tradeoff for counting occurrences. Cache effects and even paging are going to start being relevant. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Jul 27, 2016
Contributor
@johanneskoester I'm not entirely sure what you mean by "these differences disappear at some point due to compiler optimizations." Re: speedup in my application, the speed of the read mapping I'm doing is bounded primarily by the speed of Occ::get. Here's a flamegraph of my application's core benchmark:
So a 15% increase at k=128 results in about a 12% or so performance increase for my use case. That said, I'm working with abnormally large indices with very high sample rates (metagenomics), so it's entirely reasonable to think that it's a bad idea to overspecialize the library for my uses.
@Aatch good point about 32-bit ints on x86_64. This would impose a limitation on the size of the text the FM Index can contain, however. In a pathological case (a text of all a single character), I think it would mean that you could index a max 4GB text. If I'm thinking correctly, for a more reasonable case of near even distribution, it would limit use to a 20GB text, assuming a DNA + N alphabet. As I write this, I have an index building with a 25GB DNA+N input file, and have hopes to go higher if I'm able to make it work.
Perhaps this is an argument for making the counting data structures generic over integer type? IIRC, this is what SeqAn does for their indices. Ideally though that would be based on the size of the input text, no? I'd have to try but I think it might be difficult to hide that implementation detail from a user without having either integer generics or impl Trait.
Trying to improve performance for this crate has made me desperately want integer generics.
|
@johanneskoester I'm not entirely sure what you mean by "these differences disappear at some point due to compiler optimizations." Re: speedup in my application, the speed of the read mapping I'm doing is bounded primarily by the speed of Occ::get. Here's a flamegraph of my application's core benchmark: So a 15% increase at k=128 results in about a 12% or so performance increase for my use case. That said, I'm working with abnormally large indices with very high sample rates (metagenomics), so it's entirely reasonable to think that it's a bad idea to overspecialize the library for my uses. @Aatch good point about 32-bit ints on x86_64. This would impose a limitation on the size of the text the FM Index can contain, however. In a pathological case (a text of all a single character), I think it would mean that you could index a max 4GB text. If I'm thinking correctly, for a more reasonable case of near even distribution, it would limit use to a 20GB text, assuming a DNA + N alphabet. As I write this, I have an index building with a 25GB DNA+N input file, and have hopes to go higher if I'm able to make it work. Perhaps this is an argument for making the counting data structures generic over integer type? IIRC, this is what SeqAn does for their indices. Ideally though that would be based on the size of the input text, no? I'd have to try but I think it might be difficult to hide that implementation detail from a user without having either integer generics or Trying to improve performance for this crate has made me desperately want integer generics. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aatch
Jul 27, 2016
@dikaiosune it only limits the size of the "chunks" the input is split into, not the size of the data itself. Given that you're currently splitting into 128-byte chunks, the fact that using u32 limits you to 4GB chunks doesn't seem too ridiculous. The reason for limiting k here is actually so you can use a u32 for the count variable instead of a usize, meaning that instead of the vectorized loop using two <2 x u64> vectors, it can use two <4 x u32> vectors and process 8 elements at once instead.
Aatch
commented
Jul 27, 2016
|
@dikaiosune it only limits the size of the "chunks" the input is split into, not the size of the data itself. Given that you're currently splitting into 128-byte chunks, the fact that using |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Jul 27, 2016
Contributor
@Aatch If we use a u32 for count, then counting >4B matching characters from the BWT will overflow, which is likely to happen when counting up to the final rows in very large BWT's. (I think)
|
@Aatch If we use a u32 for |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Jul 27, 2016
Contributor
Note for later: for texts which are small enough, the same optimization could apply nicely to the SuffixArray impl as well. SeqAn lets you pick the type of the suffix array items so you can take advantage of 32-bit speed on <4GB texts.
|
Note for later: for texts which are small enough, the same optimization could apply nicely to the SuffixArray impl as well. SeqAn lets you pick the type of the suffix array items so you can take advantage of 32-bit speed on <4GB texts. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aatch
Jul 27, 2016
@dikaiosune it won't. The maximum value of count (the "manual count") is k - 1, since you only process k - 1 elements. You then add that to the stored count for the beginning of the block you're in. That stored count should continue to be a usize, you just cast the "manual count" to a usize before adding it to the stored count.
Aatch
commented
Jul 27, 2016
|
@dikaiosune it won't. The maximum value of |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Aha! Yeah, that makes total sense. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Jul 27, 2016
Contributor
So, to clarify, you're suggesting something like:
pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
// self.k is our sampling rate, so find our last sampled checkpoint
let i = r / self.k;
let checkpoint = self.occ[i][a as usize];
// find the portion of the BWT past the checkpoint which we need to count
let start = (i * self.k) + 1;
let end = r + 1;
// count all the matching bytes b/t the closest checkpoint and our desired lookup
let mut count: u32 = 0;
for &x in &bwt[start..end] {
if x == a {
count += 1;
}
}
// return the sampled checkpoint for this character + the manual count we just did
checkpoint + (count as usize)
}Is that accurate? Why does k also need to be a u32 here? I think that <4B sampling rates are completely reasonable, just curious.
EDIT: Nevermind. Because it prevents overflow in case someone passes a completely bonkers value for k.
|
So, to clarify, you're suggesting something like: pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
// self.k is our sampling rate, so find our last sampled checkpoint
let i = r / self.k;
let checkpoint = self.occ[i][a as usize];
// find the portion of the BWT past the checkpoint which we need to count
let start = (i * self.k) + 1;
let end = r + 1;
// count all the matching bytes b/t the closest checkpoint and our desired lookup
let mut count: u32 = 0;
for &x in &bwt[start..end] {
if x == a {
count += 1;
}
}
// return the sampled checkpoint for this character + the manual count we just did
checkpoint + (count as usize)
}Is that accurate? Why does EDIT: Nevermind. Because it prevents overflow in case someone passes a completely bonkers value for |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aatch
commented
Jul 27, 2016
|
@dikaiosune |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Jul 27, 2016
Contributor
@dikaiosune with compiler optimizations I meant that at some point the iterator implementation might be as fast as the vectorized. But maybe I'm mistaken and the compiler will never be able to do that...
|
@dikaiosune with compiler optimizations I meant that at some point the iterator implementation might be as fast as the vectorized. But maybe I'm mistaken and the compiler will never be able to do that... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Aug 14, 2016
Contributor
@johanneskoester I'm in the process of talking to some people on IRC about the optimizations at play here, and to what extent the filter/count method might be better optimized in the future. Will post back when I have more info.
|
@johanneskoester I'm in the process of talking to some people on IRC about the optimizations at play here, and to what extent the filter/count method might be better optimized in the future. Will post back when I have more info. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Aug 14, 2016
Contributor
@johanneskoester So... after digging around, it looks like using plain fold instead of map/fold or filter/count should vectorize: https://is.gd/jeOBHA -- turning on release mode + ASM shows that both the manual method and the fold method vectorize. Note however that this precludes using a u32 to accumulate the BWT count -- which does show some impressive gains from @Aatch's benchmarking.
I think that given the performance benefit from having a u32 accumulator type, and the fact that the manual loop seems to be more reliably optimized, the advantage of using a manual loop seems pretty strong.
|
@johanneskoester So... after digging around, it looks like using plain fold instead of map/fold or filter/count should vectorize: https://is.gd/jeOBHA -- turning on release mode + ASM shows that both the manual method and the fold method vectorize. Note however that this precludes using a I think that given the performance benefit from having a u32 accumulator type, and the fact that the manual loop seems to be more reliably optimized, the advantage of using a manual loop seems pretty strong. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aatch
Aug 15, 2016
@johanneskoester be careful to not fall into the "sufficiently smart compiler" trap. Given enough time and information, a compiler can theoretically produce perfectly-optimised code. Therefore, it's more important to consider what it can do now.
Aatch
commented
Aug 15, 2016
|
@johanneskoester be careful to not fall into the "sufficiently smart compiler" trap. Given enough time and information, a compiler can theoretically produce perfectly-optimised code. Therefore, it's more important to consider what it can do now. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Aug 15, 2016
Contributor
Ok, I'm fine with u32 and a custom loop then. Thanks a lot for evaluating this!!
|
Ok, I'm fine with u32 and a custom loop then. Thanks a lot for evaluating this!! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Aug 15, 2016
Contributor
Maybe you can add some comments to the code that explain why no iterators are used in the refined implementation. This way, we ensure that your insights on this snippet are not lost or even reverted at later times.
|
Maybe you can add some comments to the code that explain why no iterators are used in the refined implementation. This way, we ensure that your insights on this snippet are not lost or even reverted at later times. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Any update on this? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@johanneskoester I'm hoping to clean this up this weekend. |
johanneskoester
added
enhancement
WIP
labels
Dec 9, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
coveralls
commented
Feb 5, 2017
•
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Feb 5, 2017
Contributor
I'm getting back to this after quite some time. To summarize my understanding of where this is:
countis more efficient in vectorized versions when stored as au32Occ.kneeds to be au32to ensure that no one constructs Occ with a sampling rate which would causecountto overflow- Changing
ktou32means that the signature ofOcc::newneeds to change, which would affect people upgrading to the new version. The tests in the module rely on type inference for numeric literals, so there's not need to change those items, but downstream users will be affected. It's not a big change, and if you're regularly passing values greater than 4,000,000,000 for k, then you almost certainly have broken code.
@johanneskoester what do you think about the last point? It's a wrinkle that I don't think we covered when discussing previously. In the meantime I'll push a commit with a comment explaining the manual looping.
|
I'm getting back to this after quite some time. To summarize my understanding of where this is:
@johanneskoester what do you think about the last point? It's a wrinkle that I don't think we covered when discussing previously. In the meantime I'll push a commit with a comment explaining the manual looping. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
coveralls
commented
Feb 5, 2017
•
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
coveralls
commented
Feb 5, 2017
johanneskoester
requested changes
Feb 8, 2017
I'm fine with the changes. (see minor comment below). API stability is not yet a major concern.
| let mut count: u32 = 0; | ||
| for &x in &bwt[start..end] { | ||
| if x == a { | ||
| count += 1; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Feb 8, 2017
Contributor
What happens if I do count += (x == a) as u32? I think it should be cheaper, because addition is faster than branch prediction?
johanneskoester
Feb 8, 2017
Contributor
What happens if I do count += (x == a) as u32? I think it should be cheaper, because addition is faster than branch prediction?
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Feb 9, 2017
Contributor
From running benches on my 2016 MBP.
The current PR:
test search_index_seeds ... bench: 33,975 ns/iter (+/- 1,272)
test search_index_seeds ... bench: 33,912 ns/iter (+/- 3,118)
test search_index_seeds ... bench: 34,024 ns/iter (+/- 963)
average: ~33970 ns/iter
with RUSTFLAGS="-C target-cpu=native"
test search_index_seeds ... bench: 33,715 ns/iter (+/- 2,859)
test search_index_seeds ... bench: 33,954 ns/iter (+/- 2,944)
test search_index_seeds ... bench: 34,207 ns/iter (+/- 1,287)
average: ~33958 ns/iter
With the suggested change:
test search_index_seeds ... bench: 34,387 ns/iter (+/- 3,337)
test search_index_seeds ... bench: 33,873 ns/iter (+/- 2,172)
test search_index_seeds ... bench: 33,947 ns/iter (+/- 3,346)
average: ~34069 ns/iter
with RUSTFLAGS="-C target-cpu=native"
test search_index_seeds ... bench: 34,067 ns/iter (+/- 2,697)
test search_index_seeds ... bench: 34,046 ns/iter (+/- 1,759)
test search_index_seeds ... bench: 33,992 ns/iter (+/- 3,879)
average: ~34035 ns/iter
The difference is miniscule (about 0.2-0.3%) between the average run times. However, the variance of the version you suggest is slightly higher on my machine (average variance of 1784 ns as PR'd vs. 2951 ns as suggested for a generic x86_64 arch, and 2363 ns vs 2778 ns for the target-cpu=native arch), suggesting that the branch may actually have more predictable performance, at least on the random test data I included in that benchmark.
That said, I'm running these benches on a different machine than I had when I opened the PR, the Rust nightly version has changed, and I don't have time right now to dig into the generated assembly to see if there's anything wonky.
My sense is that given a lack of clear evidence for the branchless version performing better, it would be better to opt for the version which more clearly communicates the intent of counting items which match our needle byte -- which I'd argue is the version currently submitted for review. That said, since it seems like a complete wash between the two versions (the 0.2% difference is well within the error bars of this measurement), I'm happy to switch it if you'd prefer.
anp
Feb 9, 2017
Contributor
From running benches on my 2016 MBP.
The current PR:
test search_index_seeds ... bench: 33,975 ns/iter (+/- 1,272)
test search_index_seeds ... bench: 33,912 ns/iter (+/- 3,118)
test search_index_seeds ... bench: 34,024 ns/iter (+/- 963)
average: ~33970 ns/iter
with RUSTFLAGS="-C target-cpu=native"
test search_index_seeds ... bench: 33,715 ns/iter (+/- 2,859)
test search_index_seeds ... bench: 33,954 ns/iter (+/- 2,944)
test search_index_seeds ... bench: 34,207 ns/iter (+/- 1,287)
average: ~33958 ns/iter
With the suggested change:
test search_index_seeds ... bench: 34,387 ns/iter (+/- 3,337)
test search_index_seeds ... bench: 33,873 ns/iter (+/- 2,172)
test search_index_seeds ... bench: 33,947 ns/iter (+/- 3,346)
average: ~34069 ns/iter
with RUSTFLAGS="-C target-cpu=native"
test search_index_seeds ... bench: 34,067 ns/iter (+/- 2,697)
test search_index_seeds ... bench: 34,046 ns/iter (+/- 1,759)
test search_index_seeds ... bench: 33,992 ns/iter (+/- 3,879)
average: ~34035 ns/iter
The difference is miniscule (about 0.2-0.3%) between the average run times. However, the variance of the version you suggest is slightly higher on my machine (average variance of 1784 ns as PR'd vs. 2951 ns as suggested for a generic x86_64 arch, and 2363 ns vs 2778 ns for the target-cpu=native arch), suggesting that the branch may actually have more predictable performance, at least on the random test data I included in that benchmark.
That said, I'm running these benches on a different machine than I had when I opened the PR, the Rust nightly version has changed, and I don't have time right now to dig into the generated assembly to see if there's anything wonky.
My sense is that given a lack of clear evidence for the branchless version performing better, it would be better to opt for the version which more clearly communicates the intent of counting items which match our needle byte -- which I'd argue is the version currently submitted for review. That said, since it seems like a complete wash between the two versions (the 0.2% difference is well within the error bars of this measurement), I'm happy to switch it if you'd prefer.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Feb 9, 2017
Contributor
I could not agree more. Let's keep it like it is. Ready for merging?
johanneskoester
Feb 9, 2017
Contributor
I could not agree more. Let's keep it like it is. Ready for merging?
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
merged commit d1ef73b
into
rust-bio:master
Feb 9, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Feb 9, 2017
Contributor
Awesome, this is really great. Btw. if you have any software based on Rust-Bio, please let me know. We can link it on the homepage.
|
Awesome, this is really great. Btw. if you have any software based on Rust-Bio, please let me know. We can link it on the homepage. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anp
Feb 9, 2017
Contributor
Glad to hear it, thanks!
I built some research code in my previous position and rust-bio was phenomenally helpful, but I've since started a new job where I'm not working on it anymore. If I hear they've published and/or open-sourced the project, I will be sure to let you know!
|
Glad to hear it, thanks! I built some research code in my previous position and rust-bio was phenomenally helpful, but I've since started a new job where I'm not working on it anymore. If I hear they've published and/or open-sourced the project, I will be sure to let you know! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
johanneskoester
Feb 9, 2017
Contributor
Oh no, that's a loss! I hope you can at least go on with Rust in your free time :-).
|
Oh no, that's a loss! I hope you can at least go on with Rust in your free time :-). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
So far so good :D |
anp commentedJul 25, 2016
I ended up writing a blog post about the tools I used for profiling the change I proposed to Occ::get. Fun story, @Aatch pointed out on Reddit that the slowdown in some other versions I had tried was probably due to the overhead of dealing with alignment in a vectorized version of the function, and noted that 3 may be an odd sample rate to test with the FMIndex. As an example, bowtie2's FM index appears to default to a sampling rate of 32, and I've seen a few cases of using higher rates as well.
The version of Occ::get proposed by Aatch is at approximate parity with the filter/count version for performance at a sampling rate of 32 (just a tiny bit slower), but it quickly gets much, much faster as the sampling rate increases. For example, with a sampling rate of 128 (in use in one of my applications):
filter/count, k = 128
vectorized, k = 128
filter/count, k = 64
vectorized, k = 64
filter/count, k = 32
vectorized, k = 32
Seems like an overall win to me. At a
kof 32, the averages are just slightly higher than the filter/count version, but well within the error margin reported by libtest.The one downside here is that at smaller sampling rates, a non-vectorized version wins out:
filter/count, k = 8
vectorized, k = 8
Thoughts here? Without integer generics (at all) and specialization (in stable), I'm not sure it's possible to cleanly maximize performance of small sampling rates and large ones. I'm biased here because I'm working with very large indices that need high sampling rates, but I'm not sure what other values of
kare "in the wild."