Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occ::get performance, redux #76

Merged
merged 4 commits into from Feb 9, 2017
Merged

Conversation

@anp
Copy link
Contributor

anp commented Jul 25, 2016

I ended up writing a blog post about the tools I used for profiling the change I proposed to Occ::get. Fun story, @Aatch pointed out on Reddit that the slowdown in some other versions I had tried was probably due to the overhead of dealing with alignment in a vectorized version of the function, and noted that 3 may be an odd sample rate to test with the FMIndex. As an example, bowtie2's FM index appears to default to a sampling rate of 32, and I've seen a few cases of using higher rates as well.

The version of Occ::get proposed by Aatch is at approximate parity with the filter/count version for performance at a sampling rate of 32 (just a tiny bit slower), but it quickly gets much, much faster as the sampling rate increases. For example, with a sampling rate of 128 (in use in one of my applications):

filter/count, k = 128

test search_index_seeds ... bench:     179,442 ns/iter (+/- 1,947)

vectorized, k = 128

test search_index_seeds ... bench:     152,368 ns/iter (+/- 433)

filter/count, k = 64

test search_index_seeds ... bench:     105,141 ns/iter (+/- 1,665)

vectorized, k = 64

test search_index_seeds ... bench:      99,044 ns/iter (+/- 252)

filter/count, k = 32

test search_index_seeds ... bench:      74,427 ns/iter (+/- 5,166)

vectorized, k = 32

test search_index_seeds ... bench:      75,333 ns/iter (+/- 303)

Seems like an overall win to me. At a k of 32, the averages are just slightly higher than the filter/count version, but well within the error margin reported by libtest.

The one downside here is that at smaller sampling rates, a non-vectorized version wins out:

filter/count, k = 8

test search_index_seeds ... bench:      40,598 ns/iter (+/- 1,170)

vectorized, k = 8

test search_index_seeds ... bench:      53,842 ns/iter (+/- 224)

Thoughts here? Without integer generics (at all) and specialization (in stable), I'm not sure it's possible to cleanly maximize performance of small sampling rates and large ones. I'm biased here because I'm working with very large indices that need high sampling rates, but I'm not sure what other values of k are "in the wild."

@johanneskoester
Copy link
Contributor

johanneskoester commented Jul 26, 2016

This is very interesting, Adam! To me, the speedups are too small to justify the use of the loop. I would hope that these differences disappear at some point due to compiler optimizations. But I might be mistaken. What is the real speedup you get in your application (just to get a feeling)?

@Aatch
Copy link

Aatch commented Jul 27, 2016

One change I'm investigating is making k a u32 instead of a usize. On a 64-bit platform, with the default settings, the way the vectorized code works means that it is limited to processing 4 elements at a time, but limit k to a u32 allows it to process 8 elements at a time. My testing shows that for k == 32, this increases speed for the loop over the non-vectorized version by a reasonable amount.

For k == 32:

filter-count:
   test search_index_seeds ... bench:      33,397 ns/iter (+/- 1,498)

loop:
   test search_index_seeds ... bench:      37,146 ns/iter (+/- 1,266)

loop, k: u32:
   test search_index_seeds ... bench:      31,221 ns/iter (+/- 1,982)

For k == 64

filter-count:
   test search_index_seeds ... bench:      49,810 ns/iter (+/- 9,850)

loop, k: u32
   test search_index_seeds ... bench:      33,503 ns/iter (+/- 1,824)

And for smaller k, it is still is slower, I don't think it's a real problem, since 32/64 are apparently the more common sizes.

filter-count:
   test search_index_seeds ... bench:      18,364 ns/iter (+/- 1,374)

loop, k: u32:
   test search_index_seeds ... bench:      27,241 ns/iter (+/- 1,781)

At this point, you do need a more in-depth benchmark to see the impact, as the whole point of this structure is to control the space/time tradeoff for counting occurrences. Cache effects and even paging are going to start being relevant.

@anp
Copy link
Contributor Author

anp commented Jul 27, 2016

@johanneskoester I'm not entirely sure what you mean by "these differences disappear at some point due to compiler optimizations." Re: speedup in my application, the speed of the read mapping I'm doing is bounded primarily by the speed of Occ::get. Here's a flamegraph of my application's core benchmark:

mgindex flamegraph

So a 15% increase at k=128 results in about a 12% or so performance increase for my use case. That said, I'm working with abnormally large indices with very high sample rates (metagenomics), so it's entirely reasonable to think that it's a bad idea to overspecialize the library for my uses.

@Aatch good point about 32-bit ints on x86_64. This would impose a limitation on the size of the text the FM Index can contain, however. In a pathological case (a text of all a single character), I think it would mean that you could index a max 4GB text. If I'm thinking correctly, for a more reasonable case of near even distribution, it would limit use to a 20GB text, assuming a DNA + N alphabet. As I write this, I have an index building with a 25GB DNA+N input file, and have hopes to go higher if I'm able to make it work.

Perhaps this is an argument for making the counting data structures generic over integer type? IIRC, this is what SeqAn does for their indices. Ideally though that would be based on the size of the input text, no? I'd have to try but I think it might be difficult to hide that implementation detail from a user without having either integer generics or impl Trait.

Trying to improve performance for this crate has made me desperately want integer generics.

@Aatch
Copy link

Aatch commented Jul 27, 2016

@dikaiosune it only limits the size of the "chunks" the input is split into, not the size of the data itself. Given that you're currently splitting into 128-byte chunks, the fact that using u32 limits you to 4GB chunks doesn't seem too ridiculous. The reason for limiting k here is actually so you can use a u32 for the count variable instead of a usize, meaning that instead of the vectorized loop using two <2 x u64> vectors, it can use two <4 x u32> vectors and process 8 elements at once instead.

@anp
Copy link
Contributor Author

anp commented Jul 27, 2016

@Aatch If we use a u32 for count, then counting >4B matching characters from the BWT will overflow, which is likely to happen when counting up to the final rows in very large BWT's. (I think)

@anp
Copy link
Contributor Author

anp commented Jul 27, 2016

Note for later: for texts which are small enough, the same optimization could apply nicely to the SuffixArray impl as well. SeqAn lets you pick the type of the suffix array items so you can take advantage of 32-bit speed on <4GB texts.

@Aatch
Copy link

Aatch commented Jul 27, 2016

@dikaiosune it won't. The maximum value of count (the "manual count") is k - 1, since you only process k - 1 elements. You then add that to the stored count for the beginning of the block you're in. That stored count should continue to be a usize, you just cast the "manual count" to a usize before adding it to the stored count.

@anp
Copy link
Contributor Author

anp commented Jul 27, 2016

Aha! Yeah, that makes total sense.

@anp
Copy link
Contributor Author

anp commented Jul 27, 2016

So, to clarify, you're suggesting something like:

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
    // self.k is our sampling rate, so find our last sampled checkpoint
    let i = r / self.k;
    let checkpoint = self.occ[i][a as usize];

    // find the portion of the BWT past the checkpoint which we need to count
    let start = (i * self.k) + 1;
    let end = r + 1;

    // count all the matching bytes b/t the closest checkpoint and our desired lookup
    let mut count: u32 = 0;
    for &x in &bwt[start..end] {
        if x == a {
            count += 1;
        }
    }
    // return the sampled checkpoint for this character + the manual count we just did
    checkpoint + (count as usize)
}

Is that accurate? Why does k also need to be a u32 here? I think that <4B sampling rates are completely reasonable, just curious.

EDIT: Nevermind. Because it prevents overflow in case someone passes a completely bonkers value for k.

@Aatch
Copy link

Aatch commented Jul 27, 2016

@dikaiosune k only needs to be a u32 to ensure that count doesn't overflow.

@johanneskoester
Copy link
Contributor

johanneskoester commented Jul 27, 2016

@dikaiosune with compiler optimizations I meant that at some point the iterator implementation might be as fast as the vectorized. But maybe I'm mistaken and the compiler will never be able to do that...

@anp
Copy link
Contributor Author

anp commented Aug 14, 2016

@johanneskoester I'm in the process of talking to some people on IRC about the optimizations at play here, and to what extent the filter/count method might be better optimized in the future. Will post back when I have more info.

@anp
Copy link
Contributor Author

anp commented Aug 14, 2016

@johanneskoester So... after digging around, it looks like using plain fold instead of map/fold or filter/count should vectorize: https://is.gd/jeOBHA -- turning on release mode + ASM shows that both the manual method and the fold method vectorize. Note however that this precludes using a u32 to accumulate the BWT count -- which does show some impressive gains from @Aatch's benchmarking.

I think that given the performance benefit from having a u32 accumulator type, and the fact that the manual loop seems to be more reliably optimized, the advantage of using a manual loop seems pretty strong.

@Aatch
Copy link

Aatch commented Aug 15, 2016

@johanneskoester be careful to not fall into the "sufficiently smart compiler" trap. Given enough time and information, a compiler can theoretically produce perfectly-optimised code. Therefore, it's more important to consider what it can do now.

@johanneskoester
Copy link
Contributor

johanneskoester commented Aug 15, 2016

Ok, I'm fine with u32 and a custom loop then. Thanks a lot for evaluating this!!

@johanneskoester
Copy link
Contributor

johanneskoester commented Aug 15, 2016

Maybe you can add some comments to the code that explain why no iterators are used in the refined implementation. This way, we ensure that your insights on this snippet are not lost or even reverted at later times.

@johanneskoester
Copy link
Contributor

johanneskoester commented Sep 27, 2016

Any update on this?

@anp
Copy link
Contributor Author

anp commented Sep 30, 2016

@johanneskoester I'm hoping to clean this up this weekend.

update to upstream
@coveralls
Copy link

coveralls commented Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling a8399a4 on dikaiosune:master into 796a583 on rust-bio:master.

@anp
Copy link
Contributor Author

anp commented Feb 5, 2017

I'm getting back to this after quite some time. To summarize my understanding of where this is:

  • count is more efficient in vectorized versions when stored as a u32
  • Occ.k needs to be a u32 to ensure that no one constructs Occ with a sampling rate which would cause count to overflow
  • Changing k to u32 means that the signature of Occ::new needs to change, which would affect people upgrading to the new version. The tests in the module rely on type inference for numeric literals, so there's not need to change those items, but downstream users will be affected. It's not a big change, and if you're regularly passing values greater than 4,000,000,000 for k, then you almost certainly have broken code.

@johanneskoester what do you think about the last point? It's a wrinkle that I don't think we covered when discussing previously. In the meantime I'll push a commit with a comment explaining the manual looping.

Also, add comment explaining the need for a manual counter.
@coveralls
Copy link

coveralls commented Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling efef57f on dikaiosune:master into 796a583 on rust-bio:master.

1 similar comment
@coveralls
Copy link

coveralls commented Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling efef57f on dikaiosune:master into 796a583 on rust-bio:master.

Copy link
Contributor

johanneskoester left a comment

I'm fine with the changes. (see minor comment below). API stability is not yet a major concern.

let mut count: u32 = 0;
for &x in &bwt[start..end] {
if x == a {
count += 1;

This comment has been minimized.

Copy link
@johanneskoester

johanneskoester Feb 8, 2017

Contributor

What happens if I do count += (x == a) as u32? I think it should be cheaper, because addition is faster than branch prediction?

This comment has been minimized.

Copy link
@anp

anp Feb 9, 2017

Author Contributor

From running benches on my 2016 MBP.

The current PR:

test search_index_seeds ... bench:      33,975 ns/iter (+/- 1,272)
test search_index_seeds ... bench:      33,912 ns/iter (+/- 3,118)
test search_index_seeds ... bench:      34,024 ns/iter (+/- 963)

average: ~33970 ns/iter

with RUSTFLAGS="-C target-cpu=native"

test search_index_seeds ... bench:      33,715 ns/iter (+/- 2,859)
test search_index_seeds ... bench:      33,954 ns/iter (+/- 2,944)
test search_index_seeds ... bench:      34,207 ns/iter (+/- 1,287)

average: ~33958 ns/iter

With the suggested change:

test search_index_seeds ... bench:      34,387 ns/iter (+/- 3,337)
test search_index_seeds ... bench:      33,873 ns/iter (+/- 2,172)
test search_index_seeds ... bench:      33,947 ns/iter (+/- 3,346)

average: ~34069 ns/iter

with RUSTFLAGS="-C target-cpu=native"

test search_index_seeds ... bench:      34,067 ns/iter (+/- 2,697)
test search_index_seeds ... bench:      34,046 ns/iter (+/- 1,759)
test search_index_seeds ... bench:      33,992 ns/iter (+/- 3,879)

average: ~34035 ns/iter

The difference is miniscule (about 0.2-0.3%) between the average run times. However, the variance of the version you suggest is slightly higher on my machine (average variance of 1784 ns as PR'd vs. 2951 ns as suggested for a generic x86_64 arch, and 2363 ns vs 2778 ns for the target-cpu=native arch), suggesting that the branch may actually have more predictable performance, at least on the random test data I included in that benchmark.

That said, I'm running these benches on a different machine than I had when I opened the PR, the Rust nightly version has changed, and I don't have time right now to dig into the generated assembly to see if there's anything wonky.

My sense is that given a lack of clear evidence for the branchless version performing better, it would be better to opt for the version which more clearly communicates the intent of counting items which match our needle byte -- which I'd argue is the version currently submitted for review. That said, since it seems like a complete wash between the two versions (the 0.2% difference is well within the error bars of this measurement), I'm happy to switch it if you'd prefer.

This comment has been minimized.

Copy link
@johanneskoester

johanneskoester Feb 9, 2017

Contributor

I could not agree more. Let's keep it like it is. Ready for merging?

This comment has been minimized.

Copy link
@anp

anp Feb 9, 2017

Author Contributor

Yep!

@johanneskoester johanneskoester merged commit d1ef73b into rust-bio:master Feb 9, 2017
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 91.697%
Details
@johanneskoester
Copy link
Contributor

johanneskoester commented Feb 9, 2017

Awesome, this is really great. Btw. if you have any software based on Rust-Bio, please let me know. We can link it on the homepage.

@anp
Copy link
Contributor Author

anp commented Feb 9, 2017

Glad to hear it, thanks!

I built some research code in my previous position and rust-bio was phenomenally helpful, but I've since started a new job where I'm not working on it anymore. If I hear they've published and/or open-sourced the project, I will be sure to let you know!

@johanneskoester
Copy link
Contributor

johanneskoester commented Feb 9, 2017

Oh no, that's a loss! I hope you can at least go on with Rust in your free time :-).

@anp
Copy link
Contributor Author

anp commented Feb 9, 2017

So far so good :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.