Occ::get performance, redux #76

Merged
merged 4 commits into from Feb 9, 2017

Conversation

Projects
None yet
4 participants
@anp
Contributor

anp commented Jul 25, 2016

I ended up writing a blog post about the tools I used for profiling the change I proposed to Occ::get. Fun story, @Aatch pointed out on Reddit that the slowdown in some other versions I had tried was probably due to the overhead of dealing with alignment in a vectorized version of the function, and noted that 3 may be an odd sample rate to test with the FMIndex. As an example, bowtie2's FM index appears to default to a sampling rate of 32, and I've seen a few cases of using higher rates as well.

The version of Occ::get proposed by Aatch is at approximate parity with the filter/count version for performance at a sampling rate of 32 (just a tiny bit slower), but it quickly gets much, much faster as the sampling rate increases. For example, with a sampling rate of 128 (in use in one of my applications):

filter/count, k = 128

test search_index_seeds ... bench:     179,442 ns/iter (+/- 1,947)

vectorized, k = 128

test search_index_seeds ... bench:     152,368 ns/iter (+/- 433)

filter/count, k = 64

test search_index_seeds ... bench:     105,141 ns/iter (+/- 1,665)

vectorized, k = 64

test search_index_seeds ... bench:      99,044 ns/iter (+/- 252)

filter/count, k = 32

test search_index_seeds ... bench:      74,427 ns/iter (+/- 5,166)

vectorized, k = 32

test search_index_seeds ... bench:      75,333 ns/iter (+/- 303)

Seems like an overall win to me. At a k of 32, the averages are just slightly higher than the filter/count version, but well within the error margin reported by libtest.

The one downside here is that at smaller sampling rates, a non-vectorized version wins out:

filter/count, k = 8

test search_index_seeds ... bench:      40,598 ns/iter (+/- 1,170)

vectorized, k = 8

test search_index_seeds ... bench:      53,842 ns/iter (+/- 224)

Thoughts here? Without integer generics (at all) and specialization (in stable), I'm not sure it's possible to cleanly maximize performance of small sampling rates and large ones. I'm biased here because I'm working with very large indices that need high sampling rates, but I'm not sure what other values of k are "in the wild."

@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Jul 26, 2016

Contributor

This is very interesting, Adam! To me, the speedups are too small to justify the use of the loop. I would hope that these differences disappear at some point due to compiler optimizations. But I might be mistaken. What is the real speedup you get in your application (just to get a feeling)?

Contributor

johanneskoester commented Jul 26, 2016

This is very interesting, Adam! To me, the speedups are too small to justify the use of the loop. I would hope that these differences disappear at some point due to compiler optimizations. But I might be mistaken. What is the real speedup you get in your application (just to get a feeling)?

@Aatch

This comment has been minimized.

Show comment
Hide comment
@Aatch

Aatch Jul 27, 2016

One change I'm investigating is making k a u32 instead of a usize. On a 64-bit platform, with the default settings, the way the vectorized code works means that it is limited to processing 4 elements at a time, but limit k to a u32 allows it to process 8 elements at a time. My testing shows that for k == 32, this increases speed for the loop over the non-vectorized version by a reasonable amount.

For k == 32:

filter-count:
   test search_index_seeds ... bench:      33,397 ns/iter (+/- 1,498)

loop:
   test search_index_seeds ... bench:      37,146 ns/iter (+/- 1,266)

loop, k: u32:
   test search_index_seeds ... bench:      31,221 ns/iter (+/- 1,982)

For k == 64

filter-count:
   test search_index_seeds ... bench:      49,810 ns/iter (+/- 9,850)

loop, k: u32
   test search_index_seeds ... bench:      33,503 ns/iter (+/- 1,824)

And for smaller k, it is still is slower, I don't think it's a real problem, since 32/64 are apparently the more common sizes.

filter-count:
   test search_index_seeds ... bench:      18,364 ns/iter (+/- 1,374)

loop, k: u32:
   test search_index_seeds ... bench:      27,241 ns/iter (+/- 1,781)

At this point, you do need a more in-depth benchmark to see the impact, as the whole point of this structure is to control the space/time tradeoff for counting occurrences. Cache effects and even paging are going to start being relevant.

Aatch commented Jul 27, 2016

One change I'm investigating is making k a u32 instead of a usize. On a 64-bit platform, with the default settings, the way the vectorized code works means that it is limited to processing 4 elements at a time, but limit k to a u32 allows it to process 8 elements at a time. My testing shows that for k == 32, this increases speed for the loop over the non-vectorized version by a reasonable amount.

For k == 32:

filter-count:
   test search_index_seeds ... bench:      33,397 ns/iter (+/- 1,498)

loop:
   test search_index_seeds ... bench:      37,146 ns/iter (+/- 1,266)

loop, k: u32:
   test search_index_seeds ... bench:      31,221 ns/iter (+/- 1,982)

For k == 64

filter-count:
   test search_index_seeds ... bench:      49,810 ns/iter (+/- 9,850)

loop, k: u32
   test search_index_seeds ... bench:      33,503 ns/iter (+/- 1,824)

And for smaller k, it is still is slower, I don't think it's a real problem, since 32/64 are apparently the more common sizes.

filter-count:
   test search_index_seeds ... bench:      18,364 ns/iter (+/- 1,374)

loop, k: u32:
   test search_index_seeds ... bench:      27,241 ns/iter (+/- 1,781)

At this point, you do need a more in-depth benchmark to see the impact, as the whole point of this structure is to control the space/time tradeoff for counting occurrences. Cache effects and even paging are going to start being relevant.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Jul 27, 2016

Contributor

@johanneskoester I'm not entirely sure what you mean by "these differences disappear at some point due to compiler optimizations." Re: speedup in my application, the speed of the read mapping I'm doing is bounded primarily by the speed of Occ::get. Here's a flamegraph of my application's core benchmark:

mgindex flamegraph

So a 15% increase at k=128 results in about a 12% or so performance increase for my use case. That said, I'm working with abnormally large indices with very high sample rates (metagenomics), so it's entirely reasonable to think that it's a bad idea to overspecialize the library for my uses.

@Aatch good point about 32-bit ints on x86_64. This would impose a limitation on the size of the text the FM Index can contain, however. In a pathological case (a text of all a single character), I think it would mean that you could index a max 4GB text. If I'm thinking correctly, for a more reasonable case of near even distribution, it would limit use to a 20GB text, assuming a DNA + N alphabet. As I write this, I have an index building with a 25GB DNA+N input file, and have hopes to go higher if I'm able to make it work.

Perhaps this is an argument for making the counting data structures generic over integer type? IIRC, this is what SeqAn does for their indices. Ideally though that would be based on the size of the input text, no? I'd have to try but I think it might be difficult to hide that implementation detail from a user without having either integer generics or impl Trait.

Trying to improve performance for this crate has made me desperately want integer generics.

Contributor

anp commented Jul 27, 2016

@johanneskoester I'm not entirely sure what you mean by "these differences disappear at some point due to compiler optimizations." Re: speedup in my application, the speed of the read mapping I'm doing is bounded primarily by the speed of Occ::get. Here's a flamegraph of my application's core benchmark:

mgindex flamegraph

So a 15% increase at k=128 results in about a 12% or so performance increase for my use case. That said, I'm working with abnormally large indices with very high sample rates (metagenomics), so it's entirely reasonable to think that it's a bad idea to overspecialize the library for my uses.

@Aatch good point about 32-bit ints on x86_64. This would impose a limitation on the size of the text the FM Index can contain, however. In a pathological case (a text of all a single character), I think it would mean that you could index a max 4GB text. If I'm thinking correctly, for a more reasonable case of near even distribution, it would limit use to a 20GB text, assuming a DNA + N alphabet. As I write this, I have an index building with a 25GB DNA+N input file, and have hopes to go higher if I'm able to make it work.

Perhaps this is an argument for making the counting data structures generic over integer type? IIRC, this is what SeqAn does for their indices. Ideally though that would be based on the size of the input text, no? I'd have to try but I think it might be difficult to hide that implementation detail from a user without having either integer generics or impl Trait.

Trying to improve performance for this crate has made me desperately want integer generics.

@Aatch

This comment has been minimized.

Show comment
Hide comment
@Aatch

Aatch Jul 27, 2016

@dikaiosune it only limits the size of the "chunks" the input is split into, not the size of the data itself. Given that you're currently splitting into 128-byte chunks, the fact that using u32 limits you to 4GB chunks doesn't seem too ridiculous. The reason for limiting k here is actually so you can use a u32 for the count variable instead of a usize, meaning that instead of the vectorized loop using two <2 x u64> vectors, it can use two <4 x u32> vectors and process 8 elements at once instead.

Aatch commented Jul 27, 2016

@dikaiosune it only limits the size of the "chunks" the input is split into, not the size of the data itself. Given that you're currently splitting into 128-byte chunks, the fact that using u32 limits you to 4GB chunks doesn't seem too ridiculous. The reason for limiting k here is actually so you can use a u32 for the count variable instead of a usize, meaning that instead of the vectorized loop using two <2 x u64> vectors, it can use two <4 x u32> vectors and process 8 elements at once instead.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Jul 27, 2016

Contributor

@Aatch If we use a u32 for count, then counting >4B matching characters from the BWT will overflow, which is likely to happen when counting up to the final rows in very large BWT's. (I think)

Contributor

anp commented Jul 27, 2016

@Aatch If we use a u32 for count, then counting >4B matching characters from the BWT will overflow, which is likely to happen when counting up to the final rows in very large BWT's. (I think)

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Jul 27, 2016

Contributor

Note for later: for texts which are small enough, the same optimization could apply nicely to the SuffixArray impl as well. SeqAn lets you pick the type of the suffix array items so you can take advantage of 32-bit speed on <4GB texts.

Contributor

anp commented Jul 27, 2016

Note for later: for texts which are small enough, the same optimization could apply nicely to the SuffixArray impl as well. SeqAn lets you pick the type of the suffix array items so you can take advantage of 32-bit speed on <4GB texts.

@Aatch

This comment has been minimized.

Show comment
Hide comment
@Aatch

Aatch Jul 27, 2016

@dikaiosune it won't. The maximum value of count (the "manual count") is k - 1, since you only process k - 1 elements. You then add that to the stored count for the beginning of the block you're in. That stored count should continue to be a usize, you just cast the "manual count" to a usize before adding it to the stored count.

Aatch commented Jul 27, 2016

@dikaiosune it won't. The maximum value of count (the "manual count") is k - 1, since you only process k - 1 elements. You then add that to the stored count for the beginning of the block you're in. That stored count should continue to be a usize, you just cast the "manual count" to a usize before adding it to the stored count.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Jul 27, 2016

Contributor

Aha! Yeah, that makes total sense.

Contributor

anp commented Jul 27, 2016

Aha! Yeah, that makes total sense.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Jul 27, 2016

Contributor

So, to clarify, you're suggesting something like:

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
    // self.k is our sampling rate, so find our last sampled checkpoint
    let i = r / self.k;
    let checkpoint = self.occ[i][a as usize];

    // find the portion of the BWT past the checkpoint which we need to count
    let start = (i * self.k) + 1;
    let end = r + 1;

    // count all the matching bytes b/t the closest checkpoint and our desired lookup
    let mut count: u32 = 0;
    for &x in &bwt[start..end] {
        if x == a {
            count += 1;
        }
    }
    // return the sampled checkpoint for this character + the manual count we just did
    checkpoint + (count as usize)
}

Is that accurate? Why does k also need to be a u32 here? I think that <4B sampling rates are completely reasonable, just curious.

EDIT: Nevermind. Because it prevents overflow in case someone passes a completely bonkers value for k.

Contributor

anp commented Jul 27, 2016

So, to clarify, you're suggesting something like:

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
    // self.k is our sampling rate, so find our last sampled checkpoint
    let i = r / self.k;
    let checkpoint = self.occ[i][a as usize];

    // find the portion of the BWT past the checkpoint which we need to count
    let start = (i * self.k) + 1;
    let end = r + 1;

    // count all the matching bytes b/t the closest checkpoint and our desired lookup
    let mut count: u32 = 0;
    for &x in &bwt[start..end] {
        if x == a {
            count += 1;
        }
    }
    // return the sampled checkpoint for this character + the manual count we just did
    checkpoint + (count as usize)
}

Is that accurate? Why does k also need to be a u32 here? I think that <4B sampling rates are completely reasonable, just curious.

EDIT: Nevermind. Because it prevents overflow in case someone passes a completely bonkers value for k.

@Aatch

This comment has been minimized.

Show comment
Hide comment
@Aatch

Aatch Jul 27, 2016

@dikaiosune k only needs to be a u32 to ensure that count doesn't overflow.

Aatch commented Jul 27, 2016

@dikaiosune k only needs to be a u32 to ensure that count doesn't overflow.

@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Jul 27, 2016

Contributor

@dikaiosune with compiler optimizations I meant that at some point the iterator implementation might be as fast as the vectorized. But maybe I'm mistaken and the compiler will never be able to do that...

Contributor

johanneskoester commented Jul 27, 2016

@dikaiosune with compiler optimizations I meant that at some point the iterator implementation might be as fast as the vectorized. But maybe I'm mistaken and the compiler will never be able to do that...

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Aug 14, 2016

Contributor

@johanneskoester I'm in the process of talking to some people on IRC about the optimizations at play here, and to what extent the filter/count method might be better optimized in the future. Will post back when I have more info.

Contributor

anp commented Aug 14, 2016

@johanneskoester I'm in the process of talking to some people on IRC about the optimizations at play here, and to what extent the filter/count method might be better optimized in the future. Will post back when I have more info.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Aug 14, 2016

Contributor

@johanneskoester So... after digging around, it looks like using plain fold instead of map/fold or filter/count should vectorize: https://is.gd/jeOBHA -- turning on release mode + ASM shows that both the manual method and the fold method vectorize. Note however that this precludes using a u32 to accumulate the BWT count -- which does show some impressive gains from @Aatch's benchmarking.

I think that given the performance benefit from having a u32 accumulator type, and the fact that the manual loop seems to be more reliably optimized, the advantage of using a manual loop seems pretty strong.

Contributor

anp commented Aug 14, 2016

@johanneskoester So... after digging around, it looks like using plain fold instead of map/fold or filter/count should vectorize: https://is.gd/jeOBHA -- turning on release mode + ASM shows that both the manual method and the fold method vectorize. Note however that this precludes using a u32 to accumulate the BWT count -- which does show some impressive gains from @Aatch's benchmarking.

I think that given the performance benefit from having a u32 accumulator type, and the fact that the manual loop seems to be more reliably optimized, the advantage of using a manual loop seems pretty strong.

@Aatch

This comment has been minimized.

Show comment
Hide comment
@Aatch

Aatch Aug 15, 2016

@johanneskoester be careful to not fall into the "sufficiently smart compiler" trap. Given enough time and information, a compiler can theoretically produce perfectly-optimised code. Therefore, it's more important to consider what it can do now.

Aatch commented Aug 15, 2016

@johanneskoester be careful to not fall into the "sufficiently smart compiler" trap. Given enough time and information, a compiler can theoretically produce perfectly-optimised code. Therefore, it's more important to consider what it can do now.

@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Aug 15, 2016

Contributor

Ok, I'm fine with u32 and a custom loop then. Thanks a lot for evaluating this!!

Contributor

johanneskoester commented Aug 15, 2016

Ok, I'm fine with u32 and a custom loop then. Thanks a lot for evaluating this!!

@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Aug 15, 2016

Contributor

Maybe you can add some comments to the code that explain why no iterators are used in the refined implementation. This way, we ensure that your insights on this snippet are not lost or even reverted at later times.

Contributor

johanneskoester commented Aug 15, 2016

Maybe you can add some comments to the code that explain why no iterators are used in the refined implementation. This way, we ensure that your insights on this snippet are not lost or even reverted at later times.

@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Sep 27, 2016

Contributor

Any update on this?

Contributor

johanneskoester commented Sep 27, 2016

Any update on this?

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Sep 30, 2016

Contributor

@johanneskoester I'm hoping to clean this up this weekend.

Contributor

anp commented Sep 30, 2016

@johanneskoester I'm hoping to clean this up this weekend.

Merge pull request #1 from rust-bio/master
update to upstream
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling a8399a4 on dikaiosune:master into 796a583 on rust-bio:master.

coveralls commented Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling a8399a4 on dikaiosune:master into 796a583 on rust-bio:master.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 5, 2017

Contributor

I'm getting back to this after quite some time. To summarize my understanding of where this is:

  • count is more efficient in vectorized versions when stored as a u32
  • Occ.k needs to be a u32 to ensure that no one constructs Occ with a sampling rate which would cause count to overflow
  • Changing k to u32 means that the signature of Occ::new needs to change, which would affect people upgrading to the new version. The tests in the module rely on type inference for numeric literals, so there's not need to change those items, but downstream users will be affected. It's not a big change, and if you're regularly passing values greater than 4,000,000,000 for k, then you almost certainly have broken code.

@johanneskoester what do you think about the last point? It's a wrinkle that I don't think we covered when discussing previously. In the meantime I'll push a commit with a comment explaining the manual looping.

Contributor

anp commented Feb 5, 2017

I'm getting back to this after quite some time. To summarize my understanding of where this is:

  • count is more efficient in vectorized versions when stored as a u32
  • Occ.k needs to be a u32 to ensure that no one constructs Occ with a sampling rate which would cause count to overflow
  • Changing k to u32 means that the signature of Occ::new needs to change, which would affect people upgrading to the new version. The tests in the module rely on type inference for numeric literals, so there's not need to change those items, but downstream users will be affected. It's not a big change, and if you're regularly passing values greater than 4,000,000,000 for k, then you almost certainly have broken code.

@johanneskoester what do you think about the last point? It's a wrinkle that I don't think we covered when discussing previously. In the meantime I'll push a commit with a comment explaining the manual looping.

Switch Occ::k and Occ::get's count accumulator to u32.
Also, add comment explaining the need for a manual counter.
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling efef57f on dikaiosune:master into 796a583 on rust-bio:master.

coveralls commented Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling efef57f on dikaiosune:master into 796a583 on rust-bio:master.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Feb 5, 2017

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling efef57f on dikaiosune:master into 796a583 on rust-bio:master.

Coverage Status

Coverage increased (+0.02%) to 91.697% when pulling efef57f on dikaiosune:master into 796a583 on rust-bio:master.

@johanneskoester

I'm fine with the changes. (see minor comment below). API stability is not yet a major concern.

+ let mut count: u32 = 0;
+ for &x in &bwt[start..end] {
+ if x == a {
+ count += 1;

This comment has been minimized.

@johanneskoester

johanneskoester Feb 8, 2017

Contributor

What happens if I do count += (x == a) as u32? I think it should be cheaper, because addition is faster than branch prediction?

@johanneskoester

johanneskoester Feb 8, 2017

Contributor

What happens if I do count += (x == a) as u32? I think it should be cheaper, because addition is faster than branch prediction?

This comment has been minimized.

@anp

anp Feb 9, 2017

Contributor

From running benches on my 2016 MBP.

The current PR:

test search_index_seeds ... bench:      33,975 ns/iter (+/- 1,272)
test search_index_seeds ... bench:      33,912 ns/iter (+/- 3,118)
test search_index_seeds ... bench:      34,024 ns/iter (+/- 963)

average: ~33970 ns/iter

with RUSTFLAGS="-C target-cpu=native"

test search_index_seeds ... bench:      33,715 ns/iter (+/- 2,859)
test search_index_seeds ... bench:      33,954 ns/iter (+/- 2,944)
test search_index_seeds ... bench:      34,207 ns/iter (+/- 1,287)

average: ~33958 ns/iter

With the suggested change:

test search_index_seeds ... bench:      34,387 ns/iter (+/- 3,337)
test search_index_seeds ... bench:      33,873 ns/iter (+/- 2,172)
test search_index_seeds ... bench:      33,947 ns/iter (+/- 3,346)

average: ~34069 ns/iter

with RUSTFLAGS="-C target-cpu=native"

test search_index_seeds ... bench:      34,067 ns/iter (+/- 2,697)
test search_index_seeds ... bench:      34,046 ns/iter (+/- 1,759)
test search_index_seeds ... bench:      33,992 ns/iter (+/- 3,879)

average: ~34035 ns/iter

The difference is miniscule (about 0.2-0.3%) between the average run times. However, the variance of the version you suggest is slightly higher on my machine (average variance of 1784 ns as PR'd vs. 2951 ns as suggested for a generic x86_64 arch, and 2363 ns vs 2778 ns for the target-cpu=native arch), suggesting that the branch may actually have more predictable performance, at least on the random test data I included in that benchmark.

That said, I'm running these benches on a different machine than I had when I opened the PR, the Rust nightly version has changed, and I don't have time right now to dig into the generated assembly to see if there's anything wonky.

My sense is that given a lack of clear evidence for the branchless version performing better, it would be better to opt for the version which more clearly communicates the intent of counting items which match our needle byte -- which I'd argue is the version currently submitted for review. That said, since it seems like a complete wash between the two versions (the 0.2% difference is well within the error bars of this measurement), I'm happy to switch it if you'd prefer.

@anp

anp Feb 9, 2017

Contributor

From running benches on my 2016 MBP.

The current PR:

test search_index_seeds ... bench:      33,975 ns/iter (+/- 1,272)
test search_index_seeds ... bench:      33,912 ns/iter (+/- 3,118)
test search_index_seeds ... bench:      34,024 ns/iter (+/- 963)

average: ~33970 ns/iter

with RUSTFLAGS="-C target-cpu=native"

test search_index_seeds ... bench:      33,715 ns/iter (+/- 2,859)
test search_index_seeds ... bench:      33,954 ns/iter (+/- 2,944)
test search_index_seeds ... bench:      34,207 ns/iter (+/- 1,287)

average: ~33958 ns/iter

With the suggested change:

test search_index_seeds ... bench:      34,387 ns/iter (+/- 3,337)
test search_index_seeds ... bench:      33,873 ns/iter (+/- 2,172)
test search_index_seeds ... bench:      33,947 ns/iter (+/- 3,346)

average: ~34069 ns/iter

with RUSTFLAGS="-C target-cpu=native"

test search_index_seeds ... bench:      34,067 ns/iter (+/- 2,697)
test search_index_seeds ... bench:      34,046 ns/iter (+/- 1,759)
test search_index_seeds ... bench:      33,992 ns/iter (+/- 3,879)

average: ~34035 ns/iter

The difference is miniscule (about 0.2-0.3%) between the average run times. However, the variance of the version you suggest is slightly higher on my machine (average variance of 1784 ns as PR'd vs. 2951 ns as suggested for a generic x86_64 arch, and 2363 ns vs 2778 ns for the target-cpu=native arch), suggesting that the branch may actually have more predictable performance, at least on the random test data I included in that benchmark.

That said, I'm running these benches on a different machine than I had when I opened the PR, the Rust nightly version has changed, and I don't have time right now to dig into the generated assembly to see if there's anything wonky.

My sense is that given a lack of clear evidence for the branchless version performing better, it would be better to opt for the version which more clearly communicates the intent of counting items which match our needle byte -- which I'd argue is the version currently submitted for review. That said, since it seems like a complete wash between the two versions (the 0.2% difference is well within the error bars of this measurement), I'm happy to switch it if you'd prefer.

This comment has been minimized.

@johanneskoester

johanneskoester Feb 9, 2017

Contributor

I could not agree more. Let's keep it like it is. Ready for merging?

@johanneskoester

johanneskoester Feb 9, 2017

Contributor

I could not agree more. Let's keep it like it is. Ready for merging?

This comment has been minimized.

@anp

anp Feb 9, 2017

Contributor

Yep!

@anp

anp Feb 9, 2017

Contributor

Yep!

@johanneskoester johanneskoester merged commit d1ef73b into rust-bio:master Feb 9, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 91.697%
Details
@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Feb 9, 2017

Contributor

Awesome, this is really great. Btw. if you have any software based on Rust-Bio, please let me know. We can link it on the homepage.

Contributor

johanneskoester commented Feb 9, 2017

Awesome, this is really great. Btw. if you have any software based on Rust-Bio, please let me know. We can link it on the homepage.

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 9, 2017

Contributor

Glad to hear it, thanks!

I built some research code in my previous position and rust-bio was phenomenally helpful, but I've since started a new job where I'm not working on it anymore. If I hear they've published and/or open-sourced the project, I will be sure to let you know!

Contributor

anp commented Feb 9, 2017

Glad to hear it, thanks!

I built some research code in my previous position and rust-bio was phenomenally helpful, but I've since started a new job where I'm not working on it anymore. If I hear they've published and/or open-sourced the project, I will be sure to let you know!

@johanneskoester

This comment has been minimized.

Show comment
Hide comment
@johanneskoester

johanneskoester Feb 9, 2017

Contributor

Oh no, that's a loss! I hope you can at least go on with Rust in your free time :-).

Contributor

johanneskoester commented Feb 9, 2017

Oh no, that's a loss! I hope you can at least go on with Rust in your free time :-).

@anp

This comment has been minimized.

Show comment
Hide comment
@anp

anp Feb 9, 2017

Contributor

So far so good :D

Contributor

anp commented Feb 9, 2017

So far so good :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment