Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST] Index Benchmark #960

Merged
merged 3 commits into from
Feb 24, 2020
Merged

[TEST] Index Benchmark #960

merged 3 commits into from
Feb 24, 2020

Conversation

Clemapfel
Copy link
Contributor

@Clemapfel Clemapfel commented May 10, 2019

benchmark construction of fm- and bi-fm-index and compare with seqan2

resolves #1566

@codecov
Copy link

codecov bot commented May 14, 2019

Codecov Report

Merging #960 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #960   +/-   ##
=======================================
  Coverage   97.63%   97.63%           
=======================================
  Files         235      235           
  Lines        8904     8904           
=======================================
  Hits         8693     8693           
  Misses        211      211
Impacted Files Coverage Δ
...ude/seqan3/test/performance/sequence_generator.hpp 100% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2b720...a1866a7. Read the comment docs.

@Clemapfel
Copy link
Contributor Author

Clemapfel commented Jul 17, 2019

Let me know if I should change anything, I want to get these off of my backlog

I'm not sure if there's an error in my code or if these results reflect reality but it seems strange that seqan3 has so much trouble only for small texts but not longer ones

construct_index_seqan3<fm_index<std::vector<dna4>>, 5>                    22375839 ns     22375746 ns           31
construct_index_seqan3<fm_index<std::vector<dna4>>, 50>                   11148532 ns     11148310 ns           66
construct_index_seqan3<fm_index<std::vector<dna4>>, 500>                   8372083 ns      8371841 ns           83
construct_index_seqan3<fm_index<std::vector<dna4>>, 5000>                  6904690 ns      6904545 ns          102
construct_index_seqan3<fm_index<std::vector<dna4>>, 50000>                10536404 ns     10536065 ns           67
construct_index_seqan3<fm_index<std::vector<dna4>>, 500000>               68870512 ns     68868419 ns           10
construct_index_seqan3<fm_index<std::vector<dna4>>, 5000000>             763673684 ns    763580355 ns            1
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 5>             9627 ns         9627 ns        72073
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 50>           13446 ns        13446 ns        52503
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 500>          66011 ns        66009 ns        10746
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 5000>        979003 ns       978978 ns          727
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 50000>     10734212 ns     10733943 ns           66
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 500000>   124561039 ns    124560337 ns            6
construct_index_seqan2<Index<String<Dna>, FMIndex<void, cfg>>, 5000000> 1874113125 ns   1811238968 ns            1

@rrahn
Copy link
Contributor

rrahn commented Feb 6, 2020

@Clemapfel thank you for this work. Please let us know if you can finish this PR, otherwise we will take it over.

@eseiler
Copy link
Member

eseiler commented Feb 16, 2020

Size = 50

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------
index_benchmark<std::vector<seqan3::dna4>>/fm_index_1                           9694997 ns      9791667 ns           75
index_benchmark<std::vector<seqan3::aa27>>/fm_index_2                           5568685 ns      5667892 ns          102
index_benchmark<std::string>/fm_index_3                                         5721029 ns      5719866 ns          112
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/fm_index_4              8985347 ns      9027778 ns           90
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/fm_index_5              9003509 ns      8958333 ns           75
index_benchmark<std::vector<std::string>>/fm_index_6                            8857841 ns      8958333 ns           75
index_benchmark<std::vector<seqan3::dna4>>/bi_fm_index_1                       19779454 ns     19847973 ns           37
index_benchmark<std::vector<seqan3::aa27>>/bi_fm_index_2                       18941547 ns     18841912 ns           34
index_benchmark<std::string>/bi_fm_index_3                                     19079965 ns     19003378 ns           37
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/bi_fm_index_4          13235782 ns     13437500 ns           50
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/bi_fm_index_5          13890164 ns     13750000 ns           50
index_benchmark<std::vector<std::string>>/bi_fm_index_6                        13224644 ns     13437500 ns           50

index_benchmark<std::vector<seqan3::dna4>>/seqan2_fm_index_1                      30303 ns        30483 ns        23579
index_benchmark<std::vector<seqan3::aa27>>/seqan2_fm_index_2                      31419 ns        31495 ns        21333
index_benchmark<std::string>/seqan2_fm_index_3                                    36707 ns        36830 ns        18667
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/seqan2_fm_index_4         45773 ns        46039 ns        14933
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/seqan2_fm_index_5         49000 ns        50000 ns        10000
index_benchmark<std::vector<std::string>>/seqan2_fm_index_6                       77448 ns        78125 ns        11200
index_benchmark<std::vector<seqan3::dna4>>/seqan2_bi_fm_index_1                   62199 ns        59375 ns        10000
index_benchmark<std::vector<seqan3::aa27>>/seqan2_bi_fm_index_2                   63646 ns        62779 ns        11200
index_benchmark<std::string>/seqan2_bi_fm_index_3                                 76926 ns        76730 ns         8960
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/seqan2_bi_fm_index_4      94056 ns        94164 ns         7467
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/seqan2_bi_fm_index_5      99653 ns       100442 ns         7467
index_benchmark<std::vector<std::string>>/seqan2_bi_fm_index_6                   154939 ns       153460 ns         4480

Size = 50'000

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------
index_benchmark<std::vector<seqan3::dna4>>/fm_index_1                           5314992 ns      5312500 ns          100
index_benchmark<std::vector<seqan3::aa27>>/fm_index_2                           5363862 ns      5301339 ns          112
index_benchmark<std::string>/fm_index_3                                         5965164 ns      5859375 ns          112
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/fm_index_4             19752076 ns     19847973 ns           37
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/fm_index_5             19496426 ns     19761029 ns           34
index_benchmark<std::vector<std::string>>/fm_index_6                           22063284 ns     21972656 ns           32
index_benchmark<std::vector<seqan3::dna4>>/bi_fm_index_1                       10786256 ns     10742188 ns           64
index_benchmark<std::vector<seqan3::aa27>>/bi_fm_index_2                       10948766 ns     10986328 ns           64
index_benchmark<std::string>/bi_fm_index_3                                     12070402 ns     12207031 ns           64
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/bi_fm_index_4          38552500 ns     38651316 ns           19
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/bi_fm_index_5          38460063 ns     38651316 ns           19
index_benchmark<std::vector<std::string>>/bi_fm_index_6                        44046100 ns     43945312 ns           16

index_benchmark<std::vector<seqan3::dna4>>/seqan2_fm_index_1                    6334793 ns      6250000 ns           90
index_benchmark<std::vector<seqan3::aa27>>/seqan2_fm_index_2                    8866397 ns      9027778 ns           90
index_benchmark<std::string>/seqan2_fm_index_3                                 12672620 ns     12834821 ns           56
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/seqan2_fm_index_4      47076433 ns     47916667 ns           15
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/seqan2_fm_index_5      54937820 ns     54687500 ns           10
index_benchmark<std::vector<std::string>>/seqan2_fm_index_6                    73960622 ns     72916667 ns            9
index_benchmark<std::vector<seqan3::dna4>>/seqan2_bi_fm_index_1                12616348 ns     12695312 ns           64
index_benchmark<std::vector<seqan3::aa27>>/seqan2_bi_fm_index_2                15400638 ns     14930556 ns           45
index_benchmark<std::string>/seqan2_bi_fm_index_3                              25356629 ns     25669643 ns           28
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/seqan2_bi_fm_index_4   93827400 ns     93750000 ns            7
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/seqan2_bi_fm_index_5   90320300 ns     89285714 ns            7
index_benchmark<std::vector<std::string>>/seqan2_bi_fm_index_6                190387775 ns    191406250 ns            4

Size = 500'000

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------
index_benchmark<std::vector<seqan3::dna4>>/fm_index_1                          33999543 ns     34226190 ns           21
index_benchmark<std::vector<seqan3::aa27>>/fm_index_2                          34804800 ns     34539474 ns           19
index_benchmark<std::string>/fm_index_3                                        38902278 ns     39062500 ns           18
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/fm_index_4            189752975 ns    191406250 ns            4
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/fm_index_5            220274033 ns    218750000 ns            3
index_benchmark<std::vector<std::string>>/fm_index_6                          231449933 ns    229166667 ns            3
index_benchmark<std::vector<seqan3::dna4>>/bi_fm_index_1                       70174778 ns     69444444 ns            9
index_benchmark<std::vector<seqan3::aa27>>/bi_fm_index_2                       73341122 ns     72916667 ns            9
index_benchmark<std::string>/bi_fm_index_3                                     80824457 ns     80357143 ns            7
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/bi_fm_index_4         395090100 ns    398437500 ns            2
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/bi_fm_index_5         438812950 ns    406250000 ns            2
index_benchmark<std::vector<std::string>>/bi_fm_index_6                       458149900 ns    460937500 ns            2

index_benchmark<std::vector<seqan3::dna4>>/seqan2_fm_index_1                   75215814 ns     73660714 ns            7
index_benchmark<std::vector<seqan3::aa27>>/seqan2_fm_index_2                  109371867 ns    109375000 ns            6
index_benchmark<std::string>/seqan2_fm_index_3                                209087800 ns    208333333 ns            3
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/seqan2_fm_index_4     867676500 ns    859375000 ns            1
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/seqan2_fm_index_5     902669700 ns    890625000 ns            1
index_benchmark<std::vector<std::string>>/seqan2_fm_index_6                  1136226400 ns   1125000000 ns            1
index_benchmark<std::vector<seqan3::dna4>>/seqan2_bi_fm_index_1               152813175 ns    152343750 ns            4
index_benchmark<std::vector<seqan3::aa27>>/seqan2_bi_fm_index_2               187564750 ns    187500000 ns            4
index_benchmark<std::string>/seqan2_bi_fm_index_3                             386712450 ns    390625000 ns            2
index_benchmark<std::vector<std::vector<seqan3::dna4>>>/seqan2_bi_fm_index_4 1479774000 ns   1484375000 ns            1
index_benchmark<std::vector<std::vector<seqan3::aa27>>>/seqan2_bi_fm_index_5 1729828200 ns   1718750000 ns            1
index_benchmark<std::vector<std::string>>/seqan2_bi_fm_index_6               2267391900 ns   2234375000 ns            1

SeqAn3 seems to be asymptocially faster than SeqAn2 😁

@eseiler eseiler requested review from a team, eseiler and wvdtoorn and removed request for a team and eseiler February 18, 2020 07:16
@eseiler eseiler requested review from joergi-w and removed request for wvdtoorn February 18, 2020 07:37
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions and ideas for improvement...

Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you! 👍

b->Args({500, 1'000});
}

struct fm_index_seqan2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an enum?

class enum index_tag
{
    fm_index_seqan2,
    fm_index_seqan3,
    bi_fm_index_seqan2,
    bi_fm_index_seqan3
}


inner_rng_t make_single_seqan3(size_t seed = 0)
{
// The characters with rank 254 and 255 are reserved for the indices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually test a text with an alphabet of such a large size? Does our index (the SDSL) even work?
I cannot see a respective test in here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string ? 😁


#if SEQAN3_HAS_SEQAN2
// Map the SeqAn3 alphabet to its SeqAn2 equivalent.
using seqan2_alphabet_t = std::conditional_t<std::same_as<alphabet_t, seqan3::dna4>, seqan::Dna,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to do this here?
Can't you write directly into the benchmark template:

BENCHMARK_TEMPLATE(index_benchmark, bi_fm_index_seqan3, std::vector<seqan3::dna4>)->Apply(arguments);
#if SEQAN3_HAS_SEQAN2
- BENCHMARK_TEMPLATE(index_benchmark, bi_fm_index_seqan2, std::vector<seqan3::dna4>)->Apply(arguments);
+ BENCHMARK_TEMPLATE(index_benchmark, bi_fm_index_seqan2, seqan::String<seqan::Dna>)->Apply(arguments);
#endif // SEQAN3_HAS_SEQAN2

This also seems error-prone if change the alphabet to dna5 for testing, this function will fall back to char for seqan2 wihtout any error

using fm_index_seqan2_t = seqan::Index<seqan2_rng_t, seqan::FMIndex<void, index_cfg>>;
using bi_fm_index_seqan2_t = seqan::Index<seqan2_rng_t, seqan::BidirectionalIndex<seqan::FMIndex<void, index_cfg>>>;

seqan2_rng_t make_sequence_seqan2()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should introduce a seqan3::test::generate_seqan2_sequence into the global test include header since this is something we will need more often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto generate_sequence_seqan2(size_t const len = 500,
size_t const variance = 0,
size_t const seed = 0)
{
std::mt19937 gen(seed);
std::uniform_int_distribution<uint8_t> dis_alpha(0, seqan::ValueSize<alphabet_t>::VALUE - 1);
std::uniform_int_distribution<size_t> dis_length(len - variance, len + variance);
seqan::String<alphabet_t> sequence;
size_t length = dis_length(gen);
for (size_t l = 0; l < length; ++l)
appendValue(sequence, alphabet_t{dis_alpha(gen)});
return sequence;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe templatize over the range_type?

#if SEQAN3_HAS_SEQAN2
if constexpr (std::same_as<index_tag_t, fm_index_seqan2> || std::same_as<index_tag_t, bi_fm_index_seqan2>)
{
typename sequence_generator<index_tag_t, rng_t>::index_t index{generator.seqan2};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you store this in a member?

Suggested change
typename sequence_generator<index_tag_t, rng_t>::index_t index{generator.seqan2};
typename sequence_generator<index_tag_t, rng_t>::index_t index{generator.make_seqan2_sequence()};

}
}

BENCHMARK_TEMPLATE(index_benchmark, fm_index_seqan3, std::vector<seqan3::dna4>)->Apply(arguments);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are generating the same sequences for every new benchmark right? This is quite inefficient for these many benchmarks :/ make we could store them somehow?
Sorry that this will require some more refactoring but we don't want the benchmark to take too much time.

Copy link
Member

@eseiler eseiler Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequence generation isn't actually taking long, but I can refactor the tests.

The whole point of the class was to use the same sequences for seqan2 and seqan3, but I can just generate them individually and store them.

In this case, I would also block this PR by #920 such that I have the numeric sequence generator for generating char-sequences for seqan3 (numeric-sequence -> to_char?), because I think this is easier than generating a char sequence and then filtering out the ranks that are too high.

Copy link
Member

@eseiler eseiler Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just store a long enough sequence, and then do | seqan3::views::take(length) on it. the view should produce overhead, but then again I'm not entirely sure ... but it would be a quite elegant solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I would also block this PR by #920 such that I have the numeric sequence generator for generating char-sequences for seqan3 (numeric-sequence -> to_char?), because I think this is easier than generating a char sequence and then filtering out the ranks that are too high.

👍

I could just store a long enough sequence, and then do | seqan3::views::take(length) on it. the view should produce overhead, but then again I'm not entirely sure ... but it would be a quite elegant solution?

Indeed this would be elegant :) And I think it would be comparable if you use
seqan3::views::take(length) | seqan3::views::to<std::vector> ? This, of course, involves copying but this should not be a major performance bottleneck.. Or you could use vies::take for seqan3 and Infix<> for seqan2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like this. I'll try it out and then we can talk about it on monday 👍

Copy link
Member

@smehringer smehringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅

#include <seqan3/search/fm_index/all.hpp>
#include <seqan3/test/performance/sequence_generator.hpp>
#include <seqan3/test/seqan2.hpp>


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

| seqan3::views::to<std::string>};
};

sequence_store_seqan3 store3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sequence_store_seqan3 store3;
sequence_store_seqan3 store{};

I like the design :)

Comment on lines 144 to 153
if constexpr (index_tag == tag::fm_index)
{
seqan::Index<rng_t, seqan::FMIndex<void, index_cfg>> index{sequence};
seqan::indexCreate(index, seqan::FibreSALF());
}
else
{
seqan::Index<rng_t, seqan::BidirectionalIndex<seqan::FMIndex<void, index_cfg>>> index{sequence};
seqan::indexCreate(index, seqan::FibreSALF());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using index_type = std::conditional_t<index_tag == tag::fm_index, seqan::FMIndex<void, index_cfg>, seqan::BidirectionalIndex<seqan::FMIndex<void, index_cfg>>>;
seqan::Index<rng_t, index_type> index{sequence};
seqan::indexCreate(index, seqan::FibreSALF());

instead of the if else clause

Copy link
Member

@smehringer smehringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and clean commit history

if constexpr (dimension == 1)
sequence = std::move(inner_sequence);
else
for (int32_t i = 0; i < state.range(1); ++i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi line else block with braces

@smehringer smehringer merged commit ddd5e44 into seqan:master Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add microbenchmark for index construction
5 participants