Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a sequence generator for the tests and benchmarks of the library such that it is easier to handle the generation of sequences inbetween different calls. #12

Closed
5 of 19 tasks
rrahn opened this issue Mar 26, 2020 · 4 comments
Assignees
Milestone

Comments

@rrahn
Copy link
Contributor

rrahn commented Mar 26, 2020

Description

Currently we use free functions to generate the sequences but have to pass the seed for every sequence invocation. This makes it difficult to generate different sequences over different test instances or to generate multiple "random" sequences without resetting the seed for every invocation. Accordingly, there should be generator class that has the random number generator as a state and is initialised during construction. Thus, it becomes much easier to handle a global sequence generator or a test case local one that generates random sequences for a single test but the same sequences for every test case, e.g. as member of the test fixture.

Acceptance criteria

  • intialise with a seed on construction (default seed is 42)
  • generate sequences by specifying the target alphabet
  • generate a sequence collection (number of sequences, average length of sequences, deviation parameter)
  • the generated sequences are std::vector over the given alphabet
  • the collection size is required and generates this many sequences
  • the sequence size is required and generates sequences with this size on average
  • the sequence size variation defaults to 0 if not provided (all sequences have same length)

Tasks

  • Implement and test the basic generator class that can be intialised with the seed and generate a single sequence
  • Extend and test the generator to produce a collection of sequences
  • Replace all instances of the free function calls with the generator
  • Adapt the sequence generator to also work with SeqAn2 by providing conditionally included generation functions if SeqAn2 is available.

Definition of Done

  • Implementation and design approved
  • Unit tests pass
  • Test coverage = 100%
  • Microbenchmarks added and/or affected microbenchmarks < 5% performance drop
  • API documentation added
  • Tutorial/teaching material added
  • Test suite compiles in less than 30 seconds (on travis)
  • Changelog entry added
@rrahn rrahn added ready to tackle This story was discussed and can be immidietly tackled needs refinement A story that was not discussed and/or estimated by the team yet but is planned for upcoming sprints. and removed ready to tackle This story was discussed and can be immidietly tackled labels Mar 26, 2020
@rrahn rrahn added this to Misc in SeqAn3 Library Mar 26, 2020
@rrahn rrahn added ready to tackle This story was discussed and can be immidietly tackled spike and removed needs refinement A story that was not discussed and/or estimated by the team yet but is planned for upcoming sprints. ready to tackle This story was discussed and can be immidietly tackled labels Jul 6, 2020
@rrahn rrahn added this to the Sprint 7 milestone Jul 6, 2020
@Irallia Irallia self-assigned this Jul 14, 2020
@marehr
Copy link
Member

marehr commented Jul 27, 2020

Since this ticket only describes the functional requirements, I thought up the following (iterative) design:

After some thought I think the base class for sequence generation should be called random_sequence instead of sequence_generator.

1) Implement call operator()

generate (single) fixed sized sequence
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{/*.length = */4, /*.length_variation = */0}};
using sequence_type = decltype(random_sequence)::sequence_type;

sequence_type sequence1 = random_sequence(); // CAGT
sequence_type sequence2 = random_sequence(); // NCTG
generate (single) variable sized sequence, generates sequence with size length ± length_variation.
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{/*.length = */4, /*.length_variation = */2}};
using sequence_type = decltype(random_sequence)::sequence_type;

sequence_type sequence1 = random_sequence(); // CAG
sequence_type sequence2 = random_sequence(); // NCTGA
generate a collection of variable/fixed sized sequences
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{/*.length = */4, /*.length_variation = */2}};
using sequence_type = decltype(random_sequence)::sequence_type;

std::vector<sequence_type> sequences(5); // pre-allocates 5 sequences
std::ranges::generate(sequences, random_sequence); // and fill them with the content of the sequence generator

// sequences = {TTGCC, AT, TTC, AGTCAT, AGG}
generate sequence pairs
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{/*.length = */4, /*.length_variation = */2}};
using sequence_type = decltype(random_sequence)::sequence_type;

std::vector<std::tuple<sequence_type, sequence_type>> sequences(5); // pre-allocates 5 sequences
std::ranges::generate(sequences, [&random_sequence]()
{
    return std::tuple{random_sequence(), random_sequence()};
});

// sequences = (TTGCC, AT), (TTC, AGTCAT), (AGG, CACG), (TTTGGC, GAT), (CA, GTAC)

This covers most of our use cases. (This is basically the same as seqan3::test::generate_sequence, but as function object).

2) Implement begin, end

that returns an iterator which calls random_sequence::operator().

The range interface is not really needed, because normally we want to persist data, but it might be still interesting.

generate infinte many sequences
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{/*.length = */4, /*.length_variation = */2}};

for (auto && sequence: random_sequence);
// TTGCC, AT, TTC, AGTCAT, AGG, CACG, TTTGGC, GAT, CA, ...
generate finite many sequences
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{/*.length = */4, /*.length_variation = */2}};

for (auto && sequence: random_sequence | std::views::take(5));
// TTGCC, AT, TTC, AGTCAT, AGG
generate sequence pairs
seqan3::test::random_sequence<seqan3::dna5> random_sequence{{}};

for (auto && [sequence1, sequence2] : seqan3::views::zip(random_sequence, random_sequence)
                                      | std::views::take(5));
// (TTGCC, AT), (TTC, AGTCAT), (AGG, CACG), (TTTGGC, GAT), (CA, GTAC)

3) Thought's on sequence collection generator

generate random sequences
seqan3::test::random_sequences<seqan3::dna5> random_sequences
{
    /*seqan3::test::random_sequence<seqan3::dna5>*/{/*.length = */4, /*.length_variation = */2},
    {/*.collection_size = */5, /*.collection_size_variation = */1}
};

auto sequences1 = random_sequences();
// sequences1 = {TTGCC, AT, TTC, AGTCAT, AGG} // 5 ± 1 sequences

auto sequences2 = random_sequences();
// sequences2 = {AGC, GCTAG, CCCA, TTGAC} // 5 ± 1 sequences

This would be basically a wrapper for

std::vector<sequence_type> sequences(5); // pre-allocates 5 sequences
std::ranges::generate(sequences, random_sequence); // and fill them with the content of the sequence generator

from section 1. I'm not sure if we really need this, since it does not really add any real value as it is right now.

generate random sequence pairs
seqan3::test::random_sequence_pairs<seqan3::dna5> random_sequence_pairs
{
    /*seqan3::test::random_sequence<seqan3::dna5>*/{/*.length = */4, /*.length_variation = */2},
    {/*.collection_size = */5, /*.collection_size_variation = */0}
};

auto sequences = random_sequence_pairs();
// (TTGCC, AT), (TTC, AGTCAT), (AGG, CACG), (TTTGGC, GAT), (CA, GTAC)

This would be basically a wrapper for

std::vector<std::tuple<sequence_type, sequence_type>> sequences(5); // pre-allocates 5 sequences
std::ranges::generate(sequences, [&random_sequence]()
{
    return std::tuple{random_sequence(), random_sequence()};
});

from section 1. Instead of an extra class seqan3::test::random_sequence_pairs, we might want to add a "transformation"
/ "projection" option to seqan3::test::random_sequences.

generate random sample of reference sequence
seqan3::test::random_sequence<seqan3::dna5> random_reference_sequence{/*.length = */40, /*.length_variation = */2};
auto reference_sequence = random_reference_sequence();
seqan3::test::random_sequence_fragment</*????*/> random_sequence_fragment
{
    // The reference_sequence to sample from
    /*.reference_sequence =*/ reference_sequence,
    // The size of each fragment / read
    {/*.sequence_length = */10, /*.sequence_length_variation = */2},
    // The changes done to each fragment / read.
    {/*.max_error_total = */3, /*.max_error_substitution = */ 3, /*.max_error_deletion = */ 0, /*.max_error_insertion = */ 0},
    // Quality modifiers?
    {/*.sequence_ends_have_worse_quality*/}
};

// some defaults for sequencing machines, e.g. illumina?

// random_reference_sequence = TTGCC ATTTC AGTCA TAGGC ACGTT TGGCG ATCAG TAC
auto read1 = random_sequence_fragment(); // TTGCC ATTTC AGTCA TAG[GC ACGTT TGG]CG ATCAG TAC
auto read2 = random_sequence_fragment(); // TTGCC ATTTC AG[TCA TAGAT] ACGTT TGGCG ATCAG TAC

This seems to be completely different from seqan3::test::random_sequences, because it uses a reference sequence to
sample from. I imagine this as a base for Mason (https://publications.imp.fu-berlin.de/962/)

4) Thought's on configurator for sequence generator

  • seqan3::random_sequence_cfg::sequence_length(base_length[, length_variation = 0])

    Produces sequence with length base_length ± length_variation

  • seqan3::random_sequence_cfg::sequence_length(distribution)

    Produces sequence with length distribution(random_number_engine())

  • seqan3::random_sequence_cfg::random_number_engine(engine)

    Sets random_number_engine, defaults to std::mt19937_64 (64bit mersenne twister)

    (OPEN: Take this as copy or as reference? When passed as copy this might always produce the same sequences; As
    reference would mean that class does not own it and the caller must guarantee life-time. Or as a hybrid?
    Owns it if passed as rvalue, otherwise does-not own it?)

5) Thought's on seqan2 sequence generator

We could have the following approaches:

  1. seqan3::test::random_sequence automatically detects that given alphabet is a seqan2 alphabet and changes it
    semantics
    Look and feel:

    seqan3::test::random_sequence<seqan::Dna> random_sequence{{/*.length = */4, /*.length_variation = */2}};
    using sequence_type = decltype(random_sequence)::sequence_type; // = seqan::String<seqan::Dna>
    
    std::vector<sequence_type> sequences(5); // pre-allocates 5 sequences
    std::ranges::generate(sequences, random_sequence); // and fill them with the content of the sequence generator
    
    // sequences = {TTGCC, AT, TTC, AGTCAT, AGG}

    (For SeqAn2 seqan::String<seqan::Dna> is a sequence, whereas seqan3 assumes std::vector<alphabet>)

  2. We have a special seqan3::test::seqan2_random_sequence class that basically has all this implemented.
    Look and feel:

    seqan3::test::random_sequence<seqan::Dna> random_sequence{{/*.length = */4, /*.length_variation = */2}};
    using sequence_type = decltype(random_sequence)::sequence_type; // = seqan::String<seqan::Dna>
    
    std::vector<sequence_type> sequences(5); // pre-allocates 5 sequences
    std::ranges::generate(sequences, random_sequence); // and fill them with the content of the sequence generator
    
    // sequences = {TTGCC, AT, TTC, AGTCAT, AGG}
  3. We have a special specialisation for seqan2 via a seqan3::test::basic_random_sequence base class
    Look and feel:

    // seqan3 implementation is basically this:
    template <typename alphabet_t>
    struct random_sequence : basic_random_sequence<alphabet_t, std::vector<alphabet_t>>;
    
    template <typename seqan2_alphabet_t>
    struct basic_random_sequence<seqan2_alphabet_t, seqan::String<seqan2_alphabet_t>>
    {
       // re-implement everything for seqan2 types
    };
    seqan3::test::basic_random_sequence<seqan::Dna, seqan::String<seqan::Dna>> random_sequence{{/*.length = */4, /*.length_variation = */2}};
    using sequence_type = decltype(random_sequence)::sequence_type; // = seqan::String<seqan::Dna>
    
    std::vector<sequence_type> sequences(5); // pre-allocates 5 sequences
    std::ranges::generate(sequences, random_sequence); // and fill them with the content of the sequence generator
    
    // sequences = {TTGCC, AT, TTC, AGTCAT, AGG}

    (Note: 3. is a super-set of 2., because seqan3::test::seqan2_random_sequence is just a type alias for
    seqan3::test::basic_random_sequence<seqan::Dna, seqan::String<seqan::Dna>>.

@rrahn
Copy link
Contributor Author

rrahn commented Aug 6, 2020

Some thoughts before I forget about them.

1)

I very much agree with the general style proposed in 1 with some additions:

  • I would not call it random_sequence but sequence_generator, because it is a callable object.
  • I am not completely satisfied with the design of specifying the length and variance during construction but would do it during the invocation.
seqan3::test::sequence_generator<seqan3::dna4> generate_sequence{/*.seed = */42};
auto sequence = generate_sequence(100, 10);

To make it invocable without arguments you could also do:

seqan3::test::sequence_generator<seqan3::dna4> generate_sequence{/*.seed = */42};
size_t average_size = 100;  // As an example for possible run time parameter.
size_t size_variance = 10;
std::vector<sequence_t> sequences(10);
std::ranges::generate(sequences, [&] () { return generate_sequence(average_size, size_variance); });

The rational is that it is natural to change the size or variance based on some run time parameters. In the proposed approach either a new generator has to be constructed or the generator offers setter interfaces for the respective parameters.
But in my opinion it feels more natural to call the generator in the function style with parameters.

2)

I would not add the range interface but would rather have a generic generator view that gets an invocable and generates an infinity range by calling the invocable. (Note: one could play around with co-routines here as well)

seqan3::test::sequence_generator<seqan3::dna4> generate_sequence{/*.seed = */42};
for (auto && sequence : seqan3::views::generate([&] () { return generate_sequence(100); }) | seqan3::views::take(1000))
{ ... }

The other points are quite far in the future so I haven't put more thoughts into it.

Implementation details:

Since the purpose is mainly for seqan3, I propose to use the standard seqan3 alphabet interfaces to generate a sequence. For the SeqAn2 alphabet (seqan::SimpleType) we should add overloads such that they can be found by the CPO.

Generate a single letter:

std::uniform_int_distribution<> letter_rank_distribution(0ull, seqan3::alphabet_size_v<alphabet_t>);
/// ...
constexpr alphabet_t generate_letter() 
{
    return seqan3::assign_rank(alphabet_t{}, this->letter_rank_distribution(this->random_number_generator));
}

Generate a sequence:

constexpr std::vector<alphabet_t> operator()(sequence_size, size_variance = 0)
{
    std::uniform_int_distribution<> size_distribution(sequence_size -  size_variance, sequence_size + size_variance);
    std::vector<alphabet_t> sequence(size_distribution(this->random_number_generator));
    std::ranges::generate(sequence, generate_letter);
    return sequence;
}

We then can either convert the vector to a seqan::String on the caller site:

auto sequence = generate_sequence(100);
seqan::String<seqan::Dna4> seqan_string(sequence.size());
std::ranges::move(sequence, seqan_string.begin());

or provide a convenient traits object to the sequence_generator:
Simple version 1 (should be fine for most cases):

seqan3::test::sequence_generator<seqan::Dna4, seqan::String /*defaults to std::vector*/> ...;

Verbose version 2 (allows more control over the sequence type specification):

template <typename alphabet_t>
struct seqan_traits
{
    using alphabet_type = alphabet_t;
    using sequence_type = seqan::String<alphabet_type, seqan::Alloc<>>;
};

seqan3::test::sequence_generator<seqan::Dna4, seqan_traits> ...;

@rrahn
Copy link
Contributor Author

rrahn commented Aug 11, 2020

Resolution: 11.08.2020

This would be an example for the look and feel of the seqan3::test::random_sequence_generator to generate a set of sequences:

template <std::uniform_random_bit_generator generator_t>
auto fn(generator_t && generator /*could be std::mt19937_64*/)
{
    using sequence_t = std::vector<seqan3::dna4>;
    std::vector<sequence_t> sequences(100);
    seqan3::test::random_sequence_generator sequence_gen{.size = 100, .size_variance = 10};
    std::ranges::generate(sequences, [&] () { return sequence_gen<sequence_t>(generator); });
    return sequences;
}

... we also considered a view based interface, but we decided for the above solution.

template <std::uniform_random_bit_generator generator_t>
auto fn(generator_t && generator /*could be std::mt19937_64*/)
{
    using sequence_t = std::vector<seqan3::dna4>;
    std::vector<sequence_t> sequences;
    seqan3::test::random_sequence_generator sequence_gen{.size = 100, .size_variance = 10};
    std::ranges::move(ranges::view::generate_n([&] () { return sequence_gen<sequence_t>(generator); }, 100), 
                      std::back_inserter(sequences));
    return sequences;
}

EDIT (14.08.2020)

This would be the scaffold of the random_sequence_generator.

class random_sequence_generator
{
public:
    // all constructors
    explicit random_sequence_generator(size_t size, size_t size_variance = 0) : size{size}, size_variance{size_variance}
    {}

    template <seqan3::sequence sequence_t, std::uniform_random_bit_generator generator_t>
        requires (std::ranges::output_range<sequence_t>) // possibly a specific interface that allows to create a std::back_inserter from the sequence type.
    sequence_t operator()(generator_t && random_generator) const
    {
        sequence_t random_sequence{};
        // implement sequence generation using random_generator
        return random_sequence;
    }

    size_t size{};
    size_t size_variance{};
};

User calling code:

int main()
{
    std::mt19937_64 random_engine{42};
    seqan3::test::random_sequence_generator random_sequence_gen{100};
    auto seq1 = random_sequence_gen<std::vector<seqan3::dna4>>(random_engine);
    auto seq2 = random_sequence_gen<seqan::String<seqan::Dna>>(random_engine);
}

@smehringer
Copy link
Member

Fixed by seqan/seqan3#1985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants