-
Notifications
You must be signed in to change notification settings - Fork 82
Module: IO Sequence
- test/implementation for sequence_file_in.cpp [ME,SM]
- test/implementation for sequence_file_out.cpp [KN]
- definition of sequence_file_concept in sequence_file_format.hpp [JW, GU]
- implementation of sequence_file_format_fasta.hpp [TD, JK]
- copy/implement the tokenisation from seqan2 [RR]
io/sequence.hpp // meta-header
io/sequence/sequence_file_in.hpp // seq_file_in, seq_file_in_traits_concept, seq_file_default_traits
io/sequence/sequence_file_format.hpp // seq_file_in_formats_concept
io/sequence/sequence_file_format_fasta.hpp // seq_file_in_formats_fasta
io/sequence/sequence_file_format_fastq.hpp // seq_file_in_formats_fastq
io/sequence/sequence_file_format_embl.hpp // seq_file_in_formats_embl
io/sequence/sequence_file_format_genbank.hpp // seq_file_in_formats_genbank
io/sequence/sequence_file_format_raw.hpp // seq_file_in_formats_raw
io/sequence/sequence_file_format_sam.hpp // seq_file_in_formats_sam
io/sequence/sequence_file_format_bam.hpp // seq_file_in_formats_bam
+ the whole thing for seq_file_out
The seq_file_in
is data structure that regular users interact with. They instantiate an object of this type with a path to a file and internally the correct format is selected from the file's extension (this is done via std::variant
).
They can then call the .read(seq, id, qual)
member of the seq_file_in
to read either a single record or multiple records (requires container-of-container concept) in to the provided out-parameters. The write-call is forwarded to the actual implementation in the selected format via tag-dispatching through std::visit
.
The user may change the input stream type by providing their own traits type that must satisfy the seq_file_in_traits_concept
. The traits type also contains the list of valid formats. In case the user wants to restrict the allowed formats that can just reduce the definition of the variant accordingly.
If the user (or a library developer) wants to add a new format that just need to make sure that the format satisfies the seq_file_in_formats_concept
and add it to the above mentioned variant.
Output accordingly.
- we discussed using a record type that contains sequence, id and qualities, or defining a concept for this, but it proved more difficult architecturally and added unnecessary complexity
- for reading a single record it would have worked okay, and one could have used structured bindings and
std::tie
to still offer reading to individual sequence, id ... - variables - however for reading batches of records this would have been impossible.
- Since reading the entire file into individual sequence, id- and quality sets is one of the most used use-cases we opted for individual parameters
- one could have used return values when reading instead of out-parameters
- the advantage is a cleaner design
- as above, this would have worked well for a single read operation, where one can use structured bindings and
std::tie
to work with existing data structures and containers in the calling function - however it result in unnecessary copies when reading batches (for vector-of-vector, one could move the function return value into the outer vector, but if concat types are involved or the type of container-of-container differs if would always imply a needless copy)
- additionally all container-types, including the sequence alphabet would have had been hard coded in the traits objects requiring more configuration from the user than necessary (e.g. just for reading aminoacid they would have had to specify a different trait)
- we decided on plain old out-parameters.
TODO
https://gist.github.com/h-2/bf8d707c8f86acbe456ee043a5a301d3