Skip to content

Module: IO Sequence

kneubert edited this page Mar 21, 2017 · 14 revisions

Module Session

  • test/implementation for sequence_file_in.cpp [ME,SM]
  • test/implementation for sequence_file_out.cpp [KN]
  • definition of sequence_file_concept in sequence_file_format.hpp [JW, GU]
  • implementation of sequence_file_format_fasta.hpp [TD, JK]
  • copy/implement the tokenisation from seqan2 [RR]

Layout

io/sequence.hpp                              // meta-header
io/sequence/sequence_file_in.hpp             // seq_file_in, seq_file_in_traits_concept, seq_file_default_traits
io/sequence/sequence_file_format.hpp         // seq_file_in_formats_concept
io/sequence/sequence_file_format_fasta.hpp   // seq_file_in_formats_fasta
io/sequence/sequence_file_format_fastq.hpp   // seq_file_in_formats_fastq
io/sequence/sequence_file_format_embl.hpp    // seq_file_in_formats_embl
io/sequence/sequence_file_format_genbank.hpp // seq_file_in_formats_genbank
io/sequence/sequence_file_format_raw.hpp     // seq_file_in_formats_raw
io/sequence/sequence_file_format_sam.hpp     // seq_file_in_formats_sam
io/sequence/sequence_file_format_bam.hpp     // seq_file_in_formats_bam


+ the whole thing for seq_file_out

Description

The seq_file_in is data structure that regular users interact with. They instantiate an object of this type with a path to a file and internally the correct format is selected from the file's extension (this is done via std::variant). They can then call the .read(seq, id, qual) member of the seq_file_in to read either a single record or multiple records (requires container-of-container concept) in to the provided out-parameters. The write-call is forwarded to the actual implementation in the selected format via tag-dispatching through std::visit.

The user may change the input stream type by providing their own traits type that must satisfy the seq_file_in_traits_concept. The traits type also contains the list of valid formats. In case the user wants to restrict the allowed formats that can just reduce the definition of the variant accordingly.

If the user (or a library developer) wants to add a new format that just need to make sure that the format satisfies the seq_file_in_formats_concept and add it to the above mentioned variant.

Output accordingly.

Design decisions

individual parameters VS record type

  • we discussed using a record type that contains sequence, id and qualities, or defining a concept for this, but it proved more difficult architecturally and added unnecessary complexity
  • for reading a single record it would have worked okay, and one could have used structured bindings and std::tie to still offer reading to individual sequence, id ... - variables
  • however for reading batches of records this would have been impossible.
  • Since reading the entire file into individual sequence, id- and quality sets is one of the most used use-cases we opted for individual parameters

return values VS out-parameters in read()

  • one could have used return values when reading instead of out-parameters
  • the advantage is a cleaner design
  • as above, this would have worked well for a single read operation, where one can use structured bindings and std::tie to work with existing data structures and containers in the calling function
  • however it result in unnecessary copies when reading batches (for vector-of-vector, one could move the function return value into the outer vector, but if concat types are involved or the type of container-of-container differs if would always imply a needless copy)
  • additionally all container-types, including the sequence alphabet would have had been hard coded in the traits objects requiring more configuration from the user than necessary (e.g. just for reading aminoacid they would have had to specify a different trait)
  • we decided on plain old out-parameters.

filters

TODO

Prototype implementation

https://gist.github.com/h-2/bf8d707c8f86acbe456ee043a5a301d3

Clone this wiki locally