Skip to content

Commit

Permalink
Merge pull request #2290 from Irallia/feature/alphabet/add_phred94
Browse files Browse the repository at this point in the history
[FEATURE] Alphabet: Add the quality alphabet seqan3::phred94
  • Loading branch information
smehringer committed Dec 8, 2020
2 parents 066b704 + d797403 commit c812a00
Show file tree
Hide file tree
Showing 17 changed files with 331 additions and 15 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ If possible, provide tooling that performs the changes, e.g. a shell-script.

## New features

#### Alphabet

* Added `seqan3::phred94`, a quality type that represents the full Phred Score range (Sanger format) and is used for
PacBio Phred scores of HiFi reads ([\#2290](https://github.com/seqan/seqan3/pull/2290)).

#### Argument Parser

* We expanded the `seqan3::output_file_validator`, with a parameter `seqan3::output_file_open_options` to allow overwriting
Expand Down
19 changes: 12 additions & 7 deletions include/seqan3/alphabet/quality/all.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

/*!\file
* \author Marie Hoffmann <marie.hoffmann AT fu-berlin.de>
* \author Lydia Buntrock <lydia.buntrock AT fu-berlin.de>
* \brief Meta-header that includes all headers from alphabet/quality/
*/

Expand All @@ -18,6 +19,7 @@
#include <seqan3/alphabet/quality/phred42.hpp>
#include <seqan3/alphabet/quality/phred63.hpp>
#include <seqan3/alphabet/quality/phred68legacy.hpp>
#include <seqan3/alphabet/quality/phred94.hpp>

/*!\defgroup quality Quality
* \brief Provides the various quality score types.
Expand Down Expand Up @@ -51,11 +53,12 @@
*
* ###Encoding Schemes
*
* | Format | Quality Type | Phred Score Range | Rank Range | ASCII Range | Assert |
* |:---------------------------:|:----------------------|:------------------:|:------------:|:------------:|:-------------------------:|
* | Sanger, Illumina 1.8+ short | seqan3::phred42 | [0 .. 41] | [0 .. 41] | ['!' .. 'J'] | Phred score in [0 .. 61] |
* | Sanger, Illumina 1.8+ long | seqan3::phred63 | [0 .. 62] | [0 .. 62] | ['!' .. '_'] | Phred score in [0 .. 62] |
* | Solexa, Illumina [1.0; 1.8[ | seqan3::phred68legacy | [-5 .. 62] | [0 .. 67] | [';' .. '~'] | Phred score in [-5 .. 62] |
* | Standard Use Case | Format | Encoding | Alphabet Type | Phred Score Range | Rank Range | ASCII Range |
* |:-------------------:|:---------------------------:|:--------:|:----------------------|:-----------------:|:----------:|:------------:|
* | Sanger, Illumina | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred42 | [0 .. 41] | [0 .. 41] | ['!' .. 'J'] |
* | Sanger, Illumina | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred63 | [0 .. 62] | [0 .. 62] | ['!' .. '_'] |
* | PacBio | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred94 | [0 .. 93] | [0 .. 93] | ['!' .. '~'] |
* | Solexa | Solexa, Illumina [1.0; 1.8[ | Phred+64 | seqan3::phred68legacy | [-5 .. 62] | [0 .. 67] | [';' .. '~'] |
*
* The most distributed format is the *Sanger* or <I>Illumina 1.8+</I> format.
* Despite typical Phred scores for Illumina machines range from 0 to maximal
Expand All @@ -65,8 +68,10 @@
* For other formats, like Solexa and Illumina 1.0 to 1.7 the type
* seqan3::phred68legacy is provided. To cover also the Solexa format, the Phred
* score is stored as a <B>signed</B> integer starting at -5.
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred94, as these use the full range of the phred
* quality scores.
* An overview of all the score formats and their encodings can be found here:
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding.
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding (last access 01.12.2020).
*
* ###Concept
*
Expand All @@ -92,5 +97,5 @@
* All quality alphabets are explicitly convertible to each other via their
* Phred representation. Values not present in one alphabet are mapped to the
* closest value in the target alphabet (e.g. a `seqan3::phred63` letter with
* value 60 will convert to a `seqan3::phred42` letter of score 41).
* value 60 will convert to a `seqan3::phred42` letter of score 41, this also applies to `seqan3::phred94`).
*/
4 changes: 3 additions & 1 deletion include/seqan3/alphabet/quality/phred42.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,15 @@ namespace seqan3
*
* \details
*
* The phred42 quality alphabet represents the zero-based phred score range
* The phred42 \ref quality alphabet represents the zero-based phred score range
* [0..41] mapped to the consecutive ASCII range ['!' .. 'J']. It therefore can
* represent the Illumina 1.8+ standard and the original Sanger score. If you
* intend to use phred scores exceeding 41, use the larger score type, namely
* seqan3::phred63, otherwise on construction exceeding scores are mapped to 41.
*
* \include test/snippet/alphabet/quality/phred42.cpp
*
* \see quality
*/
class phred42 : public quality_base<phred42, 42>
{
Expand Down
10 changes: 6 additions & 4 deletions include/seqan3/alphabet/quality/phred63.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
namespace seqan3
{

/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range).
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores.
* \implements seqan3::writable_quality_alphabet
* \if DEV \implements seqan3::detail::writable_constexpr_alphabet \endif
* \implements seqan3::trivially_copyable
Expand All @@ -32,11 +32,13 @@ namespace seqan3
*
* \details
*
* The phred63 quality alphabet represents the zero-based phred score range
* [0..62] mapped to the ASCII range ['!' .. '~']. It represents the Sanger and
* Illumina 1.8+ standard beyond the typical range of raw reads (0 to 41).
* The phred63 \ref quality alphabet represents the zero-based phred score range [0..62] mapped to the ASCII range
* ['!' .. '_']. It represents the Sanger and Illumina 1.8+ standard beyond the typical range of raw reads (0 to 41),
* namely seqan3::phred42.
*
* \include test/snippet/alphabet/quality/phred63.cpp
*
* \see quality
*/
class phred63 : public quality_base<phred63, 63>
{
Expand Down
116 changes: 116 additions & 0 deletions include/seqan3/alphabet/quality/phred94.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
// -----------------------------------------------------------------------------------------------------
// Copyright (c) 2006-2020, Knut Reinert & Freie Universität Berlin
// Copyright (c) 2016-2020, Knut Reinert & MPI für molekulare Genetik
// This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License
// shipped with this file and also available at: https://github.com/seqan/seqan3/blob/master/LICENSE.md
// -----------------------------------------------------------------------------------------------------

/*!\file
* \author Lydia Buntrock <lydia.buntrock AT fu-berlin.de>
* \brief Provides seqan3::phred94 quality scores.
*/

#pragma once

#include <seqan3/alphabet/quality/quality_base.hpp>

namespace seqan3
{

/*!\brief Quality type for PacBio Phred scores of HiFi reads.
* \implements seqan3::writable_quality_alphabet
* \if DEV \implements seqan3::detail::writable_constexpr_alphabet \endif
* \implements seqan3::trivially_copyable
* \implements seqan3::standard_layout
* \implements std::regular
*
* \ingroup quality
*
* \details
*
* The phred94 quality alphabet represents the zero-based phred score range [0..93] mapped to the ASCII range
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It is typically used for HiFi reads produced by PacBio. For Sanger and
* Illumina phred scores of raw reads the range is typically (0 to 41), represented as seqan3::phred42. If you expect
* only slightly larger score types you can use seqan3::phred63 (0 to 62) which still has memory advantages when used
* with seqan3::qualified.
*
* \include test/snippet/alphabet/quality/phred94.cpp
*
* You can find more information about the Phred scores in our \ref quality submodule \see quality submodule.
*/
class phred94 : public quality_base<phred94, 94>
{
private:
//!\brief The base class.
using base_t = quality_base<phred94, 94>;

//!\brief Befriend seqan3::quality_base.
friend base_t;
//!\cond \brief Befriend seqan3::alphabet_base.
friend base_t::base_t;
//!\endcond

public:
/*!\name Constructors, destructor and assignment
* \{
*/
constexpr phred94() noexcept = default; //!< Defaulted.
constexpr phred94(phred94 const &) noexcept = default; //!< Defaulted.
constexpr phred94(phred94 &&) noexcept = default; //!< Defaulted.
constexpr phred94 & operator=(phred94 const &) noexcept = default; //!< Defaulted.
constexpr phred94 & operator=(phred94 &&) noexcept = default; //!< Defaulted.
~phred94() noexcept = default; //!< Defaulted.

//!\brief Construct from phred value.
constexpr phred94(phred_type const p) : base_t{p} {}

// Inherit converting constructor
using base_t::base_t;
//!\}

/*!\name Member variables.
* \{
*/
//!\brief The projection offset between phred and rank score representation.
static constexpr phred_type offset_phred{0};

//!\brief The projection offset between char and rank score representation.
static constexpr char_type offset_char{'!'};
//!\}
};

/*!\name Literals
* \{
*/
/*!\brief The seqan3::phred94 char literal.
* \relates seqan3::phred94
* \returns seqan3::phred94
*/
constexpr phred94 operator""_phred94(char const c) noexcept
{
return phred94{}.assign_char(c);
}

/*!\brief The seqan3::phred94 string literal.
* \param[in] s A pointer to the character sequence to assign from.
* \param[in] n The length of the character sequence to assign from.
* \relates seqan3::phred94
* \returns seqan3::std::vector<seqan3::phred94>
*
* You can use this string literal to easily assign to std::vector<seqan3::phred94>:
*
* \include test/snippet/alphabet/quality/phred94_literal.cpp
*/
inline std::vector<phred94> operator""_phred94(char const * s, std::size_t n)
{
std::vector<phred94> r;
r.resize(n);

for (size_t i = 0; i < n; ++i)
r[i].assign_char(s[i]);

return r;
}
//!\}

} // namespace seqan3
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_assign_char_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ BENCHMARK_TEMPLATE(assign_char, seqan3::aa20);
BENCHMARK_TEMPLATE(assign_char, seqan3::aa27);
BENCHMARK_TEMPLATE(assign_char, seqan3::phred42);
BENCHMARK_TEMPLATE(assign_char, seqan3::phred63);
BENCHMARK_TEMPLATE(assign_char, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(assign_char, char);
BENCHMARK_TEMPLATE(assign_char, char32_t);
Expand All @@ -58,6 +59,7 @@ BENCHMARK_TEMPLATE(assign_char, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(assign_char, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(assign_char, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(assign_char, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(assign_char, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_assign_rank_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ BENCHMARK_TEMPLATE(assign_rank, seqan3::aa20);
BENCHMARK_TEMPLATE(assign_rank, seqan3::aa27);
BENCHMARK_TEMPLATE(assign_rank, seqan3::phred42);
BENCHMARK_TEMPLATE(assign_rank, seqan3::phred63);
BENCHMARK_TEMPLATE(assign_rank, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(assign_rank, char);
BENCHMARK_TEMPLATE(assign_rank, char32_t);
Expand All @@ -65,6 +66,7 @@ BENCHMARK_TEMPLATE(assign_rank, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_to_char_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ BENCHMARK_TEMPLATE(to_char, seqan3::aa20);
BENCHMARK_TEMPLATE(to_char, seqan3::aa27);
BENCHMARK_TEMPLATE(to_char, seqan3::phred42);
BENCHMARK_TEMPLATE(to_char, seqan3::phred63);
BENCHMARK_TEMPLATE(to_char, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(to_char, char);
BENCHMARK_TEMPLATE(to_char, char32_t);
Expand All @@ -78,6 +79,7 @@ BENCHMARK_TEMPLATE(to_char, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(to_char, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(to_char, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(to_char, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(to_char, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_to_rank_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ BENCHMARK_TEMPLATE(to_rank, seqan3::aa20);
BENCHMARK_TEMPLATE(to_rank, seqan3::aa27);
BENCHMARK_TEMPLATE(to_rank, seqan3::phred42);
BENCHMARK_TEMPLATE(to_rank, seqan3::phred63);
BENCHMARK_TEMPLATE(to_rank, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(to_rank, char);
BENCHMARK_TEMPLATE(to_rank, char32_t);
Expand All @@ -72,6 +73,7 @@ BENCHMARK_TEMPLATE(to_rank, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(to_rank, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(to_rank, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(to_rank, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(to_rank, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
4 changes: 4 additions & 0 deletions test/snippet/alphabet/quality/phred63.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,8 @@ int main()
seqan3::phred63 another_phred{49};
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 49
// we need to cast to (int) for human readable console output

seqan3::phred63 a_third_phred;
another_phred.assign_phred(75); // converted down to 62
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 62
}
18 changes: 18 additions & 0 deletions test/snippet/alphabet/quality/phred94.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>

int main()
{
seqan3::phred94 phred;
phred.assign_rank(2); // wrapper for assign_phred(2)
seqan3::debug_stream << phred.to_phred() << "\n"; // 2
seqan3::debug_stream << phred.to_char() << "\n"; // '#'
seqan3::debug_stream << phred.to_rank() << "\n"; // 2

seqan3::phred94 another_phred{75};
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 75

seqan3::phred94 a_third_phred;
another_phred.assign_phred(105); // converted down to 93
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 93
}
17 changes: 17 additions & 0 deletions test/snippet/alphabet/quality/phred94_literal.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#include <seqan3/std/algorithm>

#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>

int main()
{
using seqan3::operator""_phred94;

// directly assign to a std::vector<phred94> using a string literal
std::vector<seqan3::phred94> qual_vec = "###!"_phred94;

// This is the same as a sequence of char literals:
std::vector<seqan3::phred94> qual_vec2 = {'#'_phred94, '#'_phred94, '#'_phred94, '!'_phred94};

seqan3::debug_stream << std::ranges::equal(qual_vec, qual_vec2) << '\n'; // prints 1 (true)
}
3 changes: 1 addition & 2 deletions test/unit/alphabet/composite/composite_integration_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,13 @@ using seqan3::operator""_aa27;
using seqan3::operator""_dna4;
using seqan3::operator""_rna4;

// tests various combinations of alphabet_variant and alphabet_tuple
using qualified_dna_phred42 = seqan3::qualified<seqan3::dna4, seqan3::phred42>;
using qualified_gapped_dna_phred42 = seqan3::qualified<seqan3::gapped<seqan3::dna4>, seqan3::phred42>;
using gapped_qualified_dna_phred42 = seqan3::gapped<qualified_dna_phred42>;
using qualified_qualified_gapped_dna_phred42_phred42 = seqan3::qualified<qualified_gapped_dna_phred42, seqan3::phred42>;
using gapped_alphabet_variant_dna_phred42 = seqan3::gapped<seqan3::alphabet_variant<seqan3::dna4, seqan3::phred42>>;

// Some haessllihckeiten-tests

TEST(composite, custom_constructors)
{
qualified_dna_phred42 t11{'C'_dna4};
Expand Down
1 change: 1 addition & 0 deletions test/unit/alphabet/quality/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
seqan3_test(phred42_test.cpp)
seqan3_test(phred63_test.cpp)
seqan3_test(phred68legacy_test.cpp)
seqan3_test(phred94_test.cpp)
seqan3_test(qualified_test.cpp)
seqan3_test(quality_conversion_integration_test.cpp)
Loading

0 comments on commit c812a00

Please sign in to comment.