Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Alphabet: Add the quality alphabet seqan3::phred94 #2290

Merged
merged 5 commits into from
Dec 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ If possible, provide tooling that performs the changes, e.g. a shell-script.

## New features

#### Alphabet

* Added `seqan3::phred94`, a quality type that represents the full Phred Score range (Sanger format) and is used for
PacBio Phred scores of HiFi reads ([\#2290](https://github.com/seqan/seqan3/pull/2290)).

#### Argument Parser

* We expanded the `seqan3::output_file_validator`, with a parameter `seqan3::output_file_open_options` to allow overwriting
Expand Down
19 changes: 12 additions & 7 deletions include/seqan3/alphabet/quality/all.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

/*!\file
* \author Marie Hoffmann <marie.hoffmann AT fu-berlin.de>
* \author Lydia Buntrock <lydia.buntrock AT fu-berlin.de>
* \brief Meta-header that includes all headers from alphabet/quality/
*/

Expand All @@ -18,6 +19,7 @@
#include <seqan3/alphabet/quality/phred42.hpp>
#include <seqan3/alphabet/quality/phred63.hpp>
#include <seqan3/alphabet/quality/phred68legacy.hpp>
#include <seqan3/alphabet/quality/phred94.hpp>

/*!\defgroup quality Quality
* \brief Provides the various quality score types.
Expand Down Expand Up @@ -51,11 +53,12 @@
*
* ###Encoding Schemes
*
* | Format | Quality Type | Phred Score Range | Rank Range | ASCII Range | Assert |
* |:---------------------------:|:----------------------|:------------------:|:------------:|:------------:|:-------------------------:|
* | Sanger, Illumina 1.8+ short | seqan3::phred42 | [0 .. 41] | [0 .. 41] | ['!' .. 'J'] | Phred score in [0 .. 61] |
* | Sanger, Illumina 1.8+ long | seqan3::phred63 | [0 .. 62] | [0 .. 62] | ['!' .. '_'] | Phred score in [0 .. 62] |
* | Solexa, Illumina [1.0; 1.8[ | seqan3::phred68legacy | [-5 .. 62] | [0 .. 67] | [';' .. '~'] | Phred score in [-5 .. 62] |
* | Standard Use Case | Format | Encoding | Alphabet Type | Phred Score Range | Rank Range | ASCII Range |
* |:-------------------:|:---------------------------:|:--------:|:----------------------|:-----------------:|:----------:|:------------:|
* | Sanger, Illumina | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred42 | [0 .. 41] | [0 .. 41] | ['!' .. 'J'] |
* | Sanger, Illumina | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred63 | [0 .. 62] | [0 .. 62] | ['!' .. '_'] |
* | PacBio | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred94 | [0 .. 93] | [0 .. 93] | ['!' .. '~'] |
* | Solexa | Solexa, Illumina [1.0; 1.8[ | Phred+64 | seqan3::phred68legacy | [-5 .. 62] | [0 .. 67] | [';' .. '~'] |
*
* The most distributed format is the *Sanger* or <I>Illumina 1.8+</I> format.
* Despite typical Phred scores for Illumina machines range from 0 to maximal
Expand All @@ -65,8 +68,10 @@
* For other formats, like Solexa and Illumina 1.0 to 1.7 the type
* seqan3::phred68legacy is provided. To cover also the Solexa format, the Phred
* score is stored as a <B>signed</B> integer starting at -5.
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred94, as these use the full range of the phred
* quality scores.
* An overview of all the score formats and their encodings can be found here:
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding.
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding (last access 01.12.2020).
*
* ###Concept
*
Expand All @@ -92,5 +97,5 @@
* All quality alphabets are explicitly convertible to each other via their
* Phred representation. Values not present in one alphabet are mapped to the
* closest value in the target alphabet (e.g. a `seqan3::phred63` letter with
* value 60 will convert to a `seqan3::phred42` letter of score 41).
* value 60 will convert to a `seqan3::phred42` letter of score 41, this also applies to `seqan3::phred94`).
*/
4 changes: 3 additions & 1 deletion include/seqan3/alphabet/quality/phred42.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,15 @@ namespace seqan3
*
* \details
*
* The phred42 quality alphabet represents the zero-based phred score range
* The phred42 \ref quality alphabet represents the zero-based phred score range
* [0..41] mapped to the consecutive ASCII range ['!' .. 'J']. It therefore can
* represent the Illumina 1.8+ standard and the original Sanger score. If you
* intend to use phred scores exceeding 41, use the larger score type, namely
* seqan3::phred63, otherwise on construction exceeding scores are mapped to 41.
*
* \include test/snippet/alphabet/quality/phred42.cpp
*
* \see quality
*/
class phred42 : public quality_base<phred42, 42>
{
Expand Down
10 changes: 6 additions & 4 deletions include/seqan3/alphabet/quality/phred63.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
namespace seqan3
{

/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range).
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores.
* \implements seqan3::writable_quality_alphabet
* \if DEV \implements seqan3::detail::writable_constexpr_alphabet \endif
* \implements seqan3::trivially_copyable
Expand All @@ -32,11 +32,13 @@ namespace seqan3
*
* \details
*
* The phred63 quality alphabet represents the zero-based phred score range
* [0..62] mapped to the ASCII range ['!' .. '~']. It represents the Sanger and
* Illumina 1.8+ standard beyond the typical range of raw reads (0 to 41).
* The phred63 \ref quality alphabet represents the zero-based phred score range [0..62] mapped to the ASCII range
* ['!' .. '_']. It represents the Sanger and Illumina 1.8+ standard beyond the typical range of raw reads (0 to 41),
* namely seqan3::phred42.
*
* \include test/snippet/alphabet/quality/phred63.cpp
*
* \see quality
*/
class phred63 : public quality_base<phred63, 63>
{
Expand Down
116 changes: 116 additions & 0 deletions include/seqan3/alphabet/quality/phred94.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
// -----------------------------------------------------------------------------------------------------
// Copyright (c) 2006-2020, Knut Reinert & Freie Universität Berlin
// Copyright (c) 2016-2020, Knut Reinert & MPI für molekulare Genetik
// This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License
// shipped with this file and also available at: https://github.com/seqan/seqan3/blob/master/LICENSE.md
// -----------------------------------------------------------------------------------------------------

/*!\file
* \author Lydia Buntrock <lydia.buntrock AT fu-berlin.de>
* \brief Provides seqan3::phred94 quality scores.
*/

#pragma once

#include <seqan3/alphabet/quality/quality_base.hpp>

namespace seqan3
{

/*!\brief Quality type for PacBio Phred scores of HiFi reads.
* \implements seqan3::writable_quality_alphabet
* \if DEV \implements seqan3::detail::writable_constexpr_alphabet \endif
* \implements seqan3::trivially_copyable
* \implements seqan3::standard_layout
* \implements std::regular
*
* \ingroup quality
*
* \details
*
* The phred94 quality alphabet represents the zero-based phred score range [0..93] mapped to the ASCII range
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It is typically used for HiFi reads produced by PacBio. For Sanger and
* Illumina phred scores of raw reads the range is typically (0 to 41), represented as seqan3::phred42. If you expect
* only slightly larger score types you can use seqan3::phred63 (0 to 62) which still has memory advantages when used
* with seqan3::qualified.
*
* \include test/snippet/alphabet/quality/phred94.cpp
*
* You can find more information about the Phred scores in our \ref quality submodule \see quality submodule.
*/
class phred94 : public quality_base<phred94, 94>
{
private:
//!\brief The base class.
using base_t = quality_base<phred94, 94>;

//!\brief Befriend seqan3::quality_base.
friend base_t;
//!\cond \brief Befriend seqan3::alphabet_base.
friend base_t::base_t;
//!\endcond

public:
/*!\name Constructors, destructor and assignment
* \{
*/
constexpr phred94() noexcept = default; //!< Defaulted.
constexpr phred94(phred94 const &) noexcept = default; //!< Defaulted.
constexpr phred94(phred94 &&) noexcept = default; //!< Defaulted.
constexpr phred94 & operator=(phred94 const &) noexcept = default; //!< Defaulted.
constexpr phred94 & operator=(phred94 &&) noexcept = default; //!< Defaulted.
~phred94() noexcept = default; //!< Defaulted.

//!\brief Construct from phred value.
constexpr phred94(phred_type const p) : base_t{p} {}

// Inherit converting constructor
using base_t::base_t;
//!\}

/*!\name Member variables.
* \{
*/
//!\brief The projection offset between phred and rank score representation.
static constexpr phred_type offset_phred{0};

//!\brief The projection offset between char and rank score representation.
static constexpr char_type offset_char{'!'};
//!\}
};

/*!\name Literals
* \{
*/
/*!\brief The seqan3::phred94 char literal.
* \relates seqan3::phred94
* \returns seqan3::phred94
*/
constexpr phred94 operator""_phred94(char const c) noexcept
{
return phred94{}.assign_char(c);
}

/*!\brief The seqan3::phred94 string literal.
* \param[in] s A pointer to the character sequence to assign from.
* \param[in] n The length of the character sequence to assign from.
* \relates seqan3::phred94
* \returns seqan3::std::vector<seqan3::phred94>
*
* You can use this string literal to easily assign to std::vector<seqan3::phred94>:
*
* \include test/snippet/alphabet/quality/phred94_literal.cpp
*/
inline std::vector<phred94> operator""_phred94(char const * s, std::size_t n)
{
std::vector<phred94> r;
r.resize(n);

for (size_t i = 0; i < n; ++i)
r[i].assign_char(s[i]);

return r;
}
//!\}

} // namespace seqan3
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_assign_char_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ BENCHMARK_TEMPLATE(assign_char, seqan3::aa20);
BENCHMARK_TEMPLATE(assign_char, seqan3::aa27);
BENCHMARK_TEMPLATE(assign_char, seqan3::phred42);
BENCHMARK_TEMPLATE(assign_char, seqan3::phred63);
BENCHMARK_TEMPLATE(assign_char, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(assign_char, char);
BENCHMARK_TEMPLATE(assign_char, char32_t);
Expand All @@ -58,6 +59,7 @@ BENCHMARK_TEMPLATE(assign_char, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(assign_char, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(assign_char, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(assign_char, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(assign_char, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_assign_rank_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ BENCHMARK_TEMPLATE(assign_rank, seqan3::aa20);
BENCHMARK_TEMPLATE(assign_rank, seqan3::aa27);
BENCHMARK_TEMPLATE(assign_rank, seqan3::phred42);
BENCHMARK_TEMPLATE(assign_rank, seqan3::phred63);
BENCHMARK_TEMPLATE(assign_rank, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(assign_rank, char);
BENCHMARK_TEMPLATE(assign_rank, char32_t);
Expand All @@ -65,6 +66,7 @@ BENCHMARK_TEMPLATE(assign_rank, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(assign_rank, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_to_char_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ BENCHMARK_TEMPLATE(to_char, seqan3::aa20);
BENCHMARK_TEMPLATE(to_char, seqan3::aa27);
BENCHMARK_TEMPLATE(to_char, seqan3::phred42);
BENCHMARK_TEMPLATE(to_char, seqan3::phred63);
BENCHMARK_TEMPLATE(to_char, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(to_char, char);
BENCHMARK_TEMPLATE(to_char, char32_t);
Expand All @@ -78,6 +79,7 @@ BENCHMARK_TEMPLATE(to_char, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(to_char, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(to_char, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(to_char, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(to_char, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
2 changes: 2 additions & 0 deletions test/performance/alphabet/alphabet_to_rank_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ BENCHMARK_TEMPLATE(to_rank, seqan3::aa20);
BENCHMARK_TEMPLATE(to_rank, seqan3::aa27);
BENCHMARK_TEMPLATE(to_rank, seqan3::phred42);
BENCHMARK_TEMPLATE(to_rank, seqan3::phred63);
BENCHMARK_TEMPLATE(to_rank, seqan3::phred94);
/* adaptations */
BENCHMARK_TEMPLATE(to_rank, char);
BENCHMARK_TEMPLATE(to_rank, char32_t);
Expand All @@ -72,6 +73,7 @@ BENCHMARK_TEMPLATE(to_rank, seqan3::alphabet_variant<seqan3::dna4, char>);
BENCHMARK_TEMPLATE(to_rank, seqan3::masked<seqan3::dna4>);
BENCHMARK_TEMPLATE(to_rank, seqan3::qualified<seqan3::dna4, seqan3::phred42>);
BENCHMARK_TEMPLATE(to_rank, seqan3::qualified<seqan3::dna5, seqan3::phred63>);
BENCHMARK_TEMPLATE(to_rank, seqan3::qualified<seqan3::dna5, seqan3::phred94>);

#if SEQAN3_HAS_SEQAN2
template <typename alphabet_t>
Expand Down
4 changes: 4 additions & 0 deletions test/snippet/alphabet/quality/phred63.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,8 @@ int main()
seqan3::phred63 another_phred{49};
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 49
// we need to cast to (int) for human readable console output

seqan3::phred63 a_third_phred;
another_phred.assign_phred(75); // converted down to 62
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 62
}
18 changes: 18 additions & 0 deletions test/snippet/alphabet/quality/phred94.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>

int main()
{
seqan3::phred94 phred;
phred.assign_rank(2); // wrapper for assign_phred(2)
seqan3::debug_stream << phred.to_phred() << "\n"; // 2
seqan3::debug_stream << phred.to_char() << "\n"; // '#'
seqan3::debug_stream << phred.to_rank() << "\n"; // 2
MitraDarja marked this conversation as resolved.
Show resolved Hide resolved

seqan3::phred94 another_phred{75};
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 75

seqan3::phred94 a_third_phred;
another_phred.assign_phred(105); // converted down to 93
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 93
}
17 changes: 17 additions & 0 deletions test/snippet/alphabet/quality/phred94_literal.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#include <seqan3/std/algorithm>

#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>

int main()
{
using seqan3::operator""_phred94;

// directly assign to a std::vector<phred94> using a string literal
std::vector<seqan3::phred94> qual_vec = "###!"_phred94;

// This is the same as a sequence of char literals:
std::vector<seqan3::phred94> qual_vec2 = {'#'_phred94, '#'_phred94, '#'_phred94, '!'_phred94};

seqan3::debug_stream << std::ranges::equal(qual_vec, qual_vec2) << '\n'; // prints 1 (true)
}
3 changes: 1 addition & 2 deletions test/unit/alphabet/composite/composite_integration_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,13 @@ using seqan3::operator""_aa27;
using seqan3::operator""_dna4;
using seqan3::operator""_rna4;

// tests various combinations of alphabet_variant and alphabet_tuple
using qualified_dna_phred42 = seqan3::qualified<seqan3::dna4, seqan3::phred42>;
using qualified_gapped_dna_phred42 = seqan3::qualified<seqan3::gapped<seqan3::dna4>, seqan3::phred42>;
using gapped_qualified_dna_phred42 = seqan3::gapped<qualified_dna_phred42>;
using qualified_qualified_gapped_dna_phred42_phred42 = seqan3::qualified<qualified_gapped_dna_phred42, seqan3::phred42>;
using gapped_alphabet_variant_dna_phred42 = seqan3::gapped<seqan3::alphabet_variant<seqan3::dna4, seqan3::phred42>>;

// Some haessllihckeiten-tests

TEST(composite, custom_constructors)
{
qualified_dna_phred42 t11{'C'_dna4};
Expand Down
1 change: 1 addition & 0 deletions test/unit/alphabet/quality/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
seqan3_test(phred42_test.cpp)
seqan3_test(phred63_test.cpp)
seqan3_test(phred68legacy_test.cpp)
seqan3_test(phred94_test.cpp)
seqan3_test(qualified_test.cpp)
seqan3_test(quality_conversion_integration_test.cpp)
Loading