Skip to content

Commit

Permalink
[MISC] Remove field::alignment from SAM file.
Browse files Browse the repository at this point in the history
  • Loading branch information
smehringer committed Oct 21, 2022
1 parent e723338 commit b138781
Show file tree
Hide file tree
Showing 26 changed files with 190 additions and 888 deletions.
9 changes: 1 addition & 8 deletions doc/cookbook/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,14 +137,7 @@ To search for either 1 insertion or 1 deletion you can use the seqan3::search_cf

\snippet doc/tutorial/09_search/search_small_snippets.cpp error_search

# Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two `seqan3::aligned_sequence`s) is done automatically for
you. You can access the resulting alignment via seqan3::sam_record::alignment:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp main

# Combining sequence and alignment files
# Reading the CIGAR information from a SAM file and construct an alignment

This recipe can be used to:
1. Read in a FASTA file with the reference and a SAM file with the alignment
Expand Down
37 changes: 16 additions & 21 deletions doc/tutorial/10_sam_file/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ To make things clearer, here is the table of SAM columns and the corresponding f
| 3 | RNAME | seqan3::sam_record::reference_id | seqan3::field::ref_id |
| 4 | POS | seqan3::sam_record::reference_position | seqan3::field::ref_offset |
| 5 | MAPQ | seqan3::sam_record::mapping_quality | seqan3::field::mapq |
| 6 | CIGAR | implicitly stored in seqan3::sam_record::alignment <br> explicitly stored in seqan3::sam_record::cigar_sequence | seqan3::field::alignment <br> seqan3::field::cigar |
| 6 | CIGAR | seqan3::sam_record::cigar_sequence | seqan3::field::cigar |
| 7 | RNEXT | seqan3::sam_record::mate_reference_id | seqan3::field::mate |
| 8 | PNEXT | seqan3::sam_record::mate_position | seqan3::field::mate |
| 9 | TLEN | seqan3::sam_record::template_length | seqan3::field::mate |
Expand Down Expand Up @@ -138,14 +138,12 @@ Average: 27.4

\endsolution

# Alignment representation in the SAM format

In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.
# Alignment representation in SAM/BAM files

The SAM format is the common output format of read mappers where you align short read sequences
to one or more large reference sequences.
In fact, the SAM format stores those alignment information only partially:
It **does not store the reference sequence** but only the read sequence and a *CIGAR* string
It **does not store the reference sequence** but only the query/read sequence and a *CIGAR* string
representing the alignment based on the read.

Take this SAM record as an example:
Expand Down Expand Up @@ -183,26 +181,17 @@ into a std::vector\<seqan3::cigar\>:

\snippet doc/tutorial/10_sam_file/sam_file_read_cigar.cpp main

## Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two seqan3::aligned_sequence's) is done automatically for you.
You can access it by accessing seqan3::sam_record::alignment from the record:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp main
## Transforming the CIGAR information into an seqan3 alignment

In the example above, you can only safely access the aligned read.

\attention The **unknown** aligned reference sequence at the first position in the alignment tuple cannot be accessed
(e.g. via the `operator[]`). It is represented by a dummy type that throws on access.
In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.

Although the SAM format does not handle reference sequence information,
you can provide these information to the seqan3::sam_file_input which automatically fills the alignment object.
You can pass reference ids and reference sequences as additional constructor parameters:
The conversion from a CIGAR string to an alignment can be done with the function `seqan3::alignment_from_cigar`.
You need to pass the reference sequence with the position the read was aligned to and the read sequence:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_with_ref.cpp main
\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.cpp

The code will print the following:
\include doc/tutorial/10_sam_file/sam_file_sam_record.err
\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.err

\assignment{Assignment 2: Combining sequence and alignment files}

Expand All @@ -218,14 +207,20 @@ Only use
* seqan3::sam_record::id,
* seqan3::sam_record::reference_id,
* seqan3::sam_record::mapping_quality, and
* seqan3::sam_record::alignment.
* seqan3::sam_record::cigar.

With that information do the following:
* Filter the alignment records and only take those with a mapping quality >= 30.
(Take a look at the tutorial \ref sequence_file_section_fun_with_ranges for a reminder about using views on files)
* For the resulting alignments, print which read was mapped against which reference id and
the number of `seqan3::gap`s in each sequence (aligned reference and read sequence).

Hint:

Given your reference sequences, you need to know which reference sequence to use for the alignment.
For this purpose, you can access `record.reference_id()` which gives you the position of the respective reference
sequence.

Your program should print the following:

```
Expand Down
30 changes: 0 additions & 30 deletions doc/tutorial/10_sam_file/sam_file_alignments_with_ref.cpp

This file was deleted.

5 changes: 0 additions & 5 deletions doc/tutorial/10_sam_file/sam_file_alignments_with_ref.err

This file was deleted.

25 changes: 0 additions & 25 deletions doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp

This file was deleted.

5 changes: 0 additions & 5 deletions doc/tutorial/10_sam_file/sam_file_alignments_without_ref.err

This file was deleted.

10 changes: 8 additions & 2 deletions doc/tutorial/10_sam_file/sam_file_solution2.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ r003 2064 chr2 18 10 5M * 0 0 TAGGC *
#include <string>
#include <vector>

#include <seqan3/alignment/cigar_conversion/alignment_from_cigar.hpp>
#include <seqan3/alphabet/gap/gap.hpp>
#include <seqan3/alphabet/nucleotide/dna5.hpp>
#include <seqan3/core/debug_stream.hpp>
Expand Down Expand Up @@ -59,14 +60,19 @@ int main()

for (auto & record : mapping_file | mapq_filter)
{
auto alignment = seqan3::alignment_from_cigar(record.cigar_sequence(),
reference_sequences[record.reference_id().value()],
record.reference_position().value(),
record.sequence());

// as loop
size_t sum_reference{};
for (auto const & char_reference : std::get<0>(record.alignment()))
for (auto const & char_reference : std::get<0>(alignment))
if (char_reference == seqan3::gap{})
++sum_reference;

// or via std::ranges::count
size_t sum_read = std::ranges::count(std::get<1>(record.alignment()), seqan3::gap{});
size_t sum_read = std::ranges::count(std::get<1>(alignment), seqan3::gap{});

// The reference_id is ZERO based and an optional. -1 is represented by std::nullopt (= reference not known).
std::optional reference_id = record.reference_id();
Expand Down
22 changes: 7 additions & 15 deletions doc/tutorial/10_sam_file/sam_file_writing.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,27 @@ seqan3::test::create_temporary_snippet_file example_sam{"out.sam", ""};
//![main]
#include <seqan3/io/sam_file/all.hpp>

using aligned_sequence_type = std::vector<seqan3::gapped<seqan3::dna5>>;
using alignment_type = std::pair<aligned_sequence_type, aligned_sequence_type>;

using types = seqan3::type_list<std::vector<seqan3::dna5>, std::string, alignment_type>;
using fields = seqan3::fields<seqan3::field::seq, seqan3::field::id, seqan3::field::alignment>;

int main()
{
using namespace seqan3::literals;

auto filename = std::filesystem::current_path() / "out.sam";

seqan3::sam_file_output fout{filename};

using types = seqan3::type_list<std::vector<seqan3::dna5>, std::string, std::vector<seqan3::cigar>>;
using fields = seqan3::fields<seqan3::field::seq, seqan3::field::id, seqan3::field::cigar>;
using sam_record_type = seqan3::sam_record<types, fields>;

// write the following to the file
// r001 0 * 0 0 4M2I2M2D * 0 0 ACGTACGT *
sam_record_type record{};
record.id() = "r001";
record.sequence() = "ACGTACGT"_dna5;
auto & [reference_sequence, read_sequence] = record.alignment();

// ACGT--GTTT
seqan3::assign_unaligned(reference_sequence, "ACGTGTTT"_dna5);
seqan3::insert_gap(reference_sequence, reference_sequence.begin() + 4, 2);

// ACGTACGT--
seqan3::assign_unaligned(read_sequence, record.sequence());
seqan3::insert_gap(read_sequence, read_sequence.end(), 2);
record.cigar_sequence() = {{4, 'M'_cigar_operation},
{2, 'I'_cigar_operation},
{2, 'M'_cigar_operation},
{2, 'D'_cigar_operation}};

fout.push_back(record);
}
Expand Down
7 changes: 4 additions & 3 deletions doc/tutorial/11_read_mapper/read_mapper_step4.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# include <fstream>
# include <span>

# include <seqan3/alignment/cigar_conversion/cigar_from_alignment.hpp>
# include <seqan3/alignment/configuration/all.hpp>
# include <seqan3/alignment/pairwise/align_pairwise.hpp>
# include <seqan3/argument_parser/all.hpp>
Expand Down Expand Up @@ -53,7 +54,7 @@ void map_reads(std::filesystem::path const & query_path,
seqan3::field::id,
seqan3::field::ref_id,
seqan3::field::ref_offset,
seqan3::field::alignment,
seqan3::field::cigar,
seqan3::field::qual,
seqan3::field::mapq>{}};
//! [sam_file_output]
Expand All @@ -80,15 +81,15 @@ void map_reads(std::filesystem::path const & query_path,

for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
{
auto aligned_seq = alignment.alignment();
auto cigar = seqan3::cigar_from_alignment(alignment.alignment());
size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
size_t map_qual = 60u + alignment.score();

sam_out.emplace_back(query,
record.id(),
storage.ids[result.reference_id()],
ref_offset,
aligned_seq,
cigar,
record.base_qualities(),
map_qual);
}
Expand Down
70 changes: 0 additions & 70 deletions include/seqan3/io/sam_file/detail/format_sam_base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -74,14 +74,6 @@ class format_sam_base
header_type & header,
ref_seqs_type & /*tag*/);

template <typename align_type, typename ref_seqs_type>
void construct_alignment(align_type & align,
std::vector<cigar> & cigar_vector,
[[maybe_unused]] int32_t rid,
[[maybe_unused]] ref_seqs_type & ref_seqs,
[[maybe_unused]] int32_t ref_start,
size_t ref_length);

void transfer_soft_clipping_to(std::vector<cigar> const & cigar_vector, int32_t & sc_begin, int32_t & sc_end) const;

template <typename stream_view_t>
Expand Down Expand Up @@ -199,68 +191,6 @@ inline void format_sam_base::transfer_soft_clipping_to(std::vector<cigar> const
sc_end = cigar_count_at(second_last_index);
}

/*!\brief Construct the field::alignment depending on the given information.
* \tparam align_type The alignment type.
* \tparam ref_seqs_type The type of reference sequences (might decay to ignore).
* \param[in,out] align The alignment (pair of aligned sequences) to fill.
* \param[in] cigar_vector The cigar information to convert to an alignment.
* \param[in] rid The index of the reference sequence in header.ref_ids().
* \param[in] ref_seqs The reference sequence information.
* \param[in] ref_start The start position of the alignment in the reference sequence.
* \param[in] ref_length The length of the aligned reference sequence.
*/
template <typename align_type, typename ref_seqs_type>
inline void format_sam_base::construct_alignment(align_type & align,
std::vector<cigar> & cigar_vector,
[[maybe_unused]] int32_t rid,
[[maybe_unused]] ref_seqs_type & ref_seqs,
[[maybe_unused]] int32_t ref_start,
size_t ref_length)
{
if (rid > -1 && ref_start > -1 && // read is mapped
!cigar_vector.empty() && // alignment field was not empty
!std::ranges::empty(get<1>(align))) // seq field was not empty
{
if constexpr (!detail::decays_to_ignore_v<ref_seqs_type>)
{
assert(static_cast<size_t>(ref_start + ref_length) <= std::ranges::size(ref_seqs[rid]));
// copy over unaligned reference sequence part
assign_unaligned(get<0>(align), ref_seqs[rid] | views::slice(ref_start, ref_start + ref_length));
}
else
{
using unaligned_t = std::remove_cvref_t<detail::unaligned_seq_t<decltype(get<0>(align))>>;
auto dummy_seq = views::repeat_n(std::ranges::range_value_t<unaligned_t>{}, ref_length)
| std::views::transform(detail::access_restrictor_fn{});
static_assert(std::same_as<unaligned_t, decltype(dummy_seq)>,
"No reference information was given so the type of the first alignment tuple position"
"must have an unaligned sequence type of a dummy sequence ("
"views::repeat_n(dna5{}, size_t{}) | "
"std::views::transform(detail::access_restrictor_fn{}))");

assign_unaligned(get<0>(align), dummy_seq); // assign dummy sequence
}

// insert gaps according to the cigar information
detail::alignment_from_cigar(align, cigar_vector);
}
else // not enough information for an alignment, assign an empty view/dummy_sequence
{
if constexpr (!detail::decays_to_ignore_v<ref_seqs_type>) // reference info given
{
assert(std::ranges::size(ref_seqs) > 0); // we assume that the given ref info is not empty
assign_unaligned(get<0>(align), ref_seqs[0] | views::slice(0, 0));
}
else
{
using unaligned_t = std::remove_cvref_t<detail::unaligned_seq_t<decltype(get<0>(align))>>;
assign_unaligned(get<0>(align),
views::repeat_n(std::ranges::range_value_t<unaligned_t>{}, 0)
| std::views::transform(detail::access_restrictor_fn{}));
}
}
}

/*!\brief Reads std::byte fields using std::from_chars.
* \tparam stream_view_t The type of the stream as a view.
*
Expand Down
Loading

0 comments on commit b138781

Please sign in to comment.