Skip to content

Commit

Permalink
[MISC] Remove field::alignment from SAM file.
Browse files Browse the repository at this point in the history
  • Loading branch information
smehringer committed Oct 21, 2022
1 parent 3630847 commit 4ad6eca
Show file tree
Hide file tree
Showing 30 changed files with 191 additions and 1,146 deletions.
9 changes: 1 addition & 8 deletions doc/cookbook/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,14 +137,7 @@ To search for either 1 insertion or 1 deletion you can use the seqan3::search_cf

\snippet doc/tutorial/09_search/search_small_snippets.cpp error_search

# Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two `seqan3::aligned_sequence`s) is done automatically for
you. You can access the resulting alignment via seqan3::sam_record::alignment:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp main

# Combining sequence and alignment files
# Reading the CIGAR information from a SAM file and construct an alignment

This recipe can be used to:
1. Read in a FASTA file with the reference and a SAM file with the alignment
Expand Down
37 changes: 16 additions & 21 deletions doc/tutorial/10_sam_file/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ To make things clearer, here is the table of SAM columns and the corresponding f
| 3 | RNAME | seqan3::sam_record::reference_id | seqan3::field::ref_id |
| 4 | POS | seqan3::sam_record::reference_position | seqan3::field::ref_offset |
| 5 | MAPQ | seqan3::sam_record::mapping_quality | seqan3::field::mapq |
| 6 | CIGAR | implicitly stored in seqan3::sam_record::alignment <br> explicitly stored in seqan3::sam_record::cigar_sequence | seqan3::field::alignment <br> seqan3::field::cigar |
| 6 | CIGAR | seqan3::sam_record::cigar_sequence | seqan3::field::cigar |
| 7 | RNEXT | seqan3::sam_record::mate_reference_id | seqan3::field::mate |
| 8 | PNEXT | seqan3::sam_record::mate_position | seqan3::field::mate |
| 9 | TLEN | seqan3::sam_record::template_length | seqan3::field::mate |
Expand Down Expand Up @@ -138,14 +138,12 @@ Average: 27.4

\endsolution

# Alignment representation in the SAM format

In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.
# Alignment representation in SAM/BAM files

The SAM format is the common output format of read mappers where you align short read sequences
to one or more large reference sequences.
In fact, the SAM format stores those alignment information only partially:
It **does not store the reference sequence** but only the read sequence and a *CIGAR* string
It **does not store the reference sequence** but only the query/read sequence and a *CIGAR* string
representing the alignment based on the read.

Take this SAM record as an example:
Expand Down Expand Up @@ -183,26 +181,17 @@ into a std::vector\<seqan3::cigar\>:

\snippet doc/tutorial/10_sam_file/sam_file_read_cigar.cpp main

## Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two seqan3::aligned_sequence's) is done automatically for you.
You can access it by accessing seqan3::sam_record::alignment from the record:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp main
## Transforming the CIGAR information into an seqan3 alignment

In the example above, you can only safely access the aligned read.

\attention The **unknown** aligned reference sequence at the first position in the alignment tuple cannot be accessed
(e.g. via the `operator[]`). It is represented by a dummy type that throws on access.
In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.

Although the SAM format does not handle reference sequence information,
you can provide these information to the seqan3::sam_file_input which automatically fills the alignment object.
You can pass reference ids and reference sequences as additional constructor parameters:
The conversion from a CIGAR string to an alignment can be done with the function `seqan3::alignment_from_cigar`.
You need to pass the reference sequence with the position the read was aligned to and the read sequence:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_with_ref.cpp main
\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.cpp

The code will print the following:
\include doc/tutorial/10_sam_file/sam_file_sam_record.err
\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.err

\assignment{Assignment 2: Combining sequence and alignment files}

Expand All @@ -218,14 +207,20 @@ Only use
* seqan3::sam_record::id,
* seqan3::sam_record::reference_id,
* seqan3::sam_record::mapping_quality, and
* seqan3::sam_record::alignment.
* seqan3::sam_record::cigar.

With that information do the following:
* Filter the alignment records and only take those with a mapping quality >= 30.
(Take a look at the tutorial \ref sequence_file_section_fun_with_ranges for a reminder about using views on files)
* For the resulting alignments, print which read was mapped against which reference id and
the number of `seqan3::gap`s in each sequence (aligned reference and read sequence).

Hint:

Given your reference sequences, you need to know which reference sequence to use for the alignment.
For this purpose, you can access `record.reference_id()` which gives you the position of the respective reference
sequence.

Your program should print the following:

```
Expand Down
30 changes: 0 additions & 30 deletions doc/tutorial/10_sam_file/sam_file_alignments_with_ref.cpp

This file was deleted.

5 changes: 0 additions & 5 deletions doc/tutorial/10_sam_file/sam_file_alignments_with_ref.err

This file was deleted.

25 changes: 0 additions & 25 deletions doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp

This file was deleted.

5 changes: 0 additions & 5 deletions doc/tutorial/10_sam_file/sam_file_alignments_without_ref.err

This file was deleted.

10 changes: 8 additions & 2 deletions doc/tutorial/10_sam_file/sam_file_solution2.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ r003 2064 chr2 18 10 5M * 0 0 TAGGC *
#include <string>
#include <vector>

#include <seqan3/alignment/cigar_conversion/alignment_from_cigar.hpp>
#include <seqan3/alphabet/gap/gap.hpp>
#include <seqan3/alphabet/nucleotide/dna5.hpp>
#include <seqan3/core/debug_stream.hpp>
Expand Down Expand Up @@ -59,14 +60,19 @@ int main()

for (auto & record : mapping_file | mapq_filter)
{
auto alignment = seqan3::alignment_from_cigar(record.cigar_sequence(),
reference_sequences[record.reference_id().value()],
record.reference_position().value(),
record.sequence());

// as loop
size_t sum_reference{};
for (auto const & char_reference : std::get<0>(record.alignment()))
for (auto const & char_reference : std::get<0>(alignment))
if (char_reference == seqan3::gap{})
++sum_reference;

// or via std::ranges::count
size_t sum_read = std::ranges::count(std::get<1>(record.alignment()), seqan3::gap{});
size_t sum_read = std::ranges::count(std::get<1>(alignment), seqan3::gap{});

// The reference_id is ZERO based and an optional. -1 is represented by std::nullopt (= reference not known).
std::optional reference_id = record.reference_id();
Expand Down
22 changes: 7 additions & 15 deletions doc/tutorial/10_sam_file/sam_file_writing.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,27 @@ seqan3::test::create_temporary_snippet_file example_sam{"out.sam", ""};
//![main]
#include <seqan3/io/sam_file/all.hpp>

using aligned_sequence_type = std::vector<seqan3::gapped<seqan3::dna5>>;
using alignment_type = std::pair<aligned_sequence_type, aligned_sequence_type>;

using types = seqan3::type_list<std::vector<seqan3::dna5>, std::string, alignment_type>;
using fields = seqan3::fields<seqan3::field::seq, seqan3::field::id, seqan3::field::alignment>;

int main()
{
using namespace seqan3::literals;

auto filename = std::filesystem::current_path() / "out.sam";

seqan3::sam_file_output fout{filename};

using types = seqan3::type_list<std::vector<seqan3::dna5>, std::string, std::vector<seqan3::cigar>>;
using fields = seqan3::fields<seqan3::field::seq, seqan3::field::id, seqan3::field::cigar>;
using sam_record_type = seqan3::sam_record<types, fields>;

// write the following to the file
// r001 0 * 0 0 4M2I2M2D * 0 0 ACGTACGT *
sam_record_type record{};
record.id() = "r001";
record.sequence() = "ACGTACGT"_dna5;
auto & [reference_sequence, read_sequence] = record.alignment();

// ACGT--GTTT
seqan3::assign_unaligned(reference_sequence, "ACGTGTTT"_dna5);
seqan3::insert_gap(reference_sequence, reference_sequence.begin() + 4, 2);

// ACGTACGT--
seqan3::assign_unaligned(read_sequence, record.sequence());
seqan3::insert_gap(read_sequence, read_sequence.end(), 2);
record.cigar_sequence() = {{4, 'M'_cigar_operation},
{2, 'I'_cigar_operation},
{2, 'M'_cigar_operation},
{2, 'D'_cigar_operation}};

fout.push_back(record);
}
Expand Down
7 changes: 4 additions & 3 deletions doc/tutorial/11_read_mapper/read_mapper_step4.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# include <fstream>
# include <span>

# include <seqan3/alignment/cigar_conversion/cigar_from_alignment.hpp>
# include <seqan3/alignment/configuration/all.hpp>
# include <seqan3/alignment/pairwise/align_pairwise.hpp>
# include <seqan3/argument_parser/all.hpp>
Expand Down Expand Up @@ -53,7 +54,7 @@ void map_reads(std::filesystem::path const & query_path,
seqan3::field::id,
seqan3::field::ref_id,
seqan3::field::ref_offset,
seqan3::field::alignment,
seqan3::field::cigar,
seqan3::field::qual,
seqan3::field::mapq>{}};
//! [sam_file_output]
Expand All @@ -80,15 +81,15 @@ void map_reads(std::filesystem::path const & query_path,

for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
{
auto aligned_seq = alignment.alignment();
auto cigar = seqan3::cigar_from_alignment(alignment.alignment());
size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
size_t map_qual = 60u + alignment.score();

sam_out.emplace_back(query,
record.id(),
storage.ids[result.reference_id()],
ref_offset,
aligned_seq,
cigar,
record.base_qualities(),
map_qual);
}
Expand Down
4 changes: 1 addition & 3 deletions include/seqan3/alphabet/cigar/cigar.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,10 @@ class cigar : public alphabet_tuple_base<cigar, uint32_t, exposition_only::cigar
* Example usage:
* \include test/snippet/alphabet/cigar/cigar_operation.cpp
*
* \if DEV
* \note Usually you do not want to manipulate cigar elements and vectors on
* your own but convert an alignment to a cigar and back. See
* seqan3::detail::get_cigar_vector for how to convert two aligned sequences into
* seqan3::cigar_from_alignment for how to convert two aligned sequences into
* a cigar_vector.
* \endif
*
* \sa https://samtools.github.io/hts-specs/SAMv1.pdf#page=8
*
Expand Down
Loading

0 comments on commit 4ad6eca

Please sign in to comment.