Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 36 additions & 25 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@ \subsection{Terminologies and Concepts}
flags. Typically the alignment designated primary is the best alignment, but
the decision may be arbitrary.\footnotemark

\item[Template alignment]
A set of read alignments for all reads in the template. In the case
of multiple mappings for a template, multiple template alignments
may exist. All read alignments (whether chimeric or linear) within
a template alignment share the same value of the 0x100 flag
(secondary alignment). Template alignments consisting of more than
2 segments must have the TC auxiliary tag present.

\item[1-based coordinate system] A coordinate system where the first
base of a sequence is one. In this coordinate system, a region is
specified by a closed interval. For example, the region between the
Expand Down Expand Up @@ -296,6 +304,7 @@ \subsection{The alignment section: mandatory fields}
0x40 and 0x80 are unset, the index of the read in the template
is unknown. This may happen for a non-linear template or the index
is lost in data processing.
\item Bit 0x20 defines the next segment as per the definition in RNEXT and PNEXT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means. If the next segment is the third segment in a 5 segment template, what does having 0x20 set actually mean? Alignments for segments 2 and 3 are consistent? 3 and 4? All segment alignment are consistent with having no (large) SVs wrt the set of reference alignment?

\item If 0x1 is unset, no assumptions can be made about 0x2, 0x8,
0x20, 0x40 and 0x80.
\end{itemize}
Expand Down Expand Up @@ -344,28 +353,30 @@ \subsection{The alignment section: mandatory fields}
\item Sum of lengths of the {\tt M/I/S/=/X} operations shall equal
the length of {\sf SEQ}.
\end{itemize}
\item {\sf RNEXT}: Reference sequence name of the primary alignment of the NEXT read in the
template. For the last read, the next read is the first
read in the template. If {\tt @SQ} header lines are present, {\sf
RNEXT} (if not `*' or `=') must be present in one of the {\tt SQ-SN}
tag. This field is set as `*' when the information is unavailable, and
set as `=' if {\sf RNEXT} is identical {\sf RNAME}. If not `=' and the
next read in the template has one primary mapping (see also bit
0x100 in {\sf FLAG}), this field is identical to {\sf RNAME} at the primary line of the
next read. If {\sf
RNEXT} is `*', no assumptions can be made on {\sf PNEXT} and bit
\item {\sf RNEXT}: Reference sequence name of the NEXT read in this
template alignment, where NEXT is defined to be the next read in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially this PR essentially redefines secondary alignments as secondary alignments for the entire template as opposed to the current definition that implicitly (by requiring RNEXT to point to the primary alignment) defines secondary alignments as secondary alignments for the segment.

For read pairs that have one read uniquely mappable and the other read multi-mapping to 100 different location this change in definition will require spec-compliant implementations to write out the uniquely aligned segment 100 times, each with a different RNEXT.

It is a major change that will break many tools as tools can no longer assume that a high mapq means that record is unique. This as simple as calculating read depth of reads with (e.g.) mapq>10 will now require read deduplication. Is this intended?

template coordinates rather than mapping coordinates. For the last
read, the next read is the first read in this template alignment.
Multiple template alignments may exist, with RNEXT/PNEXT forming a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where supp alignment fit in should be explicitly stated. This is an issue with the current specs as it doesn't explicitly state that RNEXT and PNEXT should be to the non-supp primary record.

We should explicitly state that RNEXT and PNEXT never point to supp alignment records.

circular list per template alignment. If {\tt @SQ} header lines are
present, {\sf RNEXT} (if not `*' or `=') must be present in one of
the {\tt SQ-SN} tag. This field is set as `*' when the information
is unavailable, and set as `=' if {\sf RNEXT} is identical to {\sf
RNAME}. If {\sf RNEXT} is `*', no assumptions can be made on {\sf
PNEXT} and bit 0x20.
\item {\sf PNEXT}: Position of the NEXT read in this template
alignment. Set as 0 when the information is unavailable. This field
equals {\sf POS} at the primary line of the next read. If {\sf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next read or next segment?

PNEXT} is 0, no assumptions can be made on {\sf RNEXT} and bit
0x20.
\item {\sf PNEXT}: Position of the primary alignment of the NEXT read in the template. Set as
0 when the information is unavailable. This field equals {\sf POS} at the primary line of
the next read. If {\sf PNEXT} is 0, no assumptions can be made on
{\sf RNEXT} and bit 0x20.
\item {\sf TLEN}: signed observed Template LENgth. If all segments are
mapped to the same reference, the unsigned observed template length
equals the number of bases from the leftmost mapped base to the
rightmost mapped base. The leftmost segment has a plus sign and the
rightmost has a minus sign. The sign of segments in the middle is
undefined. It is set as 0 for single-segment template or when the
information is unavailable.
\item {\sf TLEN}: signed observed Template LENgth. If the first and
last segments of this template alignment are mapped to the same
reference, the unsigned observed template length equals the number
of bases from the leftmost mapped base to the rightmost mapped
base. The leftmost segment has a plus sign and the rightmost has a
minus sign. The sign of segments in the middle is undefined. It is
set as 0 for single-segment template or when the information is
unavailable.
\item {\sf SEQ}: segment SEQuence. This field can be a `*' when the
sequence is not stored. If not a `*', the length of the sequence must
equal the sum of lengths of {\tt M/I/S/=/X} operations in {\sf CIGAR}.
Expand Down Expand Up @@ -434,7 +445,7 @@ \subsection{The alignment section: optional fields}
{\tt CS} & Z & Color read sequence on the original strand of the read. The primer base must be included.\\
{\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features.\footnotemark\\
{\tt E2} & Z & The 2nd most likely base calls. Same encoding and same length as {\sf QUAL}.\\
{\tt FI} & i & The index of segment in the template.\\
{\tt FI} & i & The index of segment in the template, counting from 1 onwards.\\
{\tt FS} & Z & Segment suffix.\\
{\tt FZ} & B,S & Flow signal intensities on the original strand of the read, stored as {\tt (uint16\_t) round(value * 100.0)}. \\
{\tt LB} & Z & Library. Value to be consistent with the header {\tt RG-LB} tag if {\tt @RG} is present.\\
Expand Down Expand Up @@ -464,7 +475,7 @@ \subsection{The alignment section: optional fields}
Each element in the semi-colon delimited list represents a part of the chimeric alignment. Conventionally, at a supplementary line,
the first element points to the primary line.\\
{\tt SM} & i & Template-independent mapping quality \\
{\tt TC} & i & The number of segments in the template.\\
{\tt TC} & i & The number of segments in the template. Mandatory for templates with more than two segments.\\
{\tt U2} & Z & Phred probility of the 2nd call being wrong conditional on the best being wrong. The same encoding as {\sf QUAL}. \\
{\tt UQ} & i & Phred likelihood of the segment, conditional on the mapping being correct \\
\hline
Expand Down Expand Up @@ -559,8 +570,8 @@ \section{Recommended Practice for the SAM Format}
\begin{enumerate}[label=\arabic*]
\item When one segment is present in multiple lines to represent a multiple
mapping of the segment, only one of these records should have the secondary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the current specs are missing clarification as to what's allowed when the primary alignment is chimeric -
you can have multiple non-secondary lines, but only one of these can be non-supplementary.

alignment flag bit (0x100) unset. {\sf RNEXT} and {\sf PNEXT} point to the
primary line of the next read in the template.
alignment flag bit (0x100) unset. Regardless of bit 0x100, {\sf RNEXT} and
{\sf PNEXT} point to the next segment in the current template alignment.
\item {\sf SEQ} and {\sf QUAL} of secondary alignments should be set
to `*' to reduce the file size.
\end{enumerate}
Expand Down