Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

jkbonfield · 2014-11-17T12:01:23Z

This for a discussion point at the next GA4GH conference call, rather than for acceptance immediately.

Also see discussions in issues #44 and #48.

Things still to resolve:

TLEN definition has not changed from left/right to 5'/3'. We need to come to a decision once and for all.
Should PNEXT/RNEXT/TLEN always be linking together to form circles per template alignment, or should we permit the old method of linking to next (aka "other end" in a 2-segment world) primary alignment for cases where it makes more sense to the aligner. Ie do we have to double up the other end in such cases?
There are no additional auxiliary tags defined here yet; specifically no SI (which secondary alignment number this is within), SC (count of secondary alignments) and AC (total count of all segment alignments in file for this template). Field names subject to discussion and change of course.
No commentary of CC/CP. These are already in use by novoalign, but do not form a circular linked list meaning that it is not possible to start at any point and do a full round-trip to find all segment alignments for all primary, secondary and supplementary alignment cases.

for primary alignments only.

jkbonfield · 2014-12-15T12:02:57Z

For the issues listed above here are my recollections from the conference call. Put here so the process is open and also for my poor memory.

Not discussed due to lack of time.
More debate needed as no clear consensus; hence likely siding on not forcibly requiring duplicate alignments for secondary reads.
OK in principle, but lack of time meant we didn't discuss specifics.
Not discussed due to lack of time.

Regarding the content of the pull request itself, I didn't hear any objections to relaxing the requirement for PNEXT/RNEXT/TLEN(/FLAGS) having to only link back to primary alignments. I also didn't hear objections to making FI/TC mandatory on templates with more than 2 alignments.

We didn't really discuss the issues of TLEN having middle segments unmapped while outer ones are mapped, again due to the time constraints. In order to try and get this moving faster I'll update the pull request with a few more bits so we cover the undealt with issues above.

Finally, there are concerns over taking random-access slices of the data and keeping fields up to date. What does TC really mean? If a template comes in 3 parts matching positions 1000, 2000 and 3000 in a reference, and we ask for 1900..2100 as a sub-range, TC won't really change, but it's also perhaps not useful for reducing algorithmic complexity? We need another field anyway (total number of alignments), so should that be updated on range queries?

d-cameron

Having whole-template secondary alignments is problematic as not only does it increase the number of records needing to be written, but it breaks existing piplines that can currently get away with a mapq check instead of looking at the secondary/supp flags.

To fully handle multi-mapping alignments we need to be able to arbitrary alignment graphs including having the ability to specify chimeric alignments for which part of a segment is multi-mapping.

Is there a particular use case driving this PR? Multi-segment template alignment above paired reads feels like a premature specification generalisation for which sequencing technology didn't advance in the manner expected. The only use case for it that I've seen with any significant traction is 10X's chromium linked reads but it's unusable for that since the reads from a single template are sampled randomly and the specs can't handle that.

d-cameron · 2018-01-11T16:59:58Z

SAMv1.tex

+    PNEXT} and bit 0x20.
+\item {\sf PNEXT}: Position of the NEXT read in this template
+  alignment.  Set as 0 when the information is unavailable. This field
+  equals {\sf POS} at the primary line of the next read. If {\sf


Next read or next segment?

d-cameron · 2018-01-11T17:03:19Z

SAMv1.tex

@@ -296,6 +304,7 @@ \subsection{The alignment section: mandatory fields}
    0x40 and 0x80 are unset, the index of the read in the template
    is unknown. This may happen for a non-linear template or the index
    is lost in data processing.
+  \item Bit 0x20 defines the next segment as per the definition in RNEXT and PNEXT.


I'm not sure what this means. If the next segment is the third segment in a 5 segment template, what does having 0x20 set actually mean? Alignments for segments 2 and 3 are consistent? 3 and 4? All segment alignment are consistent with having no (large) SVs wrt the set of reference alignment?

d-cameron · 2018-01-11T17:12:25Z

SAMv1.tex

-  next read.  If {\sf
-    RNEXT} is `*', no assumptions can be made on {\sf PNEXT} and bit
+\item {\sf RNEXT}: Reference sequence name of the NEXT read in this
+  template alignment, where NEXT is defined to be the next read in


So essentially this PR essentially redefines secondary alignments as secondary alignments for the entire template as opposed to the current definition that implicitly (by requiring RNEXT to point to the primary alignment) defines secondary alignments as secondary alignments for the segment.

For read pairs that have one read uniquely mappable and the other read multi-mapping to 100 different location this change in definition will require spec-compliant implementations to write out the uniquely aligned segment 100 times, each with a different RNEXT.

It is a major change that will break many tools as tools can no longer assume that a high mapq means that record is unique. This as simple as calculating read depth of reads with (e.g.) mapq>10 will now require read deduplication. Is this intended?

d-cameron · 2018-01-11T17:15:43Z

SAMv1.tex

+  template alignment, where NEXT is defined to be the next read in
+  template coordinates rather than mapping coordinates.  For the last
+  read, the next read is the first read in this template alignment.
+  Multiple template alignments may exist, with RNEXT/PNEXT forming a


Where supp alignment fit in should be explicitly stated. This is an issue with the current specs as it doesn't explicitly state that RNEXT and PNEXT should be to the non-supp primary record.

We should explicitly state that RNEXT and PNEXT never point to supp alignment records.

d-cameron · 2018-01-11T17:18:18Z

SAMv1.tex

@@ -559,8 +570,8 @@ \section{Recommended Practice for the SAM Format}
  \begin{enumerate}[label=\arabic*]
  \item When one segment is present in multiple lines to represent a multiple
 	mapping of the segment, only one of these records should have the secondary


Again, the current specs are missing clarification as to what's allowed when the primary alignment is chimeric -
you can have multiple non-secondary lines, but only one of these can be non-supplementary.

Discard the group if it is supported only by SUPPLEMENTARY reads. As the SAM specification requires (for each extant read) exactly one alignment record with neither SECONDARY nor SUPPLEMENTARY set, in fact we filter based on primary alignments with neither of those flags set. As we look in depth at only one read out of the pair, we use the flags of the observed read and assume that the mate is similarly primary / supplementary. This is good enough as the intended behaviour of pairing in the presence of supplementary alignments is unclear in any case -- see samtools/hts-specs#53 et al.

jkbonfield · 2018-12-13T09:25:39Z

Sailing against the wind and not how things are done right now.

For completeness sake, testing with bwa shows two primary reads and a supplementary read will link 1 to 2, 2 to 1 and 1s (supp) to 2. Thus supplementary reads are not part of the RNEXT/PNEXT chain, and must instead use the SA:Z tag to identify how they fit into the primary data. The spec is (now) clear on this too; primary is ONLY non-supplementary and so by definition a supplementary read can never be considered as primary.

Changed rnext/pnext/tlen to operate per template alignment rather than

7574e2c

for primary alignments only.

jkbonfield mentioned this pull request Dec 11, 2014

SAM: "next" vs "next primary". #44

Open

d-cameron requested changes Jan 11, 2018

View reviewed changes

jmarshall added the sam label May 2, 2018

jkbonfield closed this Dec 13, 2018

jkbonfield mentioned this pull request Jun 25, 2020

SAM: Clarification of primary vs secondary reads #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

jkbonfield commented Nov 17, 2014

jkbonfield commented Dec 15, 2014

d-cameron left a comment

d-cameron Jan 11, 2018

d-cameron Jan 11, 2018

d-cameron Jan 11, 2018

d-cameron Jan 11, 2018

d-cameron Jan 11, 2018

jkbonfield commented Dec 13, 2018

Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

Conversation

jkbonfield commented Nov 17, 2014

jkbonfield commented Dec 15, 2014

d-cameron left a comment

Choose a reason for hiding this comment

d-cameron Jan 11, 2018

Choose a reason for hiding this comment

d-cameron Jan 11, 2018

Choose a reason for hiding this comment

d-cameron Jan 11, 2018

Choose a reason for hiding this comment

d-cameron Jan 11, 2018

Choose a reason for hiding this comment

d-cameron Jan 11, 2018

Choose a reason for hiding this comment

jkbonfield commented Dec 13, 2018