-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53
Conversation
for primary alignments only.
For the issues listed above here are my recollections from the conference call. Put here so the process is open and also for my poor memory.
Regarding the content of the pull request itself, I didn't hear any objections to relaxing the requirement for PNEXT/RNEXT/TLEN(/FLAGS) having to only link back to primary alignments. I also didn't hear objections to making FI/TC mandatory on templates with more than 2 alignments. We didn't really discuss the issues of TLEN having middle segments unmapped while outer ones are mapped, again due to the time constraints. In order to try and get this moving faster I'll update the pull request with a few more bits so we cover the undealt with issues above. Finally, there are concerns over taking random-access slices of the data and keeping fields up to date. What does TC really mean? If a template comes in 3 parts matching positions 1000, 2000 and 3000 in a reference, and we ask for 1900..2100 as a sub-range, TC won't really change, but it's also perhaps not useful for reducing algorithmic complexity? We need another field anyway (total number of alignments), so should that be updated on range queries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having whole-template secondary alignments is problematic as not only does it increase the number of records needing to be written, but it breaks existing piplines that can currently get away with a mapq check instead of looking at the secondary/supp flags.
To fully handle multi-mapping alignments we need to be able to arbitrary alignment graphs including having the ability to specify chimeric alignments for which part of a segment is multi-mapping.
Is there a particular use case driving this PR? Multi-segment template alignment above paired reads feels like a premature specification generalisation for which sequencing technology didn't advance in the manner expected. The only use case for it that I've seen with any significant traction is 10X's chromium linked reads but it's unusable for that since the reads from a single template are sampled randomly and the specs can't handle that.
PNEXT} and bit 0x20. | ||
\item {\sf PNEXT}: Position of the NEXT read in this template | ||
alignment. Set as 0 when the information is unavailable. This field | ||
equals {\sf POS} at the primary line of the next read. If {\sf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next read or next segment?
@@ -296,6 +304,7 @@ \subsection{The alignment section: mandatory fields} | |||
0x40 and 0x80 are unset, the index of the read in the template | |||
is unknown. This may happen for a non-linear template or the index | |||
is lost in data processing. | |||
\item Bit 0x20 defines the next segment as per the definition in RNEXT and PNEXT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means. If the next segment is the third segment in a 5 segment template, what does having 0x20 set actually mean? Alignments for segments 2 and 3 are consistent? 3 and 4? All segment alignment are consistent with having no (large) SVs wrt the set of reference alignment?
next read. If {\sf | ||
RNEXT} is `*', no assumptions can be made on {\sf PNEXT} and bit | ||
\item {\sf RNEXT}: Reference sequence name of the NEXT read in this | ||
template alignment, where NEXT is defined to be the next read in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially this PR essentially redefines secondary alignments as secondary alignments for the entire template as opposed to the current definition that implicitly (by requiring RNEXT to point to the primary alignment) defines secondary alignments as secondary alignments for the segment.
For read pairs that have one read uniquely mappable and the other read multi-mapping to 100 different location this change in definition will require spec-compliant implementations to write out the uniquely aligned segment 100 times, each with a different RNEXT.
It is a major change that will break many tools as tools can no longer assume that a high mapq means that record is unique. This as simple as calculating read depth of reads with (e.g.) mapq>10 will now require read deduplication. Is this intended?
template alignment, where NEXT is defined to be the next read in | ||
template coordinates rather than mapping coordinates. For the last | ||
read, the next read is the first read in this template alignment. | ||
Multiple template alignments may exist, with RNEXT/PNEXT forming a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where supp alignment fit in should be explicitly stated. This is an issue with the current specs as it doesn't explicitly state that RNEXT and PNEXT should be to the non-supp primary record.
We should explicitly state that RNEXT and PNEXT never point to supp alignment records.
@@ -559,8 +570,8 @@ \section{Recommended Practice for the SAM Format} | |||
\begin{enumerate}[label=\arabic*] | |||
\item When one segment is present in multiple lines to represent a multiple | |||
mapping of the segment, only one of these records should have the secondary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the current specs are missing clarification as to what's allowed when the primary alignment is chimeric -
you can have multiple non-secondary lines, but only one of these can be non-supplementary.
Discard the group if it is supported only by SUPPLEMENTARY reads. As the SAM specification requires (for each extant read) exactly one alignment record with neither SECONDARY nor SUPPLEMENTARY set, in fact we filter based on primary alignments with neither of those flags set. As we look in depth at only one read out of the pair, we use the flags of the observed read and assume that the mate is similarly primary / supplementary. This is good enough as the intended behaviour of pairing in the presence of supplementary alignments is unclear in any case -- see samtools/hts-specs#53 et al.
Sailing against the wind and not how things are done right now. For completeness sake, testing with bwa shows two primary reads and a supplementary read will link 1 to 2, 2 to 1 and 1s (supp) to 2. Thus supplementary reads are not part of the RNEXT/PNEXT chain, and must instead use the SA:Z tag to identify how they fit into the primary data. The spec is (now) clear on this too; primary is ONLY non-supplementary and so by definition a supplementary read can never be considered as primary. |
This for a discussion point at the next GA4GH conference call, rather than for acceptance immediately.
Also see discussions in issues #44 and #48.
Things still to resolve:
TLEN definition has not changed from left/right to 5'/3'. We need to come to a decision once and for all.
Should PNEXT/RNEXT/TLEN always be linking together to form circles per template alignment, or should we permit the old method of linking to next (aka "other end" in a 2-segment world) primary alignment for cases where it makes more sense to the aligner. Ie do we have to double up the other end in such cases?
There are no additional auxiliary tags defined here yet; specifically no SI (which secondary alignment number this is within), SC (count of secondary alignments) and AC (total count of all segment alignments in file for this template). Field names subject to discussion and change of course.
No commentary of CC/CP. These are already in use by novoalign, but do not form a circular linked list meaning that it is not possible to start at any point and do a full round-trip to find all segment alignments for all primary, secondary and supplementary alignment cases.