Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed rnext/pnext/tlen to operate per template alignment rather than for primary alignments only. #53

Closed
wants to merge 1 commit into from

Conversation

jkbonfield
Copy link
Contributor

This for a discussion point at the next GA4GH conference call, rather than for acceptance immediately.

Also see discussions in issues #44 and #48.

Things still to resolve:

  1. TLEN definition has not changed from left/right to 5'/3'. We need to come to a decision once and for all.

  2. Should PNEXT/RNEXT/TLEN always be linking together to form circles per template alignment, or should we permit the old method of linking to next (aka "other end" in a 2-segment world) primary alignment for cases where it makes more sense to the aligner. Ie do we have to double up the other end in such cases?

  3. There are no additional auxiliary tags defined here yet; specifically no SI (which secondary alignment number this is within), SC (count of secondary alignments) and AC (total count of all segment alignments in file for this template). Field names subject to discussion and change of course.

  4. No commentary of CC/CP. These are already in use by novoalign, but do not form a circular linked list meaning that it is not possible to start at any point and do a full round-trip to find all segment alignments for all primary, secondary and supplementary alignment cases.

@jkbonfield
Copy link
Contributor Author

For the issues listed above here are my recollections from the conference call. Put here so the process is open and also for my poor memory.

  1. Not discussed due to lack of time.
  2. More debate needed as no clear consensus; hence likely siding on not forcibly requiring duplicate alignments for secondary reads.
  3. OK in principle, but lack of time meant we didn't discuss specifics.
  4. Not discussed due to lack of time.

Regarding the content of the pull request itself, I didn't hear any objections to relaxing the requirement for PNEXT/RNEXT/TLEN(/FLAGS) having to only link back to primary alignments. I also didn't hear objections to making FI/TC mandatory on templates with more than 2 alignments.

We didn't really discuss the issues of TLEN having middle segments unmapped while outer ones are mapped, again due to the time constraints. In order to try and get this moving faster I'll update the pull request with a few more bits so we cover the undealt with issues above.

Finally, there are concerns over taking random-access slices of the data and keeping fields up to date. What does TC really mean? If a template comes in 3 parts matching positions 1000, 2000 and 3000 in a reference, and we ask for 1900..2100 as a sub-range, TC won't really change, but it's also perhaps not useful for reducing algorithmic complexity? We need another field anyway (total number of alignments), so should that be updated on range queries?

Copy link
Contributor

@d-cameron d-cameron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having whole-template secondary alignments is problematic as not only does it increase the number of records needing to be written, but it breaks existing piplines that can currently get away with a mapq check instead of looking at the secondary/supp flags.

To fully handle multi-mapping alignments we need to be able to arbitrary alignment graphs including having the ability to specify chimeric alignments for which part of a segment is multi-mapping.

Is there a particular use case driving this PR? Multi-segment template alignment above paired reads feels like a premature specification generalisation for which sequencing technology didn't advance in the manner expected. The only use case for it that I've seen with any significant traction is 10X's chromium linked reads but it's unusable for that since the reads from a single template are sampled randomly and the specs can't handle that.

PNEXT} and bit 0x20.
\item {\sf PNEXT}: Position of the NEXT read in this template
alignment. Set as 0 when the information is unavailable. This field
equals {\sf POS} at the primary line of the next read. If {\sf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next read or next segment?

@@ -296,6 +304,7 @@ \subsection{The alignment section: mandatory fields}
0x40 and 0x80 are unset, the index of the read in the template
is unknown. This may happen for a non-linear template or the index
is lost in data processing.
\item Bit 0x20 defines the next segment as per the definition in RNEXT and PNEXT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means. If the next segment is the third segment in a 5 segment template, what does having 0x20 set actually mean? Alignments for segments 2 and 3 are consistent? 3 and 4? All segment alignment are consistent with having no (large) SVs wrt the set of reference alignment?

next read. If {\sf
RNEXT} is `*', no assumptions can be made on {\sf PNEXT} and bit
\item {\sf RNEXT}: Reference sequence name of the NEXT read in this
template alignment, where NEXT is defined to be the next read in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially this PR essentially redefines secondary alignments as secondary alignments for the entire template as opposed to the current definition that implicitly (by requiring RNEXT to point to the primary alignment) defines secondary alignments as secondary alignments for the segment.

For read pairs that have one read uniquely mappable and the other read multi-mapping to 100 different location this change in definition will require spec-compliant implementations to write out the uniquely aligned segment 100 times, each with a different RNEXT.

It is a major change that will break many tools as tools can no longer assume that a high mapq means that record is unique. This as simple as calculating read depth of reads with (e.g.) mapq>10 will now require read deduplication. Is this intended?

template alignment, where NEXT is defined to be the next read in
template coordinates rather than mapping coordinates. For the last
read, the next read is the first read in this template alignment.
Multiple template alignments may exist, with RNEXT/PNEXT forming a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where supp alignment fit in should be explicitly stated. This is an issue with the current specs as it doesn't explicitly state that RNEXT and PNEXT should be to the non-supp primary record.

We should explicitly state that RNEXT and PNEXT never point to supp alignment records.

@@ -559,8 +570,8 @@ \section{Recommended Practice for the SAM Format}
\begin{enumerate}[label=\arabic*]
\item When one segment is present in multiple lines to represent a multiple
mapping of the segment, only one of these records should have the secondary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the current specs are missing clarification as to what's allowed when the primary alignment is chimeric -
you can have multiple non-secondary lines, but only one of these can be non-supplementary.

@jmarshall jmarshall added the sam label May 2, 2018
jmarshall added a commit to wwcrc/BRASS that referenced this pull request Nov 2, 2018
Discard the group if it is supported only by SUPPLEMENTARY reads.
As the SAM specification requires (for each extant read) exactly one
alignment record with neither SECONDARY nor SUPPLEMENTARY set, in fact
we filter based on primary alignments with neither of those flags set.

As we look in depth at only one read out of the pair, we use the flags
of the observed read and assume that the mate is similarly primary /
supplementary. This is good enough as the intended behaviour of pairing
in the presence of supplementary alignments is unclear in any case --
see samtools/hts-specs#53 et al.
@jkbonfield
Copy link
Contributor Author

Sailing against the wind and not how things are done right now.

For completeness sake, testing with bwa shows two primary reads and a supplementary read will link 1 to 2, 2 to 1 and 1s (supp) to 2. Thus supplementary reads are not part of the RNEXT/PNEXT chain, and must instead use the SA:Z tag to identify how they fit into the primary data. The spec is (now) clear on this too; primary is ONLY non-supplementary and so by definition a supplementary read can never be considered as primary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants