-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oddly high mapping quality when aligning reads with Ns #4230
Comments
I'm not sure I have a full explanation for why this is happening, but one thing that I would recommend is dropping the base qualities on the Ns. If an N comes off of a sequencing machine, the base is usually given a base quality of |
With one N, you still get full-length alignments from gapless extension, because the full-length bonus is larger than the penalty for a single mismatch. With two or more Ns, the trailing Ns get trimmed, and Giraffe does tail alignment with Dozeu. What does Dozeu do when the read and/or the graph contain Ns? (Gapless extension masks all non-ACGT characters in the read with X, assuming that the graph does not contain character X.) |
I believe that Dozeu will treat all Ns as 0 score mismatches (even if both read and reference are N). The full length bonus is not applied during dynamic programming, but it is applied when choosing the optimal traceback location and in the final score. |
I think I know what is happening. Let's concentrate on the read with two trailing Ns. There are three equally good alignments in the graph, with the problematic one being path
The first (internal) N corresponds to node 49. Haplotype The best extension has an unaligned 14 bp tail, which we align using Dozeu. Because Dozeu does not use the mismatch penalty for Ns, this alignment has a higher score than the equivalent alignments elsewhere in the graph. |
A couple of questions:
|
Giraffe uses two alignment algorithms. First it tries to find an alignment without indels. If it can't find a full-length alignment with a few mismatches, it trims the partial alignments it found, as mismatches near the ends of an alignment often indicate the presence of indels. Then it aligns the tails of the trimmed alignment while allowing indels. In the first (gapless extension) algorithm, Ns are mismatches with a normal mismatch penalty against everything. In the second algorithm (Dozeu), Ns are zero-cost mismatches against everything. This is effectively a rare bug that requires a complex fix, because the above behaviors are inherent to the way the alignment algorithms have been implemented. |
1. What were you trying to do?
Doing some experiments where I am aligning the following reads to the HPRC v1.1 pangenome index:
1. a read
r1_wo_n
2. (1) with the last x bp trimmed
r1_cut
3. (2) but adding 1 Ns to the end
r1_w_1n
4. (2) but adding 2 Ns to the end
r1_w_2n
5. (2) but adding x Ns to the end
r1_w_n
Here's the FASTQ file:
2. What did you want to happen?
I expected the the reads with tails of Ns to have a similar alignment score and mapping quality to the trimmed read. And also expected the reads with Ns to have worse mapping quality than the full read.
3. What actually happened?
Reads with at least 2 Ns have a better alignment score than the trimmed read, and subsequently a better mapping quality than both the trimmed read and the full read. Oddly, the read with 2 Ns (
r1_w_2n
) has an even higher mapping quality than the read with the full amount of Ns (r1_w_n
). Here's a filtered version of the GAM output. It's clear that the higher mapping quality results from the reads with Ns having higher-than-expected primary alignment scores.5. What data and command can the vg dev team use to make the problem happen?
I used a re-tagged version of HPRC v1.1 (specified GRCh38 as the only reference). Hopefully this should have the same behavior as the original HPRC v1.1 index. Also enabled reporting of secondary alignments.
6. What does running
vg version
say?The text was updated successfully, but these errors were encountered: