Oddly high mapping quality when aligning reads with Ns #4230

alshai · 2024-02-26T15:46:14Z

1. What were you trying to do?

Doing some experiments where I am aligning the following reads to the HPRC v1.1 pangenome index:
1. a read r1_wo_n
2. (1) with the last x bp trimmed r1_cut
3. (2) but adding 1 Ns to the end r1_w_1n
4. (2) but adding 2 Ns to the end r1_w_2n
5. (2) but adding x Ns to the end r1_w_n

Here's the FASTQ file:

@r1_wo_n
GGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCATGACACTGTAATCCAGTCTGGGTGACAGAGCGAAATTCCAGCAAAGGAAGGGAGAGAGGAAGGGAGAGAGGGAGTAGGGGAGAGAGAGAGAGAGAGAGTGAGGCAGGAAGGAAGGAAGAAGGGAAG
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@r1_cut
GGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCATGACACTGTANTCCAGNCTGGG
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@r1_w_1n
GGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCATGACACTGTANTCCAGNCTGGGN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@r1_w_2n
GGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCATGACACTGTANTCCAGNCTGGGNN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@r1_w_n
GGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCATGACACTGTANTCCAGNCTGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. What did you want to happen?

I expected the the reads with tails of Ns to have a similar alignment score and mapping quality to the trimmed read. And also expected the reads with Ns to have worse mapping quality than the full read.

3. What actually happened?

Reads with at least 2 Ns have a better alignment score than the trimmed read, and subsequently a better mapping quality than both the trimmed read and the full read. Oddly, the read with 2 Ns (r1_w_2n) has an even higher mapping quality than the read with the full amount of Ns (r1_w_n). Here's a filtered version of the GAM output. It's clear that the higher mapping quality results from the reads with Ns having higher-than-expected primary alignment scores.

{
  "name": "r1_wo_n",
  "score": 162,
  "mapping_quality": 1,
  "secondary_scores": [
    162,
    162,
    162
  ]
}
{
  "name": "r1_cut",
  "score": 68,
  "mapping_quality": 1,
  "secondary_scores": [
    68,
    68,
    68
  ]
}
{
  "name": "r1_w_1n",
  "score": 64,
  "mapping_quality": 1,
  "secondary_scores": [
    64,
    64,
    64
  ]
}
{
  "name": "r1_w_2n",
  "score": 76,
  "mapping_quality": 60,
  "secondary_scores": [
    76,
    60,
    60
  ]
}
{
  "name": "r1_w_n",
  "score": 76,
  "mapping_quality": 45,
  "secondary_scores": [
    76,
    68,
    68
  ]
}

5. What data and command can the vg dev team use to make the problem happen?

I used a re-tagged version of HPRC v1.1 (specified GRCh38 as the only reference). Hopefully this should have the same behavior as the original HPRC v1.1 index. Also enabled reporting of secondary alignments.

vg giraffe -M5 -Z hprc-v1.1-mc-grch38.d9.re-tagged.gbz -m hprc-v1.1-mc-grch38.d9.re-tagged.min -d hprc-v1.1-mc-grch38.d9.re-tagged.dist -x hprc-v1.1-mc-grch38.d9.re-tagged.xg -f reads.fq > out.high_qual.gam

6. What does running vg version say?

vg version v1.54.0 "Parafada"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by xian@octo

The text was updated successfully, but these errors were encountered:

jeizenga · 2024-02-26T21:46:32Z

I'm not sure I have a full explanation for why this is happening, but one thing that I would recommend is dropping the base qualities on the Ns. If an N comes off of a sequencing machine, the base is usually given a base quality of #, whereas you've given them a near-perfect base quality. If this still occurs with more realistic base qualities, it would also help to have the full alignments to look at (the path field in the JSON).

jltsiren · 2024-02-26T23:44:10Z

With one N, you still get full-length alignments from gapless extension, because the full-length bonus is larger than the penalty for a single mismatch. With two or more Ns, the trailing Ns get trimmed, and Giraffe does tail alignment with Dozeu. What does Dozeu do when the read and/or the graph contain Ns? (Gapless extension masks all non-ACGT characters in the read with X, assuming that the graph does not contain character X.)

jeizenga · 2024-02-27T00:58:04Z

I believe that Dozeu will treat all Ns as 0 score mismatches (even if both read and reference are N). The full length bonus is not applied during dynamic programming, but it is applied when choosing the optimal traceback location and in the final score.

jltsiren · 2024-02-27T04:26:22Z

I think I know what is happening. Let's concentrate on the read with two trailing Ns.

There are three equally good alignments in the graph, with the problematic one being path >45>47>48>49>50 in the following subgraph:

H	VN:Z:1.1	RS:Z:GRCh38
S	44686345	AGGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCAT
S	44686346	A
S	44686347	G
S	44686348	ACACTGTA
S	44686349	C
S	44686350	TCCAGCCTGGGTGACAGAGCGAAATTCCAGCAAAGAAAGGGAGAGAGGAAGGGAGAGAGGGAGTAGGGGAGAGAGAGAGAGAGAG
L	44686345	+	44686346	+	0M
L	44686345	+	44686347	+	0M
L	44686346	+	44686348	+	0M
L	44686347	+	44686348	+	0M
L	44686348	+	44686349	+	0M
L	44686349	+	44686350	+	0M
W	GRCh38	0	chr5	176090011	176090154	>44686345>44686346>44686348>44686349>44686350	WT:i:51
W	unknown	1	chr5	0	57	>44686345>44686347>44686348	WT:i:1
W	unknown	2	chr5	0	143	>44686345>44686347>44686348>44686349>44686350	WT:i:36
W	unknown	3	chr5	0	85	>44686350	WT:i:1

The first (internal) N corresponds to node 49. Haplotype unknown 1 ends at node 48, because it would have a rare allele at the position of node 49. Meanwhile, haplotype unknown 2 would be the correct alignment. We get one right-maximal extension with no mismatches over unknown 1 and another with 10 additional matches, 4 additional mismatches (all Ns), and a full-length bonus over unknown 2. Because the extension over unknown 1 scores higher, we select it as the best extension.

The best extension has an unaligned 14 bp tail, which we align using Dozeu. Because Dozeu does not use the mismatch penalty for Ns, this alignment has a higher score than the equivalent alignments elsewhere in the graph.

alshai · 2024-02-28T16:40:09Z

A couple of questions:

Do the reads automatically get trimmed if there are at least 2 Ns. Or is it that in this example the 1-N read just happens to be small enough to get the full-length bonus?
Do the Ns receive a penalty during the right-maximal extension?
Is there anyway we can tell Dozeu to use a non-0 penalty for Ns?

jltsiren · 2024-02-29T04:16:08Z

Giraffe uses two alignment algorithms. First it tries to find an alignment without indels. If it can't find a full-length alignment with a few mismatches, it trims the partial alignments it found, as mismatches near the ends of an alignment often indicate the presence of indels. Then it aligns the tails of the trimmed alignment while allowing indels.

In the first (gapless extension) algorithm, Ns are mismatches with a normal mismatch penalty against everything. In the second algorithm (Dozeu), Ns are zero-cost mismatches against everything. This is effectively a rare bug that requires a complex fix, because the above behaviors are inherent to the way the alignment algorithms have been implemented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oddly high mapping quality when aligning reads with Ns #4230

Oddly high mapping quality when aligning reads with Ns #4230

alshai commented Feb 26, 2024

jeizenga commented Feb 26, 2024

jltsiren commented Feb 26, 2024

jeizenga commented Feb 27, 2024

jltsiren commented Feb 27, 2024

alshai commented Feb 28, 2024

jltsiren commented Feb 29, 2024

Oddly high mapping quality when aligning reads with Ns #4230

Oddly high mapping quality when aligning reads with Ns #4230

Comments

alshai commented Feb 26, 2024

jeizenga commented Feb 26, 2024

jltsiren commented Feb 26, 2024

jeizenga commented Feb 27, 2024

jltsiren commented Feb 27, 2024

alshai commented Feb 28, 2024

jltsiren commented Feb 29, 2024