Wrong output for two ~5Mb long sequences #7

lh3 · 2022-03-23T15:57:36Z

I am using WFA2-lib as a library to align two MHC sequences. The edit distance should be 123502 but WFA2 gives 2164110 in very short time. The two sequences are MHC-00GRCh38 and MHC-CHM13 from file "MHC-61.agc" at this Zenodo record. The code calling WFA2 is here:

https://github.com/lh3/lv89/blob/a16c30bba590e7de734edbdeeab55e4849e9ed20/main.c#L65-L71

PS: the output is correct for two ~100k sequences.

smarco · 2022-03-23T16:47:34Z

Hi,

You are not doing anything wrong. Probably, it is my fault. I decided to set the heuristic wf_heuristic_wfadaptive as the default. You should get what you want by adding the following line (i.e., full WFA):

attributes.heuristic.strategy = wf_heuristic_none;

In this case, I am getting a -123585. I have checked using:

./bin/align_benchmark -a edit-wfa -i hl.seq -o hl.alg --wfa-score-only -v

Let me know if that is what you expect or if the score difference (123502-123585) is a bug, and I should patch it.

That decision -- which I honestly don't like -- was made because many people tend to compare the full-WFA against other methods based on heuristics (e.g., banded, drops, ...) and dismiss it. In many cases, people fail to understand that the default is an exact method that can take time significantly more time than a heuristic one. Revisiting this decision, I should put a warning on the README and, I don't know, maybe switching the default back to full-WFA.

PS. The alignment.

lh3 · 2022-03-23T18:43:52Z

Thanks. This fixed the issue (lh3/lv89@7a740f7). 123585 is the correct edit distance for global alignment. 123502 is for extension alignment.

I mildly prefer to make the exact mode the default. The heuristic is more likely to fail on real data when we have long tandem duplications. One possible way is to add an explicit parameter to the main API and require the users to enable/disable heuristics:

wavefront_align(wavefront_aligner_t*, char*, int, char*, int heuristic_strategy);

PS: after a second thought, I feel adding an explicit parameter might not be a good idea because other parameters like type of alignment are more important.

lh3 added a commit to lh3/lv89 that referenced this issue Mar 23, 2022

fixed for WFA2; resolves smarco/WFA2-lib#7

7a740f7

lh3 closed this as completed Mar 23, 2022

lh3 mentioned this issue Mar 23, 2022

Other Landau-Vishkin implementations #6

Closed

lh3 mentioned this issue Apr 20, 2022

Use of kalloc in miniwfa, etc armintoepfer/aligner-testbed#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong output for two ~5Mb long sequences #7

Wrong output for two ~5Mb long sequences #7

lh3 commented Mar 23, 2022 •

edited

Loading

smarco commented Mar 23, 2022

lh3 commented Mar 23, 2022 •

edited

Loading

Wrong output for two ~5Mb long sequences #7

Wrong output for two ~5Mb long sequences #7

Comments

lh3 commented Mar 23, 2022 • edited Loading

smarco commented Mar 23, 2022

lh3 commented Mar 23, 2022 • edited Loading

lh3 commented Mar 23, 2022 •

edited

Loading

lh3 commented Mar 23, 2022 •

edited

Loading