Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong output for two ~5Mb long sequences #7

Closed
lh3 opened this issue Mar 23, 2022 · 2 comments
Closed

Wrong output for two ~5Mb long sequences #7

lh3 opened this issue Mar 23, 2022 · 2 comments

Comments

@lh3
Copy link

lh3 commented Mar 23, 2022

I am using WFA2-lib as a library to align two MHC sequences. The edit distance should be 123502 but WFA2 gives 2164110 in very short time. The two sequences are MHC-00GRCh38 and MHC-CHM13 from file "MHC-61.agc" at this Zenodo record. The code calling WFA2 is here:

https://github.com/lh3/lv89/blob/a16c30bba590e7de734edbdeeab55e4849e9ed20/main.c#L65-L71

PS: the output is correct for two ~100k sequences.

@smarco
Copy link
Owner

smarco commented Mar 23, 2022

Hi,

You are not doing anything wrong. Probably, it is my fault. I decided to set the heuristic wf_heuristic_wfadaptive as the default. You should get what you want by adding the following line (i.e., full WFA):

attributes.heuristic.strategy = wf_heuristic_none;

In this case, I am getting a -123585. I have checked using:

./bin/align_benchmark -a edit-wfa -i hl.seq -o hl.alg --wfa-score-only -v

Let me know if that is what you expect or if the score difference (123502-123585) is a bug, and I should patch it.

That decision -- which I honestly don't like -- was made because many people tend to compare the full-WFA against other methods based on heuristics (e.g., banded, drops, ...) and dismiss it. In many cases, people fail to understand that the default is an exact method that can take time significantly more time than a heuristic one. Revisiting this decision, I should put a warning on the README and, I don't know, maybe switching the default back to full-WFA.

PS. The alignment.

hl alg 001

lh3 added a commit to lh3/lv89 that referenced this issue Mar 23, 2022
@lh3
Copy link
Author

lh3 commented Mar 23, 2022

Thanks. This fixed the issue (lh3/lv89@7a740f7). 123585 is the correct edit distance for global alignment. 123502 is for extension alignment.

I mildly prefer to make the exact mode the default. The heuristic is more likely to fail on real data when we have long tandem duplications. One possible way is to add an explicit parameter to the main API and require the users to enable/disable heuristics:

wavefront_align(wavefront_aligner_t*, char*, int, char*, int heuristic_strategy);

PS: after a second thought, I feel adding an explicit parameter might not be a good idea because other parameters like type of alignment are more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants