-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protein sequence wrongly identified as DNA #1574
Comments
Dear @gavinmdouglas, For file formats that do not allow metadata (e.g. FASTA), HyPhy uses a heuristic to guess which type of data it is. Basically it's a frequency-based heuristic, and your dataset seems to have a large number of
Not sure what is going on, but it looks like there may be a lot of motif repeats there, e.g. In any case, you can use an obscure
Best, |
Thanks for clarifying, @spond. Yes it seems like this is an unusual amino acid sequence, and I can appreciate that edge cases like this that cause problems are very rare (~0.03% of bacterial gene alignments I have been processing of around 1.8 million). However, given that 'Q' for instance is not a valid IUPAC nucleotide symbol, but is an amino acid symbol, perhaps information like that could be used to improve the heuristic? Thanks, Gavin |
Dear @gavinmdouglas, That's a great suggestion! If you have more alignments that fail the heuristic, could you send them along? I'll see if adjusting it to use the information like you suggest (disjoint characters like Best, |
Hey Sergei, Absolutely, you can see all of the alignments that failed due to this problem attached! All the best, Gavin |
Hi there,
Thanks for making this extremely useful tool!
I ran into this error when running this command (following the workflow described here):
The input alignment must contain protein data in call to assert(alignments.AlphabetType(grnJDpsA.alphabet)==utility.getGlobalValue('terms.amino_acid'), error_msg);
I identified that this is because there is one line of my input protein MSA that is all gap characters. When this line is removed the command finishes correctly. I am also able to comment out that check in
alignments.ReadProteinDataSet
(temporarily) to avoid this issue.I am using HYPHY 2.5.36(MP) for Linux on x86_64
I have attached the two input files and the log output.
full_log.txt
group_2532.fna_nuc.fas.gz
group_2532.fna_protein.msa.gz
Thanks!
Gavin
The text was updated successfully, but these errors were encountered: