Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein sequence wrongly identified as DNA #1574

Closed
gavinmdouglas opened this issue Feb 28, 2023 · 4 comments
Closed

Protein sequence wrongly identified as DNA #1574

gavinmdouglas opened this issue Feb 28, 2023 · 4 comments

Comments

@gavinmdouglas
Copy link

Hi there,

Thanks for making this extremely useful tool!

I ran into this error when running this command (following the workflow described here):

hyphy hyphy-analyses/codon-msa/post-msa.bf --protein-msa group_2532.fna_protein.msa \
                                                                                 --nucleotide-sequences group_2532.fna_nuc.fas \
                                                                                 --output test \
                                                                                 &> full_log.txt

The input alignment must contain protein data in call to assert(alignments.AlphabetType(grnJDpsA.alphabet)==utility.getGlobalValue('terms.amino_acid'), error_msg);

I identified that this is because there is one line of my input protein MSA that is all gap characters. When this line is removed the command finishes correctly. I am also able to comment out that check in alignments.ReadProteinDataSet (temporarily) to avoid this issue.

I am using HYPHY 2.5.36(MP) for Linux on x86_64

I have attached the two input files and the log output.

full_log.txt
group_2532.fna_nuc.fas.gz
group_2532.fna_protein.msa.gz

Thanks!

Gavin

@spond
Copy link
Member

spond commented Mar 2, 2023

Dear @gavinmdouglas,

For file formats that do not allow metadata (e.g. FASTA), HyPhy uses a heuristic to guess which type of data it is. Basically it's a frequency-based heuristic, and your dataset seems to have a large number of ACGT amino-acids, so the heuristic breaks down.

A = 0.1047103051746964
C = 0.009619637328615606
D = 0.03409258440218512
E = 0.01831785345717487
F = 0.02355152587350925
G = 0.1734114698510993
H = 0.0119047619047622
I = 0.05900781365177676
K = 0.01337903582485723
L = 0.06999115435647334
M = 0.01337903582485723
N = 0.04072681704260974
P = 0.07795223352498308
Q = 0.02333038478549588
R = 0.01581158779301263
S = 0.05576441102757022
T = 0.1551304732419257
V = 0.06247235736399598
W = 0.009619637328615606
Y = 0.02782692024178385

Not sure what is going on, but it looks like there may be a lot of motif repeats there, e.g. PTGIT

image

In any case, you can use an obscure HyPhy tag to force it to read the FASTA file as protein sequences. Just add the following line to the top of your FASTA file: $BASESET :BASE20

image

hyphy /Users/sergei/Development/hyphy-analyses/codon-msa/post-msa.bf --protein-msa /Users/sergei/Downloads/group_2532.fna_protein.msa  --nucleotide-sequences /Users/sergei/Downloads/group_2532.fna_nuc.fas  --output  /Users/sergei/Downloads/group_2532.fna_codon.msa  
compress: Yes
code: Universal

Analysis Description
--------------------
 Map a protein MSA back onto nucleotide sequences 

- __Requirements__: A protein MSA and the corresponding nucleotide alignment

- __Citation__: TBD

- __Written by__: Sergei L Kosakovsky Pond

- __Contact Information__: spond@temple.edu

- __Analysis Version__: 0.01

Load the protein MSA
Load the unaligned in-frame sequences
[UNIQUE SEQUENCES] Retained 10 unique  sequences

Best,
Sergei

group_2532.fna_codon.msa.zip

@gavinmdouglas
Copy link
Author

Thanks for clarifying, @spond. Yes it seems like this is an unusual amino acid sequence, and I can appreciate that edge cases like this that cause problems are very rare (~0.03% of bacterial gene alignments I have been processing of around 1.8 million).

However, given that 'Q' for instance is not a valid IUPAC nucleotide symbol, but is an amino acid symbol, perhaps information like that could be used to improve the heuristic?

Thanks,

Gavin

@spond
Copy link
Member

spond commented Mar 2, 2023

Dear @gavinmdouglas,

That's a great suggestion! If you have more alignments that fail the heuristic, could you send them along? I'll see if adjusting it to use the information like you suggest (disjoint characters like Q and I) will improve auto-detection.

Best,
Sergei

@gavinmdouglas
Copy link
Author

Hey Sergei,

Absolutely, you can see all of the alignments that failed due to this problem attached!

All the best,

Gavin

failed_alignments.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants