Protein sequence wrongly identified as DNA #1574

gavinmdouglas · 2023-02-28T23:12:45Z

Hi there,

Thanks for making this extremely useful tool!

I ran into this error when running this command (following the workflow described here):

hyphy hyphy-analyses/codon-msa/post-msa.bf --protein-msa group_2532.fna_protein.msa \
                                                                                 --nucleotide-sequences group_2532.fna_nuc.fas \
                                                                                 --output test \
                                                                                 &> full_log.txt

The input alignment must contain protein data in call to assert(alignments.AlphabetType(grnJDpsA.alphabet)==utility.getGlobalValue('terms.amino_acid'), error_msg);

I identified that this is because there is one line of my input protein MSA that is all gap characters. When this line is removed the command finishes correctly. I am also able to comment out that check in alignments.ReadProteinDataSet (temporarily) to avoid this issue.

I am using HYPHY 2.5.36(MP) for Linux on x86_64

I have attached the two input files and the log output.

full_log.txt
group_2532.fna_nuc.fas.gz
group_2532.fna_protein.msa.gz

Thanks!

Gavin

The text was updated successfully, but these errors were encountered:

spond · 2023-03-02T16:04:22Z

Dear @gavinmdouglas,

For file formats that do not allow metadata (e.g. FASTA), HyPhy uses a heuristic to guess which type of data it is. Basically it's a frequency-based heuristic, and your dataset seems to have a large number of ACGT amino-acids, so the heuristic breaks down.

A = 0.1047103051746964
C = 0.009619637328615606
D = 0.03409258440218512
E = 0.01831785345717487
F = 0.02355152587350925
G = 0.1734114698510993
H = 0.0119047619047622
I = 0.05900781365177676
K = 0.01337903582485723
L = 0.06999115435647334
M = 0.01337903582485723
N = 0.04072681704260974
P = 0.07795223352498308
Q = 0.02333038478549588
R = 0.01581158779301263
S = 0.05576441102757022
T = 0.1551304732419257
V = 0.06247235736399598
W = 0.009619637328615606
Y = 0.02782692024178385

Not sure what is going on, but it looks like there may be a lot of motif repeats there, e.g. PTGIT

In any case, you can use an obscure HyPhy tag to force it to read the FASTA file as protein sequences. Just add the following line to the top of your FASTA file: $BASESET :BASE20

hyphy /Users/sergei/Development/hyphy-analyses/codon-msa/post-msa.bf --protein-msa /Users/sergei/Downloads/group_2532.fna_protein.msa  --nucleotide-sequences /Users/sergei/Downloads/group_2532.fna_nuc.fas  --output  /Users/sergei/Downloads/group_2532.fna_codon.msa  
compress: Yes
code: Universal

Analysis Description
--------------------
 Map a protein MSA back onto nucleotide sequences 

- __Requirements__: A protein MSA and the corresponding nucleotide alignment

- __Citation__: TBD

- __Written by__: Sergei L Kosakovsky Pond

- __Contact Information__: spond@temple.edu

- __Analysis Version__: 0.01

Load the protein MSA
Load the unaligned in-frame sequences
[UNIQUE SEQUENCES] Retained 10 unique  sequences

Best,
Sergei

group_2532.fna_codon.msa.zip

gavinmdouglas · 2023-03-02T16:29:11Z

Thanks for clarifying, @spond. Yes it seems like this is an unusual amino acid sequence, and I can appreciate that edge cases like this that cause problems are very rare (~0.03% of bacterial gene alignments I have been processing of around 1.8 million).

However, given that 'Q' for instance is not a valid IUPAC nucleotide symbol, but is an amino acid symbol, perhaps information like that could be used to improve the heuristic?

Thanks,

Gavin

spond · 2023-03-02T16:34:00Z

Dear @gavinmdouglas,

That's a great suggestion! If you have more alignments that fail the heuristic, could you send them along? I'll see if adjusting it to use the information like you suggest (disjoint characters like Q and I) will improve auto-detection.

Best,
Sergei

gavinmdouglas · 2023-03-02T20:11:11Z

Hey Sergei,

Absolutely, you can see all of the alignments that failed due to this problem attached!

All the best,

Gavin

failed_alignments.tar.gz

gavinmdouglas closed this as completed Mar 2, 2023

spond mentioned this issue Apr 14, 2023

Issue with CleanStopCodons.bf: Incompatible matrix dimensions in call to CheckDimension: 1x64 and 8000x1 #1590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein sequence wrongly identified as DNA #1574

Protein sequence wrongly identified as DNA #1574

gavinmdouglas commented Feb 28, 2023

spond commented Mar 2, 2023

gavinmdouglas commented Mar 2, 2023

spond commented Mar 2, 2023

gavinmdouglas commented Mar 2, 2023

Protein sequence wrongly identified as DNA #1574

Protein sequence wrongly identified as DNA #1574

Comments

gavinmdouglas commented Feb 28, 2023

spond commented Mar 2, 2023

gavinmdouglas commented Mar 2, 2023

spond commented Mar 2, 2023

gavinmdouglas commented Mar 2, 2023