You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having the usecase where I located a small "motif" in a protein sequence, that I'm interested in finding again in the nucleotide sequence coding for the protein.
The sequence I was looking for, expressed as a regex is the following, so let's use that as an example here (. is of course any letter, as per standard regex syntax):
E.SM.YSDN
I would now want to be able to seqkit grep against not only protein sequences, but also nucleotide ones.
By using a genetic code table I can do this by manually converting this sequence into a (DNA) nucleotide regex like this one (where [XY] are character classes allowing any of X and Y in one position):
GA[AG]...AG[CT]ATC...TA[CT]AG[CT]GA[CT]AA[CT]
Now, it would be useful to not need to do this translation manually, but rather be able to do something similar to:
Of course, the similar thing could be done using degenerate amino acid / bases too, if that is preferred over regular expressions.
The text was updated successfully, but these errors were encountered:
samuell
changed the title
[feature suggestion]Reverse translate protein search expression into nucleotide regex or degenerate base sequence
[feature suggestion] Reverse translate protein search expression into nucleotide regex or degenerate base sequence
Oct 4, 2023
That would be achieved, but is tblastn simpler and faster?
Perhaps! In my own quick try, it seemed that I need to put my query sequence into a file before running it, but there is perhaps some way to do this more easily.
Prerequisites
seqkit version
Describe your issue
I'm having the usecase where I located a small "motif" in a protein sequence, that I'm interested in finding again in the nucleotide sequence coding for the protein.
The sequence I was looking for, expressed as a regex is the following, so let's use that as an example here (
.
is of course any letter, as per standard regex syntax):I would now want to be able to
seqkit grep
against not only protein sequences, but also nucleotide ones.By using a genetic code table I can do this by manually converting this sequence into a (DNA) nucleotide regex like this one (where
[XY]
are character classes allowing any ofX
andY
in one position):Now, it would be useful to not need to do this translation manually, but rather be able to do something similar to:
seqkit grep --by-seq -r --protein-to-nucleotide -p "E.SM.YSDN" nucleotide_sequences.fa
Of course, the similar thing could be done using degenerate amino acid / bases too, if that is preferred over regular expressions.
The text was updated successfully, but these errors were encountered: