Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature suggestion] Reverse translate protein search expression into nucleotide regex or degenerate base sequence #418

Open
3 tasks done
samuell opened this issue Oct 4, 2023 · 2 comments

Comments

@samuell
Copy link
Contributor

samuell commented Oct 4, 2023

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
  • [-] provide a reproducible example

I'm having the usecase where I located a small "motif" in a protein sequence, that I'm interested in finding again in the nucleotide sequence coding for the protein.

The sequence I was looking for, expressed as a regex is the following, so let's use that as an example here (. is of course any letter, as per standard regex syntax):

E.SM.YSDN

I would now want to be able to seqkit grep against not only protein sequences, but also nucleotide ones.

By using a genetic code table I can do this by manually converting this sequence into a (DNA) nucleotide regex like this one (where [XY] are character classes allowing any of X and Y in one position):

GA[AG]...AG[CT]ATC...TA[CT]AG[CT]GA[CT]AA[CT]

Now, it would be useful to not need to do this translation manually, but rather be able to do something similar to:

seqkit grep --by-seq -r --protein-to-nucleotide -p "E.SM.YSDN" nucleotide_sequences.fa

Of course, the similar thing could be done using degenerate amino acid / bases too, if that is preferred over regular expressions.

@samuell samuell changed the title [feature suggestion]Reverse translate protein search expression into nucleotide regex or degenerate base sequence [feature suggestion] Reverse translate protein search expression into nucleotide regex or degenerate base sequence Oct 4, 2023
@shenwei356
Copy link
Owner

That would be achieved, but is tblastn simpler and faster?

@samuell
Copy link
Contributor Author

samuell commented Oct 10, 2023

That would be achieved, but is tblastn simpler and faster?

Perhaps! In my own quick try, it seemed that I need to put my query sequence into a file before running it, but there is perhaps some way to do this more easily.

I can explore this option a little more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants