-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to keep Protein ID when retrieving coding sequences #107
Comments
Hi @santiagoha, I don't think there is any way to do this on the NCBI end (nucleotide records have nucleotide IDs). You can probably cook something up to replace the IDs through fetch_cds <- function(prot_acc){
search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
sub("XM_\\d+\\.\\d", prot_acc, rec)
}
cat(substr(fetch_cds("XP_012370245"), 1, 500), "\n")
|
Hi @dwinter, thank you so much for the hint! I will try to apply it now for a large number of IDs |
Hi @dwinter, this may be a silly question, but as you suggested earlier I am trying to check that the CDS actually exists, and if not, the function should break. I modified your
When I apply it to an extant CDS, it works fine, however if I apply it to a protein ID that don't have the CDS it returns the following error:
I have try to solve it in different ways but I always get the same error. I understand that the error is telling that I'm trying to access some element that is outside the limits of the list/array, however, I don't understand which list/array is causing the error. Am I missing something? Thank you very much for help. |
Hey @santiagoha, can you provide me with an ID that throws this error. Are you sure ther is a protein with this accession? A couple of general debugging pointers.
|
Hi @dwinter, I just checked in NCBI and apparently the record for this accession (XP_004623289.1) was removed: As you suggested, I previously broke the function and tried it with this ID: XP_004623289.1. The
So, it doesn't have the protein_nuccore_mrna database. And, actually, if I try to call this database it is NULL:
This is why I used this a the conditional statement to break the function. I used
But I still don't understand what is causing error, is the |
Hey @santiagoha , There are no IDs in the search result entrez_search(db="protein", term="XP_004623289.1[Accn]") Entrez search result with 0 hits (object contains 0 IDs and no web_history object)
Search term (as translated): (XP_004623289.1[Accn]) As a result, elink is failing. We should probably check for non-zero length ID vectors within rentrez, but for now you probably want to catch these before you go hunting for links. |
Hi @dwinter, I didn't noticed that there were no IDs!
So, basically, it searches if there is any id for the corresponding Protein ID, and I added For example, for the list of IDs: "XP_004633320.1" "XP_004623289.1" "XP_004626331.1", the second Protein ID don't have a corresponding id with the
And the output looks as follows:
I don't know if this is the most efficient approximation but it works, I hope this can be useful, thanks again for your help! |
closing now, feel free to sumbmit more issues if you run into anything else. |
I have a bunch of protein IDs and I need to retrieve the corresponding coding sequences (CDSs). I have managed to retrieve the CDSs but the names of each sequence change from XP* to XM*, and I need to retain the XP* header for each sequence.
Basically it looks like this:
And the output looks like this:
Is there a way to keep the protein id (XP_012370245) instead of the nucleotide id (XM_012514791.1)? Something like:
Any suggestion is very welcome, thanks!
The text was updated successfully, but these errors were encountered: