Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to keep Protein ID when retrieving coding sequences #107

Closed
santiagoha opened this issue May 25, 2017 · 8 comments
Closed

How to keep Protein ID when retrieving coding sequences #107

santiagoha opened this issue May 25, 2017 · 8 comments

Comments

@santiagoha
Copy link

I have a bunch of protein IDs and I need to retrieve the corresponding coding sequences (CDSs). I have managed to retrieve the CDSs but the names of each sequence change from XP* to XM*, and I need to retain the XP* header for each sequence.

Basically it looks like this:

library(renters)
search1 <- entrez_search(db="protein", term="XP_012370245[Accn]")
links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])

And the output looks like this:

> rec
[1] ">XM_012514791.1 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
\nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Is there a way to keep the protein id (XP_012370245) instead of the nucleotide id (XM_012514791.1)? Something like:

> rec
[1] ">XP_012370245
 \nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Any suggestion is very welcome, thanks!

@dwinter
Copy link
Member

dwinter commented May 25, 2017

Hi @santiagoha,

I don't think there is any way to do this on the NCBI end (nucleotide records have nucleotide IDs).

You can probably cook something up to replace the IDs through stringr or the base string manipulation functions. This works for your example (though you probably want to add some checks to make sure you are actually findings CDS sequences etc):

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    sub("XM_\\d+\\.\\d", prot_acc, rec)    
}

cat(substr(fetch_cds("XP_012370245"), 1, 500), "\n")
>XP_012370245 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
ATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG
CCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG
CCTGGCGGAGCTCTGGGGTGCCGACCTGGGGTCCGCCTGGAGGCGGCTTCACGCCACCGAACTGTGTCCG
CGCGGCGCAGTCCGCGTGGTGGCAGCGGTGGCGCCGCGGGGCCGCCTGGTGTGGTGCGAGGAGCGCCCGG
GCGCGGGCGGACGCCGCGTGTGCGTCCGCACCCTAGAGCCTGGCGGCGAGACTGGTGCCCGCCTGGGCCG
CACGCACGTCCTGCTGCACCACTGCCCGCCCTTCCACCTGCTGGCCTCGCGCAAGGACGTCTTCC

@santiagoha
Copy link
Author

Hi @dwinter, thank you so much for the hint! I will try to apply it now for a large number of IDs

@santiagoha
Copy link
Author

Hi @dwinter, this may be a silly question, but as you suggested earlier I am trying to check that the CDS actually exists, and if not, the function should break. I modified your fetch_cds function like this:

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    if (is.null(links$links$protein_nuccore_mrna[1]))
    {
    	stop("No CDS found") 
    }
    else 
    {
    	rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    	sub("XM_\\d+\\.\\d", prot_acc, rec)
    }  	  
}

When I apply it to an extant CDS, it works fine, however if I apply it to a protein ID that don't have the CDS it returns the following error:

fetch_cds("XP_004623289.1")
Error in res[[1]] : subscript out of bounds

I have try to solve it in different ways but I always get the same error. I understand that the error is telling that I'm trying to access some element that is outside the limits of the list/array, however, I don't understand which list/array is causing the error. Am I missing something? Thank you very much for help.

@dwinter
Copy link
Member

dwinter commented May 26, 2017

Hey @santiagoha, can you provide me with an ID that throws this error. Are you sure ther is a protein with this accession?

A couple of general debugging pointers.

  1. You can use traceback() to get the particular series of calls that gives an error (whch could give you a clue as to what's gong on.
  2. It's always a good idea to break the function down and run each line outside of the function scope while using the same inputs that are producing errors. That way you can see exactly when the errors occur

@santiagoha
Copy link
Author

Hi @dwinter, I just checked in NCBI and apparently the record for this accession (XP_004623289.1) was removed:
https://www.ncbi.nlm.nih.gov/protein/XP_004623289.1?report=genpept
Maybe this is why it doesn't has a corresponding CDS?

As you suggested, I previously broke the function and tried it with this ID: XP_004623289.1. The entrez_search and entrez_link functions worked fine. I then checked what was inside links:

links$links
elink result with information from 1 databases:
[1] protein_nuccore

So, it doesn't have the protein_nuccore_mrna database. And, actually, if I try to call this database it is NULL:

links$links$protein_nuccore_mrna[1]
NULL

This is why I used this a the conditional statement to break the function.

I used traceback() and this is what I'm getting:

traceback()
3: parse_elink(record, cmd = cmd, by_id = by_id)
2: entrez_link(dbfrom = "protein", db = "nuccore", id = search1$ids) at #3
1: fetch_cds("XP_004623289.1")

But I still don't understand what is causing error, is the parse_elinkfunction? Thank you very much for your time and advises!

@dwinter
Copy link
Member

dwinter commented May 26, 2017

Hey @santiagoha ,

There are no IDs in the search result

entrez_search(db="protein", term="XP_004623289.1[Accn]")
Entrez search result with 0 hits (object contains 0 IDs and no web_history object)
 Search term (as translated):  (XP_004623289.1[Accn])

As a result, elink is failing. We should probably check for non-zero length ID vectors within rentrez, but for now you probably want to catch these before you go hunting for links.

@santiagoha
Copy link
Author

santiagoha commented May 26, 2017

Hi @dwinter, I didn't noticed that there were no IDs!
I modified the function as follows:

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    tryCatch({
    	if (search1$count == 0) stop("No CDS found")
    	else 
    	{
    	links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    	rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    	sub("XM_\\d+\\.\\d", prot_acc, rec)
    	} 
    }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 
}

So, basically, it searches if there is any id for the corresponding Protein ID, and I added tryCatch to avoid the function to break. This is useful if you want to iterate the function over a list of IDs.

For example, for the list of IDs: "XP_004633320.1" "XP_004623289.1" "XP_004626331.1", the second Protein ID don't have a corresponding id with the entrez_searchfunction, but the other two are extant ids.

ids.a <- c("XP_004633320.1", "XP_004623289.1" ,"XP_004626331.1")
CDSs <- list()
for(i in 1:length(ids.a))
{
	id <- ids.a[i]
	cds <- fetch_cds(id)
	CDSs[[i]] <- cds
}

And the output looks as follows:

> CDSs
[[1]]
[1] ">XP_004633320.1 PREDICTED: Octodon degus major facilitator superfamily domain containing 9 (Mfsd9), mRNA
\nATGTGGTGCCTGAGGTTCGGGGCCCCGGGCGGGCCAGCTTCAGCCACTCCGGAGCTGCCGCCGGAGCGAG\nGCCCGGGCGCGGTGGGAGCTCGCCGCTTCCTGCTTGGCCTCTACCTGGTGGGCTTCCTGG (...)

[[2]]
NULL

[[3]]
[1] ">XP_004626331.1 PREDICTED: Octodon degus endothelin receptor type B (Ednrb), mRNA
\nCGGAGCGGCCGGTACACGCACTGGAGCAAGCAGTAGCATGCAGTCGCCAACAAGTCTGTGCAGACTCGCC\nCTGGTGGCTCTGGTTTTTGCCTGCATTTCGGCCGGGGTCTGGGGAGAGGAGAGAGGATTCCC (...)

I don't know if this is the most efficient approximation but it works, I hope this can be useful, thanks again for your help!

dwinter added a commit that referenced this issue May 31, 2017
@dwinter
Copy link
Member

dwinter commented May 31, 2017

closing now, feel free to sumbmit more issues if you run into anything else.

@dwinter dwinter closed this as completed May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants