How to keep Protein ID when retrieving coding sequences #107

santiagoha · 2017-05-25T21:18:35Z

I have a bunch of protein IDs and I need to retrieve the corresponding coding sequences (CDSs). I have managed to retrieve the CDSs but the names of each sequence change from XP* to XM*, and I need to retain the XP* header for each sequence.

Basically it looks like this:

library(renters)
search1 <- entrez_search(db="protein", term="XP_012370245[Accn]")
links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])

And the output looks like this:

> rec
[1] ">XM_012514791.1 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
\nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Is there a way to keep the protein id (XP_012370245) instead of the nucleotide id (XM_012514791.1)? Something like:

> rec
[1] ">XP_012370245
 \nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Any suggestion is very welcome, thanks!

The text was updated successfully, but these errors were encountered:

dwinter · 2017-05-25T21:33:42Z

Hi @santiagoha,

I don't think there is any way to do this on the NCBI end (nucleotide records have nucleotide IDs).

You can probably cook something up to replace the IDs through stringr or the base string manipulation functions. This works for your example (though you probably want to add some checks to make sure you are actually findings CDS sequences etc):

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    sub("XM_\\d+\\.\\d", prot_acc, rec)    
}

cat(substr(fetch_cds("XP_012370245"), 1, 500), "\n")

>XP_012370245 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
ATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG
CCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG
CCTGGCGGAGCTCTGGGGTGCCGACCTGGGGTCCGCCTGGAGGCGGCTTCACGCCACCGAACTGTGTCCG
CGCGGCGCAGTCCGCGTGGTGGCAGCGGTGGCGCCGCGGGGCCGCCTGGTGTGGTGCGAGGAGCGCCCGG
GCGCGGGCGGACGCCGCGTGTGCGTCCGCACCCTAGAGCCTGGCGGCGAGACTGGTGCCCGCCTGGGCCG
CACGCACGTCCTGCTGCACCACTGCCCGCCCTTCCACCTGCTGGCCTCGCGCAAGGACGTCTTCC

santiagoha · 2017-05-25T22:07:12Z

Hi @dwinter, thank you so much for the hint! I will try to apply it now for a large number of IDs

santiagoha · 2017-05-25T23:49:35Z

Hi @dwinter, this may be a silly question, but as you suggested earlier I am trying to check that the CDS actually exists, and if not, the function should break. I modified your fetch_cds function like this:

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    if (is.null(links$links$protein_nuccore_mrna[1]))
    {
    	stop("No CDS found") 
    }
    else 
    {
    	rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    	sub("XM_\\d+\\.\\d", prot_acc, rec)
    }  	  
}

When I apply it to an extant CDS, it works fine, however if I apply it to a protein ID that don't have the CDS it returns the following error:

fetch_cds("XP_004623289.1")
Error in res[[1]] : subscript out of bounds

I have try to solve it in different ways but I always get the same error. I understand that the error is telling that I'm trying to access some element that is outside the limits of the list/array, however, I don't understand which list/array is causing the error. Am I missing something? Thank you very much for help.

dwinter · 2017-05-26T00:06:13Z

Hey @santiagoha, can you provide me with an ID that throws this error. Are you sure ther is a protein with this accession?

A couple of general debugging pointers.

You can use traceback() to get the particular series of calls that gives an error (whch could give you a clue as to what's gong on.
It's always a good idea to break the function down and run each line outside of the function scope while using the same inputs that are producing errors. That way you can see exactly when the errors occur

santiagoha · 2017-05-26T00:29:58Z

Hi @dwinter, I just checked in NCBI and apparently the record for this accession (XP_004623289.1) was removed:
https://www.ncbi.nlm.nih.gov/protein/XP_004623289.1?report=genpept
Maybe this is why it doesn't has a corresponding CDS?

As you suggested, I previously broke the function and tried it with this ID: XP_004623289.1. The entrez_search and entrez_link functions worked fine. I then checked what was inside links:

links$links
elink result with information from 1 databases:
[1] protein_nuccore

So, it doesn't have the protein_nuccore_mrna database. And, actually, if I try to call this database it is NULL:

links$links$protein_nuccore_mrna[1]
NULL

This is why I used this a the conditional statement to break the function.

I used traceback() and this is what I'm getting:

traceback()
3: parse_elink(record, cmd = cmd, by_id = by_id)
2: entrez_link(dbfrom = "protein", db = "nuccore", id = search1$ids) at #3
1: fetch_cds("XP_004623289.1")

But I still don't understand what is causing error, is the parse_elinkfunction? Thank you very much for your time and advises!

dwinter · 2017-05-26T00:47:44Z

Hey @santiagoha ,

There are no IDs in the search result

entrez_search(db="protein", term="XP_004623289.1[Accn]")

Entrez search result with 0 hits (object contains 0 IDs and no web_history object)
 Search term (as translated):  (XP_004623289.1[Accn])

As a result, elink is failing. We should probably check for non-zero length ID vectors within rentrez, but for now you probably want to catch these before you go hunting for links.

santiagoha · 2017-05-26T16:03:09Z

Hi @dwinter, I didn't noticed that there were no IDs!
I modified the function as follows:

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    tryCatch({
    	if (search1$count == 0) stop("No CDS found")
    	else 
    	{
    	links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    	rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    	sub("XM_\\d+\\.\\d", prot_acc, rec)
    	} 
    }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 
}

So, basically, it searches if there is any id for the corresponding Protein ID, and I added tryCatch to avoid the function to break. This is useful if you want to iterate the function over a list of IDs.

For example, for the list of IDs: "XP_004633320.1" "XP_004623289.1" "XP_004626331.1", the second Protein ID don't have a corresponding id with the entrez_searchfunction, but the other two are extant ids.

ids.a <- c("XP_004633320.1", "XP_004623289.1" ,"XP_004626331.1")
CDSs <- list()
for(i in 1:length(ids.a))
{
	id <- ids.a[i]
	cds <- fetch_cds(id)
	CDSs[[i]] <- cds
}

And the output looks as follows:

> CDSs
[[1]]
[1] ">XP_004633320.1 PREDICTED: Octodon degus major facilitator superfamily domain containing 9 (Mfsd9), mRNA
\nATGTGGTGCCTGAGGTTCGGGGCCCCGGGCGGGCCAGCTTCAGCCACTCCGGAGCTGCCGCCGGAGCGAG\nGCCCGGGCGCGGTGGGAGCTCGCCGCTTCCTGCTTGGCCTCTACCTGGTGGGCTTCCTGG (...)

[[2]]
NULL

[[3]]
[1] ">XP_004626331.1 PREDICTED: Octodon degus endothelin receptor type B (Ednrb), mRNA
\nCGGAGCGGCCGGTACACGCACTGGAGCAAGCAGTAGCATGCAGTCGCCAACAAGTCTGTGCAGACTCGCC\nCTGGTGGCTCTGGTTTTTGCCTGCATTTCGGCCGGGGTCTGGGGAGAGGAGAGAGGATTCCC (...)

I don't know if this is the most efficient approximation but it works, I hope this can be useful, thanks again for your help!

Closes #107

dwinter · 2017-05-31T04:16:20Z

closing now, feel free to sumbmit more issues if you run into anything else.

dwinter added the use-cases label May 25, 2017

dwinter mentioned this issue May 26, 2017

Throw error if identifier sending function is given empty identifiers #108

Closed

dwinter added a commit that referenced this issue May 31, 2017

Useful error sending an empty ID set to NCBI

84ebd07

Closes #107

dwinter closed this as completed May 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to keep Protein ID when retrieving coding sequences #107

How to keep Protein ID when retrieving coding sequences #107

santiagoha commented May 25, 2017

dwinter commented May 25, 2017 •

edited

Loading

santiagoha commented May 25, 2017

santiagoha commented May 25, 2017

dwinter commented May 26, 2017

santiagoha commented May 26, 2017

dwinter commented May 26, 2017

santiagoha commented May 26, 2017 •

edited

Loading

dwinter commented May 31, 2017

How to keep Protein ID when retrieving coding sequences #107

How to keep Protein ID when retrieving coding sequences #107

Comments

santiagoha commented May 25, 2017

dwinter commented May 25, 2017 • edited Loading

santiagoha commented May 25, 2017

santiagoha commented May 25, 2017

dwinter commented May 26, 2017

santiagoha commented May 26, 2017

dwinter commented May 26, 2017

santiagoha commented May 26, 2017 • edited Loading

dwinter commented May 31, 2017

dwinter commented May 25, 2017 •

edited

Loading

santiagoha commented May 26, 2017 •

edited

Loading