Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
accession number not matching with taxa ID #10
Hello, I just updated to the newest version of taxonomizr and ran the new code to create the sql. I have about 200,000 accession number from searching ncbi for rbcl genes. I have tried both the base and version of accession numbers and they do not seem to be matching up correctly, they get some but not all the taxa Ids (the accession number has a taxa id if I search them directly in ncbi). Could you check to see if you are having the same problem? Attached is a link to the dataframe of accession numbers (taxa id are from entrez code, but would like to use your much faster functions) and example code is below.
I have gotten around the issue, so no hurry, but it would be nice to figure out the problem so I don't need to wait hours on entrez functions.
 2478980 2478980 2478980 88415 88415 88415 88415 1191690 1191690 1191690 1077399 1077399 1077399
So if I do:
library(taxonomizr) x<-read.csv('ncbi_rbcl_lineage.csv',stringsAsFactors=FALSE) taxaId<-accessionToTaxa(x$accession,'db/taxo/nameNode.sqlite','base') summary(is.na(taxaId)) range(which(is.na(taxaId)))
I get 602 NAs and 221,618 good taxa IDs and I see that the 602 NAs are all in the first 800 entries of your .csv. Just eyeballing a couple they seem to have a recent modification date e.g.:
Or are you having a problem different than I'm seeing here?
I was curious so I redownloaded the database today (4 days later). Now I'm getting 222,172 good IDs and only 48 NAs. The 48 NAs are all in the first 64 entries of the .csv. That seems to reinforce that we're just seeing delay in data moving through the NCBI pipelines.
Thanks for the report and good luck with your project.