New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accession number not matching with taxa ID #10

Closed
ngeraldi opened this Issue Nov 5, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@ngeraldi

ngeraldi commented Nov 5, 2018

Hello, I just updated to the newest version of taxonomizr and ran the new code to create the sql. I have about 200,000 accession number from searching ncbi for rbcl genes. I have tried both the base and version of accession numbers and they do not seem to be matching up correctly, they get some but not all the taxa Ids (the accession number has a taxa id if I search them directly in ncbi). Could you check to see if you are having the same problem? Attached is a link to the dataframe of accession numbers (taxa id are from entrez code, but would like to use your much faster functions) and example code is below.
It is worth mentioning that getTaxonomy() works once I get the taxa ID from entrez code.

I have gotten around the issue, so no hurry, but it would be nice to figure out the problem so I don't need to wait hours on entrez functions.
thank you
Nathan

https://www.dropbox.com/s/ivygvjqb0zfa6rl/ncbi_rbcl_lineage.csv?dl=0
x<-data.table::fread(file="ncbi_rbcl_lineage.csv",header = T,sep=",")
accessionToTaxa(x$accession[1:100],"accessionTaxa.sql",version='base')

output

[1] 2478980 2478980 2478980 88415 88415 88415 88415 1191690 1191690 1191690 1077399 1077399 1077399
[14] 1077399 1077399 1077399 NA NA NA NA NA NA NA NA NA NA
[27] NA NA NA NA NA NA NA NA NA NA NA NA NA
[40] NA NA NA NA NA NA NA NA NA NA NA NA NA
[53] NA NA NA NA NA NA NA NA NA NA NA NA 1486651
[66] 1486654 1486646 1486646 1486646 1486654 1486646 373125 1486654 1486654 1486654 1486654 1486646 373125
[79] 373125 373125 1486646 1486646 1486646 1486646 1486646 1486646 373124 373124 340433 1486647 1486650
[92] 1486650 373124 373124 1486647 1486647 1486646 1486647 1486650 1486646
Warning messages:
1: In file.remove(tmp) :
cannot remove file 'C:\Users\geraldn\AppData\Local\Temp\RtmpM31OhV\file219c75ba49a3', reason 'Permission denied'
2: In file.remove(tmp) :
cannot remove file 'C:\Users\geraldn\AppData\Local\Temp\RtmpM31OhV\file219c75ba49a3', reason 'Permission denied'

@sherrillmix

This comment has been minimized.

Owner

sherrillmix commented Nov 5, 2018

My first guess would be that those NAs are from recently uploaded sequences and that a fresh database download from NCBI would catch them but that's just a guess. I'm downloading a fresh database copy and will update once that finishes.

@sherrillmix

This comment has been minimized.

Owner

sherrillmix commented Nov 5, 2018

So if I do:

library(taxonomizr)
x<-read.csv('ncbi_rbcl_lineage.csv',stringsAsFactors=FALSE)
taxaId<-accessionToTaxa(x$accession,'db/taxo/nameNode.sqlite','base')
summary(is.na(taxaId))
range(which(is.na(taxaId)))

I get 602 NAs and 221,618 good taxa IDs and I see that the 602 NAs are all in the first 800 entries of your .csv. Just eyeballing a couple they seem to have a recent modification date e.g.:
MF070051
MH104899
MH748856
and do not show up in a zgrep of the raw accession2taxid.gz files. So I'm going to lay the blame on delays in new data percolating through NCBI into the downloadable archives and suggest waiting a few days and redownloading the archives.

Or are you having a problem different than I'm seeing here?

@ngeraldi

This comment has been minimized.

ngeraldi commented Nov 6, 2018

I am getting the exact same results. Guess I just need to be more patient.
Thanks for the quick reply.
Best
Nathan

@ngeraldi ngeraldi closed this Nov 6, 2018

@sherrillmix

This comment has been minimized.

Owner

sherrillmix commented Nov 9, 2018

I was curious so I redownloaded the database today (4 days later). Now I'm getting 222,172 good IDs and only 48 NAs. The 48 NAs are all in the first 64 entries of the .csv. That seems to reinforce that we're just seeing delay in data moving through the NCBI pipelines.

Thanks for the report and good luck with your project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment