-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace taxize with taxald #226
Conversation
Co-Authored-By: cboettig <cboettig@gmail.com>
Co-Authored-By: cboettig <cboettig@gmail.com>
I wish `spelling::spell_check_package()` would check README.Rmd too! gets me every time. ropensci/spelling#11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments. I think it'd be nice to at least not overwrite existing otu
annotations, and that should be an easy change?
call. = FALSE) | ||
} | ||
get_uid <- getExportedValue("taxize", "get_uid") | ||
get_ids <- getExportedValue("taxald", "get_ids") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to just write taxld::get_ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but this change is intentional -- taxize::get_uid
is the NCBI-specific function (though you probably wouldn't guess that from the name), while taxize::get_ids
is a wrapper around a suite of potential authorities (including itis
, col
, and gbif
. The original RNeXML implementation only supported NCBI by hardwiring get_uid
, while at the same time promising to support future versions (we've always had a type
argument, but only supporting type="NCBI"
. So this PR delivers on the support for some alternate authorities and changes the name to match the corresponding taxize
function.
(Side note, but taxize::get_ids()
did nothing to standardize the return object provided by wrapping the different authorities, so you couldn't just swap this back to taxize::get_ids()
. taxald
mimics the input, but always provides a consistent output regardless of the authority source).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the change to taxald::get_ids
is intentional. I was just saying, isn't
get_ids <- taxald::get_ids
the same as
get_ids <- getExportedValue("taxald", "get_ids")
except the former is shorter and (I find anyway) more readable as to what's happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, get_ids
only gets used a single time. Is there a good reason to not simply replace
taxa_ids <- get_ids(labels, type, format = "uri", ...)
with
taxa_ids <- taxald::get_ids(labels, type, format = "uri", ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm nitpicking here 😄 so feel free to just ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am intentionally avoiding the use of an explicit reference to the package (e.g. taxald::
) because that would cause R CMD CHECK to throw a WARNING, because taxald
is only suggested and not imported. Importing from a function from another package using getExportedValue
is something of an accepted hack to allow a package to have "soft" dependencies.
We did this with taxize
initially because taxize
is a huge dependency with lots of additional dependencies and listing it in Imports would make it required for a user who really just wants to do simple parsing of NeXML files and doesn't need this 'bonus' functionality. taxald
is a bit lighter, but still not 'core' functionality; and besides, we can't have it in Imports if taxald
isn't on CRAN, though we can have it in Suggests.
for(i in 1:length(taxa_ids)){ | ||
id <- taxa_ids[[i]] | ||
if(is.na(id)) | ||
warning(paste("ID for otu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's been a warning previously, but I didn't like that before either. That's because issuing a warning on behalf of the user is assuming that user meant each and every taxon label to be resolvable against some database. Maybe the user already knew that not everything was going to resolve, and now has to use suppressWarnings()
to get rid of something they were fully anticipating. I think just passing back the fact that nothing was found is good enough, and I'm not sure why just not adding that meta
annotation wouldn't be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I just left this in for compatibility, but I agree this is a warning that has always been more annoying than helpful. It's a lot less annoying than the previous one because taxald does a better job finding matches ... Ideally I think it might be preferable to collect these warnings in a log and return a single warning at the end? Possibly also include a way to suppress the warning as a function argument. I do feel it wouldn't be great to require inspecting the NeXML / meta elements manually to detect unresolved otu labels...
R/taxize_nexml.R
Outdated
"not found. Consider checking the spelling | ||
or alternate classification")) | ||
else | ||
nexml@otus[[j]]@otu[[i]]@meta <- New("ListOfmeta", list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's been this way before, but note that this will overwrite any meta
annotations that the otu
elements might have had before. Not really a good idea, and maybe an opportunity to fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, that's a good point. You've been playing more with S4 more recently then me -- is there a more concise way to fix that than checking if nexml@otus[[j]]@otu[[i]]@meta
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to care whether it's empty – simply combine (c()
) the current value (which is of type ListOfmeta
) with the new meta
, and the generic method for c(ListOfmeta, ...)
should catch it and treat it correctly. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., the following:
nexml@otus[[j]]@otu[[i]]@meta <- c(nexml@otus[[j]]@otu[[i]]@meta,
meta(href = taxa_ids[[i]], rel = "tc:toTaxon"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> length(c(new("ListOfmeta"), meta("foo:foo", "bar")))
[1] 1
> length(c(new("ListOfmeta", list(meta("baz","bang"))), meta("foo:foo", "bar")))
[1] 2
> class(c(new("ListOfmeta"), meta("foo:foo", "bar")))
[1] "ListOfmeta"
attr(,"package")
[1] "RNeXML"
So that works as I claimed 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, have now implemented this.
Very nice speedup! Shaved about 40% off of the CI test time. 👍 |
Thanks for the nice review. I've fixed the vignette (dropping taxize mention), append instead of overwriting meta, and added an argument to opt out of warnings. Also rebuilt docs. Merge when ready. |
@hlapp can you review this?
This implements the changes proposed in #224. Unlike the original taxize version, this approach vectorizes the call for id lookup, and also supports multiple authorities instead of only NCBI. The advantages of this are particularly evident in the
primates
example in the metadata vignette -- which previously ran very slowly and resolved no names, no almost all names are resolved (against ITIS taxonomy in this case), and which runs quite quickly.taxald
pulls a local copy of the complete id table for the authority on the fly if one has not been installed locally. Technically this can be a bit slow, but as these are compressed tables (coming from AWS S3 at the moment) they download more quickly than some single API calls. A user can install a local copy, (e.g. by runningtaxald::td_create(authority = "all")
), but for simplicity and consistency with the old version I haven't documented that here.taxald
isn't on CRAN yet and may change a bit still, though I think this basic behavior should not be impacted. So merging this would mean we would ideally wanttaxald
on CRAN before the next RNeXML release; though as this has always been an opt-in or soft dependency, it's not essential.This also includes and updated version of the pkgdown build based on taxald, so deprecates PR #223 (and resolves the travis-build error associated with taxize network errors).