Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace taxize with taxald #226

Merged
merged 14 commits into from
Dec 8, 2018
Merged

Replace taxize with taxald #226

merged 14 commits into from
Dec 8, 2018

Conversation

cboettig
Copy link
Member

@cboettig cboettig commented Dec 7, 2018

@hlapp can you review this?

This implements the changes proposed in #224. Unlike the original taxize version, this approach vectorizes the call for id lookup, and also supports multiple authorities instead of only NCBI. The advantages of this are particularly evident in the primates example in the metadata vignette -- which previously ran very slowly and resolved no names, no almost all names are resolved (against ITIS taxonomy in this case), and which runs quite quickly.

taxald pulls a local copy of the complete id table for the authority on the fly if one has not been installed locally. Technically this can be a bit slow, but as these are compressed tables (coming from AWS S3 at the moment) they download more quickly than some single API calls. A user can install a local copy, (e.g. by running taxald::td_create(authority = "all")), but for simplicity and consistency with the old version I haven't documented that here.

taxald isn't on CRAN yet and may change a bit still, though I think this basic behavior should not be impacted. So merging this would mean we would ideally want taxald on CRAN before the next RNeXML release; though as this has always been an opt-in or soft dependency, it's not essential.

This also includes and updated version of the pkgdown build based on taxald, so deprecates PR #223 (and resolves the travis-build error associated with taxize network errors).

Copy link
Contributor

@hlapp hlapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments. I think it'd be nice to at least not overwrite existing otu annotations, and that should be an easy change?

vignettes/metadata.Rmd Outdated Show resolved Hide resolved
call. = FALSE)
}
get_uid <- getExportedValue("taxize", "get_uid")
get_ids <- getExportedValue("taxald", "get_ids")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to just write taxld::get_ids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this change is intentional -- taxize::get_uid is the NCBI-specific function (though you probably wouldn't guess that from the name), while taxize::get_ids is a wrapper around a suite of potential authorities (including itis, col, and gbif. The original RNeXML implementation only supported NCBI by hardwiring get_uid, while at the same time promising to support future versions (we've always had a type argument, but only supporting type="NCBI". So this PR delivers on the support for some alternate authorities and changes the name to match the corresponding taxize function.

(Side note, but taxize::get_ids() did nothing to standardize the return object provided by wrapping the different authorities, so you couldn't just swap this back to taxize::get_ids(). taxald mimics the input, but always provides a consistent output regardless of the authority source).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the change to taxald::get_ids is intentional. I was just saying, isn't

get_ids <- taxald::get_ids

the same as

get_ids <- getExportedValue("taxald", "get_ids")

except the former is shorter and (I find anyway) more readable as to what's happening?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, get_ids only gets used a single time. Is there a good reason to not simply replace

    taxa_ids <- get_ids(labels, type, format = "uri", ...)

with

    taxa_ids <- taxald::get_ids(labels, type, format = "uri", ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm nitpicking here 😄 so feel free to just ignore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am intentionally avoiding the use of an explicit reference to the package (e.g. taxald::) because that would cause R CMD CHECK to throw a WARNING, because taxald is only suggested and not imported. Importing from a function from another package using getExportedValue is something of an accepted hack to allow a package to have "soft" dependencies.

We did this with taxize initially because taxize is a huge dependency with lots of additional dependencies and listing it in Imports would make it required for a user who really just wants to do simple parsing of NeXML files and doesn't need this 'bonus' functionality. taxald is a bit lighter, but still not 'core' functionality; and besides, we can't have it in Imports if taxald isn't on CRAN, though we can have it in Suggests.

for(i in 1:length(taxa_ids)){
id <- taxa_ids[[i]]
if(is.na(id))
warning(paste("ID for otu",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's been a warning previously, but I didn't like that before either. That's because issuing a warning on behalf of the user is assuming that user meant each and every taxon label to be resolvable against some database. Maybe the user already knew that not everything was going to resolve, and now has to use suppressWarnings() to get rid of something they were fully anticipating. I think just passing back the fact that nothing was found is good enough, and I'm not sure why just not adding that meta annotation wouldn't be good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just left this in for compatibility, but I agree this is a warning that has always been more annoying than helpful. It's a lot less annoying than the previous one because taxald does a better job finding matches ... Ideally I think it might be preferable to collect these warnings in a log and return a single warning at the end? Possibly also include a way to suppress the warning as a function argument. I do feel it wouldn't be great to require inspecting the NeXML / meta elements manually to detect unresolved otu labels...

R/taxize_nexml.R Outdated
"not found. Consider checking the spelling
or alternate classification"))
else
nexml@otus[[j]]@otu[[i]]@meta <- New("ListOfmeta", list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's been this way before, but note that this will overwrite any meta annotations that the otu elements might have had before. Not really a good idea, and maybe an opportunity to fix that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, that's a good point. You've been playing more with S4 more recently then me -- is there a more concise way to fix that than checking if nexml@otus[[j]]@otu[[i]]@meta is empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to care whether it's empty – simply combine (c()) the current value (which is of type ListOfmeta) with the new meta, and the generic method for c(ListOfmeta, ...) should catch it and treat it correctly. No?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., the following:

nexml@otus[[j]]@otu[[i]]@meta <- c(nexml@otus[[j]]@otu[[i]]@meta,
                                   meta(href = taxa_ids[[i]], rel = "tc:toTaxon"))

Copy link
Contributor

@hlapp hlapp Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> length(c(new("ListOfmeta"), meta("foo:foo", "bar")))
[1] 1
> length(c(new("ListOfmeta", list(meta("baz","bang"))), meta("foo:foo", "bar")))
[1] 2
> class(c(new("ListOfmeta"), meta("foo:foo", "bar")))
[1] "ListOfmeta"
attr(,"package")
[1] "RNeXML"

So that works as I claimed 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, have now implemented this.

@hlapp
Copy link
Contributor

hlapp commented Dec 7, 2018

Very nice speedup! Shaved about 40% off of the CI test time. 👍

@cboettig
Copy link
Member Author

cboettig commented Dec 7, 2018

Thanks for the nice review. I've fixed the vignette (dropping taxize mention), append instead of overwriting meta, and added an argument to opt out of warnings. Also rebuilt docs. Merge when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants