Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attaching nexml metadata to phylo objects #19

Closed
cboettig opened this issue Sep 6, 2013 · 8 comments
Closed

Attaching nexml metadata to phylo objects #19

cboettig opened this issue Sep 6, 2013 · 8 comments
Labels
Milestone

Comments

@cboettig
Copy link
Member

cboettig commented Sep 6, 2013

Crazy idea: when reading into a phylo, should we create a new RNeXML environment, store the full nexml tree in there, and add a new slot to the phylo object storing an unevaluated get("<unique_tree_id>", envir=RNeXML) that methods could use to access the full NeXML??

This would let us do something like:

tr <- nexml_read("tree.xml")
metadata(tr)

instead of

tr <- nexml_read("tree.xml", type="nexml")
metadata(tr)

That is, reading in as the default (ape) type, and still calling functions that need the full nexml metadata, while also having an ape tree object that can still be passed around to the usual R packages.

Or maybe that's stupid and asking for trouble, and we should be explicit about what type of object we want.

@sckott
Copy link
Contributor

sckott commented Sep 6, 2013

@cboettig That seems reasonable at small scale, but with huge xml files then we would be putting a lot of data into the users workspace without them realizing it.

@cboettig
Copy link
Member Author

cboettig commented Oct 7, 2013

In general it would be useful to read in a nexml object that could be passed directly to functions based on ape trees without requiring coercion and dropping of metadata.

I never understood why phylobase didn't do this -- but it appears that phylo4 objects do not inherit the phylo S3 class and cannot be passed to phylo functions without explicit coercion:

library(phylobase)
library(ape)
data(bird.orders)
bird.orders4 <- as(bird.orders, "phylo4") # make ape::phylo tree into phylobase::phylo4 S4 class
plot.phylo(bird.orders4) # attempting to use the S4 fails

Of course a plot function is defined for phylo4, but more interesting functions are not written for phylo4, so this is a huge handicap: consider:

 S <- c(10, 47, 69, 214, 161, 17, 355, 51, 56, 10, 39, 152,
             6, 143, 358, 103, 319, 23, 291, 313, 196, 1027, 5712)
bd.ext(bird.orders4, S)   # Fails again. Works with the S3 type 

Anyway, it appears this problem can be solved using setOldClass. I've defined an the class phyloS4 which inherits all methods for the S3 phylo class without having to explicitly declare those methods. In this way, we have the benefits of an S4 class while maintaining compatibility with all developers who only write functions based on the S3 class. (as long as functions don't stupidly check the string identity class(obj) == "phylo", instead of using the proper class check is(obj, "phylo")....)

I can then build a new class, nexmlTree by extending this class. Again my new class acts like an S3 phylo in any such functions, but adds a representation containing all the nexml data. This approach doesn't minimize memory footprint, but usually that is not a concern for R users (otherwise coercion is always an option). It does satisfy the need for an object that works with all existing functions while also containing any and all metadata we can express in nexml.

See R/extend_phylo.R for the defitition.

@cboettig
Copy link
Member Author

Looking for feedback on this approach.

It appears that phylobase didn't choose to extend the phylo class in a way that phylo4 objects could be simply passed to existing functions designed for the S3 phylo objects. This is possible, as I have now implemented with the tentatively named nexmlTree class, and describe here: http://carlboettiger.info/2013/10/07/nexml-phylo-class-extension.html

On one hand, it seems to make sense that we want an object that both has the metadata attached to it, with methods that can operate to extract, display, and potentially compute on that metadata, but still works as a tree object in all existing functions.

On the other hand, this makes a larger object, since it has all this metadata attached (possibly not a problem?). It can also introduce more potential trouble to have users using this object directly in their workflow, instead of converting to a vanilla phylo object and using that (for instance, as I describe in my linked notes, methods that check class with string matching instead of the built-in method will throw an error).

Seems it is an important design choice whether we build methods around the extended class or have separate methods for working on RNeXML S4 object metadata and just convert that to an ape::phylo for tree methods? @schamberlain @hlapp @rvosa thoughts?

@sckott
Copy link
Contributor

sckott commented Oct 15, 2013

whether we build methods around the extended class or have separate methods for working on RNeXML S4 object metadata and just convert that to an ape::phylo for tree methods

Do you have a feeling for which is better?

@hlapp
Copy link
Contributor

hlapp commented Oct 16, 2013

Not clear to me what the concrete consequences for users would be. Can you explicate?

@cboettig
Copy link
Member Author

With separate objects, users would have to decide to read in a NeXML file as nexml (and later convert it), or read it in directly as "phylo" and later read it in again to do anything with the metadata. e.g.:

tree <- nexml_read("file.xml", type="phylo") # object of class "phylo"
plot(tree)

or

nexml_tree <- nexml_read("file.xml", type="nexml") # object of class "nexml"
tree <- as(nexml_tree, "phylo")
plot(tree)

while to perform metadata functions they have to operate on the nexml object instead:

summary(nexml_tree) 
citation(nexml_tree)
license(nexml_tree)

(those methods not yet written btw).

In Option 2, with a combined interface, the user would use the same object for all purposes:

tree <- nexml_read("file.xml")  # object of class "nexmlTree"
plot(tree)
metadata(tree)
summary(tree)
license(tree)

etc. Clearly the interface is cleaner in the later context. The cost is larger object memory size and a chance that poorly written phylogenetics functions (at least ones that check class using strings) fail.

@cboettig
Copy link
Member Author

(Um, note that plot(tree) is the ape method plot.phylo, I'm just using it to illustrate any existing method. Could be a richer function like bd.ext, any function from gieger, OUwie, phytools etc. Meanwhile the other 'metadata' functions would be the unique functions provided in RNeXML to handle the metadata. I'm not sure quite what or how many such functions we'll have, but see ideas in #20)

@cboettig
Copy link
Member Author

Okay, I think we can just support both and let the user decide. The metadata methods (now implemented, see #20 (comment) and commit 94996e6 ) are written for the "nexml" class and inherited by the "nexmlTree" class. By default, I support the second method; e.g. tree <- nexml_read("file.xml") will read in an object of class "nexmlTree" that acts like a phylo object has all the metadata attached, with associated methods. Users who would prefer a pure phylo object can coerce this or read it in as such, as shown above.

Not sure if users will have any use for the raw nexml class, since the nexmlTree class has the added benefit of working in phylo methods. Still, it is available as an object for any user or developer just needing an R S4 representation of a nexml document.

I think this resolves this question. Re-open with outstanding issues, or feel free to add further questions or comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants