Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement metadata extraction/parsing tools #20

Closed
cboettig opened this issue Sep 8, 2013 · 21 comments
Closed

Implement metadata extraction/parsing tools #20

cboettig opened this issue Sep 8, 2013 · 21 comments
Labels
Milestone

Comments

@cboettig
Copy link
Member

cboettig commented Sep 8, 2013

We will want either/both of:

  • A function to extract all the metadata nodes from the S4 object and summarize this data neatly (more obvious for top-level metadata, less clear how to present the annotations of individual nodes)
  • RDFa based tool for extraction/reasoning on the ontological terms?

Additionally, might allow some automatic calculation of metadata such as the number of trees in a file, number of taxa, names of taxa, etc (e.g. along the lines of the TreeBase metadata) / PhyloWS terms. Perhaps we should add this summary data in meta nodes or is that asking for trouble?

Really need to enumerate the use-cases for leveraging this metadata (may involve thinking more about additional metadata we want to add).

@rvosa
Copy link
Contributor

rvosa commented Sep 9, 2013

I think the former is more immediately useful than the latter. For example,
if there are numerical annotations on all the nodes in a tree (like a
bootstrap value or posterior probability) it would probably be useful to
have this as a list in parallel with the list of nodes so that calculations
can be done over the list.

@cboettig
Copy link
Member Author

cboettig commented Oct 4, 2013

Metadata extraction function wish-list:

  • Provide visual/plain text, concise summary of all metadata available (e.g. extract RDF -> turtle)
  • A citation function providing the essential metadata needed to cite the nexml resource, including any publication involved.
  • otu summary: showing the species/otus involved.
  • Run a block of code for a method provided?
  • sample from a distribution given branch-length uncertainty annotations?

We might illustrate counting trees, tips, taxa, etc, but this is most naturally done using existing approaches for existing formats rather than providing a dedicated function.

Should probably break these into separate issues...

@cboettig
Copy link
Member Author

Some methods are now provided to extract metadata. All could be improved, as I comment below.

  • summary currently just uses the phylo summary method. Should print some basic metadata when available as well (author, title, date, license perhaps).
  • get_metadata: returns a named list of all top-level meta elements "content" or "href", named with their associated "property" or "rel", respectively. Maybe a cleaner way to do this, but in does provide a consise way to access specific elements later, e.g.
nex <- nexml_read("nexml.xml")
meta <- get_metadata(nex)
meta["dc:title"]

We might need to think about how we handle namespaces though. Considering providing explicit functions for the most common metadata.

  • get_license Just extracts any dc:rights or license node. Too slow since we do this by writing to XML and using XPath, rather than searching the S4 structure. On the flip side, xpath search is more explicit and flexible -- e.g. it has an understanding of XML namespaces that we don't have as explicitly in S4. Unclear if we really should be providing such functions for a single, arbitrary metadata line, but perhaps for the most common ones (that we write automatically, like creator, title, license, date).....
  • get_citation extracts any dcterms:bibliographicCitation node, (which is generated by the 'citation' option to nexml_write given an R bibentry class object. Also generates the various elements of the citation using dc:terms. Should probably be changed to return a R bibentry object rather than just this text.

Also thinking about naming these functions without the get_ prefix, but need to deal with namespace collisions then...

  • Perhaps metadata methods should be extended to work on the "nexml.xml" file directly without a call to nexml_read first?

@rvosa
Copy link
Contributor

rvosa commented Oct 17, 2013

This looks really great, though the namespaces thing would need some work:

nex <- nexml_read("nexml.xml")

meta <- get_metadata(nex)

meta["dc:title"]

Dr. Rutger A. Vos
Bioinformaticist
Naturalis Biodiversity Center
Visiting address: Office A109, Einsteinweg 2, 2333 CC, Leiden, the
Netherlands
Mailing address: Postbus 9517, 2300 RA, Leiden, the Netherlands
http://rutgervos.blogspot.com

@cboettig
Copy link
Member Author

Figuring out the right thing to do for the metadata parsing is quite tricky. In particular, it is difficult to be both flexible and user friendly. For instance, as you point out in the example above, the user doesn't want to bother with namespaces they've never heard of -- e.g. it would make way more sense to most users to call meta["title"] than meta["dc:title"].

I suppose we could strip the prefixes from all the default namespaces, (might cause some trouble in overlap between terms in prism and dublin core? perhaps we could resolve that as well), which would alleviate some of this problem. While in general resolving the namespace definitions is obviously important, I suspect most RNeXML users don't care that the the title is actually a dc:title, etc. Any reason why that is a bad idea?

It seems like it would certainly be a bad idea to apply the same solution to namespaces beyond a few default ones known to RNeXML, so we would still be needing a solution for the more general metadata. This is the crux of the problem, that it is hard to do something intelligent with data without knowing anything about it before hand (Or perhaps there's already some clever linked-data solution to this?)

Currently, the NeXML gets mapped into R's S4 objects, so all meta nodes are available by subsetting the S4 object at least, e.g. nexml@trees[[1]]@meta[[1]]["property"] is the property of the first meta node node annotation to the first trees node. While this syntax is perhaps intuitive to most R users, it is pretty clumsy, particularly when we want to find the node that has a given property instead.

It is infinitely easier to use XPath expressions to navigate metadata, particularly when dealing with namespaces and complex queries. R has great XPath tools available through the XML package. Unfortunately, most R users won't be familiar with XPath, and this means using R's xml, rather than our S4, representation of the NeXML data as our underlying data structure..

@rvosa
Copy link
Contributor

rvosa commented Nov 15, 2013

I don't think I agree that the prefixes could/should be stripped and that most users don't care for them. I think namespaces are crucial, and once people are working in their own little ecosystem of relevant vocabularies they will actually prefer to be dealing with dc:, cdao:, dwc:, phen: etc. There is really not ever going to be a universally agreed upon default namespace for anything, including 'title', it is part of the semweb workflow to acknowledge this and to see everything as scoped within some vocabulary/context. In fact, many of the larger ontology projects go entirely the other way and use opaque identifiers for terms (GO:367472, and so on), which to my mind is the more general case that should be made easy. One thing that's going to be important though is the ability to reconstruct which prefix goes with which namespace URI: for valid curies there is no reason to expect that their prefix is going to be constant, so users who want to write portable code will have to be able to look up the prefix for a given vocab namespace, and use that variable prefix to construct curies.

@hlapp
Copy link
Contributor

hlapp commented Nov 16, 2013

On Nov 15, 2013, at 3:22 PM, Rutger Vos wrote:

I don't think I agree that the prefixes could/should be stripped and that most users don't care for them.

I too disagree with the notion that namespaces should be stripped. Doing so only complicates everything else: all of a sudden, it is no longer clear what you mean by "title", and so now we have to think about ways to still say what you mean after the fact, and need to teach people that yes, you don't need to care about namespaces, but then again, you probably should. It kind of catapults us back into the nasty world of taxonomy where we use labels as identifiers, while full well knowing that they cannot be unique, and pretend to ourselves that this is the path to happiness.

That said, I do think there's something to be said about not asking to define (by URI) each and every namespace each and every time someone writes a script. The package could make use of http://prefix.cc to standardize on namespaces, sparing users the effort of defining them when they're well known anyway. For example, here's the one for Darwin Core: http://prefix.cc/dwc.

@cboettig
Copy link
Member Author

@rvosa @hlapp Thanks both for weighing in here. Of course we aren't stripping the prefixes when performing the original parsing into the "nexml" S4 object, which corresponds 1:1 with the NeXML schema; so this is only a question of how the user queries that metadata. Currently we have a metadata function that simply extracts all the metadata at the specified level (nexml, otus, trees, tree, etc) and returns a named character string in which the name corresponds to the rel or property and the value corresponds to the content or href, e.g.:

birds <- read.nexml("birdOrders.xml")
meta <- get_metadata(birds) 

prints the named string with the top-level (default-level) metadata elements as so:

> meta 
##                                             dc:date 
##                                        "2013-11-17" 
##                                          cc:license 
## "http://creativecommons.org/publicdomain/zero/1.0/"

Which we can subset by name, e.g. meta["dc:date"]. This is probably simplest to most R users; though exactly what the namespace prefix means may be unclear if they haven't worked with namespaces before. (The user can always print a summary of the namespaces and prefixes in the nexml file using birds@namespaces).

This approach is simple, albeit a bit limited.

For instance, the R user has a much more natural and powerful way to handle these issues of prefixes and namespaces using either the XML or rrdf libraries. For instance, if we extract meta nodes into RDF-XML, we could handle the queries like so:

xpathSApply(meta, "//dc:title", xmlValue)

which uses the namespace prefix defined in the nexml; or

xpathSApply(meta, "//x:title", xmlValue, namespaces=c(x = "http://purl.org/dc/elements/1.1/"))

defining the custom prefix x to the URI, or by sparql query,

library(rrdf)
sparql.rdf(ex, "SELECT ?title WHERE { ?x <http://purl.org/dc/elements/1.1/title> ?title })

Obviously the XPath or SPARQL queries are more expressive / powerful than drawing out the metadata from the S4 structure directly. On the other hand, because both of these approaches use just the distilled metadata, the original connection between metadata elements and the structure of the XML tree is lost unless stated explicitly. An in-between solution is to use XPath on the nexml XML instead, though I think we cannot make use of the namespaces in that case, since they appear in attribute values rather than structure.

Anyway, it's nice to have these options in R, particularly for more complex queries where we might want to make some use of the ontology as well. On the other hand, simple presentation of basic metadata is probably necessary for most users.

@rvosa
Copy link
Contributor

rvosa commented Nov 19, 2013

I am certainly deformed by having played around with curies in the past
but to me both "dc:date" and "cc:license" seem perfect. Certainly much
better than "date" (what, the fruit?) and "license" (to kill?), especially
if the common namespaces use conventional prefixes, as @hlapp suggests.

But the SPARQL query that you're showing - that would be so awesome to
have, wow! Wouldn't the extracted RDF still have the subject's ID, though?
So in principle you could still match up the NeXML DOM node with the
extracted triple, right?

@cboettig
Copy link
Member Author

@rvosa Excellent point -- I had failed to add about attributes (#35)

I guess that's not strictly a violation of the standard to omit this, since it passes the check, but still something we want to be sure not to omit.

@cboettig
Copy link
Member Author

Okay, sounds like we will support both the simple extraction as character vectors named with property and prefix, and then optionally provide the triplestore representation for users to SPARQL query the metadata if they so desire. For the manuscript it would be nice to have an example SPARQL query that is easy to understand intuitively but still does some logical inference. Any suggestions for a good example?

@rvosa
Copy link
Contributor

rvosa commented Nov 19, 2013

I think in R it would be especially cool to do some sort of statistics over
a set of DOME nodes that are annotated with the same predicate and value.
Can we come up with a use case to compute a posterior over something, for
example?

@cboettig
Copy link
Member Author

@rvosa Could you expand a bit on what you think might be a useful SPARQL metadata example to illustrate? I was thinking something very simple to start: e.g. would it be possible to declare species scientific names, or genus, etc, on the otus, and then write a sparql query that would just return all those species belonging to some higher taxonomic level?

An example such as that would be easy to understand but still illustrate the ability for the computer to infer something not actually specified in the data (e.g. we haven't explicitly declared the higher taxonomic classification in the metadata).

Poking around a bit I saw some nice examples on the darwin core google-code repo: https://code.google.com/p/tdwg-rdf/wiki/SparqlReasoning, https://code.google.com/p/darwin-sw/wiki/SparqlExamples but a bit complex/abstract. Not having experience working with sparql myself, I think at this stage the goal would be mostly to illustrate the concept of reasoning outside the explicit data (from within R) rather than actually teach the query language.

@rvosa
Copy link
Contributor

rvosa commented Dec 2, 2013

I think that's a great example. You mean something like: the annotations
just give the species URIs but we can run a query to get all the species
within a genus? I don't really know what ontology you would use for that,
though. Maybe @hlapp has suggestions?

@cboettig
Copy link
Member Author

cboettig commented Dec 2, 2013

@rvosa Yeah, exactly. Or within an order, etc -- e.g. "give me all the frogs (order:Anura) in this nexml".

I was wondering if such a thing could be done in Darwin Core, but perhaps resolving that kind of query with RDF is harder than it sounds? Would need a triplestore of all species ids and higher classification, and concepts that genus is childOf family is childOf order etc?

It might be more straight forward to query a service that provides the higher classification and then search directly against that, but that doesn't really illustrate any semantic reasoning...

cboettig added a commit that referenced this issue Dec 2, 2013
@hlapp
Copy link
Contributor

hlapp commented Dec 2, 2013

You would just need a corresponding species taxonomy ontology, such as NCBI:
http://www.obofoundry.org/cgi-bin/detail.cgi?id=ncbi_taxonomy

Or the VTO: https://code.google.com/p/vertebrate-taxonomy-ontology/
(canonical URI http://purl.obolibrary.org/obo/vto.owl)

We do this in the Phenoscape KB all the time.

@cboettig
Copy link
Member Author

cboettig commented Dec 2, 2013

@hlapp very cool. Might take a bit to wrap my head around this -- Can you point me to an example nexml file that contains such meta elements (or just RDF) and maybe an example sparql query?

@hlapp
Copy link
Contributor

hlapp commented Dec 2, 2013

On Dec 2, 2013, at 2:26 PM, Carl Boettiger wrote:

@hlapp very cool. Might take a bit to wrap my head around this -- Can you point me to an example nexml file that contains such meta elements (or just RDF) and maybe an example sparql query?

The Phenoscape NeXML data files (I sent pointer previously) use the TTO and VTO ontologies for OTU designations. The RDF for the KB is fairly large and thus loaded directly into the triple store. I'll leave it to @balhoff to point to specific examples in the codebase, but note that such SPARQL queries will be necessarily very specific to the use case at hand. If you were wondering how to leverage OWL reasoning in SPARQL, many triple stores will support rdf:subClassOf reasoning, and some also support SPARQL 1.1 property paths (recursion over transitive properties).

@cboettig
Copy link
Member Author

cboettig commented Dec 2, 2013

@hlapp thanks. I see the use of VTO on the OTUs in the phenoscape examples files you mentioned earlier, e.g.
https://github.com/phenoscape/phenoscape-data/blob/master/Curation%20Files/completed-phenex-files/gardiner_1984.xml#L8 (though I cannot resolve the linked resource in the browser?)

So just to make sure I understand: based on that meta element I should in principle be able to write a sparql query that tells me that the otus include some Palaeonisciformes fish? By specific to the use case, you mean that such a query would be specific to, say, the VTO ontology vs the NCBII one? Thanks all for bringing me up to speed on this.

Also, if there's a more obvious example or use case to illustrate OWL/SPARQL reasoning for a researcher unfamiliar with these concepts, always happy to hear other suggestions.

@hlapp
Copy link
Contributor

hlapp commented Dec 2, 2013

On Dec 2, 2013, at 4:52 PM, Carl Boettiger wrote:

(though I cannot resolve the linked resource in the browser?)

Yes, individual term identifiers don't resolve right now. We have to register to ontology with Ontobee to make that work.
So just to make sure I understand: based on that meta element I should in principle be able to write a sparql query that tells me that the otus include some Palaeonisciformes fish?

Yes.
By specific to the use case, you mean that such a query would be specific to, say, the VTO ontology vs the NCBII one?

That, and that how are different ways of representing the hierarchy (class hierarchy, hierarchy of individuals by a transitive property, or a mixture), as well as connecting it to data. The SPARQL queries will differ quite a lot between them, yet achieve the same thing.

@cboettig
Copy link
Member Author

Lots of good discussion in this thread. I've done my best at implementing a kind of tiered approach for creating and editing metadata, going from assuming the user knows nothing about namespaces up through adding their own namespaces or parsing metadata with XPath and SPARQL queries. Really not sure that my explanations are meaningful though, so would love feedback on both the Writing metadata and Reading metadata section of the manuscript before closing this. I've put "Writing" section before the "Reading" section; not sure if that's the easiest order or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants