-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement metadata extraction/parsing tools #20
Comments
I think the former is more immediately useful than the latter. For example, |
Metadata extraction function wish-list:
We might illustrate counting trees, tips, taxa, etc, but this is most naturally done using existing approaches for existing formats rather than providing a dedicated function. Should probably break these into separate issues... |
Some methods are now provided to extract metadata. All could be improved, as I comment below.
nex <- nexml_read("nexml.xml")
meta <- get_metadata(nex)
meta["dc:title"] We might need to think about how we handle namespaces though. Considering providing explicit functions for the most common metadata.
Also thinking about naming these functions without the
|
This looks really great, though the namespaces thing would need some work: nex <- nexml_read("nexml.xml") meta <- get_metadata(nex) meta["dc:title"] Dr. Rutger A. Vos |
Figuring out the right thing to do for the metadata parsing is quite tricky. In particular, it is difficult to be both flexible and user friendly. For instance, as you point out in the example above, the user doesn't want to bother with namespaces they've never heard of -- e.g. it would make way more sense to most users to call I suppose we could strip the prefixes from all the default namespaces, (might cause some trouble in overlap between terms in prism and dublin core? perhaps we could resolve that as well), which would alleviate some of this problem. While in general resolving the namespace definitions is obviously important, I suspect most RNeXML users don't care that the the title is actually a It seems like it would certainly be a bad idea to apply the same solution to namespaces beyond a few default ones known to RNeXML, so we would still be needing a solution for the more general metadata. This is the crux of the problem, that it is hard to do something intelligent with data without knowing anything about it before hand (Or perhaps there's already some clever linked-data solution to this?) Currently, the NeXML gets mapped into R's S4 objects, so all meta nodes are available by subsetting the S4 object at least, e.g. It is infinitely easier to use XPath expressions to navigate metadata, particularly when dealing with namespaces and complex queries. R has great XPath tools available through the XML package. Unfortunately, most R users won't be familiar with XPath, and this means using R's xml, rather than our S4, representation of the NeXML data as our underlying data structure.. |
I don't think I agree that the prefixes could/should be stripped and that most users don't care for them. I think namespaces are crucial, and once people are working in their own little ecosystem of relevant vocabularies they will actually prefer to be dealing with dc:, cdao:, dwc:, phen: etc. There is really not ever going to be a universally agreed upon default namespace for anything, including 'title', it is part of the semweb workflow to acknowledge this and to see everything as scoped within some vocabulary/context. In fact, many of the larger ontology projects go entirely the other way and use opaque identifiers for terms (GO:367472, and so on), which to my mind is the more general case that should be made easy. One thing that's going to be important though is the ability to reconstruct which prefix goes with which namespace URI: for valid curies there is no reason to expect that their prefix is going to be constant, so users who want to write portable code will have to be able to look up the prefix for a given vocab namespace, and use that variable prefix to construct curies. |
On Nov 15, 2013, at 3:22 PM, Rutger Vos wrote:
I too disagree with the notion that namespaces should be stripped. Doing so only complicates everything else: all of a sudden, it is no longer clear what you mean by "title", and so now we have to think about ways to still say what you mean after the fact, and need to teach people that yes, you don't need to care about namespaces, but then again, you probably should. It kind of catapults us back into the nasty world of taxonomy where we use labels as identifiers, while full well knowing that they cannot be unique, and pretend to ourselves that this is the path to happiness. That said, I do think there's something to be said about not asking to define (by URI) each and every namespace each and every time someone writes a script. The package could make use of http://prefix.cc to standardize on namespaces, sparing users the effort of defining them when they're well known anyway. For example, here's the one for Darwin Core: http://prefix.cc/dwc. |
@rvosa @hlapp Thanks both for weighing in here. Of course we aren't stripping the prefixes when performing the original parsing into the "nexml" S4 object, which corresponds 1:1 with the NeXML schema; so this is only a question of how the user queries that metadata. Currently we have a birds <- read.nexml("birdOrders.xml")
meta <- get_metadata(birds) prints the named string with the top-level (default-level) metadata elements as so: > meta
## dc:date
## "2013-11-17"
## cc:license
## "http://creativecommons.org/publicdomain/zero/1.0/" Which we can subset by name, e.g. This approach is simple, albeit a bit limited. For instance, the R user has a much more natural and powerful way to handle these issues of prefixes and namespaces using either the XML or rrdf libraries. For instance, if we extract meta nodes into RDF-XML, we could handle the queries like so: xpathSApply(meta, "//dc:title", xmlValue) which uses the namespace prefix defined in the nexml; or xpathSApply(meta, "//x:title", xmlValue, namespaces=c(x = "http://purl.org/dc/elements/1.1/")) defining the custom prefix library(rrdf)
sparql.rdf(ex, "SELECT ?title WHERE { ?x <http://purl.org/dc/elements/1.1/title> ?title }) Obviously the XPath or SPARQL queries are more expressive / powerful than drawing out the metadata from the S4 structure directly. On the other hand, because both of these approaches use just the distilled metadata, the original connection between metadata elements and the structure of the XML tree is lost unless stated explicitly. An in-between solution is to use XPath on the nexml XML instead, though I think we cannot make use of the namespaces in that case, since they appear in attribute values rather than structure. Anyway, it's nice to have these options in R, particularly for more complex queries where we might want to make some use of the ontology as well. On the other hand, simple presentation of basic metadata is probably necessary for most users. |
I am certainly deformed by having played around with curies in the past But the SPARQL query that you're showing - that would be so awesome to |
Okay, sounds like we will support both the simple extraction as character vectors named with property and prefix, and then optionally provide the triplestore representation for users to SPARQL query the metadata if they so desire. For the manuscript it would be nice to have an example SPARQL query that is easy to understand intuitively but still does some logical inference. Any suggestions for a good example? |
I think in R it would be especially cool to do some sort of statistics over |
@rvosa Could you expand a bit on what you think might be a useful SPARQL metadata example to illustrate? I was thinking something very simple to start: e.g. would it be possible to declare species scientific names, or genus, etc, on the otus, and then write a sparql query that would just return all those species belonging to some higher taxonomic level? An example such as that would be easy to understand but still illustrate the ability for the computer to infer something not actually specified in the data (e.g. we haven't explicitly declared the higher taxonomic classification in the metadata). Poking around a bit I saw some nice examples on the darwin core google-code repo: https://code.google.com/p/tdwg-rdf/wiki/SparqlReasoning, https://code.google.com/p/darwin-sw/wiki/SparqlExamples but a bit complex/abstract. Not having experience working with sparql myself, I think at this stage the goal would be mostly to illustrate the concept of reasoning outside the explicit data (from within R) rather than actually teach the query language. |
I think that's a great example. You mean something like: the annotations |
@rvosa Yeah, exactly. Or within an order, etc -- e.g. "give me all the frogs (order:Anura) in this nexml". I was wondering if such a thing could be done in Darwin Core, but perhaps resolving that kind of query with RDF is harder than it sounds? Would need a triplestore of all species ids and higher classification, and concepts that genus is It might be more straight forward to query a service that provides the higher classification and then search directly against that, but that doesn't really illustrate any semantic reasoning... |
You would just need a corresponding species taxonomy ontology, such as NCBI: Or the VTO: https://code.google.com/p/vertebrate-taxonomy-ontology/ We do this in the Phenoscape KB all the time. |
@hlapp very cool. Might take a bit to wrap my head around this -- Can you point me to an example nexml file that contains such meta elements (or just RDF) and maybe an example sparql query? |
On Dec 2, 2013, at 2:26 PM, Carl Boettiger wrote:
|
@hlapp thanks. I see the use of VTO on the OTUs in the phenoscape examples files you mentioned earlier, e.g. So just to make sure I understand: based on that meta element I should in principle be able to write a sparql query that tells me that the otus include some Palaeonisciformes fish? By specific to the use case, you mean that such a query would be specific to, say, the VTO ontology vs the NCBII one? Thanks all for bringing me up to speed on this. Also, if there's a more obvious example or use case to illustrate OWL/SPARQL reasoning for a researcher unfamiliar with these concepts, always happy to hear other suggestions. |
On Dec 2, 2013, at 4:52 PM, Carl Boettiger wrote:
|
Lots of good discussion in this thread. I've done my best at implementing a kind of tiered approach for creating and editing metadata, going from assuming the user knows nothing about namespaces up through adding their own namespaces or parsing metadata with |
We will want either/both of:
Additionally, might allow some automatic calculation of metadata such as the number of trees in a file, number of taxa, names of taxa, etc (e.g. along the lines of the TreeBase metadata) / PhyloWS terms. Perhaps we should add this summary data in meta nodes or is that asking for trouble?
Really need to enumerate the use-cases for leveraging this metadata (may involve thinking more about additional metadata we want to add).
The text was updated successfully, but these errors were encountered: