Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL use case #73

Closed
rvosa opened this issue Jun 26, 2014 · 16 comments
Closed

SPARQL use case #73

rvosa opened this issue Jun 26, 2014 · 16 comments

Comments

@rvosa
Copy link
Contributor

rvosa commented Jun 26, 2014

Now that it is so relatively painless to extract RDF and run SPARQL queries on it (incidentally: great job, supercool) I think it would be good to develop a more persuasive use case to demonstrate the power of this facility.

Here's an idea: let's say we have a tree, some trait data and some occurrences for a set of species. As usual, after all the data cleaning, we find that the species in the tree, in the trait data and the occurrences are only partially overlapping. It ought to be possible to extract the union of the taxa across these different data sources by way of a query.

What do you guys think - is that the coolest we can come up with (hopefully not?) and do we have some published data lying around that we could use to demonstrate this?

@cboettig
Copy link
Member

cboettig commented Jul 1, 2014

A meaningful SPARQL example would be great. Your proposed use case does sound like a common one that many researchers could relate to. My only thought is that most R users would be more familiar with simply importing the tree and trait data, etc, and then extracting the union (the treedata() function from the geiger package being probably the most common way users handle this, though the function assumes perfectly matching species names being used on both tree and trait data).

I was wondering if we might have an example that emphasizes the logical reasoning of SPARQL that doesn't have an immediate SQL-like analog. For instance, a query that makes use of some ontology in identifying which species listed in the target dataset are a member of the queried taxonomic class or something (e.g. see our earlier thread: #20 (comment) ). Maybe that would be involved in the use case you already described.

Will give a thought to some good published data examples.

@rvosa
Copy link
Contributor Author

rvosa commented Jul 2, 2014

Reasoning would be really great but might be hard to demonstrate - do you
know of any reasoning engines that are exposed to R?

On Tue, Jul 1, 2014 at 10:06 PM, Carl Boettiger notifications@github.com
wrote:

A meaningful SPARQL example would be great. Your proposed use case does
sound like a common one that many researchers could relate to. My only
thought is that most R users would be more familiar with simply importing
the tree and trait data, etc, and then extracting the union (the
treedata() function from the geiger package being probably the most
common way users handle this, though the function assumes perfectly
matching species names being used on both tree and trait data).

I was wondering if we might have an example that emphasizes the logical
reasoning of SPARQL that doesn't have an immediate SQL-like analog. For
instance, a query that makes use of some ontology in identifying which
species listed in the target dataset are a member of the queried taxonomic
class or something (e.g. see our earlier thread: #20 (comment)
#20 (comment) ).
Maybe that would be involved in the use case you already described.

Will give a thought to some good published data examples.


Reply to this email directly or view it on GitHub
#73 (comment).

@rvosa
Copy link
Contributor Author

rvosa commented Jul 4, 2014

With commit a7c8ffd I have added some example data which I believe might be interesting to demonstrate (recursive?) SPARQL queries.

The NeXML file primates.xml contains a supertree of the Primates. The otus block contains both the terminal taxa and the higher taxa (genus through order). The nodes in the tree link to these taxa, so interior nodes may also have otu attributes that correspond with taxa (provided the tree makes these taxa monophyletic).

The general idea is that we should be able to query for all the members of a higher taxon - so given the URI of the higher taxon, give me all the direct descendants that specify rdfs:subClassOf for that taxon. Secondly, it might be nice to then be able to extract the subtree for those taxa (and plot it?), or show recursive calls to traverse the taxonomy.

Unfortunately, there appear to be some bugs in how the RDF is extracted. In particular, the namespace prefixes are not extracted correctly in the file primates_meta.xml.

What we should be getting is:

xmlns:concept="http://rs.tdwg.org/ontology/voc/TaxonConcept#"
<concept:toTaxon rdf:resource="http://ncbi.nlm.nih.gov/taxonomy/34827"/>

But instead we are getting:

xmlns:ns1="concept:"
ns1:rank rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonRank#Species"/>

I gather that this RDF is obtained by posting the NeXML to a web service, so its output is out of our control. I would like to suggest an alternative that could build on commit e3845d6. In that commit I have added an XSL stylesheet that extracts RDF/XML from RDFa. The output it produces is valid, and we should be able to run it locally, probably with better performance. However, this means we would create a dependency on a library that can process XSL stylesheets, such as this one: http://www.omegahat.org/Sxslt/

@rvosa
Copy link
Contributor Author

rvosa commented Jul 4, 2014

With commit d61a0c5 I have added an example that shows how we can query the valid RDF/XML that the XSL stylesheet produces. The example shows how you can fetch the taxon whose taxonomic rank is "Order", and return the corresponding NCBI taxon URI. Subsequently, with that URI, the example shows how to fetch its children.

A person that actually knows R (so, not me ;-)) would be able to take these examples to write a simple recursive traversal from the root to the tips. As the URIs of the subjects in this graph are constructed from the id attributes in the input NeXML it ought to be possible to get the taxa and tree nodes that correspond with these RDF subjects, e.g. to extract subtrees and plot them.

@rvosa
Copy link
Contributor Author

rvosa commented Jul 4, 2014

I played around with sparql.R a bit more. It is failing, but I hope someone will be able to get the recursion to work so it generates a newick string which we then plot. Bonus points if the newick string can have the taxon names from the original NeXML.

@cboettig
Copy link
Member

cboettig commented Jul 6, 2014

Very cool!! Look forward to digging in to your example when I'm back.


Carl Boettiger
http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos
On Jul 4, 2014 5:02 PM, "Rutger Vos" notifications@github.com wrote:

I played around with sparql.R a bit more. It is failing, but I hope
someone will be able to get the recursion to work so it generates a newick
string which we then plot. Bonus points if the newick string can have the
taxon names from the original NeXML.


Reply to this email directly or view it on GitHub
#73 (comment).

@rvosa
Copy link
Contributor Author

rvosa commented Jul 7, 2014

As of 81da59b, the RDF/XML taxonomy is traversed by recursive SPARQL queries, whose results are serialized to a Newick string with unbranched interior nodes, no branch lengths, and (optionally) interior node labels. In other words: it's a classification tree, which can be plotted as a cladogram, as the example shows. I think this would be a pretty neat use case for the supplementary materials: it's a bit too long (72 lines) to put in the MS body. To clean this up I am going to need a little more help, still:

  • update get_rdf to use the XSL stylesheet instead of the web service
  • make my code idiomatic R that plays nice with the rest of the package (coding style, relative paths etc.)

cboettig added a commit that referenced this issue Jul 9, 2014
This commit updates travis and documentation regarding the Sxslt dependency
Also updates the rdf unit test (by removing the XPath query and by checking document type)
Addresses #73.

Should probably update sparql.R example to actually use this new get_rdf() command
instead of starting with the already extracted RDF
@cboettig
Copy link
Member

cboettig commented Jul 9, 2014

Very nice. I've just updated get_rdf, and will:

  • take a go over the code idioms to make the example a bit more native.
  • I will also add this to the manuscript appendix (referencing appropriately from the SPARQL section).
  • Then I can move the sparql.R into a demos/ directory (which is the usual place for such things in R packages; allowing them to be run interactively from the command line. inst/examples is a more generic dumping ground for things that aren't necessarily R scripts.)

@rvosa
Copy link
Contributor Author

rvosa commented Jul 9, 2014

Excellent! Sorry I don't know the conventions (yet), but it's fun to learn
them.

On Wed, Jul 9, 2014 at 9:14 PM, Carl Boettiger notifications@github.com
wrote:

Very nice. I've just updated get_rdf, and will:

  • take a go over the code idioms to make the example a bit more
    native.
  • I will also add this to the manuscript appendix (referencing
    appropriately from the SPARQL section).
  • Then I can move the sparql.R into a demos/ directory (which is the
    usual place for such things in R packages; allowing them to be run
    interactively from the command line. inst/examples is a more generic
    dumping ground for things that aren't necessarily R scripts.)


Reply to this email directly or view it on GitHub
#73 (comment).

@cboettig cboettig added this to the Manuscript milestone Jul 17, 2014
@cboettig
Copy link
Member

@rvosa I was just thinking about trying to make the figure generated by sparql.R a bit easier to read but am running into trouble. My first thought was to plot just the internal node names (higher taxa levels), which would mean fewer labels crowding the plot and also make it clear that the cladogram just reflected the taxonomy.

I followed the suggestion in your code about adding get_name(id) to the recurse function definition so I have a Newick tree with internal nodes labeled, but that seems to be giving me a Newick tree that I cannot parse for some reason. Maybe you can have a quick look? Thanks much!

cboettig added a commit that referenced this issue Jul 17, 2014
doesn't run because of error in newick parsing now
@cboettig
Copy link
Member

@rvosa For quick reference, here's the Newick file I get when trying to add the node labels; not sure why it fails to parse (either using phylobase::readNewick, which uses the nexus class library, or using phytools::read.newick): https://github.com/ropensci/RNeXML/blob/96add29b379748a6dae302c483e6bbaf25297a7e/inst/examples/sparql.newick

@rvosa
Copy link
Contributor Author

rvosa commented Jul 19, 2014

The tree description is valid in principle (you can paste it into figtree,
for example), but some of the newick parsers that I've played around with
seem to be picky about i) there are no branch lengths; ii) there are
"unbranched" interior nodes; iii) there are node labels.

On Fri, Jul 18, 2014 at 12:03 AM, Carl Boettiger notifications@github.com
wrote:

@rvosa https://github.com/rvosa For quick reference, here's the Newick
file I get when trying to add the node labels; not sure why it fails to
parse (either using phylobase::readNewick, which uses the nexus class
library, or using phytools::read.newick):
https://github.com/ropensci/RNeXML/blob/96add29b379748a6dae302c483e6bbaf25297a7e/inst/examples/sparql.newick


Reply to this email directly or view it on GitHub
#73 (comment).

@cboettig
Copy link
Member

@fmichonneau Maybe you might have some idea why we I can't parse this Newick file successfully in R? e.g. with phylobase:

download.file("https://github.com/ropensci/RNeXML/raw/96add29b379748a6dae302c483e6bbaf25297a7e/inst/examples/sparql.newick", "sparql.newick", "wget")
readNewick("sparql.newick")

Gives me:

Warning:  
 A TAXA block should be read before the TREES block (but no TAXA block was found).  Taxa will be inferred from their usage in the TREES block.
at line 1, column (approximately) 5105 (file position 5104)
storing implied block: TAXA
storing read block: TREES
Error: index out of bounds
In addition: Warning message:
In FUN(X[[1L]], ...) : NAs introduced by coercion

though it seems like a valid tree (e.g. can be read into figtree)...

@fmichonneau
Copy link
Member

I think this is a bug in ape (Unfortunately, phylobase still relies on ape to parse the tree string, phylobase uses NCL to extract information about the taxa, branch lengths, labels, etc, but on ape to convert the parentheses and commas into an R object). Apparently, ape doesn't support edge labels on terminal edges. To have edge labels on terminal edges, taxa need to be in parenthesis by themselves like so (Avahi_laniger)Avahi,(... However, this apparently is not supported by ape:

ape::read.tree(text="(1,(2,3));")

gives


Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels:
[1] "1" "2" "3"

Rooted; no branch lengths.

But

ape::read.tree(text="((1),(2,3));")

gives

Error in if (sum(obj[[i]]$edge[, 1] == ROOT) == 1 && dim(obj[[i]]$edge)[1] >  : 
  missing value where TRUE/FALSE needed

This works with the phytools parser:

 phytools::read.newick(text="((1),(2,3));")

but the string from the example doesn't work (R hangs).

I reported the ape's bug to Emmanuel

@cboettig
Copy link
Member

Okay, thanks for taking a look. Yeah, I'd given phytools a try too and I
ping'd Liam about the issue. Keep me posted if you figure anything out
from Emmanuel, but nothing mission critical here.

On Wed, Jul 30, 2014 at 8:58 AM, Francois Michonneau <
notifications@github.com> wrote:

I think this is a bug in ape (Unfortunately, phylobase still relies on ape
to parse the tree string, phylobase uses NCL to extract information about
the taxa, branch lengths, labels, etc, but on ape to convert the
parentheses and commas into an R object). Apparently, ape doesn't support
edge labels on terminal edges. To have edge labels on terminal edges, taxa
need to be in parenthesis by themselves like so (Avahi_laniger)Avahi,(...
However, this apparently is not supported by ape:

ape::read.tree(text="(1,(2,3));")

gives

Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels:
[1] "1" "2" "3"

Rooted; no branch lengths.

But

ape::read.tree(text="((1),(2,3));")

gives

Error in if (sum(obj[[i]]$edge[, 1] == ROOT) == 1 && dim(obj[[i]]$edge)[1] > :
missing value where TRUE/FALSE needed

This works with the phytools parser:

phytools::read.newick(text="((1),(2,3));")

but the string from the example doesn't work (R hangs).

I reported the ape's bug to Emmanuel


Reply to this email directly or view it on GitHub
#73 (comment).

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

@cboettig
Copy link
Member

Okay, with Liam's bugfix http://blog.phytools.org/2014/07/new-version-of-readnewick-that-can-read.html we can read the tree in and just plot internal node labels to avoid over-crowding the figure (see https://github.com/ropensci/RNeXML/blob/devel/manuscripts/supplement.Rmd#L330)

I think we have a nice sparql use case now. We could possibly use a bit more text around this example, but I'll wait for others to weigh in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants