Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anna review #10

Closed
cboettig opened this issue Feb 1, 2018 · 1 comment
Closed

Anna review #10

cboettig opened this issue Feb 1, 2018 · 1 comment

Comments

@cboettig
Copy link
Member

cboettig commented Feb 1, 2018

Review Comments

This package is a great and lightweight addition to working with rdf and linked data in R. Coming after my review of the codemetar package which introduced me to linked data, I found this a great learning experience into a topic I've become really interested in but am still quite novice in so I hope my feedback helps to appreciate that particular POV.

Overall I feel package functionality is complete and self-contained (apart from one error identified below). My main feedback is regarding documentation, specifically how it could be improved to help novice users to grasp the value of semantic data and better understand how the package works.

installation

  • The only install comment I'll add is that when I first ran install(pkg_dir, dependencies = T, build_vignettes = T), the building of the vignettes threw an error because suggests package ‘jqr’ had not been installed yet? It worked without build_vignettes = T

hmm, that's curious. jqr is listed in Suggests so it should be picked up by setting dependencies = TRUE before install attempts to build the vignette. Anyway glad it worked once jqr was installed.

tests and checks

All OK

documentation

My main suggestion is to try to define some terms and improve the concept map for the tools by adding some detail and broader context to the documentation. The following suggestions could also be addressed with links to further details if you think they are too superfluous for explicit documentation with the package.

  • a brief intro to the semantic could be useful (eg something like):

The semantic web aims to link data in a machine readable way through the web, making data more alignable and interoperable, much easier to search, enriching and compute on.

  • what a graph format for data is (eg triples etc).
  • the structure of an rdf S3 object
    (ie you introduced some aspects of the data format here: (user does not have to manage world, model and storage objects by default just to perform standard operations and conversions) which we are told we can ignore (which is great) but actually creates more questions... what is this mysterious "world" object that forms an opaque slot of an rdf S3 object?) Would be nice to explain the structure of the S3 rdf briefly. Is there usefull metadata that can be extracted from the structure? (see comment later)
  • rdf file formats.
    I think its would especially aid in appreciating the rdf_serialise function to expand briefly (and potentially signpost to a resource like this) on the various serialization formats, perhaps even why one would use one over another, and particularly, why serialization involves writing a file out. I feel these are important concepts to help appreciate use cases of the function. Indeed the file out aspect of the function could do with being flagged more prominently in function man page. just by looking at the (quite cryptic if you don't know what serialization is) description and running the example, you've ended up writing a file without realising. But that's also just my test code, ask questions later approach 😜
  • Similarly, parsing can then be seen/described as reading in/encoding an rdf from their specific string formats.

Spelling a few things out in plain english and explicitly could really help folks follow what's going better and understand what file types are inputs or outputs of different functions.

I agree 100% on this, hope the new vignette is a move in the right direction. At this stage, I'd love to get your feedback, though it's not quite done/polished. If this is interesting to you, I'm debating flushing this out just a bit more into a wee paper for R Journal, maybe you and Bryce would be interested in polishing it up and being co-authors?

how do I find info on URIs?

Some signposting/guidance on how I can find information on the semantics dictating what information I can extract from an rdf object would be really useful. eg. with a df or list you could use str to get an idea of how you could start indexing these objects. If confronted with a local rdf file, how would one go about figuring out even what they can query? I appreciate this is really one of the difficulties of working with rdf and semantic data in general (the flipside to the ease of being able to make unstructured queries is that we need to know how data are labelled) but I feel some brief guidance or demo on how one would approach this would go a long way.

agree 100%, now covered in the new vignette, thanks so much for the suggestions!

examples in general

For clarity to the reader who may not have looked at function documentation yet, I recommend using the full argument names when supplying arguments to functions (if not always atleast the first time an argument is introduced) in vignettes.

SPARQL queries to JSON data section

At the end of the intro to the section, you write:

Here is a query that for all papers where I am an author, returns a table of given name, family name and year of publication:

Am I right in thinking though that you are co-author on all papers in the rdf but the query is in fact filtering the names of your co-authors? (through FILTER ( ?coi_family != "Boettiger" ))

Yes. (though this may not be relevant if the new vignette replaces the old one.)

Turning RDF-XML into more friendly JSON

It would be nice if possible to see sample of print outs of the conversion of the different files or at least of the effect of compaction.

In new vignette

rdf_add man page

  • Would be nice to see a demo of using one or more of the additonal arguments.

Added!

Motivating example

I think an additional, more detailed motivating example might illustrate more direct use case in a researchers workflow. In particular it would be good to highlight the great potential of triplestore APIs (and celebrate the efforts of many cool eg governmental linked data initiatives). So an example that incorporates a query to a triplestore and then enrichment of a researcher's data could be a cool example. This could be a longer term project or even just an rOpenSci blogpost but see comment re: rdf_query function below.

Yes!! really curious what you think of the vignette, still flushing this out a bit but at this stage would be great to get more input.

functionality

  • Serialising to turtle or trig throws an error

turtle fixed, trig deprecated.

  • In rdf_query, is there a way to return a non regularised query result ie return an rdf instead?
    I'm thinking about a usecase when maybe it's better to enrich data by merging rdfs? ie, researcher queries a triples store through an API (yeyyy open data!), combines their not fully matching but interoperable rdf data with rdf_add (ie try to show how triplestore is better than tabular non-linked data for merging) and then queries the merged rdf to extract an enriched analytical tabular dataset?

Great question. First, you inspired me to add c() method for rdf, so a user can create a larger rdf by concatenating two smaller ones.

Second, the new vignette discussed above should do a much better job on illustrating / motivating whole "it's easier to merge rdf data" theme.

Third, can we have SPARQL return rdf? In short no; because the return value of many queries won't be a whole triple, often it's just the object, say. Hopefully this is a "good thing", because a user will probably ultimately want to extract a data.frame from said triple store (for munging and plotting etc); you can't do much with the rdf object itself. One should basically imagine rdf_query() as the method for getting data.frame /data rectangles from RDF graphs.

Tests

  • Add tests for being able to serialise to trig and turtles which at the moment is throwing an error? Perhaps a test for parsing/serialising each format would be good. Also, perhaps worth checking whether eg rdf_parse(format="turtle") is working.

Indeed! test coverage didn't reflect this oversight. All methods now tested in both parse and serialize.

@cboettig
Copy link
Member Author

All resolved and approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant