Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement utilities for working with codemeta files #3

Open
cboettig opened this issue Mar 31, 2017 · 6 comments
Open

Implement utilities for working with codemeta files #3

cboettig opened this issue Mar 31, 2017 · 6 comments

Comments

@cboettig
Copy link
Member

A standard program interface would parse codemeta files, probably transform them into a standard tree stucture using jsonld::jsonld_frame() (see codemeta/codemeta#128), and perhaps provide helper utilities for extracting data of interest; e.g. generating data.frame representations of metadata over a large set of codemeta files (though maybe that's best left to a vignette documenting a basic json parsing strategy with purrr)

@cboettig
Copy link
Member Author

cboettig commented Apr 5, 2017

Added a draft vignette introducing json-ld framing: https://codemeta.github.io/codemetar/articles/JSON-LD-framing.html

Needs to be modified still, in particular we should be able to write a frame that corresponds to creating graph nodes that comply with the current json schema. (Note we can't enforce required elements, though we can provide defaults in the place of missing elements).

@mbjones
Copy link
Member

mbjones commented Apr 5, 2017

@cboettig How does the framing relate to @gothub's use of JSON Schema (see his codemeta-json-schema.json file for schema validation? It feels to me like these approaches may be overlapping.

@cboettig
Copy link
Member Author

cboettig commented Apr 5, 2017

Indeed, the vignette is largely a way for me to try and answer that question for myself; you'll see it compares the results of framing to that of validation at several points.

My basic understanding so far is that the answer is yes: framing is the json-ld's answer to json-schema. Since JSON-LD is fundamentally just linked data, just a collection of triples, it's somewhat artificial to impose a schema onto it -- the tree structure created by the schema contains no additional information, even though it's a huge convenience when working programmatically with such files. Framing lets anyone consuming JSON-LD (e.g. app developers) specify what tree structure/schema they want the data to appear in, rather than putting the onus on the data provider to conform to a particular structure.

For instance, I think we wrestled with this a bit when we were writing our context file -- the context file feels very flat, e.g. I remember feeling like I wanted the context file to nest elements like schema:name inside the context of agent, but everything was very top level. I think I appreciate this better now, that kind of nesting / tree structure is supposed to come from the frame, since it isn't inherent to linked data; (e.g. that whole thing about how we aren't supposed to encode information purely in tree structure alone with linked data).

In principle I think we should be able to write a frame that returns a JSON tree that is essentially schema-valid (obviously there's an issue if the schema requires a field that the input json-ld doesn't have; framing supports defaults for missing values but that's not much use in this context). The user input doesn't even have to be json, it could rdf-xml triples, and the framing should take care of creating the tree structure we want for programming purposes (e.g. knowing stuff like codemeta$agent[[1]]$email will always work as expected)

I haven't fully digested the upshot of the linked data design here. Clearly it relaxes some constraints on the creation of jsonld documents -- that can be good and bad: it's kinda nice to be able to tell users they cannot omit a given field, but on the other hand I think the json-ld approach is a bit less fragile and easier to extend for being less rigid (e.g. vs xml schema or json schema).

Anyway, curious for your thoughts on all this. Feels like since we opted for JSON-LD to begin with I wanted to embrace/explore that paradigm; otherwise I think we're shoehorning an XML-schema style approach around json-schema + json-ld, which feels a bit like reinventing the wheel (minus powerful tools like xpath and xslt and the rest we had for xml).

@gothub
Copy link

gothub commented Apr 5, 2017

Does the JSON-LD framing API provide any validation capability? The example documents that we have authored impose a structure on JSON-LD that I don't see how the framing API can
validate. For example, I had assumed that 'agent' must have a name and an email, but it can't have a 'funding' term or 'related' link.

If this isn't the case, and users can create whatever structure makes sense to them then the need for checking a requisite structure isn't necessary.

If we do want to impose a requisite structure and the framing api can validate it, then I'd say let's use it.

@cboettig
Copy link
Member Author

cboettig commented Apr 5, 2017

@gothub not entirely sure I follow your question, so apologies if this is off the mark.

The JSON document returned by applying a JSON-LD is guaranteed to obey the rules set by that frame. So for instance if you want to ensure you have a tree in which the SoftwareSourceCode object has exactly one child element called "agents", which has exactly one child per "agent", and each "agent" can only have child elements "name" and "email" and can't have elements like "funding" or "related", you can state this in your frame. You make essentially no assumptions about the structure of the input document, remember it's just a collection of triples of linked data. Then the json-ld framing just fills out the frame according to the rules you've set: e.g. finds all the triples about agent with id XXX, pulls out the name and email since the frame requests those. If the input data for some reason includes claims about a funding term or a related link associated with agent id XXX, those are going to be dropped because they aren't asked for in the frame. Of course if the input data doesn't include a triple assigning an "email" to agent with id XXX, then of course the output isn't going to have it. You can tell your frame to use a default (such as "NA" or "email not provided') if you like, which can be convenient in some applications to prevent code breaking which assumes the field.

So I don't think it's accurate to call this "validation", because this is really a different approach then validation to address the same problem. The problem we want to address is as programmers we like our data to have nested, tree-like structure, we want email and name to be sibling nodes, both of which are children to an agent node, and we don't want funding to be a child node of an agent, because it describes the parent (the software) not the agent. As long as the input is valid linked data, it can be structured according to whatever makes sense to the data creator; knowing that the data consumer can subset and re-cast it into their desired shape using frames.

So from a strict interpretation of the JSON-LD spec, I think it is wrong of us to say: codemeta is a JSON-LD file that must also validate against a particular JSON schema. In effect that is really saying that codemeta is just a plain JSON file that validates against a JSON schema just happens to use json-ld namespaces. Nothing wrong with that, but I think it's a different approach. An application that claims it can consume JSON-LD should be able to recognize that any equivalent representation of the same LD graph is indeed equivalent.

@cboettig
Copy link
Member Author

cboettig commented Apr 6, 2017

I've just updated the vignette to show how one could take any codemeta.json representation (e.g. flattened, compacted, expanded, or otherwise structured) and use a frame to create a new codemeta.json file that is (nearly ^[1]) valid according to Peter's schema, see: https://codemeta.github.io/codemetar/articles/JSON-LD-framing.html . Note that the "frame" to do this looks a lot like the schema (or more simply, just like a skeleton / template of a codemeta.json file; though I think it would be possible to write a far less verbose/ less explicit frame that does the same thing).

I hope this hasn't sounded like I'm against the json schema Peter created; only that I think that it might not be necessary to insist that people creating codemeta.json files need ensure that they validate in order to be compatible with other tooling.

Note that in the example, as a result of framing we automatically get more fields populated in the uploadedBy node because it's describing the same person as we are describing in the agent field. So had we omitted the email in uploadedBy but not in the agent field, it is automatically populated by the frame (assuming the frame asks for that data to be included as a leaf on both nodes). Under a schema-only approach, it would be left to the developer to resolve references to other nodes; they couldn't just request a json tree where all those references resolve automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants