Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking datasets and graphs? #30

Closed
iherman opened this issue Jul 2, 2018 · 27 comments
Closed

Rethinking datasets and graphs? #30

iherman opened this issue Jul 2, 2018 · 27 comments

Comments

@iherman
Copy link
Member

iherman commented Jul 2, 2018

I decided to start with some controversy:-)

In short, I have always been confused by the ways Datasets are treated in JSON-LD, and I propose to re-open that can of worms. I have jotted down my idea in a separate wiki page (it would have been too long for an issue).

TL;DR: My proposal is to start from scratch, ie, deprecating @graph and replacing the functionalities with something cleaner. See the wiki page...

@iherman
Copy link
Member Author

iherman commented Jul 2, 2018

I realize we have an issue with backward compatibility. I would therefore propose that we declare@graph as deprecated, but not removed. Ie, the old features remain valid, and we use a new keyword instead (@dataset). This means that, alas!, 1.1 implementations should implement @graph, too, though we should not include, imho, the graph container feature (we will have to see what to replace it with if necessary).

@gkellogg
Copy link
Member

gkellogg commented Jul 2, 2018

I updated the wiki with my thoughts. I think we should continue to use @graph, and we can adapt for some of the issues you mention.

The graph container feature is required for Verifiable Credentials, and I suspect that @msporny and @dlongley would object to it's being removed.

Creating a top-level dataset using a map structure could be accomplished by leveraging the existing container semantics as reproduced below:

{
  "@context": {
    ...
    "dataset": {"@id": "@graph", "@container": ["@graph", "@id"]}
  },
  "dataset" : {
    "URL1" : {
        Some RDF statements here
    },
    "URL2" : [
        {
            We could also define a bush just like above
        },
        {

        }
    ],
    "@none" : [{
        Default graph statements here
    }]
  }
}

This just says that the use of the "dataset" term treats it like @graph, except to use the "@container": ["@graph", "@id"] mechanism to define a graph map. This avoids needing to introduce values in the default graph that reference the named graph identifiers.

@msporny
Copy link
Member

msporny commented Jul 2, 2018

TL;DR: My proposal is to start from scratch, ie, deprecating @graph and replacing the functionalities with something cleaner.

What is the problem or documented author issue we are attempting to solve here? (You will see that I will repeat this question for every new feature/deprecation proposed for JSON-LD 1.1). :)

@graph is something that was designed to be used in @contexts... now, some may be using it in JSON-LD markup, which is okay... but I hesitate to say that it's a best practice. We had originally designed @graph to hold data that is digitally signed and needed to exist in a separate graph from the signature. So, @graph was primarily designed so we can digitally sign information (and the most natural way to do that is to use datasets). @graph wasn't something that most developers/authors would be exposed to.

I get the conceptual purity argument, but I haven't seen folks complaining about @graph. I fully admit that we may not have been exposed to those authors/developers... but again, I'd like to see them writing about this issue rather than deprecating a JSON-LD feature before seeing that data.

also,

I decided to start with some controversy:-)

😆 -- nice to see that your sense of humor hasn't changed, @iherman.

@msporny
Copy link
Member

msporny commented Jul 2, 2018

in a format [JSON-LD] that is, at the end of the day, the serialization of RDF

I know I'm sounding like a broken record at this point, but JSON-LD is not primarily a serialization of RDF. It's a graph-based syntax that just so happens to losslessly convert to and from RDF. I think people think I'm kidding when I say this, I'm only half-kidding... JSON-LD started off by attempting to create a new graph model syntax that Web developers would use... RDF compatibility was not as important to our organization as it was to the existing RDF community and it continues to not be a primary goal.

@iherman
Copy link
Member Author

iherman commented Jul 3, 2018

@gkellogg not 100%...

I tested in the dev. version of playground:

{
  "@context": {
    "@version": 1.1,
    "dataset": {"@id": "@graph", "@container": ["@graph", "@id"]}
  },
  "dataset" : {
    "http://www.ex.org/1" : {
        "@id" : "http://www.ivan-herman.net",
        "http://a.b.c/1" : "Ivan Herman"
    },
    "http://www.ex.org/2" : [
        {
         "@id" : "http://p.q.r",
         "http://a.b.c/2" : "Somebody else"
        },
        {
         "@id" : "http://x.w.z",
         "http://a.b.c/3" : "And somebody else again"
        }
    ],
    "@none" : [{
        "@id" : "http://www.w3.org",
        "http://a.b.c/4" : "Nobody"
    }]
  }
}

and what I got was:

<http://p.q.r> <http://a.b.c/2> "Somebody else" <http://www.ex.org/2> .
<http://www.ivan-herman.net> <http://a.b.c/1> "Ivan Herman" <http://www.ex.org/1> .
<http://x.w.z> <http://a.b.c/3> "And somebody else again" <http://www.ex.org/2> .
<http://www.w3.org> <http://a.b.c/4> "Nobody" _:b0 .

Note the last line: I did not get statements in the default graph, but in yet another graph with a blank node as an id...

But yes, it is pretty close.

@iherman
Copy link
Member Author

iherman commented Jul 3, 2018

@msporny

What is the problem or documented author issue we are attempting to solve here?

If I am the only one who is constantly confused on how to use the @graph term then I will of course shut up, and assign it to my own deficiencies. I would have no problem accepting that. But I do believe that the usage of @graph is confusing. As I tried to show that it confuses terms, imposes restrictions (like the usage of blank nodes in graph containers which generate RDF that would be unusable in SPARQL). Its current usage of representing bushes in JSON-LD is confusing, and it is not obvious (only via a conceptually complex trick) to represent elementary datasets. If the reader cannot gain a clear mental model of what is happening, then the only way of encoding data would be to make copy-paste from the examples without really understanding them, which is a problem (in my view).

About the role of JSON-LD: history is what it is, but that is now bygone. JSON-LD has been "marketed", and I daresay extremely successfully so, as an RDF serialization format, too. This is what it has become today and used by various communities. We have to take this connection seriously and try to improve the purity of the relationship. Ie, if our syntax leads to a confusion of RDF Graphs and RDF Datasets I do see that as a problem.

@gkellogg
Copy link
Member

gkellogg commented Jul 3, 2018

No, it’s not implemented in the spec just now, but would be a logical thing to do. Similar to adding ‘@containeron@type` which is also considered elsewhere.

@iherman
Copy link
Member Author

iherman commented Jul 3, 2018

@gkellogg

The graph container feature is required for Verifiable Credentials, and I suspect that @msporny and @dlongley would object to it's being removed.

As I said, I did not thought through how to include graph container feature, I was not saying that the feature itself should be removed. Just its current syntax.

@iherman
Copy link
Member Author

iherman commented Jul 3, 2018

@gkellogg

what about the separate proposal of represent bushes via a simple cross reference to contexts?

@ericprud
Copy link
Member

ericprud commented Jul 3, 2018

I understand Manu's (provocative) point about about JSON-LD being a graph language first and RDF-compatible second. I believe that the @container: @graph construct doesn't behave as one would expect in a graph language, i.e. that a property points at an non-rooted graph. For instance:

{ "@context": {
    "@version": 1.1,
    "p1": {
      "@id": "http://vocab.ex/p1",
      "@container": "@graph"
    },
    "p2": { "@id": "http://vocab.ex/p2" },
    "p3": { "@id": "http://vocab.ex/p3" },
    "p4": { "@id": "http://vocab.ex/p4" }
  },
  "p1": {
    "p2": {
      "p3": "v3",
      "p4": "v4"
    }
  } }

emits the dataset:

_:b0 ex:p1 _:b1 .
GRAPH _:b1 {
  _:b2 ex:p2 _:b3  .
  _:b3 ex:p3 "v3"  .
  _:b3 ex:p4 "v4"  .
}

Navigating in JSON-land, <p1> is strongly-connected to the object with a <p2> property (_:b2, in RDF land). If I'm navigating this as a graph (RDF graph, property graph, Spark's variant of Cypher, etc), <p1> connects to a bag of triples. The application has to be working with a known schema and valid data to discover _:b2.

A solution that will be irksome to some and blindingly obvious to others is to give the subjects of the nested properties (i.e. <p2>) the same identity as that of the graph. In such a schema,

  "p1": {
    "p2": {
      "p3": "v3"
    },
    "p5": "v5"

would look like:

_:b0 ex:p1 _:b1 .
GRAPH _:b1 {
  _:b1 ex:p2 _:b2  .
  _:b2 ex:p3 "v3"  .
  _:b1 ex:p5 "v5"  .
}

This would eliminate a lot of fuzzy heuristics from query/update/validation.

@BigBlueHat
Copy link
Member

@ericprud looks like some wee typos in your last examples (i.e. what happened to p5 and v5? and where did p4 come from?). Could you fix those? Thanks!

@BigBlueHat
Copy link
Member

We have to take this connection seriously and try to improve the purity of the relationship.

While I do agree with @iherman about the importance of JSON-LD as a serialization of RDF, I'll also 👍 @msporny's statements that (even regardless of history), JSON-LD's appeal reaches farther than just "RDF-land."

It may be a tricky balance to when addressing situations like this, but it's clear that "confusion" is subjective.

I've no clear technical suggestions at this point, other than that we at least have more to clarify and exemplify (i.e. improve our examples) and would prefer we start there...and see what's still missing.

@gkellogg
Copy link
Member

gkellogg commented Jul 3, 2018

@iherman while adding referencable contexts is feasible, I think it’s a big step, and I don’t think it’s necessary.

@ericprud your thought about preserving the graph name as the implicit subject of triples in the referenced graph has merit, and does solve the nasty rooting problem.

@msporny we settled on using the RDF model as the basis for JSON-LD not just to appease the RDF 1.1 WG, but because it didn’t make sense to introduce yet another model. I think it’s important that the JSON-LD surface syntax remain usable by developers that don’t care, but we need to make sure the underpinnings have a good basis in theory. Perhaps we can use JSON-LD to push forward on some emerging areas of interest, such as property graph alignment via RDF*.

@iherman
Copy link
Member Author

iherman commented Jul 3, 2018

@iherman while adding referencable contexts is feasible, I think it’s a big step, and I don’t think it’s necessary.

Because? I believe a proper and clean representation of bushes is very important and we do not have that (I do not consider the usage of @graph as "clean"...)

@msporny
Copy link
Member

msporny commented Jul 3, 2018

@msporny we settled on using the RDF model as the basis for JSON-LD not just to appease the RDF 1.1 WG, but because it didn’t make sense to introduce yet another model.

We did introduce another model:

https://json-ld.org/spec/latest/json-ld/#data-model

Yes, it is compatible but the JSON-LD data model is a superset of the RDF data model. You can express things in JSON-LD that you cannot in RDF, that was a very intentional strategy. I suggest that we keep it that way to continue to push RDF 1.1 into the modern world. Native support for RDF lists, anyone? :)

That strategy pushed the RDF 1.1 WG to add a few important features (named graphs, some would argue dataset support). I'll note that the JSON-LD data model is an extension of the RDF data model because RDF 1.1 didn't adopt all of the JSON-LD data model features.

I think it’s important that the JSON-LD surface syntax remain usable by developers that don’t care, but we need to make sure the underpinnings have a good basis in theory.

+1, as long as the realignment to theory doesn't deprecate features that are working just fine for the rest of us. It feels like this discussion is trying to fix a non-issue for JSON-LD authors. Yes, I readily admit that maybe the theoretical underpinnings aren't clean, but if you want something that's clean -- use TRiG. There are other languages that will give that to you.

JSON-LD is meant to be an everyday developer/author tool... we don't need to expose every thing in RDF to those folks (and I'd argue that if we do, JSON-LD will eventually fail).

The primary design criteria for JSON-LD is to help developers build better systems... being theoretically clean is very far down the list of priorities... and I'm very concerned that if we focus too much on that, we will turn JSON-LD into something that has so many bells and whistles attached to it that it loses its value and we'll be forced to do a JSON-LD Lite just like we were forced to do that for RDFa.

Perhaps we can use JSON-LD to push forward on some emerging areas of interest, such as property graph alignment via RDF*.

Do we have a list of features for JSON-LD 1.1 with priority based on group interest? Can we do some ranked choice voting on that so we don't spend a lot of time discussing changes to JSON-LD that are low priority?

@msporny
Copy link
Member

msporny commented Jul 3, 2018

Because? I believe a proper and clean representation of bushes is very important and we do not have that (I do not consider the usage of @graph as "clean"...)

What use case is not possible because of this missing feature?

@iherman
Copy link
Member Author

iherman commented Jul 3, 2018

Wow, this discussion goes a little bit out of hand. It was not my intention to start an RDF vs. non-RDF controversy. Can we avoid getting into this discussion?

@msporny

What use case is not possible because of this missing feature?

The question is not whether something is not possible. Yes, it is possible to express a bush with @graph, and I did not say otherwise. My claim is that it is complicated, way more complicated and counter-intuitive than necessary. The whole document is geared towards a special subset of graphs that are all rooted, but that is not the only use case out there.

My goal is to make JSON-LD easy to understand and use. In my experience, in the area of datasets and bushes, it is not. Obviously, you do not feel there is a problem with this. Let us try to pause here a bit, because whether it is easy or not is obviously a subjective statement; I would like to see the reactions of others in the group, that would begin to give a good sample.

@msporny
Copy link
Member

msporny commented Jul 3, 2018

Wow, this discussion goes a little bit out of hand. It was not my intention to start an RDF vs. non-RDF controversy.

Hey man, you're the one that wanted to be controversial. 😜

@BigBlueHat
Copy link
Member

@iherman if you just want a bush, isn't this sufficient:

[
  {"@context": "http://schema.org/", "name": "Ivan"},
  {"@context": "http://example.com/schema", "name": "Pluto"}
]

...or if you want a default @context value, you'd reshape it this way (as I'm sure you know):

{
  "@context": "http://schema.org/",
  "@graph": [
    {"@context": "http://schema.org/", "name": "Ivan"},
    {"@context": "http://example.com/schema", "name": "Pluto"}
  ]
}

I've not been tripped up by that, conceptually. And even giving that last example an @id and/or other top-level properties has all made sense to me (for what little that's worth). I do, however, get a bit tangled up by the new "@container": "@graph" + @none (for statements about the "named graph" itself). That's likely a separate issue?

Overall, maybe there's a way we can narrow in on the exact concerns? The wiki page was great for an overview, but I guess I didn't find the solution(s) any clearer than the present approach.

@gkellogg
Copy link
Member

gkellogg commented Jul 3, 2018

@iherman Looking at your "addressable context" mechanism from the wiki:

[
   {
       "@context" : {
           "@id" : "_:a"
           ...
       }
   },
   {
     "@context" : "_:a",
     "@id" : "http://www.example.org/1",
     "http://a.b.c" : "something"
   },{
     "@context" : "_:a",
     "@id" : "http://www.example.org/2",
     "http://d.e.f" : "something"
   }
]

My concern here is that this implies that a context with "@id": "http://example/ctx" would be the same as a context loaded from "http://example/ctx". This may be the case, but nowhere else in JSON-LD is the assumption that loading a document from a location implies that the document has an @id that's the same as that location, unless it has "@id": "". This would seem to be creating a precedent for contexts.

Moreover, what if you had a remote context at "http://example.org/foo", which looked like the following:

{
  "@context": {
    "@id": "http://example.org/bar",
     ...
  }
}

What is the address of this context, "http://example.org/foo", or "http://example.org/bar"? Right now, if I use "@context": "http://schema.org", I either load it, or use the version already loaded from that address.

In short, I think that this raises some issues that may muddy the waters, all to "clean up" the use of @graph for describing a bush, which is well established practice by now.

@iherman
Copy link
Member Author

iherman commented Jul 4, 2018

@gkellogg yes, I find your argument compelling indeed. It may require a specific addressing mechanism, orthogonal to @id which, I admit, is not nice either.

@iherman
Copy link
Member Author

iherman commented Jul 4, 2018

@BigBlueHat and others: it seems that I am getting to the minority with my uneasiness. The fact that the same keyword (@graph) is used both for a bush and for datasets extremely confusing and "dirty", but I obviously won't lie down the road if I am the only one.

@iherman
Copy link
Member Author

iherman commented Jul 4, 2018

One positive thing that may have come out of this discussion: #30 (comment) shows a way to produce very cleanly a dataset, provided that @none works. Personally, I would prefer to have @dataset as a standard, but it may be considered as a standard idiom by users...

@ericprud
Copy link
Member

ericprud commented Jul 4, 2018

I believe that @iherman's proposal to distinguish datasets with @dataset has low cost and good value:

  1. The most compelling JSON-LD use cases demand dual access (JSON tree and RDF graph navigation). I'd estimate this at 90% of JSON-LD's value, though in my experience it's been closer to 100%.
  2. Some folks exchange non-framed JSON-LD but their use of it is as a commodity serialization. The group most affected is a smallish set of engineers writing serializers and parsers.

For the most part, non-expert human eyes rarely fall on non-framed JSON-LD with keywords like @graph and @dataset. For those folks (let's call them experts-to-be), a clear model which distinguishes graphs from datasets is of greater value than the adoption of the legacy keyword @graph to mean a dataset.

@BigBlueHat
Copy link
Member

I'm not sure the introduction of another keyword makes any of this any clearer...and doing so would certainly raise the "expert" bar a bit by requiring an understanding of the differences between a Named Graph and a Dataset--which seems to be unclear to (or at least debated by) the people defining the terminology--see https://www.w3.org/TR/rdf11-datasets/ linked earlier.

From that Note:

Defining the semantics of RDF datasets requires an understanding of the two following issues:

  • what the graph names (IRI or blank node) denote, or what are the constraints on what the names can possibly denote;
  • how the triples in the named graph influence the meaning of the dataset.

...
Depending on the assumptions taken with respect to these two issues, the formalization of the semantics of RDF datasets can vary very much.

Perhaps it would be helpful if someone (who cares deeply about this issue) were to go through the list of interpretations represented in that Note and present the various JSON-LD expressions for each +/- any confusion they think is represented by the current expression options and/or proposals to fix them.

That would help me at least, and perhaps at least narrow the discussions here a bit more.

@gkellogg
Copy link
Member

gkellogg commented Aug 2, 2018

So as not to loose @ericprud's comment about making a change to "@container": "@graph" to align the blank node table used to identify the graph with the implicit subject of the node contained within the graph, please create another issue for this to be considered (action on @ericprud).

I believe this directly relates to the ability to validate Verifiable Claim named graphs from a data-model perspective, rather than just a JSON Schema perspective. Potentially, the contents of such a named graph could have a very large number of statements, which makes it computationally impractical to find the "root" of the graph by searching for statements with a subject (@id) which is not the value of some other node (object of a statement). We might go so far as to describe a subset of named graphs where the graph name is the same as the primary subject of the graph.

@gkellogg gkellogg moved this from Discuss-F2F to Discuss-GH in JSON-LD Management DEPRECATED Oct 27, 2018
@gkellogg gkellogg moved this from Discuss-GH to Future Work in JSON-LD Management DEPRECATED Oct 27, 2018
@azaroth42 azaroth42 moved this from Future Work to Editorial work complete in JSON-LD Management DEPRECATED Feb 8, 2019
@iherman iherman closed this as completed Feb 8, 2019
@iherman
Copy link
Member Author

iherman commented Feb 9, 2019

This issue was discussed in a meeting.

  • No actions or resolutions
View the transcript datasets and graphs
Rob Sanderson: ref: #30
Ivan Herman: I don’t like the way that this is done, but it turned into a philosophical argument, and I can just close it.
Ivan Herman: to clarify, I want to close it because it’s way too late.

@gkellogg gkellogg removed this from Editorial work complete in JSON-LD Management DEPRECATED Feb 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants