Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Profiles for JSON-LD to/from RDF #4

Closed
gkellogg opened this issue Jul 8, 2018 · 16 comments
Closed

Streaming Profiles for JSON-LD to/from RDF #4

gkellogg opened this issue Jul 8, 2018 · 16 comments

Comments

@gkellogg
Copy link
Member

gkellogg commented Jul 8, 2018

There have been some discussions on what it would take to be able to do a streaming parse of JSON-LD into Quads, and similarly to generate compliant JSON-LD from a stream of quads. Describing these as some kind of a profile would be useful for implementations that expect to work in a streaming environment, when it's not feasible to work on an entire document basis.

As currently stated, the JSON-LD to RDF algorithm requires expanding the document and creating a node map. A profile of JSON-LD which used a flattened array of node objects, where each node object could be independently expanded and no flattening is required could facilitate deserializing an arbitrarily long JSON-LD source to Quads. (Some simplifying restrictions on shared lists may be necessary). Outer document is an object, containing @context and @graph only; obviously, this only will work for systems that can access key/values in order, and for systems that ensure that @context comes lexically before @graph in the output. Obviously, only implementations that can read and write JSON objects with key ordering intact will be able to take advantage of such streaming capability.

Fo serializing RDF to JSON-LD, expectations on the grouping of quads with the same graph name and subject are necessary to reduce serialization cost, and marshaling components of RDF Lists is likely not feasible. Even if graph name/subject grouping is not maintained in the input, the resulting output will still represent a valid JSON-LD document, although it may require flattening for further processing. (Many triple stores will, in fact, generate statements/quads properly grouped, so this is likely not an issue in real world applications).

Original issue Streaming Profiles for JSON-LD to/from RDF #434.

@ajs6f
Copy link
Member

ajs6f commented Oct 26, 2018

This question arises on a regular basis (on user mailing lists) for Apache Jena, and our only current response is to admit that JSON-LD currently isn't a good choice when processing would benefit from or demands streaming.

@simonstey
Copy link

simonstey commented Feb 8, 2019

from @rubensworks sent today via email:

Dear all,

I had a couple of use cases where I needed to be able to
parse JSON-LD documents to RDF in a streaming way.
To the best of my knowledge, current JavaScript implementations
don't support streaming parsing, which is why I implemented a streaming parser [1].
Such a parser is especially useful when you need to parse large documents
that don't fully fit into your memory.

This parser can be configured to be fully spec-compliant.
However, by default, it is not fully compliant for performance reasons.
For example, the parser will by default throw an error
if an @context is found as a non-first entry in an object.

Obviously, a streaming parser will never be as fast as a regular parser for all cases.
However, we still achieve comparable performance for parsing
typical JSON-LD documents, compared to jsonld.js [2].
Currently, this parser is significantly slower for expanded documents,
so I am still looking into optimizing this.

At the moment JSON-LD 1.0 is supported,
but I aim to look into supporting the new 1.1 features in the near future.

More information on how the streaming algorithm works
can be found in the readme [3].

[1] https://github.com/rubensworks/jsonld-streaming-parser.js
[2] https://github.com/rubensworks/jsonld-streaming-parser.js#performance
[3] https://github.com/rubensworks/jsonld-streaming-parser.js#how-it-works

Kind regards,
Ruben Taelman

@iherman
Copy link
Member

iherman commented Feb 9, 2019

This issue was discussed in a meeting.

  • RESOLVED: Streaming is interesting, but not high priority for work given current participants ; highlight in a blog post {: #resolution6 .resolution}
View the transcript 5.2. Streaming Profiles for JSON-LD to/from RDF
Rob Sanderson: ref: https://github.com/w3c/json-ld-api/issues/5
ajs6f>gkellogg: there are savings to be realized if one could spec a profile for streaming
Gregg Kellogg: this profile would say, “to be streamed, a JSON_LD serialization would need to have the following characteristics
Ivan Herman: analysis of the format with this in mind
Ivan Herman: I’d say defer
… this might be interesting enough that someone might publish something before this WG ends
Gregg Kellogg: we could publish something that invites people to work on this
Proposed resolution: Streaming is interesting, but not high priority for work given current participants ; highlight in a blog post (Rob Sanderson)
Gregg Kellogg: +1
Adam Soroka: +1
Rob Sanderson: +1
Benjamin Young: +1
Simon Steyskal: +1
Ivan Herman: +1
David I. Lehn: +1
Jeff Mixter: +1
Harold Solbrig: +1
David Newbury: +1
Resolution #6: Streaming is interesting, but not high priority for work given current participants ; highlight in a blog post {: #resolution6 .resolution}

@gkellogg
Copy link
Member Author

Just to jot down some thoughts after discussing with @rubensworks, the main issue is to encourage/require a key order in JSON objects. To properly decode values in an object, @context must be seen first, and to properly assign subject identifiers, @id. Also, if keywords are aliased, when ordering the property values, those for keywords should continue to come in their unaliased lexicographical order, at least @id or its alias should come before most everything else. Otherwise, keywords would naturally come before everything other than a number. So the general advice would be:

When serializing JSON-LD, order keys which are keywords, or aliased to keywords before other keys, ordered lexicographically by the unaliased key followed by all other keys in the object ordered lexicographically. (There is some aesthetic value to ordering @value before @index, @type or @language, but it shouldn't matter for stream processing)

@iherman
Copy link
Member

iherman commented Mar 12, 2019

Can you guys give some examples? It would help to understand...

@rubensworks
Copy link
Member

Here's an example on the importance of @id coming as soon as possible:

Assuming a line-by-line parser, triples can be emitted immediately after each line in the following JSON-LD document:

{
  "@context": "http://schema.org/",
  "@id": "http://example.org/",
  "@type": "Person",               // --> <http://example.org/> a schema:Person.
  "name": "Jane Doe",              // --> <http://example.org/> schema:name "Jane Doe".
  "jobTitle": "Professor",         // --> <http://example.org/> schema:jobTitle "Professor".
  "telephone": "(425) 123-4567",   // --> <http://example.org/> schema:telephone "(425) 123-4567".
  "url": "http://www.janedoe.com"  // --> <http://example.org/> schema:url <http://www.janedoe.com>.
}

However, if @id appears on a lower line, then some information needs to be buffered until the @id becomes known (if no @id appears before the node closes, then the subject will be a blank node):

{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "@id": "http://example.org/",    // --> <http://example.org/> a schema:Person.
                                   // --> <http://example.org/> schema:name "Jane Doe".
                                   // --> <http://example.org/> schema:jobTitle "Professor".
  "telephone": "(425) 123-4567",   // --> <http://example.org/> schema:telephone "(425) 123-4567".
  "url": "http://www.janedoe.com"  // --> <http://example.org/> schema:url <http://www.janedoe.com>.
}

(Source: https://github.com/rubensworks/jsonld-streaming-parser.js#how-it-works)

@iherman
Copy link
Member

iherman commented Mar 12, 2019

Thanks @rubensworks. I have two questions, though:

  1. Can you explain why "ordered lexicographically by the unaliased key followed by all other keys in the object ordered lexicographically." is necessary? Ie, what is the importance of lexicography?
  2. Are there special requirements for the content of @context portions?

@rubensworks
Copy link
Member

Can you explain why "ordered lexicographically by the unaliased key followed by all other keys in the object ordered lexicographically." is necessary? Ie, what is the importance of lexicography?

Lexicographical ordering may be a bit too strict. I think the point @gkellogg intended to make is that @-keywords (even if they aliased) should occur before all other keys. I think this restriction can safely be made a bit weaker as follows:

  1. If there is an @context in a node, it should be the first key.
  2. If there is an @id in a node, it should be the first key if there is no @context, or the second key if there is an @context.

Are there special requirements for the content of @context portions?

I personally don't see any benefits in handling the contents of an @context in a streaming way, because the context does not produce any triples directly. Perhaps some things are possible with remote contexts that need to be fetched, but I think the HTTP overhead will negate any potential benefits from parsing this in a streaming way.

@gkellogg
Copy link
Member Author

Thanks @rubensworks. I have two questions, though:

  1. Can you explain why "ordered lexicographically by the unaliased key followed by all other keys in the object ordered lexicographically." is necessary? Ie, what is the importance of lexicography?

For keywords, you want to see @context first, followed by @id lexicographical ordering does that, but otherwise ordering isn’t important, as @rubensworks indicated. Non-keyword keys don’t really need to be ordered. Note that the algorithms currently call for object values to be optionally ordered by key.

@simonstey
Copy link

For keywords, you want to see @context first, followed by @id lexicographical ordering does that,

as long as we don't introduce any new keywords that would somehow have an effect on this order, right? (not very likely, but still ;) )

but otherwise ordering isn’t important, as @rubensworks indicated.

imo @rubensworks suggested restrictions are also more "stable" and less ambiguous than relying on lexicographical ordering only

@rubensworks
Copy link
Member

For reference, I just finished implementing a streaming JSON-LD serializer.

While the implementation of this was significantly easier than the implementation of the streaming parser,
there are some things that could not be implemented in a streaming way so that it fully adheres to the test suite.

Concretely, it has the following restrictions:

  • RDF lists are not converted to @list arrays, as you can only be certain that @list can be used once all triples have been read, which requires keeping the whole stream in-memory.
  • No deduplication of triples, as this would also require keeping the whole stream in-memory.

Next to that, in order to make the resulting JSON-LD stream as compact as possible, the following guidelines regarding triple/quad order can be followed:

  1. Quads with equal graphs should be grouped (Achieves grouping of @graph blocks).
  2. Quads with graph corresponding to the subject of triples/quads should be grouped (Achieves grouping of @graph and @id blocks).
  3. Triples with equal subjects should be grouped (Achieves grouping of @id blocks).
  4. Triples with equal predicates should be grouped (Achieves grouping of predicate arrays)

Since these findings about JSON-LD serialization (and the previous ones on parsing) may be beneficial for other people as well, I was wondering if the Best Practices Note may be a good place to summarize these findings.

@gkellogg
Copy link
Member Author

gkellogg commented Apr 3, 2019

I have an implementation of a streaming writer that pretty much does the same thing: https://github.com/ruby-rdf/json-ld/blob/develop/lib/json/ld/streaming_writer.rb.

Such list restrictions are a good argument for such structures to be more fundamental to RDF in the future.

@iherman
Copy link
Member

iherman commented Apr 27, 2019

This issue was discussed in a meeting.

  • RESOLVED: Describe preferred key ordering for serialization over the wire to enable streaming parsers as a best practice
View the transcript Streaming Profiles for JSON-LD to/from RDF
Rob Sanderson: link: https://github.com/w3c/json-ld-api/issues/5
Rob Sanderson: this came from the community group
Ivan Herman: what does a profile mean?
Gregg Kellogg: I reckon in the sense of serializing json-ld in a way that it’s easier for stream processors to deal with it
… or how would you create json-ld from a stream
… the best thing we can do is to provide requirements that should be followed
Ivan Herman: so not like profiles in the http context
Ruben Taelman: basically like gkellogg described
… I’m more than happy to summarize this in the best practice document
Dave Longley: +1 to doing this in best practices
Simon Steyskal: I don’t think it should be normative. You can do what you want. But it’s perfectly fine for a best practices document and should be in there, giving guidelines on this.
Gregg Kellogg: the one thing I’m not sure whether we can move to a bp document is something that allows one to require stream data (?)
Ivan Herman: I would propose to leave it to best practice
Tim Cole: I’m a little concerned that by not following gkellogg’s suggestions people will create json-ld that cannot be used properly by a streaming processor
Adam Soroka: we frequently get questions about streaming json-ld
… I second the concern timCole raised
Benjamin Young: #3
Benjamin Young: a lot of the stuff I’m reading there is about key ordering
… one potential option could be not to require ordering
… but processors outputting a preferred ordering
Ivan Herman: I hear bigbluehat’s argument which is perfectly valid, and maybe a future version of json-ld will have key ordering
Rob Sanderson: +1 - unordered keys are ordered by necessity of a serialization
Gregg Kellogg: serialization vs. data model wrt. ordering
Tim Cole: +1 to ivan since it will provide experience to inform normalization
Ivan Herman: I repeat what I just said, getting into a normative thing in that area is probably premature
… or too much work
Rob Sanderson: what’s the happy middle ground? key ordering for streaming?
Benjamin Young: I would like to get as much as possible into the best practice document
Ruben Taelman: scoped contexts shouldnt be an issue
… as long as they are the first object
Proposed resolution: Describe preferred key ordering for serialization over the wire to enable streaming parsers as a best practice (Rob Sanderson)
Gregg Kellogg: +1
Adam Soroka: +1
Rob Sanderson: +1
Ivan Herman: +1
Dave Longley: +1
Benjamin Young: +1
Simon Steyskal: +1
David I. Lehn: +1
Ruben Taelman: +1
Gregg Kellogg: Scoped contexts might require that @type come after @id
Pierre-Antoine Champin: +1
Tim Cole: +1 as long as leave defer for future
David Newbury: +1
Resolution #7: Describe preferred key ordering for serialization over the wire to enable streaming parsers as a best practice

@iherman
Copy link
Member

iherman commented Apr 27, 2019

Transferring this issue (used to be issue. no 5 in json-ld-syntax) to best practice repo

@iherman iherman transferred this issue from w3c/json-ld-api Apr 27, 2019
@azaroth42 azaroth42 moved this from Discuss-GH to Non TR Work in JSON-LD Management DEPRECATED Aug 15, 2019
@ajs6f
Copy link
Member

ajs6f commented Aug 23, 2019

Hm. Just thinking out loud, but I wonder if this would be even better placed as a Note separate from the BP document. Maybe, maybe not. In favor, I think it's a bit (or maybe a lot, depending on your POV) more advanced than other topics we expect to cover in the BP doc. Against, why multiply documents? We have several already…

@rubensworks
Copy link
Member

@BigBlueHat I just went through this issue and #5, and I can confirm that all information in here is summarized in #5, so we can safely close this one here.

@ajs6f ajs6f closed this as completed Feb 14, 2020
@gkellogg gkellogg removed this from Non TR Work in JSON-LD Management DEPRECATED Feb 17, 2020
@rubensworks rubensworks removed their assignment Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants