Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ndjson-ld format #140

Open
VladimirAlexiev opened this issue Feb 19, 2021 · 14 comments
Open

ndjson-ld format #140

VladimirAlexiev opened this issue Feb 19, 2021 · 14 comments
Labels
results application/sparql-results+rainbows

Comments

@VladimirAlexiev
Copy link
Contributor

VladimirAlexiev commented Feb 19, 2021

Why?

Newline-delimited JSON (line-oriented JSON) is often used in preference of JSON because it is streamable and can be processed with line-oriented tools (eg grep)

Previous work

Proposed solution

  • We're considering MIME type application/x-ld+ndjson (derived from the existing MIME type for JSON-LD application/ld+json and the MIME type of Newline Delimited JSON application/x-ndjson)
  • We're considering file extensions .ndjsonld, and maybe .jsonl and .ndjson

Considerations for backward compatibility

None?

@ericprud
Copy link
Member

@VladimirAlexiev , i found some examples in the specs but didn't find what you were referring to in the "Sample data" link above.

I think streaming JSON would be an excellent tool for long-running SPARQL results and line-oriented is a nice benefit. I guess this is a small step from current JSON results as they already require newlines to be escaped, right?

@TallTed
Copy link
Member

TallTed commented Feb 20, 2021

NDJSON is apparently also known as all of LDJSON, Line_Delimited_JSON, JSON_Lines, JSON_Streaming, JSONL, ndjson, NDJSON, and Newline_Delimited_JSON -- so this new thing could even be LD-JSON-LD!

Except that JSON-L (or JSONL) is definitely different from NDJSON... And I imagine there are other issues hiding behind the not-quite-synonym list above.


What is the (anticipated?) relationship between ND-JSON-LD (or NDJSON-LD) and JSON-LD (and 1.0, 1.1, etc.)?

Both JSON Lines and Newline Delimited JSON say they're also known by the other name, but as noted above these are different creatures. It's going to be necessary very quickly to clearly define which you're working with (and why not the other), as well as what may happen if the streams are crossed.


How and why is "Newline Delimited JSON-LD" (or is it "Linked Data in ND-JSON"?) related to the 1.2 update of SPARQL, which is the focus of this github project?

It seems to me that ND-JSON-LD should be a distinct project, maybe associated with JSON-LD given their apparent close cousin relationship.


On Media Type...

x- Media Types are generally frowned on these days, for good reason. Which the NDJSON folk know, and haven't done much about (ndjson/ndjson-spec#19, ndjson/ndjson-spec#21).

Media Types with Multiple Suffixes is heading toward RFC status, and application/ld+json already exists, so you might consider application/nd+ld+json, possibly with a synonymous application/ld+nd+json (which would need the apparently stagnant NDJSON project to change from application/x-ndjson to application/nd+json)

If you don't want to pin hopes on Media Types with Multiple Suffixes, you might also consider application/ld+ndjson, and again pushing the NDJSON project to change from application/x-ndjson to application/ndjson ...

Or leave the NDJSON project fallow as it stands, and consider application/ld+x-ndjson, which at least follows the general rules of Media Types, and parallels the existing application/ld+json.


This feels like a lot of frayed ends in search of a knot. That knot may be worthwhile, but I think it should be distinct from SPARQL 1.2.

@afs
Copy link
Collaborator

afs commented Feb 20, 2021

Won't it be application/sparql-results+x-ndjson for SELECT results and application/ld+x-ndjson for CONSTRUCT/DESCRIBE?

From JSON-LD, application/ld+... is about RDF graphs and datasets, and ...+json the concrete syntax choice. (c.f. rdf+xml).

@gkellogg
Copy link
Member

It would seem that the appropriate place for this effort would be the JSON-LD CG (AKA the JSON for Linking Data Community Group), although the JSON-LD WG remains as a maintenance group.

Also, note that the WG published the Streaming JSON-LD note, which addresses the need for a streaming serialization format, but in this case by imposing an order object entries in the line serialization, although it is not a line format, per se.

At first glance, the NDJSON-LD would seem to follow well given an out-of-bound specified context, such as via Link header. That would make it much the same as parsing an outer object containing @context and the values of @graph. Going beyond, an extension for supporting an @context at the top level, either as a URL, or a one-line object, would be straight-forward. Nothing would prevent an individual NDJSON line from including @context, either, unless there is some limitation on line length I didn't notice.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Feb 21, 2021

@ericprud
The sample data we have cited in our jira looks like this

{"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", "type": "MonetaryGrant", "id": "sg:grant.6616389",...\n 
{"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", "type": "MonetaryGrant", "id": "sg:grant.6616214",...\n ...

It's probably here http://scigraph.downloads.uberresearch.com/archives/current/grants.tar.gz

Right now we are considering NDJSON-LD for input,

but you make a good point that a streaming sparql-results-json for SELECT output would also be useful.

In fact, CONSTRUCT output as NDJSON-LD is non trivial because how would it know which triples to put on each line? How would it know which is the "main loop" of the query, or the "primary key" so to speak?

@TallTed thanks for the pointers to MIME developments!

@gkellogg thanks for the pointer to Streaming jsonld!

@rubensworks
Copy link
Member

rubensworks commented Feb 22, 2021

Pinging @wouterbeek here regarding NDJSON-LD, as he suggested it a while back here rubensworks/jsonld-streaming-parser.js#64

@ericprud
Copy link
Member

There's a longish discussion of media subtypes containing '+' on media-types@ietf.org.
(I don't actually think nd+json is viable because people assume that +json means the resource matchs 4627, but folks can always relax their standards if they don't mind breaking some stuff.)

@afs
Copy link
Collaborator

afs commented Feb 22, 2021

sparql-results+json is streaming if the fields are in the right order ("head" before "results").

  • Streaming a line format, used without the Content-length: and a line format, means there can be silent truncation of results.
  • No Content-Length interacts with connection management with some DOS potential by badly behaved clients.

These aren't reasons not to do it - they are things that should be noted in any design. Inside the enterprise is different environment to the open web.

@jaw111
Copy link
Contributor

jaw111 commented Sep 8, 2021

Just to note a real-world use case for newline delimited JSON-LD. For one application we developed, we index suitably framed JSON-LD documents in Elasticsearch where the documents are imported to Elasticsearch as NDJSON. That process uses a Jena model to gather RDF data from various sources (blackboard design pattern), then extracts and frames a sub-graph for resources of a given type.

Whilst it would be nice to be able to get some NDJSON-LD serialization as the result of a SPARQL query directly, I think it would be necessary to have some way to indicate a JSON-LD frame (rather than just a context as @gkellogg suggested) in order to guarantee consistent nesting/embedding in the JSON object structure.

Arguably for our usage the JSON-LD frame IS the query, a SPARQL query is not even needed.

@TallTed
Copy link
Member

TallTed commented Sep 8, 2021

  • Streaming a line format, used without the Content-length: and a line format, means there can be silent truncation of results.

@afs -- I would think that adding a specific termination marker to the syntax would avoid silent truncation without Content-length: -- and including the net line count in the termination marker (at which point, it should be trivially known) would prevent errors from missing lines, though it wouldn't give any good way to recover from such, other than repeating the request and running a diff on the two streams if the second also had some drop-outs...

@afs
Copy link
Collaborator

afs commented Sep 9, 2021

Content-Length is understood by HTTP/1.1 libraries and is used by them to reuse connections.

A trailer as protocol-level termination and including end-transfer information would be a good thing . It does not completely replace Content-Length though.

There is of course HTTP/2 - new protocol work ought to be an abstract design that exploits HTTP/2 features, can also be targeted at other transfer layers, for example, streaming gRPC. HTTP/1.1 may not be able to expose all of that design though improvements like early termination can be fitted.

@JervenBolleman JervenBolleman added the results application/sparql-results+rainbows label Nov 30, 2021
@VladimirAlexiev
Copy link
Contributor Author

@jaw111 thanks for the input!

necessary to have some way to indicate a JSON-LD frame

Yes, unless you have #39, #48, #73, #128 :-)

the JSON-LD frame IS the query

I think you're talking GraphQL here :-)

@jaw111
Copy link
Contributor

jaw111 commented Feb 17, 2022

I think you're talking GraphQL here :-)

I was not able to come to terms with GraphQL-LD, still prefer SPARQL.

There is definitely some overlap between JSON-LD frames and GraphQL.

@VladimirAlexiev
Copy link
Contributor Author

Just a note that @butaloto is working to upgrade our NDJSONLD implementation eclipse-rdf4j/rdf4j#2840 to JSONLD 1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
results application/sparql-results+rainbows
Projects
None yet
Development

No branches or pull requests

8 participants