Request for guidance for parsing larger JSON-LD data files #366

wouterbeek · 2021-01-17T22:22:18Z

As a standard-compliant linked data environment provider, we would love to add support for JSON-LD uploads. There is active demand for such a feature: several of our customers have asked us to do so. We have tried to add such support based on existing parsers. Unfortunately, none of the contemporary JSON-LD parsers are able to support the upload of medium-sized linked data files in the JSON-LD format.

We have looked at building a parser ourselves, or fixing an existing parser to be able to process such medium-sized datasets. The problem seems to be that the JSON-LD format does not guarantee that all transformations are defined up-front. IOW, at the end of the stream a transformation may be defined that has to be applied to an element in the beginning of the stream. (See this Github issue for a discussion along these lines.)

What is the JSON-LD WG's viewpoint on the inability to parse medium-sized linked data files today? Was the issue of parsing larger data files considered at all? If so, what is the guidance provided by the JSON-LD WG in order to deal with such files? If the issue of parsing larger data files was not previously considered, is it possible to come up with a practical solution for this issue, now that such larger files are available and linked data users are expecting that it is possible to use such files in modern linked data environments?

I have answered the most obvious counter questions to my issue in the following appendices.

Appendix A: Do medium-sized linked data files in JSON-LD really exist?

With "medium-sized linked data files" I mean data files that are currently published by organizations online in the JSON-LD format, that users of a modern linked data environment that purports to support JSON-LD uploads can legitimately be expected to be able to process.

Two examples of such medium-sized linked data files are:

German libraries data (DNB): https://data.dnb.de/opendata/authorities_entityfacts.jsonld.gz (compressed size: 932MB)
Iconclass, classification of symbolism that appears in artworks: http://iconclass.org/data/iconclass_20200710_skos_jsonld.ndjson.gz (compressed size: 141MB)

We did not perform an extensive search for larger JSON-LD files, but believe that much larger JSON-LD files are likely to already exist today, or that such files will at the very least will be created in the near future, given the popularity of the JSON-LD format.

Appendix B: What about JSON-LD Streaming Document Format?

Data files that follow the JSON-LD Streaming Document Format can be parsed by single-pass streaming parsers. If all JSON-LD files were guaranteed to be in the Streaming Document Format, then the here brought up issue would probably not exist. Unfortunately, whether a JSON-LD file conforms to the Streaming Document Format cannot be easily observed prior to initiating the parsing task. Also, users that create larger JSON-LD files are currently not guaranteed to do so using the Streaming Document Format exclusively.

Appendix C: Why not only support JSON-LD Streaming Document Format?

In theory it would be possible to only support JSON-LD uploads in the Streaming Document Format. However, this restriction is difficult to communicate to users who often do not know whether a data file they want to upload does or does not follow the Streaming Document Format. Determining whether this format is used in the first place is relatively difficult (see Appendix B), and explaining to users that they should go back to their JSON-LD data file and first reformat it in order to follow the Streaming Document Format is not very user friendly.

Also, for existing open data files that do not use the JSON-LD Streaming Document Format somebody has to perform a non-trivial conversion. Which tool should be used to perform such a non-trivial conversion? Building such a conversion tool (from JSON-LD to JSON-LD Streaming Document Format) may not be significantly easier than building a generic, two-pass streaming JSON-LD parser in the first place.

Appendix D: Why not only support JSON-LD uploads for small data files?

In theory it would be possible to only support JSON-LD uploads up to a certain file size, for which we are confident that contemporary parsers can process them.

The problem with this approach is that we do not enforce size limits for any of the other RDF serializations formats. Indeed, the ability to upload data of arbitrary size is an important feature of our approach. Also, the existence of larger JSON-LD data files today shows that there is a need to create and use such files.

Communicating to users that we are able to process arbitrarily large files in Turtle and in RDF/XML, but not in JSON-LD, will be difficult to communicate. This problem of communicating an upload limit with users is worsened by the fact that the upload limit will probably be relatively low, based on our experiences with existing parsers so far.

gkellogg · 2021-01-18T00:58:18Z

There's a lot to deal with in one issue...

As a standard-compliant linked data environment provider, we would love to add support for JSON-LD uploads. There is active demand for such a feature: several of our customers have asked us to do so. We have tried to add such support based on existing parsers. Unfortunately, none of the contemporary JSON-LD parsers are able to support the upload of medium-sized linked data files in the JSON-LD format.

You'll need to qualify "support". There is nothing inherent in the spec or in the various implementations that can't support medium-large files other than processor and memory limitations. JSON-LD leverages JSON, which uses an inherently unordered Object type, which is what leads to the need, typically, to load the entire document into memory to parse.

That said, as you've noted, there is a streaming profile, that places some restrictions on the serialized representation of JSON-LD to do manage that memory usage.

We have looked at building a parser ourselves, or fixing an existing parser to be able to process such medium-sized datasets. The problem seems to be that the JSON-LD format does not guarantee that all transformations are defined up-front. IOW, at the end of the stream a transformation may be defined that has to be applied to an element in the beginning of the stream. (See this Github issue for a discussion along these lines.)

What is the JSON-LD WG's viewpoint on the inability to parse medium-sized linked data files today?

It is able to parse such files, subject to in-memory limitations, again, due to the underlying JSON representation. Note that there is some discussion of a CBOR-LD (also here), which could be better optimized and likely draw on some of the streaming principles which comes with a known or mandated key ordering, but work hasn't progressed on that. IMO, a format such as CBOR likely has some inherent advantages for medium-large files.

Was the issue of parsing larger data files considered at all? If so, what is the guidance provided by the JSON-LD WG in order to deal with such files? If the issue of parsing larger data files was not previously considered, is it possible to come up with a practical solution for this issue, now that such larger files are available and linked data users are expecting that it is possible to use such files in modern linked data environments?

One change made in JSON-LD 1.1 as a concession to larger files is to make key sorting optional (I don't think jsonld.js takes advantage of this, though); this can help a lot in many cases, but JSON-LD was based on JSON, which is really designed to be an in-memory format. As such, it may not be appropriate for all use cases.

That said, I think the Streaming and CBOR directions would be worth investigating further, as would other profile restrictions that might restrict the usage of features such as scoped contexts and key order which are a big part of the problem in streaming.

I have answered the most obvious counter questions to my issue in the following appendices.

Appendix A: Do medium-sized linked data files in JSON-LD really exist?

With "medium-sized linked data files" I mean data files that are currently published by organizations online in the JSON-LD format, that users of a modern linked data environment that purports to support JSON-LD uploads can legitimately be expected to be able to process.

Two examples of such medium-sized linked data files are:

German libraries data (DNB): https://data.dnb.de/opendata/authorities_entityfacts.jsonld.gz (compressed size: 932MB)

Iconclass, classification of symbolism that appears in artwords: http://iconclass.org/data/iconclass_20200710_skos_jsonld.ndjson.gz (compressed size: 141MB)

We did not perform an extensive search for larger JSON-LD files, but believe that much larger JSON-LD files are likely to already exist today, or that such files will at the very least will be created in the near future, given the popularity of the JSON-LD format.

I'd call these "large" and not "medium". I suspect that any such large format is generated algorithmically, which could likely better insure feature limitation and key ordering to meet the requirements of JSON-LD Streaming. Note that there is a profile defined to identify (and request) a streaming profile: http://www.w3.org/ns/json-ld#streaming. This can be used as described in IANA Considerations as part of a MIME Type or HTTP request.

Appendix B: What about JSON-LD Streaming Document Format?

Data files that follow the JSON-LD Streaming Document Format can be parsed by single-pass streaming parsers. If all JSON-LD files were guaranteed to be in the Streaming Document Format, then the here brought up issue would probably not exist. Unfortunately, whether a JSON-LD file conforms to the Streaming Document Format cannot be easily observed prior to initiating the parsing task. Also, users that create larger JSON-LD files are currently not guaranteed to do so using the Streaming Document Format exclusively.

It can be determined if set in the HTTP response. Of course, it may not be configured properly to do so. I also believe that @rubensworks implementation will do a good job with arbitrary JSON-LD, and take advantage of streaming where it can.

The WG, with sufficient input, could add sections to JSON-LD Best Practices on considerations for creating and consuming medium-large files.

Appendix C: Why not only support JSON-LD Streaming Document Format?

In theory it would be possible to only support JSON-LD uploads in the Streaming Document Format. However, this restriction is difficult to communicate to users who often do not know whether a data file they want to upload does or does not follow the Streaming Document Format. Determining whether this format is used in the first place is relatively difficult (see Appendix B), and explaining to users that they should go back to their JSON-LD data file and first reformat it in order to follow the Streaming Document Format is not very user friendly.

I could certainly imagine a two-part process where a JSON-LD document is first put into a streaming format, something that at worst involves reading in the whole document, but often can be faster as keywords typically appear first in an object. It could then be sent to a streaming parser.

Is it your experience that such large files heavily use scoped context and object embedding? The typical ToRdf process, which might be used to serialize a dataset to JSON-LD, would not use object embedding, and thus the file could be read in a chunked fashion on object boundaries.

Also, for existing open data files that do not use the JSON-LD Streaming Document Format somebody has to perform a non-trivial conversion. Which tool should be used to perform such a non-trivial conversion? Building such a conversion tool (from JSON-LD to JSON-LD Streaming Document Format) may not be significantly easier than building a generic, two-pass streaming JSON-LD parser in the first place.

Algorithmically, it should not be difficult, as it is a simple recursive process with key ordering. Practically, the storage requirements given sufficient object embedding and pathological native key ordering may limit this, but you may find that the 90% case can be handled easily enough.

Appendix D: Why not only support JSON-LD uploads for small data files?

In theory it would be possible to only support JSON-LD uploads up to a certain file size, for which we are confident that contemporary parsers can process them.

The problem with this approach is that we do not enforce size limits for any of the other RDF serializations formats. Indeed, the ability to upload data of arbitrary size is an important feature of our approach. Also, the existence of larger JSON-LD data files today shows that there is a need to create and use such files.

Communicating to users that we are able to process arbitrarily large files in Turtle and in RDF/XML, but not in JSON-LD, will be difficult to communicate. This problem of communicating an upload limit with users is worsened by the fact that the upload limit will probably be relatively low, based on our experiences with existing parsers so far.

In summary, I the use case for JSON-LD was not based on the need for large data-dump formats, but for smaller API-style JSON messages. That said, there are ways to support such large formats, if they conform to something like the streaming profile, and tools that create such dumps should take that into consideration when generating the output.

Similarly processors may place restrictions on the profile of files they will support, either with an explicit profile argument, or by runtime detection. Failing, or falling back to a best effort service, may be the best approach. In the case of a large file composed of an array of flat object nodes (and named graphs containing flat object nodes) actual key ordering is likely not the issue, but an optimized parser may still be in order.

rubensworks · 2021-01-18T10:42:24Z

If I understand correctly, @wouterbeek's use case depends on direct data uploads via browser-based forms. In this case, attaching the streaming profile via a HTTP header is not directly possible. Instead, one would have to rely on the data uploader to "promise" that it has supplied a document adhering to the streaming profile.

Streaming parsers are required to emit an error if a document is detected that does not adhere to the streaming profile. So this would at least allow you to detect such documents.

I think the streaming profile (perhaps in the future combined with CBOR-LD), is the proper way forward for parsing large JSON-LD files. This would of course depend on the data provider to provide their data following the streaming profile, which should be reasonable considering the minimal set of requirements of this form.

In order to achieve this, data providers could build such a pipeline by first exporting to other streaming-friendly RDF formats such as N-Quads, and then piping this into a JSON-LD streaming serializer (taking into account the recommended triple ordering).

In general, my recommendation would be to accept all JSON-LD files up to a certain file size, and for larger JSON-LD files, only accept them in the streaming profile.

wouterbeek · 2021-01-18T11:40:32Z

Thanks @gkellogg and @rubensworks for the good feedback!

If generic JSON-LD is inherently an in-memory format, then we must indeed try to communicate this to our users somehow. For larger files we would need to ask the user to modify their JSON-LD file to use the streaming profile.

wouterbeek closed this as completed Jan 18, 2021

gkellogg mentioned this issue Jan 22, 2021

Expand section on "Serializing Large Datasets" w3c/json-ld-bp#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for guidance for parsing larger JSON-LD data files #366

Request for guidance for parsing larger JSON-LD data files #366

wouterbeek commented Jan 17, 2021 •

edited

Loading

Appendix A: Do medium-sized linked data files in JSON-LD really exist?

Appendix B: What about JSON-LD Streaming Document Format?

Appendix C: Why not only support JSON-LD Streaming Document Format?

Appendix D: Why not only support JSON-LD uploads for small data files?

gkellogg commented Jan 18, 2021

Appendix A: Do medium-sized linked data files in JSON-LD really exist?

Appendix B: What about JSON-LD Streaming Document Format?

Appendix C: Why not only support JSON-LD Streaming Document Format?

Appendix D: Why not only support JSON-LD uploads for small data files?

rubensworks commented Jan 18, 2021

wouterbeek commented Jan 18, 2021 •

edited

Loading

Request for guidance for parsing larger JSON-LD data files #366

Request for guidance for parsing larger JSON-LD data files #366

Comments

wouterbeek commented Jan 17, 2021 • edited Loading

Appendix A: Do medium-sized linked data files in JSON-LD really exist?

Appendix B: What about JSON-LD Streaming Document Format?

Appendix C: Why not only support JSON-LD Streaming Document Format?

Appendix D: Why not only support JSON-LD uploads for small data files?

gkellogg commented Jan 18, 2021

Appendix A: Do medium-sized linked data files in JSON-LD really exist?

Appendix B: What about JSON-LD Streaming Document Format?

Appendix C: Why not only support JSON-LD Streaming Document Format?

Appendix D: Why not only support JSON-LD uploads for small data files?

rubensworks commented Jan 18, 2021

wouterbeek commented Jan 18, 2021 • edited Loading

wouterbeek commented Jan 17, 2021 •

edited

Loading

wouterbeek commented Jan 18, 2021 •

edited

Loading