Mapping between Triples and IRIs #23

HolgerKnublauch · 2020-10-30T00:17:23Z

There has been some discussion about "long URIs" to represent embedded triples in a backwards-compatible way. If we go down this road, we need to decide on a syntax for this mapping. The mapping should be bi-directional so that systems can parse URIs back to triples if needed. Ideally, the URIs should be as short as possible and be reasonably human-readable in case someone encounters them through a "leak".

PROPOSAL:
Given a triple S, P, O produce a IRI using the template urn:triple:${encode(S)}:${encode(P)}:${encode(O)} where the encode(N) function is (JavaScript) encodeURIComponent(ttl(N)) and ttl(N) is the Turtle serialization of N, without using prefixes but using absolute IRIs only. Blank nodes would become _:ID where ID is some internal ID that the current system uses (e.g. the Jena blank node label). See the sections including https://www.w3.org/TR/turtle/#sec-iri. For literals, the available short forms need to be used, e.g. "1"^^xsd:integer becomes 1, see https://www.w3.org/TR/turtle/#literals

We might want to use 'a' for rdf:type as there is a large number of triples of this form, but I have no strong opinion on that. Potentially the system could also rely on a number of hard-coded "well-known" prefixes such as rdf, owl, sh, skos. This would further shorten the URIs in case the implementation has them occupy memory.

See http://datashapes.org/reification.html#uriReification for an earlier version that is currently implemented in TopBraid. I have since convinced myself that relying on locally defined prefixes (per file) is not desirable, as prefixes may change and then these identifiers break.

The text was updated successfully, but these errors were encountered:

VladimirAlexiev · 2020-11-03T20:37:27Z

GraphDB and rdf4j use urn:rdf4j:triple:xxx where xxx stands for the Base64 URL-safe encoding of the N-Triples representation of the embedded triple.

HolgerKnublauch · 2020-11-03T23:58:53Z

Ok Base 64 is an option (assuming we agree that the :rdf4j part can simply be removed in a standardized form.

Comments:

N-Triples doesn't use any namespace abbreviations, which would cause quite a bit of bloat, e.g. when xsd:date has to be spelled out each time. I would argue that for brevity we should define standard prefixes and require their use.
Base64 is not human-readable, while URL-encoded strings are at least manageable

Why did you use Base64? Is it producing shorter URIs in average?

Is RDF4J ever storing these long URIs internally or does it use SPO pointers and only produces the URIs when needed (i.e. rarely)?

VladimirAlexiev · 2020-11-06T22:08:48Z

I vote against relying on prefixes because they can be redefined locally and even xsd is not standardized (some people use xs).

Can we use some short hash instead of base64?

HolgerKnublauch · 2020-11-06T23:58:53Z

On prefixes we had similar discussions in the SHACL-SPARQL work and noted that prefix declarations are not really an RDF graph concept, but merely a feature of serializations. They do not necessarily "survive" round-tripping so are generally not reliable, as you also say. However, we need to keep in mind that some implementations of a long-URI policy may in fact store these URIs are real strings, and in that case we should aim at keeping the URIs as short as reasonable. A catalog of prefixes such as

[ rdf, rdfs, owl, sh, xsd, skos ]

would hopefully be quite easy to agree on and would shorten the majority of triples considerably, esp with datatypes and in common cases like rdf:type and rdfs:comment triples. These hard-coded abbreviations improve memory consumption but also human-readability.

With hash number, how would they uniquely identify triples - they cannot be parsed back.

blake-regalia · 2020-11-10T05:58:24Z

The way GraphDB does it is perfect IMO.

N-Triples doesn't use any namespace abbreviations

Exactly 👍

prefix declarations are not really an RDF graph concept, but merely a feature of serializations.

Yes, prefixes should absolutely be avoided.

However, we need to keep in mind that some implementations of a long-URI policy may in fact store these URIs are real strings, and in that case we should aim at keeping the URIs as short as reasonable.

I wouldn't worry about implementation in this regard. We should focus on the serialization, the data model does not change; implementors will choose the appropriate data structures.

The mention of long URLs is interesting. As of today, the de facto maximum URL string length widely supported on the interwebz is about a 2000 characters, which would leave about 2,500 characters worth of content unencoded.

HolgerKnublauch · 2020-11-10T06:01:35Z

Would you help me understand your reason why prefixes should be absolutely avoided?

Are the URL string length restrictions relevant for IRIs?

blake-regalia · 2020-11-10T06:24:35Z

Are the URL string length restrictions relevant for IRIs?

Ah yes, I meant to mention that I could see this becoming a concern for dereferencing long URLs which encode several layers of embedded RDF* triples this way. Although I imagine it would likely never happen in practice.

Prefixes should be avoided mainly because they introduce ambiguity to an otherwise canonical form. If prefixes are allowed, then there can be two IRIs which encode semantically equivalent triples but manifest as different strings. While it may reduce string length and ease readability, it comes at a great cost to implementations since they must first normalize every string before storing or comparing. Also, prefixes are not in any way intrinsic to the specification (e.g., there is no ontology or set of vocabulary terms RDF-star uses other than maybe rdf) so selecting a set of prefixes would be rather arbitrary and preferential.

HolgerKnublauch · 2020-11-10T08:18:53Z

On 11/10/2020 4:24 PM, Blake Regalia wrote: Are the URL string length restrictions relevant for IRIs? Ah yes, I meant to mention that I could see this becoming a concern for dereferencing long URLs which encode several layers of embedded RDF* triples this way. Although I imagine it would likely never happen in practice. Prefixes should be avoided mainly because they introduce ambiguity to an otherwise canonical form.

This would be a valid concern.

If you allow prefixes, then you can have two IRIs which encode semantically equivalent triples but manifest as different strings. While it may reduce string length and ease readability, it comes at a great cost to implementations since they must first normalize every string before storing or comparing. Also, prefixes are also not in any way intrinsic to the specification (e.g., there is no ontology or set of vocabulary terms RDF-star uses other than maybe |rdf|) so selecting a set of prefixes is rather arbitrary and preferential.

To clarify this, the mapping that I propose would always canonical. For example, all valid xsd:integer literals must be serialized in the short form, i.e. just the digits. Whenever a resource from a known prefix is used, then the qname must be used etc. The list of known prefixes would be fixed across all implementations. (I am not religious about this topic and can of course live with N-Triples notation, just wanted to clarify this position). BTW even with N-Triples there is a tiny bit of ambiguity, because there are two ways of stating xsd:string literals. Holger

VladimirAlexiev · 2020-11-12T13:42:22Z

I think readability is not an important requirement, since when you go 2-3 levels of nesting, you'll get an unreadable mess no matter what encoding you use.
Limiting length is a legitimate requirement
Holger points out a requirement of parsability (invertibility). I hadn't thought about it, but I now think it's important, eg to parse and reconstruct RDF* from NTriples*

Using a set of fixed prefixes is a very small step towards limiting length and doesn't solve the problem.
Eg what would be the encoding of this RDF* triple:

<<:Michail_Sholokhov :wrote "<full text of And Quiet Flows the Don, all 5k pages of it>" >>
  :disputedBy :A_Chernov.

I think we need to pick some compression method.
Eg EXI https://en.wikipedia.org/wiki/Efficient_XML_Interchange uses Huffman coding for representing XML efficiently on constrained (IoT) devices.
See https://www.w3.org/TR/exi/, https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/

HolgerKnublauch · 2020-11-13T00:58:36Z

It is quite easy to come up with cases where any algorithm will behave poorly. Going down multiple levels of nesting (i.e. statements about statements about statements) is one of those, but is this really happening in practice? Likewise, if anyone stores a whole book text as an RDF literal then the database will suffer no matter what.

I am open to compression algorithms assuming their trade-off is worth it. Keep in mind that we are talking about URIs, so any compressed binary format may require an extra level of URL-encoding. So you'd end up with quite a layering of algorithms that add up complicating the assessment. Qnames already solve compression in the RDF world, but they only work if we either define a comprehensive catalog of common prefixes or another mechanism to safely reference local prefixes (which I don't think is possible).

A proper scientific approach here would be to collect realistic sample data and then let the conversion algorithms do their work to compare size versus serialization/parsing performance, and then also readability (which I wouldn't want to give up on yet). The problem then becomes a matter of proper engineering.

So: does anyone have some example data?

pchampin · 2021-07-01T16:34:33Z

A long time ago, I flagged this discussion as relevant to semantics, but in retrospect, it seems to me that its is more about implementations. Semantically, this method raises problems as long as blank nodes are involved, because the blank node label that will be put in the IRI is irrelevant for the semantics (actually, it is even irrelevant for the abstract syntax). Of course, implementations can rely on that internally, and "do the right thing" under the hood with blank node labels.

Therefore, refiling this issue as discussion, and removing the semantics label. Shout if you disagree.

pchampin added the semantics About the semantics of RDF-star label Nov 10, 2020

pchampin mentioned this issue Nov 19, 2020

Should RDF* be just syntactic sugar on top of RDF? #37

Closed

pchampin added discussion Open ended discussion that does not call for a specific action and removed semantics About the semantics of RDF-star labels Jul 1, 2021

niklasl mentioned this issue Jul 9, 2023

Why quoted triples, when we already have named graphs? w3c/rdf-concepts#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping between Triples and IRIs #23

Mapping between Triples and IRIs #23

HolgerKnublauch commented Oct 30, 2020

VladimirAlexiev commented Nov 3, 2020

HolgerKnublauch commented Nov 3, 2020

VladimirAlexiev commented Nov 6, 2020

HolgerKnublauch commented Nov 6, 2020 •

edited

blake-regalia commented Nov 10, 2020

HolgerKnublauch commented Nov 10, 2020

blake-regalia commented Nov 10, 2020 •

edited

HolgerKnublauch commented Nov 10, 2020 via email

VladimirAlexiev commented Nov 12, 2020

HolgerKnublauch commented Nov 13, 2020

pchampin commented Jul 1, 2021

Mapping between Triples and IRIs #23

Mapping between Triples and IRIs #23

Comments

HolgerKnublauch commented Oct 30, 2020

VladimirAlexiev commented Nov 3, 2020

HolgerKnublauch commented Nov 3, 2020

VladimirAlexiev commented Nov 6, 2020

HolgerKnublauch commented Nov 6, 2020 • edited

blake-regalia commented Nov 10, 2020

HolgerKnublauch commented Nov 10, 2020

blake-regalia commented Nov 10, 2020 • edited

HolgerKnublauch commented Nov 10, 2020 via email

VladimirAlexiev commented Nov 12, 2020

HolgerKnublauch commented Nov 13, 2020

pchampin commented Jul 1, 2021

HolgerKnublauch commented Nov 6, 2020 •

edited

blake-regalia commented Nov 10, 2020 •

edited