Skip to content
This repository has been archived by the owner on Feb 15, 2020. It is now read-only.

Proposal: find a transitional solution before updating langString #13

Closed
pchampin opened this issue Jun 8, 2019 · 20 comments
Closed

Proposal: find a transitional solution before updating langString #13

pchampin opened this issue Jun 8, 2019 · 20 comments

Comments

@pchampin
Copy link
Collaborator

pchampin commented Jun 8, 2019

I'm trying here to flesh out my argument (made at the end of this week's telco) that we could use the -x-dir-XXX private subtag, not to sove the problem in the long term, but as a mean to ensure a smooth transition between the current state of the standards, and the future state where base direction is cleanly integrated. Doing this, we could limit the charter of the new WG to a smaller set of specifications (RDF concepts, semantics and concrete syntaxes, basically) and leave it to other WGs to update the rest of the specifications (possibly together with other changes).

I understand that many people are pessimistic about letting the genie out of the bottle, fearing that once this private subtag is in use, it will spread and pollute even clean standards (such as HTML). And honnestly, I share those concerns. But I think/hope that we can contain this risk – let the genie out "on parole", if you like. The alternative (updating all the specs at once) seems equaly risky.

The core idea is that, in the future RDF model (I'll call it RDF 1.2 for convenience), we update langString as described in langString.html, but in the abstract syntax, we forbid the language tag to contain -x-dir-ltr or -x-dir-rtl. Instead, whenever those subtags are encoutered (either in a concrete syntax of programmatically), they MUST be stripped out of the language tag, and interpreted as the base direction1.

This means that a Turtle 1.1 file (or SPARQL 1.1 query) may contain "مرحبا world, how are you?"@en-x-dir-ltr, but any RDF1.2 implementation will convert it automatically to "مرحبا world, how are you?"@en^ltr. The RDF 1.2 family of concrete syntaxes would accept the private subtag during parsing (for backward compatibility), possibly with a warning, but it would be illegal to use it when serializing.

When serializing from an RDF 1.2 implementation to a 1.1 family of language (e.g. SPARQL 1.1 results), implementations MAY encode the direction information using the private subtag, in order to preserve that information. Thanks to the principles above, an RDF 1.2 store and an RDF 1.2 client may communicate using SPARQL 1.1, but will not spread the private subtag further.

Once all specs (and corresponding implementations) are updated, only old static files may still contain the private subtag – and raise warnings whenever parsed by RDF 1.2 implementations.


1 What happens when the subtag and and explicit base direction are provided needs to be decided... I think it should be an error.

@iherman
Copy link
Member

iherman commented Jun 10, 2019

Let us say a full RDF environment like RDFLib is updated to 1.2. This means the internal Literal structure is expanded to include a direction, all equalities are managed. RDFLib has a series of parsers, they may all accept both -x-dir-* but also @lan^dir and they transform both into the new Literal. So far so good.

However, what should an RDF Turtle (or RDF/XML or JSON-LD) do? Should it generate -x-dir-* or @lan^dir? The only answer I see is that… it depends. And this 'it depends' translates onto some sort of a user settable option which is bad. And if that is where the usage may stuck with the -x-dir-* once and for all. The same holds for all processors whose output is another RDF serialization: what should a JSON-LD or an RDFa distiller outputting, e.g., Turtle, do?

A more radical approach may be (but I'am not sure) not to consider -x-dir-* as a temporary solution at all. Instead, using it should become THE serialization syntax for base serializations like Turtle, N-Triple, TriG, etc.✝︎ Conforming RDF 1.2 implementations would use the updated Literal internally to implement the abstract syntax, but Turtle serializers would use -x-dir-* and only this. I am not sure about non-core syntaxes, like RDFa and JSON-LD; I have the impression that the majority of usages for these syntaxes is to import into RDF environments like RDFLib, and less for export, i.e., a strict 1.2 export for these (e.g., producing a JSON-LD with a new syntax using @direction) might be less damaging.

(I'm not sure about RDF/XML. Maybe it should be forgotten…)

Indeed, if we do this, the WG would have to update the RDF family of specs only (and Turtle would only have to deal with editorial errata), and the only extra syntax to be updated is RDFa. SPARQL and SHACL, which is based on the Turtle syntax, might possibly choose to do an update later (or choose not to do it). (I am not sure about R2RML and CSVW.)

I am not sure that our I18N friends, like @r12a or @aphillips would like that, though. Even if it is used as a syntactic sugar for a few RDF serializations, it may leak out to other usages…

(I am not convinced about this approach either at this point, I am just musing…)


✝︎ Turtle is too 'close' to N-Triple, and N-Triple is really the "dump" format for RDF triples, so I do not think that a syntactic sugar in Turtle would be a good idea.

@pchampin
Copy link
Collaborator Author

I am not sure that our I18N friends, like @r12a or @aphillips would like that, though.

I don't think they will, and let's face it, neither you nor I would like it very much either ;-)

Even if it is used as a syntactic sugar for a few RDF serializations, it may leak out to other usages…

Excatly!

@pchampin
Copy link
Collaborator Author

pchampin commented Jun 10, 2019

Actually, my proposal above is two-fold, and may be it was a mistake to merge both aspects. So, forgetting about the controversial -x-dir-* private subtag, my argument was that we could have

  • a durable solution (the consensus being, it seems, to update langString), requiring a change in RDF;
  • a transitional solution that is compatible with RDF 1.1, and that RDF 1.2 would normatively convert to the durable solution.

Since we only have 3 proposals on the tables, if we ban -x-dir-* for the transitional solution, we are left the LocalizableString option, of which I'm not a big fan (see #2), but it might be generally more acceptable. So rephrasing the abstract two points above:

  • we introduce LocalizableString as a new datatype, which is compatible with RDF 1.1, and can be used immediately;
  • then RDF 1.2 deprecates LocalizableStrings (just like RDF 1.1 deprecated plain literals), mandating that they are interpreted as updated langStrings.
    (an alternative would be to deprecate langString in favour of LocalizableString, but I have reservations about that, which I can develop later if needed).

@pchampin pchampin changed the title Proposal: -x-dir-XXX as a transitional hack Proposal: find a transitional solution before updating langString Jun 10, 2019
@iherman
Copy link
Member

iherman commented Jun 10, 2019

Hm. The combination of LocalizableString first and RDF 1.2 second may be attractive, too. The WG could come out with, say, a WG Note with the new datatype in 2-3 months, so that the community can use it, and then move ahead with RDF 1.2.

However... if RDF 1.2 is defined by this WG, it still needs to update all the others... (although the time pressure is different). In some sense, the WG would become some sort of a maintainer of all things RDF... Note sure this is good or bad.

@pchampin
Copy link
Collaborator Author

However, what should an RDF Turtle (or RDF/XML or JSON-LD) do? Should it generate [the transitional solution] or [the durable solution]? The only answer I see is that… it depends. And this 'it depends' translates onto some sort of a user settable option which is bad.

I agree, unless we change the media types (e.g. text/turtle12 or text/turtle?profile=blabla-v1.2), but I don't think that's a very practical option... And even if we do, people will probably continue using Turtle 1.1, just to be on the safe side.

So I really like your idea of making the 1.2 serializations "compatible" with the 1.1 family of syntaxes – only with a different interpretation. That's what worked for deprecating plain literals. Piggy-backing on the @langTag syntax is controversial. What about piggy-backing on the ^^datatype syntax?

Yet another proposal

Consider the following Turtle/SPARQL literal: "مرحبا world"^^i18n:en-US_lrt (with the appropriate i18n prefix defined). It is valid in RDF 1.1. We would have to publish a note on the i18n namespace -- this would be an infinite namespace, which is weird, but RDF already has this anyway (the membership properties), and that could be only temporary (see below).

In RDF 1.2, we would decide that "hello"@en is syntactic sugar for "hello"^^i18n:en. We could even introduce "hello"@en_ltr (currently invalid) as syntactic sugar for "hello"^^i18:en_ltr, but serializers would not produce this by default (for backward compatibility).

As for the interpretation of the i18n:* datatype IRIs, I see two options (there may be more) :

  • we accept them as an infinite family of datatypes that replace langString (which would be deprecated);
  • we consider them as "magic" IRIs automatically interpreted by parser to produce updated langStrings (with language and direction metadata).

Both options are slighly "unpure", but I believe this is the price RDF has to pay for having not included base direction in the first place...

@gkellogg
Copy link
Member

gkellogg commented Jun 10, 2019

I like this last proposal, are at least the direction it is going in. It helps simplify the RDF Literal definition and better leverages the datatype element. Arguably, this would have been a better way for RDF 1.1 to have gone when introducing langString.

At also allows JSON-LD to make use of type maps and could lead to the depreciation of @language and language maps.

@pchampin
Copy link
Collaborator Author

The goal is not to deprecate @language or language maps! I think many people will want to keep them, and I respect that. Even if we did unify language and direction into the more general datatype system, they would still have a special place for users...

@iherman
Copy link
Member

iherman commented Jun 10, 2019

@pchampin,

On the i18n:*: I have no problem with the 'infinite' datatype, although maybe Pat or Peter may have some issues with it v.a.v. the formal semantics. To be checked. I am still not sure whether I prefer this one over the LocalizableString datatype; this latter may be cleaner indeed.

I would still opt for your second option:

  1. we consider them as "magic" IRIs automatically interpreted by parser to produce updated langStrings (with language and direction metadata).

i.e., to properly update RDF concepts but not to deprecate langString.

However, @gkellogg @pchampin how would that work with indexing? I thought the JSON-LD 1.1 type maps is for object types (i.e., real RDF types) and not for datatypes...

(From the current JSON-LD draft, in https://w3c.github.io/json-ld-syntax/#node-type-indexing

This enables data to be structured based on the @type of specific node objects.

@pchampin
Copy link
Collaborator Author

The way I see it, JSON-LD would encode those literals as {"@value": ..., "@language": ..., "@direction":...}, and only resort on the i18n namespace when serializing to RDF.

Symmetrically, the RDF to JSON-LD algorithm should convert any i18n:* datatype to the corresponding @language and @direction, thus anticipating the change of semantics in RDF 1.2.

Of course, some other implementations may still produce a value object of the form {"@value": "hello", "@type": "i18n:en-US_ltr"}, in which case JSON-LD processors would not behave correctly. But I would argue that this would be a bug in the source implementation, not in the JSON-LD processor. Just like "hello"^^rdf:langString is valid Turtle, but would probably lead to unexpected problems...

@iherman
Copy link
Member

iherman commented Jun 10, 2019

B.t.w.... the spec for the i18n URL-s would probably be something like "combine BCP47 with the dir" to yield something like: https://www.w3.org/ns/i18n#en_ltr, but that means again combining BCP47 (an existing standard) with some other terms via some microsyntax. To avoid any problem with a future BCP47 syntax evolution, we may want to avoid that (what happens if BCP47bis decides to use the _ character?).

This means we can do something like:

  • use https://www.w3.org/ns/18n?lang=us&dir=ltr. Yeah, ugly. I am not sure turtle could handle something like
@prefix i18n: "https://www.w3.org/ns/i18n?lang=" .
[] my:prop i18n:en&dir=ltr .
  • use something that BCP does not have a problem with. Ehem, ehem, that would mean something like: "abcd"^^i18n:en-x-dir-ltr… Back to square one?

@gkellogg
Copy link
Member

However, @gkellogg @pchampin how would that work with indexing? I thought the JSON-LD 1.1 type maps is for object types (i.e., real RDF types) and not for datatypes...

(From the current JSON-LD draft, in https://w3c.github.io/json-ld-syntax/#node-type-indexing

This enables data to be structured based on the @type of specific node objects.

Actually, the API implementation of expansion works (compaction seems to require a minor tweak for term selection) with both node objects and data objects. We could change the name of the Node Type Indexing section and add some minor text to include this (we need to do something if we want to exclude value objects in the API). Try, for example this playground link.

In retrospect, I’m not sure why the syntax document restricted type indexing to node objects.

@pchampin
Copy link
Collaborator Author

@iherman If BCP47 evolves so much as to allow _, this would be a different spec, right? So RDF 1.1 and 1.2 would still reference the original spec, and they should not be impacted. And by the time RDF 2.0 (?) upgrades to BCP47bis, hopefully, it will deprecate the ^^i18n:xx_yy hack in favor of a cleaner syntax.

@aphillips
Copy link
Collaborator

@iherman @pchampin That's right. The syntax of BCP47, in any of its iterations, has never allowed any characters except a-z, 0-9, and hyphen. Future iterations are not envisioned, but certainly it would be a breaking change (and extremely remarkable) to add more characters to what's permitted and still somehow be BCP47.

However, underscores are used in some systems for locale identifiers and most implementations of language/locale mapping are at least a little permissive about exchanging one for the other (because developers can't remember which one to use).

@pchampin
Copy link
Collaborator Author

@gkellogg Your example in the playgroud is strange, but I followed the idea and built another one.

I'm not very comfortable with it though. I consider the i18n:* IRIs as a hack to convey structured metadata in syntaxes (such as Turtle) and models (such as RDF 1.1) that are not well designed for it. In the case of JSON-LD, the clean way to represent this metadata seems to be {"@value": ..., "@language": ..., "@direction": ...}. And that's the structure that the JSON-LD algorithms should handle...

One thing that type maps won't allow me to do is to add direction to some of the values: in my example above, I would like to be able to write:

  "name_map": {
    "fr": "Lyon",
    "ar": { "@value": "ليون", "@direction": "rtl" }
  }

I don't want to have to put the direction in the key (as in "ar_rtl": "ليون") because my users will look for ar in the map.

@gkellogg
Copy link
Member

The problem with the expanded value object version is that there's no way to use maps, which I think might be important here.

One thing that type maps won't allow me to do is to add direction to some of the values

You'd need to add a term such as "ar-rtl", if the direction is conveyed in a type map.

Language maps may be best, if we adopt the -x-dir-rtl extension, or similar.

@pchampin
Copy link
Collaborator Author

@aphillips

However, underscores are used in some systems for locale identifiers and most implementations of language/locale mapping are at least a little permissive about exchanging one for the other

Well, fortunately, RDF is not (at least, not normatively). The abstract syntax specifies that "the language tag MUST be well-formed according to section 2.2.9 of [BCP47]", and concrete syntaxes only accept dashes between letters and digits.

I quickly tried a few implementations, none of them let me use underscore in language tags...

@aphillips
Copy link
Collaborator

@pchampin Agreed.

@gkellogg I agree with @pchampin that the direction doesn't want to participate in the map/language negotiation. I'll point out that string-meta has a section on this. One of the things about appending gunk to the end of the language tags that is not "default ignorable" is that is can interfere with BCP47's prefix matching language negotiation heuristic. Re-separating the direction from the language tag helps prevent ar-x-dir-rtl in the map from not matching a request for ar-AE.

@gkellogg
Copy link
Member

@pchampin Agreed.

@gkellogg I agree with @pchampin that the direction doesn't want to participate in the map/language negotiation. I'll point out that string-meta has a section on this. One of the things about appending gunk to the end of the language tags that is not "default ignorable" is that is can interfere with BCP47's prefix matching language negotiation heuristic. Re-separating the direction from the language tag helps prevent ar-x-dir-rtl in the map from not matching a request for ar-AE.

If we go for a discrete direction, we'll need to update the JSON-LD language map, which is currently restricted to having values which are plain strings, to allow either plain strings or value objects with no conflicting @language member.

@aphillips
Copy link
Collaborator

@gkellogg Yes: that's definitely a problem. See my link to "section on this" for thoughts on this (which includes improvements to language maps having to do with language tag handling as well).

@iherman
Copy link
Member

iherman commented Jun 25, 2019

I have added this to the charter as an alternative.

@iherman iherman closed this as completed Jun 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants