Round tripping language tags case #73

ericprud · 2017-10-07T11:47:48Z

Related: #71

Apart from values set values with language tags, ShExC, ShExJ and ShExR can be exactly round tripped, c.f. schema tests. Because language-tagged literals are expressed as JSON-LD object literals and RDF parsers are not responsible for preserving upper/lower case in literal language tags, a ShExC schema:

<vs1> ["flat"@en-GB]

would be be translated to ShExR:

[] a sx:Schema ; sx:shapes <http://a.example/vs1> .
<http://a.example/vs1> a sx:NodeConstraint ;
  sx:values ( "flat"@en-GB ) .

An RDF parser is allowed to parse that as en-gb so it would round-trip to ShExC:

<vs1> ["flat"@en-gb]

This doesn't affect semantics of validation but it can be a pain for folks who like to follow ISO language code rules where regions should be upper case, i.e. en-GB. (This has little impact as no one uses ShExR anyways.) Round-tripping between ShExC and ShExJ (as JSON) is unaffected by this.

PROPOSE:

add a note in the spec documenting this as a round-tripping deficiency and stating that if this is a problem for users, future versions of ShExJ will not use JSON-LD object literals for value set values.
adopt ~ address schema LangTag in [shexSpec/shex#71] shexTest#25 which has additional schemas which differ only in language tag case.

The text was updated successfully, but these errors were encountered:

ericprud · 2017-10-10T16:36:08Z

Alternate choice: leave no expectation of case-preservation when converting ShExC<->ShExJ

PROPOSE:

add a note in the spec documenting that no round trips preserve case and stating that if this is a problem for users, future versions of ShExJ will not use JSON-LD object literals for value set values.
reject ~ address schema LangTag in [shexSpec/shex#71] shexTest#25 and add pair-wise mixed-case tests which demonstrate that both "ab"@en and "ab"@EN parse to each other.

VladimirAlexiev · 2017-10-20T11:03:13Z

"ISO": indeed eg ru-Cyrl-RU is preferable to ru-cyrl-ru .
Can you say that converting back to ShexC converts tags using "BCP normalization"?
I posted to the RDF mlist https://lists.w3.org/Archives/Public/public-rdf-comments/2014Jan/0011.html including overview of what different fraemworks do (eg Sesame lowercases, Jena preserves).
and an implementation https://rt.cpan.org/Public/Ticket/Attachment/1267147/670949/lang_normalize.pl.

This won't help round-tripping but at least will enforce determinism.

ericprud · 2017-10-22T04:22:11Z

If I understand, the proposal is to:

UPPERCASE any two-letter sequence following a sequence of two or more letters.
Titlecase any four-letter sequence following a sequence of two or more letters.

e.g. mn-Cyrl-MN. I am motivated to improve RDF conformance with BCP47 rather than continue to propagate lazy short-cuts.

Does this derive from the BCP47 grammar or some other text in the doc?
Diving deeper into BCP47 than I ever wanted, I see a finite list of irregular tags with the comment "most are deprecated". What would be best here, ignore them or reference them from the spec (and thus stick them in every impl)?

Can you make a PR on the spec value constraints section (and maybe value set parsing) to make this concrete? I'd propose @gkellogg and @ericprud as reviewers.

VladimirAlexiev · 2017-10-23T07:07:02Z

Your description is correct (but there's also a dash between the two sequences).
The script quotes verbatim from http://tools.ietf.org/html/bcp47#section-2.1.1.
Irregular tags will be normalized to what is given in the spec. You don't need to specify them separately.

I don't think this normalization has any bearing on validation, since validation must be case-insensitive.
Cheers!

ericprud · 2017-10-23T11:48:07Z

I wasn't worried about the validation, just what exactly how to specify the canonical form. I guess you have something in mind like:

When emiting a ShEx schema, language tags in that schema SHOULD be in the the canonical language tag form in order to comply with [[!BCP47]] section @@!.
A language tag is in canonical language tag form if a language tag is split on '-' into a set of sequences and the following rules applied before it is joined again on '-':

Each two-letter sequence following a sequence of two or more letters is in uppercase, e.g. ab-CD-EF-ghi

Each four-letter sequence following a sequence of two or more letters is in title case. e.g. ab-Cdef-Ghij

Where in BCP47 do the capitalization rules come from? Can we justify the rules above?

VladimirAlexiev · 2017-10-24T07:23:59Z

before it is joined again

This is wrong. This would be ambiguous for eg x-whatever. The rules require capitalziation in eg x-what-Ever and x-what-EV but require nothing in x-whatever or x-what-everything or x-what-eve.

Just refer to sec 2.1.1. IMHO you don't need to restate the rules, just give some examples

VladimirAlexiev · 2017-10-24T07:25:58Z

I think I misread what "joined" means. I still think you don't need to restate the rules, but if you want to do it, please change "set of sequences" to "sequence of strings"

ericprud added this to the 2.next milestone Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round tripping language tags case #73

Round tripping language tags case #73

ericprud commented Oct 7, 2017 •

edited

ericprud commented Oct 10, 2017

VladimirAlexiev commented Oct 20, 2017

ericprud commented Oct 22, 2017 •

edited

VladimirAlexiev commented Oct 23, 2017

ericprud commented Oct 23, 2017

VladimirAlexiev commented Oct 24, 2017

VladimirAlexiev commented Oct 24, 2017

Round tripping language tags case #73

Round tripping language tags case #73

Comments

ericprud commented Oct 7, 2017 • edited

ericprud commented Oct 10, 2017

VladimirAlexiev commented Oct 20, 2017

ericprud commented Oct 22, 2017 • edited

VladimirAlexiev commented Oct 23, 2017

ericprud commented Oct 23, 2017

VladimirAlexiev commented Oct 24, 2017

VladimirAlexiev commented Oct 24, 2017

ericprud commented Oct 7, 2017 •

edited

ericprud commented Oct 22, 2017 •

edited