Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round tripping language tags case #73

Open
ericprud opened this issue Oct 7, 2017 · 7 comments
Open

Round tripping language tags case #73

ericprud opened this issue Oct 7, 2017 · 7 comments
Milestone

Comments

@ericprud
Copy link
Contributor

ericprud commented Oct 7, 2017

Related: #71

Apart from values set values with language tags, ShExC, ShExJ and ShExR can be exactly round tripped, c.f. schema tests. Because language-tagged literals are expressed as JSON-LD object literals and RDF parsers are not responsible for preserving upper/lower case in literal language tags, a ShExC schema:

<vs1> ["flat"@en-GB]

would be be translated to ShExR:

[] a sx:Schema ; sx:shapes <http://a.example/vs1> .
<http://a.example/vs1> a sx:NodeConstraint ;
  sx:values ( "flat"@en-GB ) .

An RDF parser is allowed to parse that as en-gb so it would round-trip to ShExC:

<vs1> ["flat"@en-gb]

This doesn't affect semantics of validation but it can be a pain for folks who like to follow ISO language code rules where regions should be upper case, i.e. en-GB. (This has little impact as no one uses ShExR anyways.) Round-tripping between ShExC and ShExJ (as JSON) is unaffected by this.

PROPOSE:

  1. add a note in the spec documenting this as a round-tripping deficiency and stating that if this is a problem for users, future versions of ShExJ will not use JSON-LD object literals for value set values.
  2. adopt ~ address schema LangTag in [shexSpec/shex#71] shexTest#25 which has additional schemas which differ only in language tag case.
@ericprud
Copy link
Contributor Author

Alternate choice: leave no expectation of case-preservation when converting ShExC<->ShExJ

PROPOSE:

  1. add a note in the spec documenting that no round trips preserve case and stating that if this is a problem for users, future versions of ShExJ will not use JSON-LD object literals for value set values.
  2. reject ~ address schema LangTag in [shexSpec/shex#71] shexTest#25 and add pair-wise mixed-case tests which demonstrate that both "ab"@en and "ab"@EN parse to each other.

@VladimirAlexiev
Copy link

"ISO": indeed eg ru-Cyrl-RU is preferable to ru-cyrl-ru .
Can you say that converting back to ShexC converts tags using "BCP normalization"?
I posted to the RDF mlist https://lists.w3.org/Archives/Public/public-rdf-comments/2014Jan/0011.html including overview of what different fraemworks do (eg Sesame lowercases, Jena preserves).
and an implementation https://rt.cpan.org/Public/Ticket/Attachment/1267147/670949/lang_normalize.pl.

This won't help round-tripping but at least will enforce determinism.

@ericprud
Copy link
Contributor Author

ericprud commented Oct 22, 2017

If I understand, the proposal is to:

  • UPPERCASE any two-letter sequence following a sequence of two or more letters.
  • Titlecase any four-letter sequence following a sequence of two or more letters.

e.g. mn-Cyrl-MN. I am motivated to improve RDF conformance with BCP47 rather than continue to propagate lazy short-cuts.

Does this derive from the BCP47 grammar or some other text in the doc?
Diving deeper into BCP47 than I ever wanted, I see a finite list of irregular tags with the comment "most are deprecated". What would be best here, ignore them or reference them from the spec (and thus stick them in every impl)?

Can you make a PR on the spec value constraints section (and maybe value set parsing) to make this concrete? I'd propose @gkellogg and @ericprud as reviewers.

@VladimirAlexiev
Copy link

Your description is correct (but there's also a dash between the two sequences).
The script quotes verbatim from http://tools.ietf.org/html/bcp47#section-2.1.1.
Irregular tags will be normalized to what is given in the spec. You don't need to specify them separately.

I don't think this normalization has any bearing on validation, since validation must be case-insensitive.
Cheers!

@ericprud
Copy link
Contributor Author

I wasn't worried about the validation, just what exactly how to specify the canonical form. I guess you have something in mind like:

When emiting a ShEx schema, language tags in that schema SHOULD be in the the canonical language tag form in order to comply with [[!BCP47]] section @@!.
A language tag is in canonical language tag form if a language tag is split on '-' into a set of sequences and the following rules applied before it is joined again on '-':

  • Each two-letter sequence following a sequence of two or more letters is in uppercase, e.g. ab-CD-EF-ghi
  • Each four-letter sequence following a sequence of two or more letters is in title case. e.g. ab-Cdef-Ghij

Where in BCP47 do the capitalization rules come from? Can we justify the rules above?

@VladimirAlexiev
Copy link

before it is joined again

This is wrong. This would be ambiguous for eg x-whatever. The rules require capitalziation in eg x-what-Ever and x-what-EV but require nothing in x-whatever or x-what-everything or x-what-eve.

Just refer to sec 2.1.1. IMHO you don't need to restate the rules, just give some examples

@VladimirAlexiev
Copy link

I think I misread what "joined" means. I still think you don't need to restate the rules, but if you want to do it, please change "set of sequences" to "sequence of strings"

@ericprud ericprud added this to the 2.next milestone Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants