Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Update references to match current UTS 35 spec #77

Closed
anba opened this issue Jul 19, 2019 · 11 comments · Fixed by #92
Closed

Update references to match current UTS 35 spec #77

anba opened this issue Jul 19, 2019 · 11 comments · Fixed by #92
Milestone

Comments

@anba
Copy link
Contributor

anba commented Jul 19, 2019

The current draft spec was written against UTS 35, version 34, but the UTS 35 is now at version 35 and version 35 contained many changes for Unicode BCP 47 locale identifiers. I'd suggest making a check over the complete Intl.Locale spec to verify it still matches what's currently in UTS 35.

For example:

  • Things like "Use the subtag matching unicode_language_subtag" is now ambiguous, because unicode_language_subtag is not only used in unicode_language_id, but also in the tlang production.
  • Step 8 in ApplyUnicodeExtensionToTag:

Let newExtension be the canonicalized Unicode BCP 47 U Extension based on attributes and keywords as defined in UTS #35 section 3.6.

The text refers to what was in http://www.unicode.org/reports/tr35/tr35-53/tr35.html#u_Extension, but that's now part of http://unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers (cf. canonical syntax and canonical form in that section).

@zbraniecki zbraniecki added this to the v1 milestone Oct 24, 2019
@jswalden
Copy link
Contributor

We did just discuss what "editorial" means the other day, so I should remember quite clearly what it was...but even without remembering the exact meaning, this does not seem editorial.

Perhaps most notably given the call yesterday, newer TR35 (or at least what we referenced in it) removed (or, moved) a requirement to do alias/preferred replacement in Unicode locale extensions when canonicalizing. That removal was what led me to no longer have concerns about advancing Intl.Locale. (Although on a closer look, I see that under that understanding Intl.Locale will still deduplicate attributes and keywords, where CanonicalizeLanguageTag will not, so there is a distinction -- but a simpler one to explain than a replacements-based distinction, and one that probably should be fixed by changing CanonicalizeLanguageTag to trim duplicates.)

But if #43 is correct, we specifically chose to do replacements. So if TR35 updating changed that, we would also need to change to do replacements again. And that's a significant change in how Intl.Locale operates. And sadly, it would end up resurrecting my concern about the differentiation between Intl.Locale and CanonicalizeLanguageTag. 😧

Given that this proposal modifies CanonicalizeLocaleList, I think we need also to modify CanonicalizeLanguageTag too here. At least, if we're going to return to the position that was seemingly deliberately adopted previously. If I can figure out how to build a modified proposal myself (and verify it'll merge into the main spec correctly, not just render standalone correctly), I'll see if I can figure out how to do that.

@jswalden
Copy link
Contributor

So, looking closely at this, it seems like the only way to invoke replacements any more is to invoke TR35's new "canonical form" setup. We could hand-roll our own version of this to perform replacements, of course, but that seems best avoided.

So...probably we want to invoke canonical form in Intl.Locale.

@jswalden
Copy link
Contributor

Note that BCP 47 Language Tag to Unicode BCP 47 Locale Identifier, that is the operation currently performed by CanonicalizeLanguageTag, alleges that "The result [of this operation] is a Unicode BCP 47 locale identifier, in canonical form." That...seems false? That algorithm first dials up canonical syntax, then it does language/region replacement, but canonical form also does variant replacement/replacement within 'u' and 't' extensions/replacement in 'sd' and 'rg' keys.

@anba Am I dumb, or is this just TR35 bug? (Which would raise the question of whether this algorithm in TR35 should change, and if it did change -- and ideally deduplicated -- we'd only have duplicates to deal with at all in this spec.)

@anba
Copy link
Contributor Author

anba commented Nov 19, 2019

@anba Am I dumb, or is this just TR35 bug?

Yes, this is a TR35 bug. When TR35 was changed to differentiate between "canonical syntax" and "canonical form" (in version 36), that sentence wasn't updated to use "canonical syntax".

@jswalden
Copy link
Contributor

I ended up making us invoke canonical form in CanonicalizeLanguageTag in #83. It happens that every path through Intl.Locale is a result of that function (see #83 (comment) for the details laid out), so just changing that function will get everyone on the same page about canonical form everywhere.

@zbraniecki
Copy link
Member

@jswalden - does it mean that #83 will solve #82 and this issue?

@jswalden
Copy link
Contributor

@zbraniecki #83 will solve one aspect of this issue, namely responding to the canonical form/syntax split introduced in UTS35 v35. I can't say with confidence that that is definitely the only problem this issue covers -- merely the most serious one from my point of view. @anba could say more, more quickly, than I could here, I think.

@zbraniecki
Copy link
Member

@anba - anything left here from your POV since #83 has landed?

@zbraniecki
Copy link
Member

This indeed was not an editorial issue. I'll wait for @anba to verify if there's anything left now and if what's left is editorial :)

@anba
Copy link
Contributor Author

anba commented Jan 29, 2020

For example see step 5 in Intl.Locale.prototype.language:

  1. Return the substring of locale corresponding to the unicode_language_subtag production.

The locale id "en-t-en" contains two unicode_language_subtag productions, which makes step 5 ambiguous.

@zbraniecki
Copy link
Member

Thank you! That's a good catch. I clarified that in the PR. Lmk if there's anything else you noticed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants