Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Change the CanonicalizeLanguageTag operation so that it removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does #83

Merged
merged 3 commits into from
Jan 24, 2020

Conversation

jswalden
Copy link
Contributor

This fixes #82.

@anba
Copy link
Contributor

anba commented Nov 19, 2019

LGTM!


It looks like V8 is already removing duplicate keywords, so there shouldn't be any web-compat problems:

d8> Intl.getCanonicalLocales("de-u-ca-gregory-ca-islamicc")
["de-u-ca-gregory"]

And duplicate attributes don't seem to be supported in V8, which is a V8 bug, but also means we don't need to worry about web-compat issues.

d8> Intl.getCanonicalLocales("de-u-attr-attr")
(d8):1: RangeError: Invalid language tag: de-u-attr-attr

…removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does. Fixes tc39#82.
@zbraniecki
Copy link
Member

This doesn't cover several steps in https://www.unicode.org/reports/tr35/tr35.html#Canonical_Unicode_Locale_Identifiers

  • Sorting
  • Removal of true

is that ok?

@jswalden
Copy link
Contributor Author

This is one step of two. This only deals with duplications. Sorting and removal of true I was going to do in a second PR. I'll get that second one going right pronto.

@jswalden
Copy link
Contributor Author

jswalden commented Dec 13, 2019

Okay, I added a second commit here that additionally makes CanonicalizeLanguageTag transform its input to canonical form. Moreover, for greater clarity that Intl.Locale always canonicalizes consistent with CanonicalizeLanguageTag, I made it so that if you follow through the algorithm, every possible result-string shoved into the returned locale.[[Locale]] has just passed through CanonicalizeLanguageTag:

  • ApplyOptionsToTag always returns something passed through CanonicalizeLanguageTag, either step 9 if the options-object doesn't provide language/script/region overrides, or step 13 otherwise;
  • the helpfully-renamed InsertUnicodeExtensionAndCanonicalize now obviously returns a CanonicalizeLanguageTag result;
  • ApplyUnicodeExtensionToTag is passed the result of ApplyOptionsToTag, so its tag argument must be a CanonicalizeLanguageTag result;
  • ApplyUnicodeExtensionToTag returns either tag (a CanonicalizeLanguageTag result) unaltered, or InsertUnicodeExtensionAndCanonicalize(tag) (also a CanonicalizeLanguageTag result);
  • so therefore locale.[[Locale]] in the return of Intl.Locale is a CanonicalizeLanguageTag result.

Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@zbraniecki
Copy link
Member

@srl295 - can you verify that this change is in line with ICU thinking?

@sffc
Copy link

sffc commented Jan 8, 2020

This PR is blocking stage advancement, and it looks like the PR is waiting for review from a subject matter expert. Is that correct?

CC @markusicu @FrankYFTang

Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me and seems like it should be an improvement vs previous behavior. I wonder if we should consider "upstreaming" this algorithm into the Unicode document.

@srl295
Copy link
Member

srl295 commented Jan 9, 2020

let me see if i can get a review

@yumaoka
Copy link

yumaoka commented Jan 9, 2020

I think the abstract operation name is CanonicalizeLanguageTag is not good if the operation does something beyond BCP 47. Language Tag is a term used in BCP 47 and canonicalization is specified by https://tools.ietf.org/html/bcp47#section-4.5

In CLDR, Unicode BCP 47 locale identifier is defined separately from BCP 47 language tag. And such canonicalization is in a scope of LDML/u-extension, not BCP 47 language tag specification.

I think removal of duplicated key is good and should be clarified in LDML, but this operation is for Unicode locale identifier. If the abstract operation name is CanonicalizeUnicodeLocaleId then, it makes perfect sense.

@littledan
Copy link
Member

@yumaoka Thanks for the review. It sounds like you're saying that there are editorial cleanups that we should do, but that the semantics are appropriate. Is that accurate?

@srl295
Copy link
Member

srl295 commented Jan 9, 2020

@yumaoka Thanks for the review. It sounds like you're saying that there are editorial cleanups that we should do, but that the semantics are appropriate. Is that accurate?

rename the operation to make it clear that it is not generic to BCP47 but specific to Unicode locale IDs (which are a strict subset of BCP47)?

@jswalden
Copy link
Contributor Author

Made the change to rename CanonicalizeLanguageTag to CanonicalizeUnicodeLocaleId. At first I thought maybe this ought happen after this proposal merges, so as not to bloat the "Modified algorithms" section of this proposal with single-word deletion/replacements, but then I looked and there really aren't that many, so it's not terrible to just do it here. Did so, in a further separate commit for readability.

@zbraniecki
Copy link
Member

Thank you!

@zbraniecki zbraniecki merged commit 3cf584f into tc39:master Jan 24, 2020
@jswalden
Copy link
Contributor Author

...wait, that's not it. I did a git and forgot to git add the changes, so that third commit doesn't actually make all the name-changes. :-( Sec, I'll fix.

@jswalden
Copy link
Contributor Author

Well, in theory I can fix. I have force-pushed to the same branch in my fork, but it does not appear to be showing up here. I'll open a new PR for it.

@zbraniecki
Copy link
Member

yeah, since I merged the PR, we need a new PR. Sorry for rushing the merge!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
7 participants