Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

CanonicalizeLanguageTag should remove duplicate attributes/keywords in a Unicode extension, consistent with Intl.Locale #82

Closed
jswalden opened this issue Nov 18, 2019 · 6 comments · Fixed by #83
Milestone

Comments

@jswalden
Copy link
Contributor

jswalden commented Nov 18, 2019

Canonicalization performed by CanonicalizeLanguageTag and that performed by Intl.Locale differ in two intended ways.

  1. CanonicalizeLanguageTag doesn't remove duplicated attributes or keywords, e.g. "en-u-attr-attr" and "en-u-co-dict-co-phonebk" are both considered to be canonical. Intl.Locale does (and almost necessarily must, to integrate keywords in the input tag with keywords specified through the options bag).
  2. CanonicalizeLanguageTag doesn't replace aliased subtags in Unicode locale extension sequences with their preferred forms, e.g. "en-u-ms-imperial" is canonical according to CanonicalizeLanguageTag, but Intl.Locale will transform it to "en-u-ms-uksystem". (This latter behavior doesn't exist in the current spec because of changes to TR35 upstream. See Update references to match current UTS 35 spec #77 for dealing with that change.)

On the call last week I had thought the latter TR35 upstream change was something we had accepted, and I didn't understand that the first problem still remained, so I was fine with this proposal moving forward. But the latter change was unintentional (#77 will deal with it), and the first problem is real. We need to fix both of these to move this proposal forward, IMO. :-(

I have a patch that augments this proposal with changes to the existing CanonicalizeLanguageTag algorithm such that duplicate attributes and keywords are removed. I am not sure that this is the most elegant way to implement deduplication. But it gets the job done, and of course implementations will choose whatever approach works best for them in reality. I'll create a PR once I've gotten this issue filed and have an issue number to refer to.

jswalden added a commit to jswalden/proposal-intl-locale that referenced this issue Nov 18, 2019
…removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does. Fixes tc39#82.
@jswalden
Copy link
Contributor Author

I can't find anything in TR35 or RFC 6067 that describes the removal of duplicate attributes/keywords. So it doesn't look like there's anything in TR35 we could invoke to perform this operation, and it has to be hand-rolled -- as this patch does for CanonicalizeLanguageTag, and as Intl.Locale in this specification already does. :-|

@jswalden
Copy link
Contributor Author

@anba You should probably take a look at and comment on this, seeing as you understand this stuff better than everyone else here. :-)

@aphillips
Copy link

@jswalden 6067 says something about this:

Only the first occurrence of an attribute or key conveys meaning in a
language tag. When interpreting tags containing the Unicode locale
extension, duplicate attributes or keywords are ignored in the
following way: ignore any attribute that has already appeared in the
tag and ignore any keyword whose key has already occurred in the tag.

This was intended to allow duplicates to be removed without effect (although I don't think we went on to encourage/permit it).

@jswalden
Copy link
Contributor Author

@aphillips Yeah, the semantics of a tag with duplicates in it are clear enough even if they aren't removed. Just seems to me if we remove them one place -- and we don't really have a choice about it in Intl.Locale unless we wanted to just insert out-of-band options stringwise at the start of any existing Unicode locale extension, but that seems batty -- we should be consistent and do it everywhere.

@aphillips
Copy link

The canonical form is in a specific order. I would just go ahead and remove duplicates - - - extension tags are already long and unwieldy even without useless cruft in them.

jswalden added a commit to jswalden/proposal-intl-locale that referenced this issue Dec 12, 2019
…removes duplicate attributes/keywords in Unicode locale extension sequences just as Intl.Locale does. Fixes tc39#82.
@zbraniecki zbraniecki added this to the v1 milestone Dec 17, 2019
@zbraniecki
Copy link
Member

#83 fixes this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants