Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CollationsOfLocale() order #33

Closed
anba opened this issue May 14, 2021 · 27 comments
Closed

CollationsOfLocale() order #33

anba opened this issue May 14, 2021 · 27 comments
Labels
help wanted Extra attention is needed

Comments

@anba
Copy link
Contributor

anba commented May 14, 2021

From https://tc39.es/proposal-intl-locale-info/#sec-collations-of-locale:

Let list be a List of 1 or more collation identifiers, which must be String values conforming to the type sequence from UTS 35 Unicode Locale Identifier, section 3.2, sorted in descending preference of those in common use in the locale for string comparison.

But from https://github.com/unicode-org/icu/blob/4b6e6e1bc9ef90001b4eb169e84ed33d7840b225/icu4c/source/i18n/unicode/ucol.h#L871-L891:

Given a key and a locale, returns an array of string values in a preferred
order that would make a difference. These are all and only those values where
the open (creation) of the service with the locale formed from the input locale
plus input keyword and that value has different behavior than creation with the
input locale alone.

I don't think "preferred order that would make a difference" is equivalent to "sorted in descending preference of those in common use".

For example compare:

js> new Intl.Locale("en").collations
["standard", "emoji", "eor", "search"]
js> new Intl.Locale("de").collations 
["standard", "phonebk", "search", "emoji", "eor"]

I'm not sure "search" is less common in English than in German, is it? And that "eor" is more common than "search" in English.

FWIW ICU returns the order simply based on how the collations are listed in the collation data files. For example for German, the order is constructed as follows:

  1. Take the entries from https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/de.txt.
    • Order of appearance: "phonebook" and "search".
    • It appears that the entries are simply ordered alphabetically.
  2. Append the entries from the root locale https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/root.txt
    • Order of appearance: "emoji", "eor", "search", "standard"
    • It appears that the entries are simply ordered alphabetically.
  3. Remove duplicates to get the list « "phonebook", "search", "emoji", "eor", "standard" ».
  4. Move the default collation, for German that's the default inherited from the root locale, so "standard" to the top.
  5. The final result is then « "standard", "phonebook", "search", "emoji", "eor" ».

So the collation order returned by ICU is more like:

  • First entry is the default collation.
  • The remaining entries are sorted alphabetically by each locale in the locale inheritance chain.
@anba
Copy link
Contributor Author

anba commented Jun 8, 2021

With "standard" and "search" being removed from the result list, I think the most sensible solution is to simply sort the returned list alphabetically.

So the example from above will then return:

js> new Intl.Locale("en").collations
["emoji", "eor"]
js> new Intl.Locale("de").collations 
["emoji", "eor", "phonebk"]

@FrankYFTang
Copy link
Collaborator

or should we just take out the text about the order and allow any order in the return instead of explictly state it?

@FrankYFTang FrankYFTang added the help wanted Extra attention is needed label Aug 13, 2021
@anba
Copy link
Contributor Author

anba commented Aug 30, 2021

I would like to avoid leaving the sort order unspecified, but if we can't agree on a specific order, we should explicitly mention that the sort order is undefined. Are there any concerns to sort the list in alphabetical order?

@FrankYFTang
Copy link
Collaborator

sorting in any order would require computing power and may increase lantency. Unless there is a strong reason to believe return in ANY order would be needed for the common use case, I rather we leave it to random (or unspecified) so we can reduce the need to perform unncessary sort. If the caller need it to be sorted, why don't we let the caller to sort in those condiction instead?

@anba
Copy link
Contributor Author

anba commented Aug 31, 2021

Sorting the list avoids issues like tc39/ecma402#578 and also makes it consistent with TimeZonesOfLocale() and the "Intl Enumeration" proposal. Also in the worst case, there are maximal eight values which need to be sorted, so performing a sort operation has negligible performance cost.


More detail on the number of collation values:

Here's a list to compare how many locales use different collations. The common case is two collation values, which can be sorted trivially.

Number of collations Locales count Percent of total locales count
0 0 -
1 0 -
2 116 84%
3 13 9.4%
4 1 0.7%
5 0 -
6 0 -
7 0 -
8 8 5.8%

The values were computed through:

#include <algorithm>
#include <assert.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string_view>

#include "unicode/ucol.h"
#include "unicode/uenum.h"

#define REPORT_IF_FAILURE(status) \
  if (U_FAILURE(status)) { \
    fprintf(stderr, "failure"); return 1; \
  }

int main() {
  constexpr size_t maxCount = 8;

  size_t counts[maxCount + 1] = {};

  for (int32_t i = 0, c = ucol_countAvailable(); i < c; ++i) {
    const char* locale = ucol_getAvailable(i);

    UErrorCode status = U_ZERO_ERROR;
    UEnumeration* values = ucol_getKeywordValuesForLocale("co", locale, true, &status);
    REPORT_IF_FAILURE(status);

    size_t count = 0;
    while (true) {
      UErrorCode status = U_ZERO_ERROR;
      const char* value = uenum_next(values, nullptr, &status);
      REPORT_IF_FAILURE(status);
      if (value == nullptr) {
        break;
      }

      constexpr std::string_view search = "search";
      constexpr std::string_view standard = "standard";
      if (value == search || value == standard) {
        continue;
      }

      count++;
    }
    uenum_close(values);

    assert(count <= maxCount);

    counts[count]++;
  }

  for (size_t i = 0; i < maxCount + 1; ++i) {
    printf("count[%zu]: %zu\n", i, counts[i]);
  }
}

sffc added a commit to tc39/ecma402 that referenced this issue Sep 9, 2021
# 2021-09-09 ECMA-402 Meeting

## Logistics

### Attendees

- Shane Carr - Google i18n (SFC), Co-Moderator
- Corey Roy - Salesforce (CJR)
- Romulo Cintra - Igalia (RCA), MessageFormat Working Group Liaison
- Thomas Steiner - Google (TOM)
- Frank Yung-Fong Tang - Google i18n, V8 (FYT)
- Long Ho - (LHO)
- Zibi Braniecki - Mozilla (ZB)
- Eemeli Aro - Mozilla (EAO)
- Greg Tatum - Mozilla (GPT)
- Yusuke Suzuki - Apple (YSZ)
- Louis-Aimé de Fouquières - Invited Expert (LAF)
- Richard Gibson - OpenJS Foundation (RGN)
- Myles C. Maxfield - Apple (MCM)

### Standing items

- [Discussion Board](https://github.com/tc39/ecma402/projects/2)
- [Status Wiki](https://github.com/tc39/ecma402/wiki/Proposal-and-PR-Progress-Tracking) -- please update!
- [Abbreviations](https://github.com/tc39/notes/blob/master/delegates.txt)
- [MDN Tracking](https://github.com/tc39/ecma402-mdn)
- [Meeting Calendar](https://calendar.google.com/calendar/embed?src=unicode.org_nubvqveeeol570uuu7kri513vc%40group.calendar.google.com)
- [Matrix](https://matrix.to/#/#tc39-ecma402:matrix.org)

## Status Updates

### Editor's Update

RGN: No updates.

### MessageFormat Working Group

RCA: We are working on a middle-ground data model that I hope will unblock the situation.  EAO is focused on it, with Stas, Mihai, etc.  EAO also put together an initial spec proposal.

EAO: I put together a spec outline, not a specific proposal.  I think we will be able to merge it later this week.

### Proposal Status Changes

https://github.com/tc39/ecma402/wiki/Proposal-and-PR-Progress-Tracking

FYT: Some more Test262 coverage is done.  But we still need help.

RCA: I updated browser compat for locale info, documentation for hour cycle, etc.

FYT: Do we have an instruction guide about how to update MDN?

RCA: The process is moving quickly.  It will be easier, though: you can just edit a Markdown file.

## Pull Requests

### Add changes to Annex A Implementation Dependent Behaviour

tc39/proposal-intl-locale-info#43

FYT: We added some changes to Appendix A.  Does this look good?  Do we have consensus to report this to TC39?

SFC: +1

RGN: +1

LAF: +1

#### Conclusion

Approved

### Change weekInfo to express non-continouse [sic] weekend

tc39/proposal-intl-locale-info#44

FYT: Some regions have a non-contiguous weekend.  This PR changes the data model to reflect that.

LAF: I wonder how this should be understood for all countries. In certain countries, the two "out of business" days may be not contiguous.  Should we call it business day and non business day?  Because "weekend" might not be the correct terminology.

SFC: Is there precedent in CLDR for using "business day" instead of "weekend"?

EAO: A quick Google search suggests that Brunei calls these days "weekend".

SFC: LAF, please open an issue on the repository to discuss the option name change.

SFC: Do we have consensus on the change?

LAF: +1

SFC: +1

#### Conclusion

Approved

## Proposals and Discussion Topics

### CollationsOfLocale() order

tc39/proposal-intl-locale-info#33

SFC: I feel that lists should define their sort order.  This is similar to the plural rule strings discussion from a couple of months ago.

ZB: I represent the other side.  I think developers should not be depending on the order.

LAF: (inaudible)

RGN: There is guaranteed to be an observable order.  The question is whether that order is enforced across implementations, and if so, what should that order be?

FYT: Could we return a Set?

RGN: Sets also have observable order.

SFC: I propose we bring the meta question to TC39-TG1 as a change to the style guide.

LAF: +1 about order issue

FYT: OK

#### Conclusion

SFC to make a presentation to TC39-TG1 to establish a best practice in the style guide.

### Define if "ca" Unicode extensions have an effect on Intl.Locale.prototype.weekInfo

tc39/proposal-intl-locale-info#30

LAF: My opinion about ISO-8601 is that it is not connected to any locale.  Something like Gregorian is connected to a locale, and could carry week info.  But ISO-8601 is international.

SFC: I think we should consult with CLDR.

FYT: This is about the first day of the week and minimal days in the week, not the weekend days.  I personally believe that we shouldn't limit the extension; for example, a subdivision could have legislation to change this info.

LAF: In my opinion, the impact of saying whether Sunday or Monday is the first day of week, or on the minimal days, is to make a "week calendar": a calendar that lays out days in a week, dated by week number.  I can imagine that some countries would like to distribute their own calendar, but I feel that there is a need among people to have the same week numbers.  I don't know for sure where the correct place for this concept is.

ZB: This is inspired by the mozIntl API.  The reason I needed it was for a general calendrical widget, the HTML picker.  I think date pickers in general need this, not just calendar layout.  I think it is a high-importance API.

SFC: I think the calendar subtag, or other subtags like the subdivision, should be taken into account.

FYT: I think we should take the whole locale to influence the result.

FYT: Do we need to make any changes to the proposal, and if so, what changes are needed?

RCA: No strong opinion on that, but concerned by the possible conflict with Temporal 

#### Conclusion

SFC, FYT, and LAF agree that the whole locale (including extension subtags) should influence the weekInfo.  FYT to share these notes with Anba and wait for follow-up.

### JS Input Masking 🎭

- Presenter: TOM
- Slides: https://goo.gle/ecma-402-js-input-masking
- Explainer: https://github.com/tomayac/js-input-masking/blob/main/README.md
- JS polyfill: https://github.com/tomayac/js-input-masking-polyfill

FYT: Thanks for the discussion. (1) Some parts of what you proposed… if the formats are the same across different regions, it shouldn't be part of Intl.  For example, if the ISBN format is the same across regions, it shouldn't be in Intl.  (2) Is the name "input masking" correct?  (3) A new item to consider is the postcode.  That differs a lot around the world.  The US has 5-4, India has 6 digits, Canada has special alphabetic rules.  (4) It would be good to validate whether a string is a valid input.  For example, maybe 13 digits is a valid ISBN, but not 14 digits.  (5) A Googler on our team built libphonenumber, and it ended up being their full-time job for a while.

TOM: Postcodes are interesting.  For validation, that's interesting and useful.  Thanks for confirming that it is useful.  I think it would make sense to have it in the proposal.

EAO: (1) Having built a library like this in the past, you start facing the issue of how to report errors on the input.  So it becomes error reporting, but you need to do a best effort at the formatting while also reporting errors in a side channel. (2) Formatting while the string is being edited is just really hard; you should just wait until the field loses focus.

TOM: I agree that live updating the field is challenging.  What you said about error reporting is interesting.  Verification needs a lot of thought.  I think it's something most developers probably want.

EAO: The biggest question is, how does the side-channel error reporting happen?  Because that's an interesting question for a UI component like this.

TOM: It seems like it could hook into the mechanism for email verification that we already have.  And for on-the-fly formatting, hopefully you could write the formatter so that it can listen to whatever event the developer thinks is the right event.

EAO: It's not just about a binary error.  It's about providing more context to the error messages.

TOM: I think many things can be done.  I'm new to this area, so I don't know the precedent.  I'm looking for more experience.

ZB: Thanks TOM for the presentation.  I've worked in this area before.  I'm excited about the space, and I have a lot of questions.  (1) Parsing is hard. There are a lot of questions here.  What happens if they write LTR and RTL?  What happens if they type in Arabic numerals?  What if they use different kinds of separators?  You quickly get into an uncanny valley.  (2) You should also think about address formatting, which is like postcode and phone number.  Where do you stop?  (3) International placeholders is an interesting topic.  How do you present a placeholder for a phone number?  That really depends on the region.  (4) I'm not sure that adding ??? is good for the scope of the spec.  (5) About whether this belongs in a spec.  It seems like a lot of UX teams will want to customize exactly what the output looks like: they agree on most of the format, but want to change a couple things.  There's a good question about how much of this is i18n.  (6) And finally, and this is the strongest point, if we were to specify what you are specifying, we would need to back it with a strong library.  Because speccing it in ECMA-402 doesn't give us everything.  So why not start with writing the foundational library, maybe one that can be used in many different programming languages, and then once you have the library, come back to ECMA-402 and ask whether we should bake it into the browser?  That can then help us answer questions about whether the payload is sufficiently high such that it makes sense to ship it in the browser.  So basically, I think we should start with a library.  I think ECMA-402 is likely not the right place to start.

TOM: We could build a library, but we run the risk of making the "15th way of doing things" (in reference to the XKCD comic).  Temporal started by making a polyfill, and is now integrating it into the browser.  We already have a lot of input masking libraries.

RCA: I think this is really useful. (1) I'm concerned that the scope could be very large. (2) I'm concerned about what ZB said; organizations where I've worked have wanted to have their own way of doing things with slightly different interactions and so on.  That formatter could be a custom thing for that institution.  (3) Another thing is the interoperability with HTML.  You could have an input credit card, the pattern, the validation, etc.  (4) Highly interactive input fields could slow performance on low-resource devices.

TOM: For performance, the obvious tweak would be to do validation on the server.

YSZ: I think this is a super important part of the application. (1) Like FYT said, some of this data is not Intl data. (2) Phone validation is very complicated, like ZB said. We need to care about the UI; for example, inputting the credit card should trigger a numeric keyboard rather than an alphabetic keyboard.  So it seems like we need <input type="phonenumber"/>.  Did you consider starting there?

TOM: I thought about that, and I put it in the explainer as an alternative.  

SFC: In order to avoid the "15th standard" issue, you should approach the industry leader in i18n standards, the Unicode Consortium, about making a working group to establish the industry canonical solution.  ECMA-402 looks for prior art, and Unicode is the place we point to most often.  This is similar in a way to the MessageFormat Working Group, which was chartered to resolve the competing standards for MessageFormat by bringing all the authors together.

TOM: Yeah, reaching out to Unicode and seeing if this has come up before would be a good option.  As I've said, I had this in the String prototype, and then realized that this should maybe be Intl.  Credit card numbers are generally not Intl, but phone numbers are.  So creating that prior art makes sense.

ZB: I had discussed this a few years ago with Unicode.  But with what SFC said, where there are multiple competing libraries, it means that we don't know what the answer is yet.  Once we put it in ECMA-402, we won't be able to change it.  When writing a library, we can make it and discard it with something better later.  It makes sense that we need a place to assemble expertise from the many organizations.  Maybe Unicode is the place.  And only after we have that canonical implementation, we can evaluate whether it fits in ECMA-402.

MCM: The question about new input forms was raised earlier.  Did you list use cases where form input types would NOT be sufficient for, where you need the JS APIs?

TOM: In a Node.js server, and you have a CSV file of unformatted phone numbers, you might want to format on the server.  So it makes sense to have isomorphic Node and client-side behavior.

MCM: Has Node.js said that they need a standard for this?  Aren't there already Node modules for this?

TOM: Deno is an interesting case.  They've started implementing Web APIs like fetch.  Programmers are used to the way Web APIs work, and they use them in Deno the way they expect them to work.

### LookupMatcher should retain Unicode extension keywords in DefaultLocale

#608

GPT: Seems reasonable to me.

EAO: +1

CJR: +1

#### Conclusion

OK to move forward with this change; review the final spec text when ready.

### ships the entire payload requirement

#588

#### Conclusion

FYT to follow up with Anba's suggestions on the Intl Enumeration API to harden the locale data consistency.

### DateTimeFormat fractionalSecondDigits: conflict between MDN and spec

#590

GPT: It seems reasonable to match the Temporal behavior.

SFC: Do we want to add 4-9 now, or wait until Temporal is more stable?

#### Conclusion

Seems reasonable to move forward with a spec change.  Still some open questions from Anba and SFC.

### Presumptive incompatible change in future edition erroneuosly listed

#583

RGN: The spec version is immutable.

FYT: Is there a way to publish errata?

RGN: I don't think so… I do see some errata on ECMA International, but I don't see references to those errata.

SFC: The PR in question is #471.  It was merged in January.  I don't know why the change to Annex B made it into the edition, but not the normative change to numberformat.html.

FYT: The other issue is that we have long tables in the PDF that get cut off.

RGN: We're trying to raise funding to generate the PDF by a better mechanism.

#### Conclusion

Ujjwal to investigate.

### Accept plural forms of unit in Intl.NumberFormat

#564

CJR: If we accept the plurals in RelativeTimeFormat, I can see a case for doing that also in NumberFormat.

SFC: There are basically 3 approaches.  (1), we only accept singular units.  (2), we accept plural forms for all units… stripping off the "s"?  (3), only special-case duration units like days and hours.

EAO: Pluralization for all units is challenging.  "inches", "kilometers-per-hour"

CJR: Having listened to your explanation, SFC, I agree with your assessment.  Doing it on an ad-hoc basis is leading away from consistency.

RCA: +1 for not allowing plurals.

RGN: I share this opinion.  Is there already a reference to CLDR, to prevent this from coming up again?

#### Conclusion

Stay consistent with CLDR, and add a normative reference to CLDR if there isn't already one.
@sffc
Copy link
Contributor

sffc commented Sep 9, 2021

Discussion from 2021-09-09 TC39-TG2: https://github.com/tc39/ecma402/blob/master/meetings/notes-2021-09-09.md#collationsoflocale-order

Conclusion: I am to make a presentation to TC39-TG1 to establish a best practice in the style guide.

@FrankYFTang
Copy link
Collaborator

My memory is the conclusion of TC39-TG1 is we should sort in a defined order. But should we change the current spec text to specify the order differently now?

@sffc
Copy link
Contributor

sffc commented Apr 21, 2022

Discussion at TG1: https://github.com/tc39/notes/blob/main/meetings/2021-10/oct-26.md#specifying-order-of-lists-returned-from-intl-apis

There was broad agreement that the order should be specified. Whether the order is lexicographic or semantic or something else had no firm conclusion. I therefore believe that the best path forward is to use lexicographic in the absence of a better order, or semantic when there is one.

@sffc
Copy link
Contributor

sffc commented Apr 21, 2022

CC @hsivonen @markusicu

For the purposes of Intl Locale Info, I think the order should be in preference order, if that is an order we can figure out. Based on my reading of comments earlier in this thread, it sounds like we cannot always get a consistent order of collation preference based on locale, so the proposal was to sort them alphabetically; is that right?

@hsivonen
Copy link

  • It's fairly easy establish that the default collation is the most-preferred one. This is "standard" for all languages other than "zh" and "sv". (For "sv", the default is "reformed" and for "zh", the default is "pinyin" except the region and script subtags can change the default to "stroke".)
  • It's fairly easy to establish that "search" is the least-preferred one for all locales other than "ko".
  • For "ko", it's unclear to me which one of "search" and "searchjl" should be the least-preferred and which one second-least-preferred.
  • "search" and "searchjl" are semantically different from all other collations, and it never makes sense to fall back to them from non-search collations, so depending on what the use cases for the list are, it might make sense not to have these on the list.
  • For aliases of "und", we can decide the relative order of "eor" and "emoji". It can be argued either way: probably more people care about "emoji" but "eor" has more de jure standing. (And I would further argue that perhaps "emoji" itself should be baked into the root anyway in the future.)
  • When the default resolves to "pinyin", it probably makes sense to make the order "pinyin", "stroke", "zhuyin", "unihan", "gb2312", "big5han". (The logic here with the last two is that approximately no one actually wants the last two.)
  • When the default resolves to "stroke", it probably makes sense to make the order "stroke", "zhuyin", "pinyin", "unihan", "big5han", "gb2312".
  • At present in CLDR, a language can have at most one other collation (e.g. one of "phonebk", "dict", "trad") that isn't already addressed by the rules above. I suggest inserting it as the second-most-preferred. (We can make a forward-compatible rule for how to sort these if future CLDR there can be more than one to insert here.)

@hsivonen
Copy link

hsivonen commented Apr 22, 2022

(Aside: I'm curious how feasible it would be to make sv follow the same pattern as all other non-zh languages for how its default collation and old collation are labeled.)

@FrankYFTang
Copy link
Collaborator

@hsivone- based on your comments, it seems neither CLDR nor ICU API give us a preferred order for all locale- is that correct? ( I have no doubt someone can figure out one way or ther other, but then next person can figure out a different way with their own argument/logic. My point is there is no standarized data in CLDR or C/C++ API in ICU to provide us such information now w/o some external logic

@hsivonen
Copy link

CLDR only gives the most preferred collation (the default). But it mostly doesn't need to provide much else, since 1) search collations aren't like the others and 2) other than und and zh (and the relative order of search and searchjl for ko), default-first, search-last is enough to define an order in current CLDR.

@anba
Copy link
Contributor Author

anba commented Apr 27, 2022

"standard" and "search" are explicitly excluded (#31, #36), which means the default collation for most languages won't actually appear in the list returned from CollationsOfLocale.

@hsivonen
Copy link

It seem problematic to exclude "standard" instead of excluding whatever the default is. In particular, by excluding "standard", there is no way to choose the non-default collation for Swedish.

(Or, alternatively, this could be seen as an argument to change CLDR to make Swedish conform to the same naming pattern as all other non-zh locales. The current situation for Swedish in CLDR is obviously bad. The question mainly is whether fixing it has worse ripple effects than not fixing it.)

@hsivonen
Copy link

hsivonen commented May 2, 2022

@markusicu
Copy link

If "search" is excluded, then any name that starts with "search" should be excluded.

It seems weird to exclude the default sort order. I imagine that the list of collations could be presented to a user to pick, and not being able to pick the default sort order seems counterproductive.

I personally don't think that the "emoji" sort order is particularly useful. I believe that few people care about the sort order of symbols. I believe that this sort order is mostly for producing Unicode emoji charts, and to suggest an order of emoji for emoji pickers/palettes.

We have an "eor" order but to make it useful it should theoretically be the basis for all of the European tailorings -- except that AFAIK there has been no demand other than from people who created the EOR standard. And doing so would significantly increase the size and complexity of the collation data for dozens of languages.

Don't list inherited collations. For example, for German don't also list "eor" and "emoji".

Alphabetical order seems to make sense. "Undefined order" would be fine, too. Or maybe list the default first, but otherwise don't get fancy. A user seeing a list of two or three sort orders should have no problem picking one that they like.

@sffc
Copy link
Contributor

sffc commented May 2, 2022

It sounds to me that the best order might be: the locale's default value first, and all remaining values sorted alphabetically afterwards.

@hsivonen
Copy link

hsivonen commented May 3, 2022

With alphabetical (after the default) there's the risk that it looks very close to preference order for Chrome and Edge but not for Firefox and Safari.

This demo that tests how certain interesting character pairs collate to determine which Chinese collations are actually supported confirms that Chrome and Edge actually exclude the collations that the source code suggests they exclude. Firefox and Safari show all six Chinese collations as supported, but Chrome and Edge show gb2312, big5han, and unihan as unsupported.

So excluding search, putting default first, and then sorting alphabetically would look like this in Chrome and Edge:

Simplified Chinese: pinyin, stroke, zhuyin (arguably the same as preference order)
Traditional Chinese: stroke, pinyin, zhuyin (not all Traditional first, then Simplified, but close enough to looking like a preference order)

It would be bad if Web developers looked at Chrome and Edge and thought it was a preference order, when in Firefox and Safari, the alphabetical (after default) order would very much not be the preference order:

Simplified Chinese: pinyin, big5han, gb2312, stroke, unihan, zhuyin
Traditional Chinese: stroke, big5han, gb2312, pinyin, unihan, zhuyin

The obvious next question is: If Chrome and Edge don't make big5han, gb2312, and unihan available to the Web, is it useful for Firefox and Safari to make them available to the Web? After all, Web developers can't rely on them being there across browsers when Chrome and Edge exclude them.

The comment in the Chromium source says: "big5han and gb2312han collation do not make any sense and nobody uses them." Blame points to @FrankYFTang. I'm quite willing to believe that the comment is accurate, but as a matter of diligence: What evidence is the comment based on?

Should ECMA-402 prohibit the exposure of the big5han and gb2312 collations? It seems bad to have to discover their practical unavailability from the Chromium source rather than the spec.

What's the rationale for excluding unihan from Chromium? I gather that it's semantically a CJK-specific way of saying dict.

The zh-u-co-unihan and ko-u-co-unihan collations aren't exceptionally large. ja-u-co-unihan is larger than the norm but not excessively so (and would become smaller if the private-kana building block baked into the root, which to me would make sense use-case-wise though I understand the technical reasons why private-kana isn't baked into the root). All these would become smaller if private-unihan was folded into the root. (I gather that if private-unihan was folded into the root, ko-u-co-unihan would become a mere script reordering on top of root.)

@anba
Copy link
Contributor Author

anba commented May 4, 2022

It sounds to me that the best order might be: the locale's default value first, and all remaining values sorted alphabetically afterwards.

That implies reverting #36, doesn't it? (Because the default value is "standard" for most locales.)


What are the use cases for Intl.Locale.prototype.collations? If the returned collation values are the ones which can be given to Intl.Collator, then "standard" and "search" should be excluded, because Intl.Collator doesn't allow these two collation values. They aren't allowed, because Intl.Collator requires to instead use the usage property. For example instead of new Intl.Collator("en-u-co-search"), it's necessary to use new Intl.Collator("en", {usage: "search"}).

@sffc
Copy link
Contributor

sffc commented Jun 16, 2022

According to https://tc39.es/ecma402/#sec-properties-of-intl-collator-instances,

[[Collation]] is a String value with the "type" given in Unicode Technical Standard 35 for the collation, except that the values "standard" and "search" are not allowed, while the value "default" is allowed.

Should we consider using the string "default" as an entry in locale.collations, then, if we don't have "search" and "standard"?

@sffc
Copy link
Contributor

sffc commented Jun 17, 2022

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#collationsoflocale-order-33

If I had to summarize the feelings of the group, we are just generally confounded as to why ECMA-402 differs from UTS-35. It leaves no obviously correct option on what to do here.

@FrankYFTang
Copy link
Collaborator

FrankYFTang commented Jul 7, 2022

Since "standard" and "search" are excluded in this list per https://tc39.es/ecma402/#sec-properties-of-intl-collator-instances
"[[Collation]] is a String value with the "type" given in Unicode Technical Standard 35 for the collation, except that the values "standard" and "search" are not allowed, while the value "default" is allowed."

How about I change step 4 of
https://tc39.es/proposal-intl-locale-info/#sec-collations-of-locale
from
"
Let list be a List of 1 or more unique canonical collation identifiers, which must be lower case String values conforming to the type sequence from UTS 35 Unicode Locale Identifier, section 3.2, sorted in descending preference of those in common use for string comparison in locale. The values "standard" and "search" must be excluded from list."

to
"
Let list be a List of 1 or more unique canonical collation identifiers, which must be lower case String values conforming to the type sequence from UTS 35 Unicode Locale Identifier, section 3.2, sorted using %Array.prototype.sort% using undefined as comparefn. The values "standard" and "search" must be excluded from list."

@sffc
Copy link
Contributor

sffc commented Jul 7, 2022

Did we determine whether there is a default value we can prepend to the list so that setting collation: loc.collations[0] results in the default collation type?

@FrankYFTang
Copy link
Collaborator

We discuss this in TG2 in 2022-10-06 and decide to resolve this by sorting them in alphabetic order since there are no reliable information about the preference.

@sffc
Copy link
Contributor

sffc commented Oct 6, 2022

@FrankYFTang
Copy link
Collaborator

I believe this issue is fixed by #63.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants