Add canonicalize method to LocaleCanonicalizer #747

dminor · 2021-05-31T17:05:51Z

This adds a canonicalize method to locale_canonicalizer that does UTS-35 canonicalization based upon the contents of the CLDR alias.json file.

I've done some benchmarking to try to assess the cost of canonicalization:

The canonicalize benchmark runs on the unit test data. Each one of the locales requires canonicalization.
The canonicalize-noop benchmark runs on the existing likely subtags data, where none of the locales require canonicalization.

In each case, there is a create benchmark, where the locales are only created, and a create+canonicalize benchmark where the locale is created and then canonicalized. The intent is the give an indication of how much more expensive it is to canonicalize a locale rather than just create it.

The results show that is definitely 2 to 3x slower to run canonicalize, even after I did some initial performance optimizations. I'm interested in feedback on how important it is to optimize this further. Although canonicalization is definitely slower, the benchmark is running 150-400k iterations in microseconds on my system, so to me it seems in absolute terms to still be a fast operation.

These are the benchmark results:

canonicalize/create     time:   [11.013 us 11.095 us 11.212 us]                                 
                        change: [-4.4369% -3.3961% -2.4790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
canonicalize/create+canonicalize                                                                             
                        time:   [29.580 us 29.604 us 29.628 us]
                        change: [-3.9249% -3.5300% -3.1632%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

canonicalize-noop/create                                                                             
                        time:   [2.2272 us 2.2292 us 2.2316 us]
                        change: [-6.6945% -6.3773% -6.0587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
canonicalize-noop/create+canonicalize                                                                             
                        time:   [4.6583 us 4.6696 us 4.6842 us]
                        change: [-8.5319% -7.3448% -6.2585%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

I've also added test cases based upon test262 and Firefox internal tests. Some of these are disabled because they require bcp47 data. I've filed #746 as a follow up to extend the canonicalization once this data is available.

Fixes #218

I lost the rename while rebasing.

codecov-commenter · 2021-05-31T18:37:09Z

Codecov Report

Merging #747 (2db06f0) into main (958ee68) will increase coverage by 0.51%.
The diff coverage is 85.64%.

@@            Coverage Diff             @@
##             main     #747      +/-   ##
==========================================
+ Coverage   74.63%   75.15%   +0.51%     
==========================================
  Files         178      179       +1     
  Lines       10715    11167     +452     
==========================================
+ Hits         7997     8392     +395     
- Misses       2718     2775      +57

Impacted Files	Coverage Δ
components/locale_canonicalizer/src/lib.rs	`100.00% <ø> (ø)`
components/locale_canonicalizer/src/provider.rs	`0.00% <0.00%> (ø)`
provider/cldr/src/transform/mod.rs	`11.25% <10.00%> (-0.18%)`	⬇️
provider/cldr/src/transform/aliases.rs	`83.90% <83.90%> (ø)`
...omponents/locid/src/extensions/unicode/keywords.rs	`81.81% <85.71%> (+1.81%)`	⬆️
...s/locale_canonicalizer/src/locale_canonicalizer.rs	`95.02% <96.85%> (+95.02%)`	⬆️
components/locid/src/subtags/region.rs	`94.87% <100.00%> (+0.42%)`	⬆️
provider/core/src/resource.rs	`81.69% <100.00%> (+0.08%)`	⬆️
components/locid/src/subtags/variant.rs	`91.89% <0.00%> (+5.40%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 958ee68...2db06f0. Read the comment docs.

coveralls · 2021-05-31T18:37:25Z

Pull Request Test Coverage Report for Build 2705d49b52f71b5224f41ff054a57f2e0b3a0f16-PR-747

368 of 429 (85.78%) changed or added relevant lines in 8 files are covered.
9 unchanged lines in 5 files lost coverage.
Overall coverage increased (+0.4%) to 76.225%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
components/locid/src/extensions/unicode/keywords.rs	6	7	85.71%
components/locale_canonicalizer/src/locale_canonicalizer.rs	180	185	97.3%
provider/cldr/src/transform/mod.rs	1	10	10.0%
components/locale_canonicalizer/src/provider.rs	0	13	0.0%
provider/cldr/src/transform/aliases.rs	172	205	83.9%

Files with Coverage Reduction	New Missed Lines	%
components/locale_canonicalizer/src/provider.rs	1	0%
provider/cldr/src/transform/mod.rs	1	11.25%
utils/zerovec/src/map/serde.rs	1	80.33%
provider/core/src/resource.rs	2	81.69%
components/locale_canonicalizer/src/locale_canonicalizer.rs	4	95.37%

Totals
Change from base Build 3420be93fb29c6adf7c21e17f1c2eae8beaf51aa:	0.4%
Covered Lines:	9384
Relevant Lines:	12311

💛 - Coveralls

sffc

The data loading code and CLDR transform looks good. One suggestion regarding the constructor. I didn't review the algorithm code.

sffc · 2021-06-01T20:02:28Z

components/locale_canonicalizer/src/provider.rs

+    feature = "provider_serde",
+    derive(serde::Serialize, serde::Deserialize)
+)]
+pub struct AliasesV1 {


Thought: These two data structs will need to be migrated to ZeroVec, but that cannot be done until #667 is done. I am working on #667 now, so we should coordinate this.

sffc · 2021-06-01T20:28:05Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

+        let aliases: DataPayload<AliasesV1> = provider
+            .load_payload(&DataRequest::from(key::ALIASES_V1))?
+            .take_payload()?;
+


Suggestion: Make a way to set up a LocaleCanonicalizer with the old LikelySubtags data without the new Aliases data, such as an option in an options bag.

Agree. Since we seem to have a non-insignificant cost of adding aliases, and there may be users who don't care about aliases in their environment, having an ability to construct a canonicalizer that doesn't pay that cost seems worth it.

That makes sense to me!

zbraniecki

question/suggestion (non-blocking): I understand that for our ICU4X data we can't store keys as subtags because they're non-canonical, but could we store values as tuples of subtags to save on parsing them at read time?

zbraniecki · 2021-06-02T22:05:09Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

+            || ruletype
+                .variants
+                .iter()
+                .all(|v| source.id.variants.contains(v)))


question: that leaves out a scenario where source has additional variants not present in ruletype - is that okay?

Yes, UTS-35 says to match if the ruletype variants are a subset of the source variants, so ja-Latn-fonipa-hepburn-heploc matches against the rule for hepburn-heploc and is canonicalized to ja-Latn-alalc97-fonipa. I'll add a comment explaining this is intentional.

zbraniecki · 2021-06-02T22:06:04Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

+    ruletype_variants: Option<&subtags::Variants>,
+    replacement: &LanguageIdentifier,
+) {
+    if ruletype_has_language || source.id.language.is_empty() && !replacement.language.is_empty() {


suggestion (non-blocking): would it make sense to assume that if we have ruletype_has_language then replacement.language is not empty?

Yes, this is a bit of leftover code, I used to use this for scripts and regions as well, and now this only used for language rules. I can simplify things here a bit.

Hmm, well I hit some test failures with that, so I think I'd prefer to leave this as-is. The intention is that there are two separate cases where a replacement is made:

A matching rule can be used to transform the source fields as follows if type.field ≠ {} source.field = (source.field - type.field) ∪ replacement.field else if source.field = {} and replacement.field ≠ {} source.field = replacement.field

I'll add parentheses to make this intention clearer.

zbraniecki · 2021-06-02T22:08:17Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

+            .collect();
+        for variant in replacement.variants.iter() {
+            variants.push(*variant);
+        }


suggestion (non-blocking): this could be potentially sped up if we were to allocate optimistically variants for sum of lengths and insert using binary_search already pre-sorted. Not sure if it's worth it.

My feeling is that it is probably not worth adding the complexity unless we start seeing very large numbers of variants.

zbraniecki · 2021-06-02T22:10:16Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

+        let aliases: DataPayload<AliasesV1> = provider
+            .load_payload(&DataRequest::from(key::ALIASES_V1))?
+            .take_payload()?;
+


Agree. Since we seem to have a non-insignificant cost of adding aliases, and there may be users who don't care about aliases in their environment, having an ability to construct a canonicalizer that doesn't pay that cost seems worth it.

zbraniecki · 2021-06-02T22:13:29Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

+
+        ["rg", "sd"]
+            .iter()
+            .filter_map(|key| key.parse::<Key>().ok())


suggestion: could you avoid having to parse it every time by constructing tinystr! macro for rg and sd and then Key unchecked out of it, or even add a macro for key! to parse at build time?

Also, please, add a comment explaining rg and sd here.

zbraniecki · 2021-06-02T22:14:43Z

components/locid/src/extensions/unicode/keywords.rs

+    ///     .expect("Failed to parse a Key.");
+    /// if let Some(value) = keywords.get_mut(key) {
+    ///     *value = "gregory".parse()
+    ///     .expect("Failed to parse a Value.");


nit: indent .expect by one block.

dminor · 2021-06-03T10:07:04Z

question/suggestion (non-blocking): I understand that for our ICU4X data we can't store keys as subtags because they're non-canonical, but could we store values as tuples of subtags to save on parsing them at read time?

I tried the tuples of subtags approach with the likely subtags data and it ended up increasing the data file size substantially.

On the key side of things, I'm using into() to extract a TinyStr value for comparison with the TinyStr stored in the alias data. I would expect that would be a fast operation, but I didn't verify that.

On the value side of things, at one point I was storing everything as a parsed LanguageIdentifier. When I changed the script and region values to be TinyStrs instead of LanguageIdentifiers I didn't see any significant change in my benchmarks. I kept the change because it made the data files smaller.

My expectation is that most locales we canonicalize will not require changes, so I've focused on trying to make the searching as fast as possible, and treated actually having to make a chance as an edge case. With the experimentation work we're doing, we'll have an opportunity to compare the performance against SpiderMonkey and I think those results should drive how much more we want to optimize this code. I don't expect our data driven approach to be as fast as the code generation approach in SpiderMonkey, but I'm hoping we'll be competitive. If not, time to break out the profiler and see what we can do :)

components/locale_canonicalizer/src/locale_canonicalizer.rs

sffc · 2021-06-03T22:21:20Z

components/locale_canonicalizer/src/locale_canonicalizer.rs

@@ -166,228 +205,226 @@ impl LocaleCanonicalizer<'_> {
    /// ```
    ///
    pub fn canonicalize(&self, locale: &mut Locale) -> CanonicalizationResult {


Question: What does the typical call site look like here? Is it common that you want to call both maximize for likely subtags as well as canonicalize?

Suggestion 1: If so, perhaps we should make a function that does both.

Suggestion 2: If not (if maximize and canonicalize serve different use cases), then perhaps we should make two entirely separate classes: one that loads likely subtags data and has a maximize function, and the other which loads the canonicalization data and has a canonicalize function. This has the benefit of making a less monolithic ICU4X API and making it a bit easier to do code and data slicing. (You could keep the two classes in the same crate.)

For the JavaScript usecase, it takes a few jumps, but reading through these:

https://tc39.es/ecma402/#locale-objects

https://tc39.es/ecma402/#sec-canonicalizeunicodelocaleid

https://unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers

https://unicode.org/reports/tr35/#LocaleId_Canonicalization

shows that canonicalization is applied every time we create an Intl.Locale object in JavaScript.

The likely subtags functions maximize and minimize are members of an Intl.Locale instance, so in a sense, yes they are separate. However, canonicalize depends upon being able to call maximize to handle complex region aliases, so every use of canonicalize potentially requires having the likely subtags data available.

We could break this runtime dependency by preprocessing complex regions in the CLDR provider, but that still requires likely subtags data to be available, so we would need some way of specifying a dependency between the alias data transform and the likely subtags data transform. I don't think that is possible right now.

We do use likely subtags operations on their own in Gecko outside of SpiderMonkey, but since we need canonicalization anyway, my thought was to have a singleton LocaleCanonicalizer that loads the data once and serves both usecases.

My preference would be to have a single class with three methods, rather than two classes, one of which will need to access the other anyway. If we could break the runtime dependency, then maybe it would make sense to separate them, but even in that case, I think using options to control which data is loaded would probably be fine.

That said, I might be a bit biased by what will make life easier in Gecko, and maybe there are lot of other usecases I'm not considering.

That said, I might be a bit biased by what will make life easier in Gecko, and maybe there are lot of other usecases I'm not considering.

I don't think it's a bad thing. We have a production environment with a business use case and can link it back to a technical decision ahead of us.
In the absence of more business scenarios, having one is dramatically better than having zero :)

I'd be comfortable following your suggestion and if between 0.3 and 1.0 we encounter alternative then refactor.

echeran · 2021-06-04T17:18:01Z

Discussion 2021-06-04:

sffc: it's weird to pass in an options bag that disables core functionality for an object. instead split into 2, and have one depend on the other if necessary
dminor: use cases (in Gecko) we have will want all 3 pieces of functionality anyways
zbraniecki: what optimizations can we enable by moving this data processing dependency during build time instead of runtime?
nciric: there is an option that blocks data loading at build time. can we have a feature that enables building data at build time, rather than having it an option in the object?
sffc: we could do that in cargo, something to consider. feature would disable the function and any data it depends on.
dminor: we can have that as a followup issue

dminor · 2021-06-07T13:55:56Z

I filed #767 as the follow up to add a feature to control the canonicalize method.

sffc

LGTM on the DataMarker stuff, and thanks for filling follow ups

Add canonicalize method to LocaleCanonicalizer

6eddfb0

dminor requested review from nciric, sffc, zbraniecki and a team as code owners May 31, 2021 17:05

Fix time_zones

2db06f0

I lost the rename while rebasing.

sffc requested changes Jun 1, 2021

View reviewed changes

zbraniecki reviewed Jun 2, 2021

View reviewed changes

Address review comments

205003e

zbraniecki previously approved these changes Jun 3, 2021

View reviewed changes

components/locale_canonicalizer/src/locale_canonicalizer.rs Outdated Show resolved Hide resolved

dminor requested a review from sffc June 3, 2021 18:53

sffc reviewed Jun 3, 2021

View reviewed changes

dminor added 2 commits June 4, 2021 17:02

Address review feedback

8ef7318

Merge in main

441d2fb

dminor mentioned this pull request Jun 7, 2021

Split likely subtags from LocaleCanonicalizer into their own type LocaleExpander #767

Closed

dminor dismissed zbraniecki’s stale review via 441d2fb June 7, 2021 14:01

dminor requested review from zbraniecki and sffc June 7, 2021 14:01

sffc approved these changes Jun 7, 2021

View reviewed changes

zbraniecki approved these changes Jun 7, 2021

View reviewed changes

dminor removed the request for review from nciric June 7, 2021 19:12

dminor merged commit ffd520f into unicode-org:main Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add canonicalize method to LocaleCanonicalizer #747

Add canonicalize method to LocaleCanonicalizer #747

dminor commented May 31, 2021 •

edited

Loading

codecov-commenter commented May 31, 2021

coveralls commented May 31, 2021 •

edited

Loading

sffc left a comment

sffc Jun 1, 2021

sffc Jun 1, 2021

zbraniecki Jun 2, 2021

dminor Jun 3, 2021

zbraniecki left a comment

zbraniecki Jun 2, 2021

dminor Jun 3, 2021

zbraniecki Jun 2, 2021

dminor Jun 3, 2021

dminor Jun 3, 2021

zbraniecki Jun 2, 2021

dminor Jun 3, 2021

zbraniecki Jun 2, 2021

zbraniecki Jun 2, 2021

zbraniecki Jun 2, 2021

zbraniecki Jun 2, 2021

dminor commented Jun 3, 2021

sffc Jun 3, 2021

dminor Jun 4, 2021

zbraniecki Jun 4, 2021

echeran commented Jun 4, 2021

dminor commented Jun 7, 2021

sffc left a comment

Add canonicalize method to LocaleCanonicalizer #747

Add canonicalize method to LocaleCanonicalizer #747

Conversation

dminor commented May 31, 2021 • edited Loading

codecov-commenter commented May 31, 2021

Codecov Report

coveralls commented May 31, 2021 • edited Loading

Pull Request Test Coverage Report for Build 2705d49b52f71b5224f41ff054a57f2e0b3a0f16-PR-747

💛 - Coveralls

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zbraniecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dminor commented Jun 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echeran commented Jun 4, 2021

dminor commented Jun 7, 2021

sffc left a comment

Choose a reason for hiding this comment

dminor commented May 31, 2021 •

edited

Loading

coveralls commented May 31, 2021 •

edited

Loading