Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add canonicalize method to LocaleCanonicalizer #747

Merged
merged 5 commits into from
Jun 7, 2021

Conversation

dminor
Copy link
Contributor

@dminor dminor commented May 31, 2021

This adds a canonicalize method to locale_canonicalizer that does UTS-35 canonicalization based upon the contents of the CLDR alias.json file.

I've done some benchmarking to try to assess the cost of canonicalization:

  • The canonicalize benchmark runs on the unit test data. Each one of the locales requires canonicalization.
  • The canonicalize-noop benchmark runs on the existing likely subtags data, where none of the locales require canonicalization.

In each case, there is a create benchmark, where the locales are only created, and a create+canonicalize benchmark where the locale is created and then canonicalized. The intent is the give an indication of how much more expensive it is to canonicalize a locale rather than just create it.

The results show that is definitely 2 to 3x slower to run canonicalize, even after I did some initial performance optimizations. I'm interested in feedback on how important it is to optimize this further. Although canonicalization is definitely slower, the benchmark is running 150-400k iterations in microseconds on my system, so to me it seems in absolute terms to still be a fast operation.

These are the benchmark results:

canonicalize/create     time:   [11.013 us 11.095 us 11.212 us]                                 
                        change: [-4.4369% -3.3961% -2.4790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
canonicalize/create+canonicalize                                                                             
                        time:   [29.580 us 29.604 us 29.628 us]
                        change: [-3.9249% -3.5300% -3.1632%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

canonicalize-noop/create                                                                             
                        time:   [2.2272 us 2.2292 us 2.2316 us]
                        change: [-6.6945% -6.3773% -6.0587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
canonicalize-noop/create+canonicalize                                                                             
                        time:   [4.6583 us 4.6696 us 4.6842 us]
                        change: [-8.5319% -7.3448% -6.2585%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

I've also added test cases based upon test262 and Firefox internal tests. Some of these are disabled because they require bcp47 data. I've filed #746 as a follow up to extend the canonicalization once this data is available.

Fixes #218

@dminor dminor requested review from nciric, sffc, zbraniecki and a team as code owners May 31, 2021 17:05
I lost the rename while rebasing.
@codecov-commenter
Copy link

Codecov Report

Merging #747 (2db06f0) into main (958ee68) will increase coverage by 0.51%.
The diff coverage is 85.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #747      +/-   ##
==========================================
+ Coverage   74.63%   75.15%   +0.51%     
==========================================
  Files         178      179       +1     
  Lines       10715    11167     +452     
==========================================
+ Hits         7997     8392     +395     
- Misses       2718     2775      +57     
Impacted Files Coverage Δ
components/locale_canonicalizer/src/lib.rs 100.00% <ø> (ø)
components/locale_canonicalizer/src/provider.rs 0.00% <0.00%> (ø)
provider/cldr/src/transform/mod.rs 11.25% <10.00%> (-0.18%) ⬇️
provider/cldr/src/transform/aliases.rs 83.90% <83.90%> (ø)
...omponents/locid/src/extensions/unicode/keywords.rs 81.81% <85.71%> (+1.81%) ⬆️
...s/locale_canonicalizer/src/locale_canonicalizer.rs 95.02% <96.85%> (+95.02%) ⬆️
components/locid/src/subtags/region.rs 94.87% <100.00%> (+0.42%) ⬆️
provider/core/src/resource.rs 81.69% <100.00%> (+0.08%) ⬆️
components/locid/src/subtags/variant.rs 91.89% <0.00%> (+5.40%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 958ee68...2db06f0. Read the comment docs.

@coveralls
Copy link

coveralls commented May 31, 2021

Pull Request Test Coverage Report for Build 2705d49b52f71b5224f41ff054a57f2e0b3a0f16-PR-747

  • 368 of 429 (85.78%) changed or added relevant lines in 8 files are covered.
  • 9 unchanged lines in 5 files lost coverage.
  • Overall coverage increased (+0.4%) to 76.225%

Changes Missing Coverage Covered Lines Changed/Added Lines %
components/locid/src/extensions/unicode/keywords.rs 6 7 85.71%
components/locale_canonicalizer/src/locale_canonicalizer.rs 180 185 97.3%
provider/cldr/src/transform/mod.rs 1 10 10.0%
components/locale_canonicalizer/src/provider.rs 0 13 0.0%
provider/cldr/src/transform/aliases.rs 172 205 83.9%
Files with Coverage Reduction New Missed Lines %
components/locale_canonicalizer/src/provider.rs 1 0%
provider/cldr/src/transform/mod.rs 1 11.25%
utils/zerovec/src/map/serde.rs 1 80.33%
provider/core/src/resource.rs 2 81.69%
components/locale_canonicalizer/src/locale_canonicalizer.rs 4 95.37%
Totals Coverage Status
Change from base Build 3420be93fb29c6adf7c21e17f1c2eae8beaf51aa: 0.4%
Covered Lines: 9384
Relevant Lines: 12311

💛 - Coveralls

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data loading code and CLDR transform looks good. One suggestion regarding the constructor. I didn't review the algorithm code.

feature = "provider_serde",
derive(serde::Serialize, serde::Deserialize)
)]
pub struct AliasesV1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: These two data structs will need to be migrated to ZeroVec, but that cannot be done until #667 is done. I am working on #667 now, so we should coordinate this.

Comment on lines 129 to 132
let aliases: DataPayload<AliasesV1> = provider
.load_payload(&DataRequest::from(key::ALIASES_V1))?
.take_payload()?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Make a way to set up a LocaleCanonicalizer with the old LikelySubtags data without the new Aliases data, such as an option in an options bag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Since we seem to have a non-insignificant cost of adding aliases, and there may be users who don't care about aliases in their environment, having an ability to construct a canonicalizer that doesn't pay that cost seems worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me!

Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/suggestion (non-blocking): I understand that for our ICU4X data we can't store keys as subtags because they're non-canonical, but could we store values as tuples of subtags to save on parsing them at read time?

|| ruletype
.variants
.iter()
.all(|v| source.id.variants.contains(v)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: that leaves out a scenario where source has additional variants not present in ruletype - is that okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, UTS-35 says to match if the ruletype variants are a subset of the source variants, so ja-Latn-fonipa-hepburn-heploc matches against the rule for hepburn-heploc and is canonicalized to ja-Latn-alalc97-fonipa. I'll add a comment explaining this is intentional.

ruletype_variants: Option<&subtags::Variants>,
replacement: &LanguageIdentifier,
) {
if ruletype_has_language || source.id.language.is_empty() && !replacement.language.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): would it make sense to assume that if we have ruletype_has_language then replacement.language is not empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bit of leftover code, I used to use this for scripts and regions as well, and now this only used for language rules. I can simplify things here a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well I hit some test failures with that, so I think I'd prefer to leave this as-is. The intention is that there are two separate cases where a replacement is made:

A matching rule can be used to transform the source fields as follows

    if type.field ≠ {}
        source.field = (source.field - type.field) ∪ replacement.field
    else if source.field = {} and replacement.field ≠ {}
        source.field = replacement.field

I'll add parentheses to make this intention clearer.

.collect();
for variant in replacement.variants.iter() {
variants.push(*variant);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): this could be potentially sped up if we were to allocate optimistically variants for sum of lengths and insert using binary_search already pre-sorted. Not sure if it's worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that it is probably not worth adding the complexity unless we start seeing very large numbers of variants.

Comment on lines 129 to 132
let aliases: DataPayload<AliasesV1> = provider
.load_payload(&DataRequest::from(key::ALIASES_V1))?
.take_payload()?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Since we seem to have a non-insignificant cost of adding aliases, and there may be users who don't care about aliases in their environment, having an ability to construct a canonicalizer that doesn't pay that cost seems worth it.


["rg", "sd"]
.iter()
.filter_map(|key| key.parse::<Key>().ok())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: could you avoid having to parse it every time by constructing tinystr! macro for rg and sd and then Key unchecked out of it, or even add a macro for key! to parse at build time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please, add a comment explaining rg and sd here.

/// .expect("Failed to parse a Key.");
/// if let Some(value) = keywords.get_mut(key) {
/// *value = "gregory".parse()
/// .expect("Failed to parse a Value.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent .expect by one block.

@dminor
Copy link
Contributor Author

dminor commented Jun 3, 2021

question/suggestion (non-blocking): I understand that for our ICU4X data we can't store keys as subtags because they're non-canonical, but could we store values as tuples of subtags to save on parsing them at read time?

I tried the tuples of subtags approach with the likely subtags data and it ended up increasing the data file size substantially.

On the key side of things, I'm using into() to extract a TinyStr value for comparison with the TinyStr stored in the alias data. I would expect that would be a fast operation, but I didn't verify that.

On the value side of things, at one point I was storing everything as a parsed LanguageIdentifier. When I changed the script and region values to be TinyStrs instead of LanguageIdentifiers I didn't see any significant change in my benchmarks. I kept the change because it made the data files smaller.

My expectation is that most locales we canonicalize will not require changes, so I've focused on trying to make the searching as fast as possible, and treated actually having to make a chance as an edge case. With the experimentation work we're doing, we'll have an opportunity to compare the performance against SpiderMonkey and I think those results should drive how much more we want to optimize this code. I don't expect our data driven approach to be as fast as the code generation approach in SpiderMonkey, but I'm hoping we'll be competitive. If not, time to break out the profiler and see what we can do :)

zbraniecki
zbraniecki previously approved these changes Jun 3, 2021
@dminor dminor requested a review from sffc June 3, 2021 18:53
@@ -166,228 +205,226 @@ impl LocaleCanonicalizer<'_> {
/// ```
///
pub fn canonicalize(&self, locale: &mut Locale) -> CanonicalizationResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What does the typical call site look like here? Is it common that you want to call both maximize for likely subtags as well as canonicalize?

Suggestion 1: If so, perhaps we should make a function that does both.

Suggestion 2: If not (if maximize and canonicalize serve different use cases), then perhaps we should make two entirely separate classes: one that loads likely subtags data and has a maximize function, and the other which loads the canonicalization data and has a canonicalize function. This has the benefit of making a less monolithic ICU4X API and making it a bit easier to do code and data slicing. (You could keep the two classes in the same crate.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the JavaScript usecase, it takes a few jumps, but reading through these:

shows that canonicalization is applied every time we create an Intl.Locale object in JavaScript.

The likely subtags functions maximize and minimize are members of an Intl.Locale instance, so in a sense, yes they are separate. However, canonicalize depends upon being able to call maximize to handle complex region aliases, so every use of canonicalize potentially requires having the likely subtags data available.

We could break this runtime dependency by preprocessing complex regions in the CLDR provider, but that still requires likely subtags data to be available, so we would need some way of specifying a dependency between the alias data transform and the likely subtags data transform. I don't think that is possible right now.

We do use likely subtags operations on their own in Gecko outside of SpiderMonkey, but since we need canonicalization anyway, my thought was to have a singleton LocaleCanonicalizer that loads the data once and serves both usecases.

My preference would be to have a single class with three methods, rather than two classes, one of which will need to access the other anyway. If we could break the runtime dependency, then maybe it would make sense to separate them, but even in that case, I think using options to control which data is loaded would probably be fine.

That said, I might be a bit biased by what will make life easier in Gecko, and maybe there are lot of other usecases I'm not considering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I might be a bit biased by what will make life easier in Gecko, and maybe there are lot of other usecases I'm not considering.

I don't think it's a bad thing. We have a production environment with a business use case and can link it back to a technical decision ahead of us.
In the absence of more business scenarios, having one is dramatically better than having zero :)

I'd be comfortable following your suggestion and if between 0.3 and 1.0 we encounter alternative then refactor.

@echeran
Copy link
Contributor

echeran commented Jun 4, 2021

Discussion 2021-06-04:

  • sffc: it's weird to pass in an options bag that disables core functionality for an object. instead split into 2, and have one depend on the other if necessary
  • dminor: use cases (in Gecko) we have will want all 3 pieces of functionality anyways
  • zbraniecki: what optimizations can we enable by moving this data processing dependency during build time instead of runtime?
  • nciric: there is an option that blocks data loading at build time. can we have a feature that enables building data at build time, rather than having it an option in the object?
  • sffc: we could do that in cargo, something to consider. feature would disable the function and any data it depends on.
  • dminor: we can have that as a followup issue

@dminor
Copy link
Contributor Author

dminor commented Jun 7, 2021

I filed #767 as the follow up to add a feature to control the canonicalize method.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the DataMarker stuff, and thanks for filling follow ups

@dminor dminor removed the request for review from nciric June 7, 2021 19:12
@dminor dminor merged commit ffd520f into unicode-org:main Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement UTS 35 locale canonicalization
6 participants