-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add canonicalize method to LocaleCanonicalizer #747
Conversation
I lost the rename while rebasing.
Codecov Report
@@ Coverage Diff @@
## main #747 +/- ##
==========================================
+ Coverage 74.63% 75.15% +0.51%
==========================================
Files 178 179 +1
Lines 10715 11167 +452
==========================================
+ Hits 7997 8392 +395
- Misses 2718 2775 +57
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data loading code and CLDR transform looks good. One suggestion regarding the constructor. I didn't review the algorithm code.
feature = "provider_serde", | ||
derive(serde::Serialize, serde::Deserialize) | ||
)] | ||
pub struct AliasesV1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let aliases: DataPayload<AliasesV1> = provider | ||
.load_payload(&DataRequest::from(key::ALIASES_V1))? | ||
.take_payload()?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Make a way to set up a LocaleCanonicalizer with the old LikelySubtags data without the new Aliases data, such as an option in an options bag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Since we seem to have a non-insignificant cost of adding aliases, and there may be users who don't care about aliases in their environment, having an ability to construct a canonicalizer that doesn't pay that cost seems worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question/suggestion (non-blocking)
: I understand that for our ICU4X data we can't store keys as subtags because they're non-canonical, but could we store values as tuples of subtags to save on parsing them at read time?
|| ruletype | ||
.variants | ||
.iter() | ||
.all(|v| source.id.variants.contains(v))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
: that leaves out a scenario where source
has additional variants not present in ruletype
- is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, UTS-35 says to match if the ruletype variants are a subset of the source variants, so ja-Latn-fonipa-hepburn-heploc
matches against the rule for hepburn-heploc
and is canonicalized to ja-Latn-alalc97-fonipa
. I'll add a comment explaining this is intentional.
ruletype_variants: Option<&subtags::Variants>, | ||
replacement: &LanguageIdentifier, | ||
) { | ||
if ruletype_has_language || source.id.language.is_empty() && !replacement.language.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking)
: would it make sense to assume that if we have ruletype_has_language
then replacement.language
is not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a bit of leftover code, I used to use this for scripts and regions as well, and now this only used for language rules. I can simplify things here a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well I hit some test failures with that, so I think I'd prefer to leave this as-is. The intention is that there are two separate cases where a replacement is made:
A matching rule can be used to transform the source fields as follows
if type.field ≠ {}
source.field = (source.field - type.field) ∪ replacement.field
else if source.field = {} and replacement.field ≠ {}
source.field = replacement.field
I'll add parentheses to make this intention clearer.
.collect(); | ||
for variant in replacement.variants.iter() { | ||
variants.push(*variant); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking)
: this could be potentially sped up if we were to allocate optimistically variants
for sum of lengths and insert using binary_search
already pre-sorted. Not sure if it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that it is probably not worth adding the complexity unless we start seeing very large numbers of variants.
let aliases: DataPayload<AliasesV1> = provider | ||
.load_payload(&DataRequest::from(key::ALIASES_V1))? | ||
.take_payload()?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Since we seem to have a non-insignificant cost of adding aliases, and there may be users who don't care about aliases in their environment, having an ability to construct a canonicalizer that doesn't pay that cost seems worth it.
|
||
["rg", "sd"] | ||
.iter() | ||
.filter_map(|key| key.parse::<Key>().ok()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
: could you avoid having to parse it every time by constructing tinystr!
macro for rg
and sd
and then Key
unchecked out of it, or even add a macro for key!
to parse at build time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please, add a comment explaining rg
and sd
here.
/// .expect("Failed to parse a Key."); | ||
/// if let Some(value) = keywords.get_mut(key) { | ||
/// *value = "gregory".parse() | ||
/// .expect("Failed to parse a Value."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
: indent .expect
by one block.
I tried the tuples of subtags approach with the likely subtags data and it ended up increasing the data file size substantially. On the key side of things, I'm using into() to extract a TinyStr value for comparison with the TinyStr stored in the alias data. I would expect that would be a fast operation, but I didn't verify that. On the value side of things, at one point I was storing everything as a parsed LanguageIdentifier. When I changed the script and region values to be TinyStrs instead of LanguageIdentifiers I didn't see any significant change in my benchmarks. I kept the change because it made the data files smaller. My expectation is that most locales we canonicalize will not require changes, so I've focused on trying to make the searching as fast as possible, and treated actually having to make a chance as an edge case. With the experimentation work we're doing, we'll have an opportunity to compare the performance against SpiderMonkey and I think those results should drive how much more we want to optimize this code. I don't expect our data driven approach to be as fast as the code generation approach in SpiderMonkey, but I'm hoping we'll be competitive. If not, time to break out the profiler and see what we can do :) |
@@ -166,228 +205,226 @@ impl LocaleCanonicalizer<'_> { | |||
/// ``` | |||
/// | |||
pub fn canonicalize(&self, locale: &mut Locale) -> CanonicalizationResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What does the typical call site look like here? Is it common that you want to call both maximize for likely subtags as well as canonicalize?
Suggestion 1: If so, perhaps we should make a function that does both.
Suggestion 2: If not (if maximize and canonicalize serve different use cases), then perhaps we should make two entirely separate classes: one that loads likely subtags data and has a maximize function, and the other which loads the canonicalization data and has a canonicalize function. This has the benefit of making a less monolithic ICU4X API and making it a bit easier to do code and data slicing. (You could keep the two classes in the same crate.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the JavaScript usecase, it takes a few jumps, but reading through these:
- https://tc39.es/ecma402/#locale-objects
- https://tc39.es/ecma402/#sec-canonicalizeunicodelocaleid
- https://unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers
- https://unicode.org/reports/tr35/#LocaleId_Canonicalization
shows that canonicalization is applied every time we create an Intl.Locale
object in JavaScript.
The likely subtags functions maximize
and minimize
are members of an Intl.Locale
instance, so in a sense, yes they are separate. However, canonicalize
depends upon being able to call maximize
to handle complex region aliases, so every use of canonicalize
potentially requires having the likely subtags data available.
We could break this runtime dependency by preprocessing complex regions in the CLDR provider, but that still requires likely subtags data to be available, so we would need some way of specifying a dependency between the alias data transform and the likely subtags data transform. I don't think that is possible right now.
We do use likely subtags operations on their own in Gecko outside of SpiderMonkey, but since we need canonicalization anyway, my thought was to have a singleton LocaleCanonicalizer
that loads the data once and serves both usecases.
My preference would be to have a single class with three methods, rather than two classes, one of which will need to access the other anyway. If we could break the runtime dependency, then maybe it would make sense to separate them, but even in that case, I think using options to control which data is loaded would probably be fine.
That said, I might be a bit biased by what will make life easier in Gecko, and maybe there are lot of other usecases I'm not considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I might be a bit biased by what will make life easier in Gecko, and maybe there are lot of other usecases I'm not considering.
I don't think it's a bad thing. We have a production environment with a business use case and can link it back to a technical decision ahead of us.
In the absence of more business scenarios, having one is dramatically better than having zero :)
I'd be comfortable following your suggestion and if between 0.3 and 1.0 we encounter alternative then refactor.
Discussion 2021-06-04:
|
I filed #767 as the follow up to add a feature to control the canonicalize method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the DataMarker stuff, and thanks for filling follow ups
This adds a
canonicalize
method tolocale_canonicalizer
that does UTS-35 canonicalization based upon the contents of the CLDR alias.json file.I've done some benchmarking to try to assess the cost of canonicalization:
canonicalize
benchmark runs on the unit test data. Each one of the locales requires canonicalization.canonicalize-noop
benchmark runs on the existing likely subtags data, where none of the locales require canonicalization.In each case, there is a
create
benchmark, where the locales are only created, and acreate+canonicalize
benchmark where the locale is created and then canonicalized. The intent is the give an indication of how much more expensive it is to canonicalize a locale rather than just create it.The results show that is definitely 2 to 3x slower to run canonicalize, even after I did some initial performance optimizations. I'm interested in feedback on how important it is to optimize this further. Although canonicalization is definitely slower, the benchmark is running 150-400k iterations in microseconds on my system, so to me it seems in absolute terms to still be a fast operation.
These are the benchmark results:
I've also added test cases based upon test262 and Firefox internal tests. Some of these are disabled because they require bcp47 data. I've filed #746 as a follow up to extend the canonicalization once this data is available.
Fixes #218