Add NaiveFallback #2686

zbraniecki · 2022-09-29T19:03:49Z

In relation to #2683.

The current fallback mechanism is quite costly in binary size and data payload to support. In order to enable customers to use ICU4X without it, we by default do not deduplicate data that would rely on runtime fallbacking.

I believe we can (in a true Rust fashion!) resolve this dychotomy by introducing The Third Way between no fallback and full fallback. I dubbed it naive fallback.

Naive fallback works only one way - minimizing tags, and contains a very short list of exceptions.

The algorithm works like this:

Take the requested locale and cut out everything except of language-script-region pair.
Check if there is a match.
If not, cut out region.
Check if there is a match.
If not, check if a given language-region pair is in exception list.
5.1 If it is, use language-script from that exception
5.2 If not, remove script.
Check if there is a match.
If not, use und

This will cater to exceptions in sr and zh, but not much more. For everything else it will just cut off from right to left and eventually fallback on und.

The algorithm is super small, the data is super small (maybe even baked in by default?) and if used in datagen+runtime allows us to cut out huge portion of locales which in turn reduces the number of keys in the key table in data payload.
This reduction has two benefits:

It makes the data payload smaller, which is especially noticable in smaller payloads (decimal/symbols is predominantly keys)
It makes the runtime locale selection faster which reduces the cost of constructor.

The text was updated successfully, but these errors were encountered:

zbraniecki · 2022-09-29T19:18:24Z

I created a small script that takes all locale data for a given key, hashes the content and deduplicated to the smallest number of data files to cover all keys.

In case of decimal/symbols@1 the ideal deduplication would reduce the number of keys from 683 to 49.

zbraniecki · 2022-09-29T19:28:28Z

datetime/gregory/datesymbols@1 - 574 -> 257
datetime/gregory/datelengths@1 - 574 -> 187
datetime/week_data@1 - 155 -> 6
datetime/timesymbols@1 - 574 -> 205
plurals/cardinal@1 - 215 -> 39
plurals/ordinal@1 - 100 -> 24
collator/data@1 - 132 -> 122
collator/meta@1 - 142 -> 9
time_zone/formats@1 - 574 -> 141
time_zone/generic_long@1 - 574 -> 162
list/and@1 - 574 -> 116

I assume that this deduplication is already happening for values.
I also want to say that my script provides ideal optimization, which the naive algorithm is likely not going to achieve.
I do believe tho, that we can develop it so that it provides full conformance (falls back as well as a full fallback would) while getting very close to ideal key set.

sffc · 2022-09-29T22:28:19Z

#834

sffc · 2022-09-29T22:32:28Z

tl;dr, locale fallback is very, very complicated. I have a plan for how to solve the size and speed issues. It's something we can improve incrementally.

zbraniecki · 2022-09-29T22:34:11Z

I don't think this is the same as #834. That one is about including locales based on allowlist, this is about deduplicating.

sffc · 2022-09-29T23:21:32Z

It's the same in the sense that these problems can be solved by making DataExporter smarter. We can either pre-populate locales or strip them if they have duplicates.

Note that "naive fallback" is already supported; see the no_data constructor on FallbackProvider.

sffc · 2022-09-30T02:02:03Z

Back on my computer so I can write a more complete response…

The current fallback mechanism is quite costly in binary size and data payload to support.

It has some cost in binary size, but not too much. The data size cost is about 10 kB.

In order to enable customers to use ICU4X without it, we by default do not deduplicate data that would rely on runtime fallbacking.

We do deduplicate data; we store essentially just a pointer for these extra locales. It does mean that lookup is slower if we use binary search. This can be mitigated by shipping locales as separate language packs or by adopting ZeroHashMap for locale lookup (#2579).

I believe we can (in a true Rust fashion!) resolve this dychotomy by introducing The Third Way between no fallback and full fallback. I dubbed it naive fallback.

Naive fallback works only one way - minimizing tags, and contains a very short list of exceptions. …

This is exactly what our currently algorithm is doing when you run it in no-data mode. You add data for it to handle scripts and parent locales correctly.

5.2 If not, remove script.

We don't remove script without language because that's a well documented footgun. The only exception is for collation data.

This will cater to exceptions in sr and zh, but not much more. For everything else it will just cut off from right to left and eventually fallback on und. …

Again, this is what we already do. sr and zh are handled in a data-driven pre-processing step. It uses a subset of likely subtags data, and I want to make it such that if your app is already carrying likely subtags, we don't even ship that fallback data at all.

I guess what I'm missing is, what are you suggesting that your "naive fallback" is improving on? Your wish list is basically what I've already implemented. You can see full details in flexible vertical fallback (which I'm pretty sure you've reviewed before). Note that the algorithm itself is deceptively simple. It boils down to naive subtag removal, but with additional support for extension keywords.

sffc · 2022-09-30T02:13:48Z

Another point on this topic. I think there have traditionally been two ways of looking at locales in software, and ICU4X adds a third:

Ship all available locales all the time
Ship a minimal set of locales based on product needs
(New with ICU4X) Ship a core set of locales with more loaded on demand

For case (1), these clients are already eating a large cost by shipping all locales, so shipping a little extra code and data to help them with fallback is a small price percentage-wise. I hear you about the performance hit, but there are ways to solve this without changing the fallback algorithm.

For case (2), fallbacking should be pre-computed in datagen based on the desired set of locales such that we don't need to ship any fallbacking code or data, not even a "naive" version.

For case (3), we would likely employ a hybrid approach where desired locales are pre-computed in their respective language packs, but clients can opt in to the runtime locale fallback in order to get the full behavior of case (1).

zbraniecki · 2022-09-30T02:58:56Z

just a pointer for these extra locales.

My claim is that the storing 500 locale strings when only 100 are needed is a substantial portion of the postcard payload, and it accrues cost for component constructor to select from.

My hope is that we can do better. If you think that the fallback is not costly, I'd love to see the size of the postcard files when we use the full fallback to deduplicate and reduce the number of keys to minimum. Can we do that now? If so, how?

zbraniecki · 2022-09-30T03:14:01Z

I guess what I'm missing is, what are you suggesting that your "naive fallback" is improving on?

It's improving on no key deduplication which is the default mode of datagen.

In case of decimal/symbols@1 we are currently carrying around 684 keys, while ideally we would only carry 49. We can't do that because we need to annotate fallback, but among those 684 keys, we have only 87 language keys. I assume in almost all cases more specific locales can fallback onto the language itself, and many can fallback onto und, so I assume we should be able with a fallback to carry only 60-70 keys rather than 684.

That's a substantial payload decrease, memory cost reduction and I suspect constructor perf win.

sffc · 2022-09-30T03:36:42Z

Yeah, the empty pointers are a significant source of data size and especially lookup speed issues in certain keys including number format and date format. This is a known issue in datagen. The empty pointers should be calculated and removed at datagen time, and a subset may be added back if compile-time fallback resolution is requested.

It's improving on no key deduplication which is the default mode of datagen. … In case of decimal/symbols@1 we are currently carrying around 684 keys, while ideally we would only carry 49. We can't do that because we need to annotate fallback, but among those 684 keys, we have only 87 language keys.

Nit: Let's be careful on language here. decimal/symbols@1 is a key, there are currently 684 locales for that key, but we should be able to prune that to 87 locales if we prune locales based on fallback at datagen time.

sffc · 2022-09-30T03:43:31Z

I guess what I'm trying to say is, I'm thinking from the angle that the bug in datagen, the one where empty locale pointers are unnecessarily generated, which I've known has existed since basically the inception of datagen, will be fixed. Fixing it requires some care and design, but it is totally fixable in a dot release. My above claims about fallback being a fairly low cost relative to the cost of adding all locales are based on the world in which the bug in datagen is fixed.

zbraniecki · 2022-09-30T03:49:52Z

Gotcha. I think we agree.

I have a slight concern about what you're proposing as naive fallback since it seems to ignore the edge cases of multi-script languages. There are very few of them, and I'd be comfortable baking this data in since I expect it to be ~9 keys. But it seems like you're saying that your naive fallback doesn't recognize those 9 keys and produces wrong results for cases like sr-SR or zh-TW. I think that in my mind the naive fallback is still between full and none but is a bit more informed to accommodate for most common exceptions.

sffc · 2022-09-30T04:01:47Z

I have a slight concern about what you're proposing as naive fallback since it seems to ignore the edge cases of multi-script languages. There are very few of them, and I'd be comfortable baking this data in since I expect it to be ~9 keys. But it seems like you're saying that your naive fallback doesn't recognize those 9 keys and produces wrong results for cases like sr-SR or zh-TW. I think that in my mind the naive fallback is still between full and none but is a bit more informed to accommodate for most common exceptions.

OK, sure, I could see a third LocaleFallbacker constructor that narrows the likely subtags data to only the multi-script languages. You can get rid of a few K's of data this way. It would cause locales like "de-Latn-LI" to fail (it should be able to find "de-LI"), so it should be used only in cases where you know that the script is only ever specified in multi-script languages like zh and sr. I guess an issue here is that languages may change from single-script to multi-script in future CLDR releases, as has happened in the past.

sffc · 2022-09-30T04:40:05Z

The nice thing is that this is all totally configurable since locale fallback is its own separate component in the data provider. If someone comes up with a better fallback mechanism, we can plug it right in.

sffc · 2022-09-30T05:02:25Z

By the way, I see 18 multi-script languages in CLDR 41. The majority are Cyrl/Latn/Arab hybrids (it's not just Serbian); the rest are scattered in Africa, the Middle East, India, and China.

icu4x/provider/testdata/data/json/fallback/likelysubtags@1/und.json

Line 333 in 7f22b46

"lr2s": {

Azerbaijani: Latn (default), Cyrl (Russia), Arab (Iran)
Hausa: Latn (default), Arab (Sudan)
Kazakh: Cyrl (default), Arab (China)
Kurdish: Latn (default), Arab (Lebanon), Yezi (Georgia)
Kyrgyz: Cyrl (default), Arab (China), Latn (Turkey)
Manding: Latn (default), Nkoo (Guinea)
… more African, Central Asian, and Eastern European languages
Mundari: Beng (default), Deva (Nepal)
Chinese and Cantonese

sffc · 2023-06-15T18:19:07Z

Discuss with:

robertbastian · 2024-06-27T12:44:05Z

We have an approximation of this now. With #5114 you can pass a LocaleFallbacker to the exporter, and it will deduplicate according to that. You can pass LocaleFallbacker::new_without_data() to basically get naive fallback (although not quite as outlined by Zibi in the OP).

Manishearth mentioned this issue Sep 29, 2022

Datagen should be fallback-aware #2683

Closed

sffc mentioned this issue Oct 1, 2022

Speed up resource lookup and fallback at runtime #2699

Open

DerekNonGeneric mentioned this issue Oct 3, 2022

[bug: i18n] Serbian translation of VISION file doesn't specify script OpenINF/.github#84

Closed

sffc added needs-approval One or more stakeholders need to approve proposal discuss Discuss at a future ICU4X-SC meeting and removed needs-approval One or more stakeholders need to approve proposal labels Oct 17, 2022

sffc added the C-data-infra Component: provider, datagen, fallback, adapters label Dec 22, 2022

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 15, 2023

robertbastian assigned zbraniecki, sffc and robertbastian Feb 28, 2024

robertbastian removed the discuss Discuss at a future ICU4X-SC meeting label Jun 27, 2024

robertbastian added needs-approval One or more stakeholders need to approve proposal and removed discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NaiveFallback #2686

Add NaiveFallback #2686

zbraniecki commented Sep 29, 2022

zbraniecki commented Sep 29, 2022

zbraniecki commented Sep 29, 2022

sffc commented Sep 29, 2022

sffc commented Sep 29, 2022

zbraniecki commented Sep 29, 2022

sffc commented Sep 29, 2022

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022

zbraniecki commented Sep 30, 2022

zbraniecki commented Sep 30, 2022

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022 •

edited

Loading

zbraniecki commented Sep 30, 2022

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Jun 15, 2023 •

edited by Manishearth

Loading

robertbastian commented Jun 27, 2024

Add NaiveFallback #2686

Add NaiveFallback #2686

Comments

zbraniecki commented Sep 29, 2022

zbraniecki commented Sep 29, 2022

zbraniecki commented Sep 29, 2022

sffc commented Sep 29, 2022

sffc commented Sep 29, 2022

zbraniecki commented Sep 29, 2022

sffc commented Sep 29, 2022

sffc commented Sep 30, 2022 • edited Loading

sffc commented Sep 30, 2022

zbraniecki commented Sep 30, 2022

zbraniecki commented Sep 30, 2022

sffc commented Sep 30, 2022 • edited Loading

sffc commented Sep 30, 2022 • edited Loading

zbraniecki commented Sep 30, 2022

sffc commented Sep 30, 2022 • edited Loading

sffc commented Sep 30, 2022

sffc commented Sep 30, 2022 • edited Loading

sffc commented Jun 15, 2023 • edited by Manishearth Loading

robertbastian commented Jun 27, 2024

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Sep 30, 2022 •

edited

Loading

sffc commented Jun 15, 2023 •

edited by Manishearth

Loading