Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagen should be fallback-aware #2683

Closed
Manishearth opened this issue Sep 29, 2022 · 15 comments
Closed

Datagen should be fallback-aware #2683

Manishearth opened this issue Sep 29, 2022 · 15 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality

Comments

@Manishearth
Copy link
Member

Manishearth commented Sep 29, 2022

Currently, datagen will generate data for each key that is requested of it. In --all-locales1 mode, this means that for, e.g. decimal/symbols@1, there will identical entries for all of these locales:

image

even though they all fall back to sr. BlobDataProvider and BakedDataProvider deduplicate, so this doesn't lead to data duplication, but it does mean there are nine entries in the lookup array when there only needs to be one (when fallback is enabled).

The situation is worse for en-* where for some data keys there are probably up to 105 duplicate entries! This will only get worse as CLDR adds locales.

From @zbraniecki's measurements this is actually causing a nontrivial cost in our constructors when run with full-data.

It should be possible to tell datagen which fallback algorithm is used, something like --fallback-algorithm={none, naive, full} (see #2686 about naïve fallback). This would deduplicate the keys based on fallback algorithm; e.g. all the sr number formatting keys would collapse to one sr entry. Something similar would occur for en and we'd probably end up with en-IN and some other locales with unique entries along with the majority collapsed to en.

Furthermore, --all-locales (or --all-cldr-locales) should likely force the user to select a fallback algorithm, as there are footguns pointed at both your feet here if we just pick a default:

  • If you generate data without fallbacking you will get a massive lookup vector
  • If you generate data with fallbacking but haven't enabled fallbacking in your code, stuff won't work at runtime.

To motivate why people wouldn't always want to run with --fallback-algorithm=full, the full fallback algorithm is itself expensive in data, CPU time, and codesize. Embedded platforms are likely not going to be working with a wide set of locales simultaneously, rather they will often just need to hotload data for one locale (and if settings change, they can request data for one different locale), so fallbacking isn't really important.

cc @zbraniecki @sffc

Footnotes

  1. Which should probably be called --all-cldr-locales for clarity

@Manishearth Manishearth added C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Sep 29, 2022
@sffc
Copy link
Member

sffc commented Sep 29, 2022

#834

@Manishearth
Copy link
Member Author

@sffc do you think we should close one of these in favor of the other, or is this more a subset issue of #834? I do think this has a sketch of a more concrete plan to tackle a subpart of this.

@sffc
Copy link
Member

sffc commented Sep 30, 2022

@sffc do you think we should close one of these in favor of the other, or is this more a subset issue of #834? I do think this has a sketch of a more concrete plan to tackle a subpart of this.

The two issues are not exact duplicates of each other, so we should keep them both open, but there will probably be a single PR or sequence of PRs that will fix this and the other related issues.

@sffc sffc added this to the ICU4X 1.2 milestone Dec 16, 2022
@sffc sffc self-assigned this Dec 16, 2022
@sffc sffc added S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 16, 2022
@sffc
Copy link
Member

sffc commented Dec 16, 2022

I'll start with "none" and "full" fallback modes.

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Apr 18, 2023
@sffc
Copy link
Member

sffc commented Apr 18, 2023

Discuss: whether datagen was run in "none" or "full" fallback mode impacts the runtime data provider requirements. If "none" mode, we need full fallback at runtime. In "full" mode, we only support the exact language identifiers that were present at datagen time, and we still need basic fallback that handles extension keywords. Failure to do this will cause data lookups to fail when they should otherwise succeed. How do we enforce these invariants?

@sffc
Copy link
Member

sffc commented Apr 20, 2023

Discussion:

  • The three runtime providers we support (blob, fs, and unstable) should all add the correct type of fallbacking on their own. It should be driven by metadata in the blob/fs or in the generated code for databake. Could include no_fallback constructors on those three types.
  • We could add a flag in DataRequestMetadata to signal whether fallback is taking place here or elsewhere in the tree.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Apr 20, 2023
@robertbastian
Copy link
Member

This is blocking baked data, as we should be doing fallback, and tests break if we don't.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 21, 2023
@sffc
Copy link
Member

sffc commented Jun 21, 2023

Discuss the exact names and behaviors of the datagen fallback modes. See #3487 (comment)

@sffc
Copy link
Member

sffc commented Jun 22, 2023

Discuss with:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023
@sffc
Copy link
Member

sffc commented Jul 7, 2023

Writing down some thoughts on this design.

There are two main categories of fallback-aware datagen: runtime fallback and gentime fallback, which we call Runtime and Expand modes. They have very different semantics.

Runtime Fallback Mode

The regular Runtime mode will ship "batteries included" in databake; it doesn't support Postcard or FS. The Manual Runtime mode works with all providers but requires a LocaleFallbackProvider to be hooked up manually at runtime.

The behavior when using a set-based LocaleInclude is fairly clear. For all languages in the set, we also include regional variants and extension keywords. We deduplicate based on blobs reachable with locale fallback enabled.

If using an explicit list of locales instead of a predefined set, there are a few modes:

  1. Include regional variants and extension keywords that are descendants of a listed entry in the explicit list. For example, if en is in the list, it implies that all locales with en in their inheritance chain will also be included. Open question: does this include cross-language fallback like hi-Latn?
  2. Do not include regional variants unless they are explicitly listed, but do include extension keywords like numbering system preferences
  3. Do not include regional variants or extension keywords unless they are explicitly listed (see below)

Expanded Fallback Mode

This mode is not supported with a predefined locale set; you must use an explicit list of locales.

The main open question here seems to be about how to handle extension keywords. I would suggest the following model:

  1. Do not include extension keywords unless explicitly listed
  2. Add a match-like syntax such as the following:
    • ar-EG-u-nu-latn means to include that specific numbering system entry
    • ar-EG-u-nu-* means to include all numbering system entries for that langid
    • *-u-nu-latn means to include that numbering system for all langids that are otherwise being included
    • *-u-nu-* means to include all numbering system entries for all langids that are otherwise being included
    • ar-EG-u-* means to include all extensions for ar-EG
    • *-u-* means to include all extensions for all langids that are otherwise being included

The *-u-... could instead be mul-u-... to be more like BCP-47.

The extension keyword match things could be in the LocaleInclude::Explicit or in a separate option. I might lean toward a separate option at this point because:

  1. LocaleInclude::Explicit uses LanguageIdentifier which has some benefits such as serializability, and we utilize these types heavily
  2. We can be more flexible with the syntax; don't need to fit the mold of Locale or LanguageIdentifier

For collation, we are still only allowed to select from the set of collations that are made available from the collations option in datagen. For example, in order to include gb2312, it is necessary to enable it in both source.options.collations and the *-u-co-gb2312 locale filter.

CC @robertbastian

@sffc
Copy link
Member

sffc commented Jul 7, 2023

Concrete proposal for the CLI that I think covers everything:

  • --fallback takes the following options
    • runtime (default for baked)
    • runtime-manual
    • expand
    • pass-through (default for FS and blob)
  • --locales needs to resolve to an explicit list in expand mode
  • --extensions takes the following options
    • all (default)
    • none
    • list of pattern matches as described in my previous comment
  • --regions takes the following options
    • all (default for all fallback modes except expand)
    • none (default for expand)
    • list of regions that will be applied to all languages

@Manishearth
Copy link
Member Author

I really like this proposal. May be worth bikeshedding the name of the first flag (--for-fallback-strategy?) but i think it's fine.

@sffc
Copy link
Member

sffc commented Jul 8, 2023

A few examples:

  1. --locales en en-GB de-CH --fallback runtime
    • All regional variants inheriting from en, en-GB, or de-CH are included
    • All extensions are included
    • Entries are fallback-deduplicated
  2. --locales en en-GB de-CH --fallback pass-through
    • Same as above except entries are not fallback-deduplicated
    • Observation: an explicit locale list with fallback pass-through should probably also expand the locales in the explicit list if they aren't already expanded
  3. --locales en en-GB de-CH --regions none --fallback expand
    • Only the three listed locales are included; no other language IDs
    • Extensions for those specific langids are included
  4. --locales modern --regions US GB DE --fallback runtime
    • All modern languages (and language-scripts) are included
    • Only the US, GB, and DE regional variants are included
    • Extensions are included
  5. --locales en de --regions US DE --fallback expand
    • The full permutation en-US, en-DE, de-US, and de-DE are included and expanded
    • Extensions relevant for those locales are included

@sffc
Copy link
Member

sffc commented Jul 8, 2023

Revised proposal for the MVP, to be extended later as client needs arise:

  1. --fallback takes the following options
    • runtime (default for baked)
    • runtime-manual
    • expand
    • hybrid (default for FS/blob)
  2. --extensions takes the following options
    • all (default)
    • none
    • additional options can be added later
  3. --regions is not yet exposed through the CLI or API but takes the following behavior:
    • all (default for all fallback modes except expand)
    • none (default for expand)

With the following interaction between fallback and locales:

Fallback LocaleInclude Behavior
Runtime* Pre-defined Set Include everything in the set with deduplication
Runtime* Explicit Include ancestors and descendants of the explicit locales with deduplication
Expand Pre-defined Set NOT ALLOWED (doesn't make sense)
Expand Explicit Include exactly the locales specified
Hybrid Pre-defined Set Include everything in the set
Hybrid Explicit Include ancestors and descendants of the explicit locales and also ensure that the explicit locales have entries

* or Runtime Manual

Note that I renamed "Pass-Through" to "Hybrid". I think I would like to rename "Expand" to "Computed" or "Pre-Computed" or similar but we can bikeshed that.

@sffc
Copy link
Member

sffc commented Jul 10, 2023

Discussion:

  • Rename "Expand" to "Preresolved" (--fallback preresolved)
  • ExtensionsInclude:
    • ::Recommended (default)
    • ::Custom(vec!["!*-u-co-*", "und-u-co-search*", "!*-u-nu-*"])
      • Include/Exclude list similar to ICU buildtool
      • Note: All can be achieved with "" and None with "!"
    • It would be very nice to land this option in 1.3 but it's not a hard block

LGTM: @robertbastian @sffc @Manishearth @mosuem

@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 10, 2023
@sffc sffc closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants