Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default constructors with full data #2945

Closed
sffc opened this issue Jan 1, 2023 · 22 comments
Closed

Default constructors with full data #2945

sffc opened this issue Jan 1, 2023 · 22 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) U-gecko User: Gecko

Comments

@sffc
Copy link
Member

sffc commented Jan 1, 2023

We have built data providers as a first-class feature in ICU4X. We currently tutor clients on how to build their data file and detail all the knobs at their disposal, which is essential to ICU4X's mission. Looking forward to 2.0 and beyond, though, we want to guarantee ICU4X's success when it comes to ease of use and adoption.

Eventually I think we want to work toward something like

#[cfg(feature = "default_data")]
pub fn try_new(locale: &DataLocale, options: Options) -> Result<Self, Error> { ... }

I think we can make this work with an optional dependency icu_provider_default crate that exports declarative macros, one per key, which are then invoked in the context of the component crate. This way we can avoid the cyclic dependency issue of baked data needing to import everyone's provider modules.

There are numerous questions to be answered, including:

  1. Is this a path we want to pursue, or should we double down with the always-pluggable data provider paradigm with no alternative?
  2. What is the update cadence of CLDR data in the icu_provider_default crate?
  3. Should we let users control the set of locales, and if so, how?
  4. Should we make the global provider be pluggable, and if so, how?

Also see:

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Jan 1, 2023
@robertbastian
Copy link
Member

I think we can make this work with an optional dependency icu_provider_default crate that exports declarative macros, one per key

I haven't figured out a way to export the macros yet. Currently they have to be invoked locally, as they need the data's module structure. Using include! instead of mod doesn't seem to work, because Rust gets confused around the relative paths (even though it should work the same as modules). One solution though would be to use a single file, which wouldn't be too bad for single keys (but I'd rather have a general solution).

What is the update cadence of CLDR data in the icu_provider_default crate?

What if we make icu_provider_default use a build script, then we don't have to ship and update data.

Should we let users control the set of locales, and if so, how?

The script could be configurable with a config file for locales, fallback, etc.

@sffc
Copy link
Member Author

sffc commented Jan 5, 2023

If we check-in the built data (rather than build.rs), we could consider icu_provider_default to be the source of truth for pre-built ICU4X data, and use it as a source for icu4x-data-filter as @Manishearth previously proposed.

Still not sure about how to handle CLDR versions. A few choices:

  1. Don't allow CLDR version configuration. We will always ship the latest data. Running cargo update could cause your CLDR version to change. If you care about data stability, don't use icu_provider_default.
  2. Allow a feature to be set, like "cldr-42", which disables the latest data and enables CLDR 42 data.
  3. Use version string shenanigans, like pre-release tags: "1.0.0-cldr42". However, I'm not sure if this will give us the behavior we're looking for. https://doc.rust-lang.org/cargo/reference/resolver.html#pre-releases
  4. Publish a new crate for each CLDR version: icu_provider_default_42

Related: #2715

@Manishearth
Copy link
Member

Currently they have to be invoked locally, as they need the data's module structure

use crate as icu_provider; + databake?

@Manishearth
Copy link
Member

  • Allow a feature to be set, like "cldr-42", which disables the latest data and enables CLDR 42 data.

I like this. We get tons of control over the lifecycle of CLDR data in this crate. E.g. we could say that the crate will support up to 3 CLDR versions, and by default will be the latest (So using one of the cldr-xx features is unstable, but unstable in a very well defined way. Ideally we use deprecation warnings if possible; unsure if it is)

Plus, it makes the CLDR version something that the final application can select: if there are intermediate crates using ICU4X default data, their choices do not artificially constrain the final application

Is this a path we want to pursue, or should we double down with the always-pluggable data provider paradigm with no alternative?

I think this is a good idea, yeah.

Should we let users control the set of locales, and if so, how?

We could use features. Perhaps some major defaults of "top X locales" and "all locales", plus individual features for each locale.

Unsure.

Should we make the global provider be pluggable, and if so, how?

It's ... possible! But I think this may have a problem with multiple crates: if two crates in the same deptree that depend on ICU4X default to this kind of thing, they'll end up stepping over each others' toes. I don't really like that.

I think the way to make it pluggable is to have a feature that makes this use a GlobalPluggableProvider, which looks for a particularly defined C function to get the provider (e.g. your code must define extern "C" fn plug_provider() -> *mut ICU4XPluggableDataProvider or something).

@sffc
Copy link
Member Author

sffc commented Jan 6, 2023

Hmm. So the icu_provider_default crate can implement the extern "C" fn plug_provider function, and if a client wants to override the data, they can enable a feature on icu_provider_default that essentially disables that crate, and then they can implement their own implementation of the plug_provider function?

It still needs to work with DCE.

@sffc
Copy link
Member Author

sffc commented Jan 6, 2023

For ICU4X docs, it would be nice if we could use the try_new constructors but have them point to testdata in tree.

@Manishearth
Copy link
Member

Hmm. So the icu_provider_default crate can implement the extern "C" fn plug_provider function, and if a client wants to override the data, they can enable a feature on icu_provider_default that essentially disables that crate, and then they can implement their own implementation of the plug_provider function?

Yep

It still needs to work with DCE.

It would have to go through Any or Buffer provider, but you'd still get DCE normally, yes? It's no different from using a blob provider normally.

@Manishearth
Copy link
Member

Manishearth commented Jan 6, 2023

For ICU4X docs, it would be nice if we could use the try_new constructors but have them point to testdata in tree.

That feels pretty brittle IMO (with baked data), we could generate testdata into the default crate, but then we won't be testing actual "normal" data loading paths, it would be a weird special thing we do. Basically, we can't pull in real baked testdata due to the circular dependency problem.

We could do plug_provider with buffer/any data I guess?

@sffc
Copy link
Member Author

sffc commented Jan 6, 2023

By DCE, I mean that individual data keys worth of data should be garbage collected.

Actually I don't quite see how plug_provider is compatible with the macro-based approach suggested in the OP?

@Manishearth
Copy link
Member

By DCE, I mean that individual data keys worth of data should be garbage collected.

Oh, you wouldn't get that, no.

Actually I don't quite see how plug_provider is compatible with the macro-based approach suggested in the OP?

Every macro would just wrap the same provider in that case, a provider that calls the C function.

@sffc
Copy link
Member Author

sffc commented Jan 6, 2023

We could have an extern C function for each individual data key. 🤔

@sffc
Copy link
Member Author

sffc commented Jan 14, 2023

Does this work?

#[cfg(feature = "custom_global_data")]
include!(env!("ICU4X_CUSTOM_GLOBAL_DATA_PATH"));

#[cfg(not(feature = "custom_global_data"))]
include!("default_global_data/mod.rs");

If you want custom global data you need to enable the icu_provider_global/custom_global_data feature and then you are responsible for setting the environment variable.

@sffc
Copy link
Member Author

sffc commented Jan 14, 2023

#2992

@Manishearth
Copy link
Member

Yeah I like that

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label May 25, 2023
@robertbastian robertbastian self-assigned this Jun 7, 2023
@robertbastian robertbastian added this to the 1.3 Blocking ⟨P1⟩ milestone Jun 7, 2023
@sffc
Copy link
Member Author

sffc commented Jun 8, 2023

  • @robertbastian - I think the globaldata should be a default feature, and the examples should use globaldata.
  • @sffc - My concern is that the explicit provider arguments are a key value proposition of ICU4X; I don't want the explicit arguments to bitrot and become less discoverable
  • @skius - The expicit data are for power users, and we have lots of links, including the constructor overview link, that make the explicit data provider argument discoverable
  • @sffc - For recursive constructors, if they delegate directly to _unstable, we don't have coverage problems. But if we do the more correct thing of keeping a separate tree for any, unstable, buffer, and globaldata, then we start to have coverage problems.
  • @robertbastian - This only matters for non-leaf components, which is about 3. I think we can cover the rest with manual testing.
  • @younies - The docs should give the constructors that are most useful to the user.
  • @echeran - I feel like the docs should have an example that shows an explicit data provider, since that's a value proposition, but if most users' use cases don't need it, it's okay for the docs to be "easier". "Easy" and "simple" are distinct things. The globaldata constructors make it "easy", but it is also "complex" in the sense that they don't have control over where the data comes from and need to think about the lifecycle.
  • @younies - I think it's important for the user to see the result, but leave a note that the provider can be replaced. When people get to that level, they are professionals, and they can find the feature.
  • @skius - From the perspective of a casual user, I'd be more scared/turned-off by seeing something that's complex to set up for a small project. But it should have a big note about how explicit data is an option with icu4x.
  • @echeran - I think it's important not to get caught up in what's "easy"; often we can think of all-in-one as "easy", but it actually forces you to be stuck with what you get.
  • @sffc - If we named the constructor _with_latest_data or _with_auto_data, then it signals that you are making a decision about the provenance of your data source, rather than keeping it as an implicit choice.
  • @skius - I don't like _with_latest_data because it sounds like it downloads things
  • @echeran - Another way to do this is to put away the globaldata constructors into a new module

Conclusion:

  • Use globaldata constructors in docs tests
  • Name the globaldata constructors with a suffix that clearly signals the provenance of the data, such as _with_auto_data
  • All recursive constructors should be covered with their own docs tests, but these could be auto-generated with macros

LGTM: @echeran @sffc (@skius) @younies

@Manishearth
Copy link
Member

LGTM to me as well. Occurs to me that the long term name for this might be "autodata", since @robertbastian doesn't like "globaldata" as much.

(we're not using "globaldata" publicly so far, this would mostly just be an internal name)

@robertbastian
Copy link
Member

I missed the end of this discussion and don't agree with the conclusion

Name the globaldata constructors with a suffix that clearly signals the provenance of the data, such as _with_auto_data

If we named the constructor _with_latest_data or _with_auto_data, then it signals that you are making a decision about the provenance of your data source, rather than keeping it as an implicit choice.

I think forcing users to make this explicit choice is bad, and we will keep seeing users struggle. Understanding what with_auto_data means requires

  • The user understands that i18n algorithms need data
  • The user understands that i18n data changes and is ideally managed manually
  • The user understands what the options for manual data management in ICU4X are

This is a high barrier of entry. There should be an easy path that doesn't require this level of knowledge, and ICU4X should reach those users. A user should be able to use the latest version of ICU4X to do i18n without needing to learn what the word "provider" means.

Also I really don't like the long names, *_with_auto_data is even longer than *_unstable, which we currently use in docs.

And, extending this to const variables like const EMOJI_COMPONENT: CodePointSetDataBorrowed = ...; would mean const EMOJI_COMPONENT_WITH_AUTO_DATA, which I don't think is great naming.

@sffc
Copy link
Member Author

sffc commented Jun 14, 2023

Bikeshed for constructor names: #3536

@sffc
Copy link
Member Author

sffc commented Jun 14, 2023

I'll note that the three bullet points (especially the first two) go together; I'm more comfortable using the new constructors in docs tests if they signal data provenance.

@sffc
Copy link
Member Author

sffc commented Jun 15, 2023

All recursive constructors should be covered with their own docs tests, but these could be auto-generated with macros

I was thinking that perhaps a more scalable way to handle recursive constructors would be to do what LocaleCanonicalizer is doing: the lowest-level "core" constructor is one that contains data being loaded for this specific class, with all other classes passed in as arguments. All the other constructors simply call the appropriate constructor on child classes and then invoke the core constructor with the results.

@robertbastian
Copy link
Member

FFI API discussion:

  • Cannot do overloading at the moment, as Diplomat (in particular CPP1) doesn't really support it
  • Can add a new CompiledData variant to ICU4XDataProvider, and make constructors behave accordingly
  • With 2.0 we can revisit overloading and nicer function naming

LGTM: @robertbastian, @sffc, @eggrobin, @Manishearth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) U-gecko User: Gecko
Projects
None yet
Development

No branches or pull requests

3 participants