Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Unicode keywords in ResourceOptions #1750

Merged
merged 64 commits into from
Apr 19, 2022
Merged

Conversation

sffc
Copy link
Member

@sffc sffc commented Mar 29, 2022

Part of #1109

This PR changes ResourceOptions to have private fields and store a Unicode Keywords object instead of a string. It centralizes the ResourceOptions constructors and leaves TODOs pointing to #1109.

@robertbastian robertbastian mentioned this pull request Mar 29, 2022
@sffc
Copy link
Member Author

sffc commented Mar 31, 2022

At first I tried using a straight Locale in the ResourceOptions. It increases binary size a bit. Old:

-rwxr-xr-x 1 runner docker    41784 Mar 29 05:17 optim4.elf
-rwxr-xr-x 1 runner docker    31864 Mar 29 05:17 optim5.elf

New:

-rwxr-xr-x 1 sffc primarygroup    46416 Mar 28 22:52 optim4.elf
-rwxr-xr-x 1 sffc primarygroup    35936 Mar 28 22:53 optim5.elf

I got the size regression down to something I am okay with:

-rwxr-xr-x 1 sffc primarygroup    42496 Mar 30 22:37 optim4.elf
-rwxr-xr-x 1 sffc primarygroup    32448 Mar 30 22:37 optim5.elf

I did this by storing LanguageIdentifier alongside Keywords, which are the two sections of Locale that we care about. This is nice in part because we can easily reconstruct the Locale if we need it.

Looking at the assembly, I think the sizes will go down further once we're using cmp_bytes (blocked on #1756).


Update 2022-04-12: The size on main is now

-rwxr-xr-x 1 runner docker    42264 Apr 12 22:05 optim4.elf
-rwxr-xr-x 1 runner docker    31640 Apr 12 22:05 optim5.elf

and this PR changes it to

-rwxr-xr-x 1 runner docker    43216 Apr 13 03:44 optim4.elf
-rwxr-xr-x 1 runner docker    32328 Apr 13 03:44 optim5.elf

@sffc sffc changed the title Store variants in the locale Store Unicode keywords in ResourceOptions Mar 31, 2022
zbraniecki
zbraniecki previously approved these changes Apr 13, 2022
Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right. Thank you! lgtm

@Manishearth
Copy link
Member

Manishearth commented Apr 13, 2022

@sffc What's the best way to review this? (Also, I haven't started yet, so if you're able to squash some of the smaller related commits to make it easier to review, would be appreciated)

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, just some small things

components/datetime/src/calendar.rs Outdated Show resolved Hide resolved
provider/core/src/resource.rs Outdated Show resolved Hide resolved
provider/core/src/resource.rs Show resolved Hide resolved
provider/blob/src/blob_data_provider.rs Show resolved Hide resolved
provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
///
/// If you have ownership over the `ResourceOptions`, use [`ResourceOptions::into_locale()`]
/// and then access the `id` field.
pub fn langid(&self) -> LanguageIdentifier {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this borrow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should probably return an &LanguageIdentifier and let the user clone if they need it; usually common in Rust for the user to have control over that

Till now LanguageIdentifier was expected to be a cheap clone as much as possible so cloning was more okay, but that's also changing with our vertical fallback plans, so i do think we should avoid unnecessary clones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(interpreting this as a Question)

Mainly because I wanted to stop exposing the internals of ResourceOptions. I've introduced the following functions:

  1. This method, which constructs a new LanguageIdentifier, which usually doesn't allocate
  2. Direct accessors for the language, script, and region subtags, which never allocate
  3. locale(self) which moves out of self without allocating memory and without exposing the internals directly

I'm willing to discuss this and explore alternatives if you have suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Manishearth

You made another comment but it's not showing up in this thread.

The main reason I don't want to return &LanguageIdentifier is because I don't want to expose the internals of ResourceOptions. For example, we may consider a change that stores the subtags flat with a single variant, or something else that is hyper optimized for vertical fallback. I expect that most things users would need to do with ResourceOptions should be methods directly on it. It's not the business of the user of this class that we're able to return a &LanguageIdentifier.

provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
@sffc
Copy link
Member Author

sffc commented Apr 13, 2022

Review notes for @Manishearth:

  1. Start with resource.rs; this contains the actual data model changes
  2. You might look next at the changes to BlobDataProvider and FsDataProvider, and note the changes in the filenames to the JSON files
  3. Most of the rest of the changes are minor, with a few things to point out:
    • In DateTimeFormat, I set the Calendar extension into the locale early, and remove it on the data requests that don't need it (I left TODOs for vertical fallback to do that part more automatically)
    • I fixed a bug in locale filtering in datagen where it was filtering out week data since it has language "und"

Manishearth
Manishearth previously approved these changes Apr 15, 2022
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: i really like this change!

provider/core/src/resource.rs Show resolved Hide resolved
///
/// If you have ownership over the `ResourceOptions`, use [`ResourceOptions::into_locale()`]
/// and then access the `id` field.
pub fn langid(&self) -> LanguageIdentifier {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should probably return an &LanguageIdentifier and let the user clone if they need it; usually common in Rust for the user to have control over that

Till now LanguageIdentifier was expected to be a cheap clone as much as possible so cloning was more okay, but that's also changing with our vertical fallback plans, so i do think we should avoid unnecessary clones.

}

/// Returns the [`Locale`] for this [`ResourceOptions`].
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doc that this clones the langid and perhaps suggest folks use the proposed-non-cloning .langid() if they just need a langid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the docs.

Copy link
Member Author

@sffc sffc Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I updated the docs, and in the process removed this function. There were not many call sites.

provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
provider/datagen/src/cldr/transform/datetime/mod.rs Outdated Show resolved Hide resolved
components/datetime/src/calendar.rs Show resolved Hide resolved
robertbastian
robertbastian previously approved these changes Apr 16, 2022
@sffc sffc dismissed stale reviews from robertbastian and Manishearth via 33c384c April 19, 2022 01:16
@sffc sffc requested a review from nciric as a code owner April 19, 2022 06:22
@sffc sffc removed the request for review from nciric April 19, 2022 06:25
@sffc
Copy link
Member Author

sffc commented Apr 19, 2022

Changes in the latest commits:

  • 2943e21 = fix docstest
  • 33c384c = rename langid() to get_langid() to emphasize that there is potentially work going on; also delete the unused locale() method
  • 540d0c9 = fix provider_adapters (docstest and a bug uncovered by docstest)
  • 92156d5 = add docstests to locid crate for "true" values (this is unrelated to this PR and could be excluded if you prefer)
  • a1fe886 = rename unicode_ext() to get_unicode_ext() and add matches_unicode_ext()
  • ad4dff4 = addcoverage for the functions added in the previous commit

If any of these changes are controversial, I may revert those commits in order to get this PR landed.

/// &unicode_ext_value!("coptic"),
/// ));
/// ```
pub fn matches_unicode_ext(&self, key: &unicode_ext::Key, value: &unicode_ext::Value) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we not usually pass tinystrs/subtags by value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the accepted convention is to pass Copy types by value; but there are exceptions both ways. At play here:

  1. In this case, Key but not Value is Copy. I could pass one by value and the other by reference, or both by reference.
  2. Many map interfaces pass by reference all the time, even for Copy types.

I don't have a strong opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants