Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LanguageIdentifier and/or Locale compatible with [Var]ZeroVec #831

Closed
sffc opened this issue Jun 28, 2021 · 31 comments · Fixed by #1713
Closed

Make LanguageIdentifier and/or Locale compatible with [Var]ZeroVec #831

sffc opened this issue Jun 28, 2021 · 31 comments · Fixed by #1713
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Jun 28, 2021

We are going to have many situations where we want to make a zerovec of Locale or LanguageIdentifier objects.

The general solution is to serialize Locale or LanguageIdentifer to a string, and then use VarZeroVec as the data store. Alternatively, we could introduce the LSRV struct proposed and then abandoned in #243.

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting T-techdebt Type: ICU4X code health and tech debt S-small Size: One afternoon (small bug fix or enhancement) labels Jun 28, 2021
@sffc
Copy link
Member Author

sffc commented Jul 29, 2021

2021-07-29:

  • @zbraniecki - I think LSRV can be done on an as-needed basis
  • @Manishearth - It makes sense to do this, but it could be a lot of work

Conclusion: Keep this issue open and fix it when it is needed.

@sffc sffc added backlog help wanted Issue needs an assignee and removed discuss Discuss at a future ICU4X-SC meeting labels Jul 29, 2021
@sffc sffc added needs-approval One or more stakeholders need to approve proposal v1 labels Jan 18, 2022
@sffc
Copy link
Member Author

sffc commented Jan 18, 2022

I have a proposal for how to do this. I will use Locale as an example, but the exact same principle also applies to LanguageIdentifier.

The VarULE type for Locale can be LocaleStr, an unsized type that wraps a str and guarantees that the string is a syntactically valid BCP-47 identifier. LocaleStr should be a friendly type to use, with helper methods to get the language subtag, region subtag, etc.

All ICU4X APIs that take an immutable Locale should take a &LocaleStr, including DataProvider. This avoids the need to stringify the locale into a ULE when loading data.

An optional extension, for which I would advocate strongly, is that Locale should contain a LocaleStr as a field, such that it can implement AsRef<LocaleStr> or even Deref<Target = LocaleStr>. All Locale mutation methods should also update the string. The data model for the LocaleStr can be one of:

  1. Box<LocaleStr>: simple but always allocated.
  2. Rc<LocaleStr>: enables cheap cloning.
  3. Cow<'static, LocaleStr>: enables zero-alloc for const locales like those returned by langid!()
  4. Cow<'a, LocaleStr>: maximum flexibility and zero-alloc, but adds a lifetime to Locale

To be clear, the whole premise of this post is that we want to store Locale and LanguageIdentifier in a VarZeroVec, and in order to do this, we need a cheap way to get the VarULE. Currently, we stringify the Locale or LanguageIdentifier whenever we need to use it as a VZV key, which is far from ideal. If we can have "pre-computed" strings float along with Locale or LanguageIdentifier objects, we can be a lot* more efficient.

* Based on preliminary observational data that the stringification of the resource path is the most expensive part of the data loading machinery outside of Serde.

Needs approval from:

@Manishearth
Copy link
Member

I think this is a good plan.

Perhaps Locale should contain a String that is interpreted as a LocaleStr for better mutation? As in, it contains a String field that is (unsafely) reffed as &LocaleStr on fetching

@zbraniecki
Copy link
Member

I don't like this direction.

It seems like a very significant increase in complexity of the very fundamental type of the system. Every Locale will contain stringified version of itself at all times enabling potential inconsistency between fields and strings and requiring updating both at all times.

My read is that the direction you are putting us on is that esoteric internals of data provider machinery are reshaping fundamental types of ICU4x in ways that makes them complicated. That complication is only justified when the type is considered for data provider use.

I would like to make sure that Locale or Language identifier are optimal canonical types akin to String, Vec, Duration etc

I understand that there is a tension between the two and that we are seeking the balance but the proposal feels analogous to me as if stdlib proposed that String contains additional fields that are only necessary for when used in a std::collections which is sufficiently core to stdlib that may justify increased complexity and additional cost for other uses.

I'm wondering if we should accept that data provider is not a good candidate to use such primitives and we don't use String, Vec in it for a reason and we shouldn't use Locale in it.

Instead, Data Provider should use its own type,.optimized for its own use, and aim at cheap conversion to "public" type.

Or, maybe we should do the reverse, follow what you suggested, and introduce a new type, meant for public use that is akin to the current Locale?

@Manishearth
Copy link
Member

Manishearth commented Jan 19, 2022

It seems like a very significant increase in complexity of the very fundamental type of the system. Every Locale will contain stringified version of itself at all times enabling potential inconsistency between fields and strings and requiring updating both at all times.

Actually, my model was that locale would primarily be a wrapper around a string, and the other fields can exist as optimizations but need not be so.

As in, we should be looking at locales as strings first, potentially with parsed offsets stored in. Most use of the locale is just being passed around and tested for equality, which can happen via the string.

@zbraniecki
Copy link
Member

As in, we should be looking at locales as strings first, potentially with parsed offsets stored in.

Ah, I think this is where we disagree. I see Locale as a data structure that can be stringified, you see it as a string with some guarantees.

I think there's a tradeoff to both approaches, I'm concerned that you're focused on your immediate use case and in result biased in evaluating the optimal solution.

Most use of the locale is just being passed around and tested for equality, which can happen via the string.

I don't think this is accurate for all use cases. Efficient storing, cheap canonicalization and low memory cost of allocating is another. TinyStr is 10k+ times more efficient on many of those operations than Strings are.

@Manishearth
Copy link
Member

I'm concerned that you're focused on your immediate use case and in result biased in evaluating the optimal solution.

Eh, I wouldn't really consider this "my" use case, it's not something I've been thinking about that much. Overall I am okay with the approach proposed here but I'm not championing it.


In that case I suspect having a separate LocaleStr type that is used in data (and can be easily converted) is probably the way to go. This does still mean LocaleCanonicalizer would be using string locales in its data.

@zbraniecki
Copy link
Member

Eh, I wouldn't really consider this "my" use case, it's not something I've been thinking about that much.

I did a mental shortcut here, I'm sorry. I meant "use case you are considering".

This does still mean LocaleCanonicalizer would be using string locales in its data.

I think the canonicalizer may want to handle both with separate pathways.

@sffc
Copy link
Member Author

sffc commented Jan 20, 2022

Thanks for the replies. I'll reply point-by-point:

It seems like a very significant increase in complexity of the very fundamental type of the system. Every Locale will contain stringified version of itself at all times enabling potential inconsistency between fields and strings and requiring updating both at all times.

It's a new invariant that we need to enforce, yes, but I wouldn't call it a "significant increase in complexity". The simple implementation to enforce the new invariant (consistency between the stack fields and the LocaleStr pointer) is to simply re-generate the string whenever a mutation operation happens, and throw out the old string. It's basically just pre-computing and caching the stringification. I would not advocate for anything more complicated than that unless there is a clear need.

My read is that the direction you are putting us on is that esoteric internals of data provider machinery are reshaping fundamental types of ICU4x in ways that makes them complicated. That complication is only justified when the type is considered for data provider use.

A few things to unpack here.

First, what I'm trying to do here is to solve one of the oldest problems in ICU4X: The problem of how to efficiently represent and marshal locales, such that they support high efficiency and i18n correctness both inside and outside of the data provider. I have never been completely happy with the shape of Locale in ICU4X, and as a result, I've opened at least a half dozen issues pointing out limitations of the Locale data model in the last two years, the first being #52, one of the first 100 issues of the project. I feel more confident now than I have at any point in the course of the last two years that what I'm proposing solves our problems.

Second, the data provider is a primary consumer of locales. Within the ICU4X components, locales serve two primary purposes: (1) to select the best data file based on vertical fallback, and (2) to carry user preferences. I therefore feel that the needs of data provider should, in fact, carry a fairly high level of importance when talking about the shape of the locale classes.

Third, more broadly, I would like to stop looking at zero-copy as being an "esoteric internal of data provider machinery". For better or worse, over the course of the last year, zero-copy has become the lingua franca for basically everything data-related in ICU4X. We should embrace that, rather than treating it as an annoyance.

I would like to make sure that Locale or Language identifier are optimal canonical types akin to String, Vec, Duration etc

I understand that there is a tension between the two and that we are seeking the balance but the proposal feels analogous to me as if stdlib proposed that String contains additional fields that are only necessary for when used in a std::collections which is sufficiently core to stdlib that may justify increased complexity and additional cost for other uses.

A few more things to unpack.

First, I had basically this exact concern about Writeable last Friday, when I was pushing for Writeable to stay as close to fmt::Display as possible. But in the end, our consensus was to extend Writeable with field information, because ICU4X is the primary client of Writeable and ICU4X needs this as a core feature.

Second, most Rust stdlib types do work well with Rust stdlib collections, because they were designed that way. If Locale is built for ICU4X, then it should work well with ICU4X's collections.

Third, many Rust stdlib types have both an Owned version and a zero-copy version. String derefs to &str, Vec derefs to &[T], and PathBuf derefs to &Path. It's therefore not particularly unconventional to think about designing a type to which Locale can deref.

I'm wondering if we should accept that data provider is not a good candidate to use such primitives and we don't use String, Vec in it for a reason and we shouldn't use Locale in it.

Instead, Data Provider should use its own type,.optimized for its own use, and aim at cheap conversion to "public" type.

Sure. Let's look at alternatives.

I see two general models for how to represent a locale in a [Var]ZeroVec:

  1. LSRV (tuple of TinyStr), stored in a ZeroVec
  2. LocaleStr, stored in a VarZeroVec

Using LSRV has a few problems:

  • The ZeroVec always allocates space for every subtag (since the type must be fixed-width), even though subtags are generally assumed to be empty in many cases.
  • We are limited to supporting the exact set of LSRV, with a single variant, which works for now, but which, as you and Mark have pointed out, may not work well into the future.

Using LocaleStr solves both of those problems.

The problem it introduces is that the data model of LocaleStr is fundamentally different from Locale (one is a tuple of TinyStr, and the other is a string slice). We need to convert from one model to the other when performing lookup and comparison, for example. We need to be able to check whether a Locale is equal to, greater than, or less than a LocaleStr. This is an additional layer of complexity that could be avoided by having Locale deref to LocaleStr.

Or, maybe we should do the reverse, follow what you suggested, and introduce a new type, meant for public use that is akin to the current Locale?

Yes. This is another approach. However, I would point out that all formatter constructors (like DateTimeFormat::try_new) need the data provider type, because they immediately pass the locale down into the data provider. This is fine, but it means that Locale is essentially downgraded to a "builder" that is not useful on its own when interacting with ICU4X. But maybe that's what we want!

@zbraniecki
Copy link
Member

Second, the data provider is a primary consumer of locales.

I believe that is a position that i disagree with, as stated above in response to Manish.

Data provider is a primary consumer within icu4x while Locale is a struct useful also outside and independent of it.
It may be unique or at least rare case in icu4x that we are introducing a standalone "quasi-primitive" type that does not require data provider and is useful without it.

I'm afraid that this position is fundamentally different from how you and Manish evaluate it, and hence your optimal design proposition is different from mine.

@sffc
Copy link
Member Author

sffc commented Jan 20, 2022

Second, the data provider is a primary consumer of locales.

I believe that is a position that i disagree with, as stated above in response to Manish.

To be clear: The line you quoted was not intended as a statement of a position; it was intended as a statement of reality. My position is: "the reality of how Locale is being used in ICU4X should influence the design of Locale".

@zbraniecki
Copy link
Member

zbraniecki commented Jan 20, 2022

The line you quoted was not intended as a statement of a position; it was intended as a statement of reality.

icu4x::Locale founding code - unic-langid has been used for over 2 years now in production in Gecko. This is meant to be replaced by icu_locid as part of introducing icu4x into Gecko and will be used outside of context of data provider (as well as in that context later on).

I presented here alternative use cases with alternative considerations to the ones that you are considering.
I think it's rarely wise to label ones position as "the reality" when it is but one of multiple positions presented.

@sffc
Copy link
Member Author

sffc commented Jan 20, 2022

icu4x::Locale founding code - unic-langid has been used for over 2 years now in production in Gecko. This is meant to be replaced by icu_locid as part of introducing icu4x into Gecko and will be used outside of context of data provider (as well as in that context later on).

I presented here alternative use cases with alternative considerations to the ones that you are considering.

Acknowledged.

I think it's rarely wise to label ones position as "the reality" when it is but one of multiple positions presented.

My full statement was:

Second, the data provider is a primary consumer of locales. Within the ICU4X components, locales serve two primary purposes: (1) to select the best data file based on vertical fallback, and (2) to carry user preferences.

I claim that is an undisputable fact, aka "reality". Note that I say "a primary consumer" (not "the"), and then explain that I mean this to be withinn the ICU4X components.

I'd like if we can agree that data provider is, indeed, a primary consumer of the locale class in ICU4X components, so that we can debate the subjective statement that "the reality of how Locale is being used in ICU4X should influence the design of Locale".

@Manishearth
Copy link
Member

I'm afraid that this position is fundamentally different from how you and Manish evaluate it, and hence your optimal design proposition is different from mine.

@zbraniecki as I've said before, this is not my position, overall I am rather ambivalent on the final choice being made here. My personal values here are:

  • locales should be easy to use, and using them should not be too expensive
  • locales should be possible to use in data structures in a zero copy way (this does not necessitate using the same type for the zero-copy version)
  • it is okay if parsing zero-copy locales is slow
  • it is not so okay if accessing zero-copy locale data in a useful fashion is slow. "slow" is subjective here and some perf hit might be okay.

Some things which I think are nice to have, but are likely to be in conflict:

  • We often need locales as strings, it would be nice if it would be cheap to obtain those
  • We often need locales as more structured info, it would be nice if that structured info were not expensive to obtain
  • It should be convenient to use these types with ZeroMap

Overall I don't care much about the performance of variants: If there is some parsing cost incurred by them that is fine by me, but I care more about parsing costs imposed on language/country/script. Worth clarifying separately: do others feel this way?

Personally, I see a wide range of solutions here that satisfy my needs and touch on the requirements above. I'll list them below.

For the purposes of the entries below, I shall use LocaleStr to refer to some dynamically sized type that contains at least a str (but potentially other things), and Locale to refer to some stack type that contains locale info in some format. The two types may be aliases, or contain each other, or something else, depending on the design.

A. "LocaleStr everywhere"

This is a class of solution where LocaleStr is used everywhere. The rough model here is "locales are primarily strings in interchange, but they may have other things for convenience"

A1. Vanilla LocaleStr everywhere

This is Shane's original proposal, which essentially sets LocaleStr to be a straightforward str, validated but not further parsed. Locale becomes a Box<LocaleStr> (or similar), and LocaleStr

Pros:

  • Basically a single type, consistency. ZeroMap::get() can accept the LocaleStr dereffed from a Locale when necessary
  • Type is simple
  • Getting a string out of this is cheap, good for lookup

Cons:

  • Everyone repays the cost for parsing again and again. A mitigating factor may be that
  • Mutating a Locale is expensive and annoying

A2. LocaleStr everywhere, some preparsing in Locale

This is Shane's proposed variant, where Locale contains both Box<LocaleStr> (or String, to save on reallocs) and the current preparsed locale data.

Pros:

  • Everything still derefs to LocaleStr, even though the two types are more different. ZeroMap::get() can accept the LocaleStr dereffed from a Locale when necessary
  • Getting a string out of a locale is cheap
  • Locale doesn't pay the cost of parsing on getter operations. LocaleStr still does.

Cons:

  • Complicated Locale type with tricky invariants
  • More baggage for Locale to carry around that may not strictly be necessary
  • Mutating a Locale is still expensive and annoying, though less so

A3. LocaleStr everywhere, but LocaleStr itself is 🧐 fancier

This is a solution where LocaleStr becomes something like:

struct LocaleStr {
   language: Range<usize>,
   script: Range<usize>, // null if empty (same for below)
   region: Range<usize>, 
   variants: Range<usize>,
   n_variants: usize, // maybe?
   data: str
}

where the different elements index into data. Some space optimizations can be made if desired, for example language will always be 0..something so it can be stored as usize, and more complicated things.

Locale can then just be Box<LocaleStr> again

When deserialized, this type will check all internal invariants, after which we don't need to worry about them

Pros:

  • Back to having a single type, ZeroMap::get() is easy to use
  • Harder to mess up internal invariants, everything is sealed
  • Getting a string out of a locale is cheap

Cons:

  • Larger in the data file
  • Mutation is back to being more expensive and annoying
  • Variants still need to be parsed on access. But they are rarely touched (?) so this might be okay.

B. Two separate types

Here, Locale does not attempt to deref to LocaleStr or otherwise contain it, and looks basically like it does today. The model here is that Locale is the primary interchange type, and LocaleStr is a specialized type only for use in data files in a zero-copy manner.

To make ZeroMap::<LocaleStr, _>::get() work with Locale arguments, we need to add a new trait allowing for cross-comparisons, and a ZeroMap::get_cross() method that works with that. I have worked out a design that I won't include here, but it's doable. It may not be as cheap which is unfortunate.

B1. Very basic LocaleStr

Here, LocaleStr just wraps around str, and is validated during deserialization. Getters need to reparse.

Pros:

  • Simple LocaleStr, simple Locale, both suited for their purposes
  • Small in the data file

Cons:

  • Expensive to get Locale as a string
  • LocaleStr getters are not cheap
  • ZeroMap::get_cross() is not that cheap
  • Two entirely separate types

B2. 🧐 Fancier LocaleStr

LocaleStr follows the 🧐 fancy definition above

Pros:

  • Types still suited for their purpose
  • ZeroMap::get_cross() is slightly cheaper
  • LocaleStr getters are mostly cheaper

Cons:

  • Expensive to get Locale as a string
  • Two entirely separate types

B3 🧐 🧐 Very fancy LocaleStr

LocaleStr is designed to look more like Locale, so it is:

struct LocaleStr {
   lang: TinyAsciiStr<4>,
   script: TinyAsciiStr<4>,
   region: TinyAsciiStr<4>,
   n_variants: usize, // maybe?
   variants: str,
}

variants still needs just in time parsing to fetch, but everything else is just as fast

Pros:

  • LocaleStr is just as fast as Locale for most purposes
  • Types still suited for their purposes
  • ZeroMap::get_cross() will be just as fast as LiteMap<Locale, _>::get()

Cons:

  • Nothing can be obtained as a string cheaply
  • Two separate types
  • Takes up more space in the data, especially when script/regions are unused

As I said, overall I don't have strong opinions on what to pick here, and I've probably not listed all of the options I would be okay with, but here is what I see the design space as being.

@zbraniecki
Copy link
Member

zbraniecki commented Jan 20, 2022

I'd like if we can agree that data provider is, indeed, a primary consumer of the locale class in ICU4X components

I agree.

We often need locales as strings, it would be nice if it would be cheap to obtain those

I have quite complicated position on this point. I think that in almost all performance critical runtime code this should not be the case, but it's a strong position weakly held.

Worth clarifying separately: do others feel this way?

"performance of variants" is a vague term here. If Locale contains variants, and variants increase memory cost of an instance, is it performance we care about? I think it is. Same for common operations.

I think the problem is that we identify "common operations" to be very different between Data Provider and a runtime use of locale logic with no data provider scenario.
This makes it really hard to reason about the impact of variants on performance of Locale.

I'll list them below.

Thank you for documenting those options.

I think my current mental model leads me to B1, but I wouldn't be opposed to other B's.
Basically, I start to think that Data Provider is a very peculiar user of Locale in that it's critical needs are completely non-overlapping with critical needs of external uses of Locale.

By that I mean that Data Provider needs fast/cheap serialize/deserialize, but completely doesn't care about maximize/minimize/canonicalize or any advanced matches on subtags (that may change with introduction of language matcher heuristics tho).
An external user of Locale that I can see within Mozilla and in my new project will care about minimize/maximize/canonicalize/matching and does not care about To/From string or serialization.

This furthers me into position that the solution is to separate those types for their respective use cases.

@Manishearth
Copy link
Member

@zbraniecki ah, to clarify "performance of variants" I meant perf of getters, not memory: basically, are we okay with solutions where getters have to just in time parse variants if needed

@zbraniecki
Copy link
Member

I meant perf of getters

Intuitively - I agree

@sffc
Copy link
Member Author

sffc commented Jan 21, 2022

Thanks Manish for the summary.

Lots of topics to unpack. I will put them under headings.

Ultra-Fast ULE-to-Stack

If we can hyper-optimize impl From<LocaleStr> for Locale, then most of the problems go away. We would need the new special ZeroMap lookup functions Manish mentioned; all the binary search comparisons would then use this impl.

The other direction, going from Locale to LocaleStr, is a less common operation and can be implemented via EncodeAsVarULE.

Data Provider Needs Should Dictate ICU4X Component APIs

Let me explicitly state an implied assumption from some of my previous proposals. Within the ICU4X components, locales shall be passed around in the form that DataProvider wants when the function is using the locale to load data. This means that DateTimeFormat::try_new should accept a parameter in the DataProvider format. This is a strongly held position backed by a technical, not philosophical, argument. User preferences aside, inside most try_new functions, the only purpose of the locale is to hand off to data provider. We should absolutely not have these functions take an object that they subsequently convert it to the data provider form.

What the above paragraph means is that if we choose a string-like approach in data provider, then my position is that the ICU4X components should also take a string-like argument. This is part of why I was pushing for A2, because then the functions can take AsRef<LocaleStr>.

Said another way, although I am fine with Locale existing as it currently does for non-ICU4X use cases, for use cases within ICU4X, we should not be encumbered with design choices made for non-ICU4X use cases.

Obviously, LocaleCanonicalizer and similar classes that use locales for something other than data loading should take the locale in the form that they need.

Perhaps this is the source of some of my strong feelings in this thread. I feel like I am being told that the Locale that was designed for LocaleCanonicalizer is the Locale that we should be using by default through ICU4X. But I feel that the Locale we use by default in ICU4X components should be the Locale optimized for DataProvider.

Maximized Locales

I should also note that this discussion hinges on the resolution to #1462. It sounds like that thread is currently leaning toward "data provider operates on maximized locales". Therefore, the final argument type that data provider needs may actually end up being something more along the lines of &MaximizedLSR or similar.

However, this does not decouple us from the problem of a stack type versus a stringy type.

Variants

I definitely agree that I don't care too much about "get" operations on variants and extensions. I care much more about performance of other operations in the locale lifecycle.

@sffc
Copy link
Member Author

sffc commented Jan 21, 2022

Class C: Hybrid LSR + Variants approach

I haven't fully thought through the implications, but should we explore something like

// assumes that we use TinyStrNeo such that TinyStr is ULE

struct Locale {
    language: TinyStr,
    script: TinyStr,
    region: TinyStr,
    variants_and_extensions: Box<VariantsAndExtensionsStr>
}

struct LocaleStr {
    language: TinyStr,
    script: TinyStr,
    region: TinyStr,
    variants_and_extensions: VariantsAndExtensionsStr
}

@zbraniecki
Copy link
Member

I feel like I am being told that the Locale that was designed for LocaleCanonicalizer is the Locale that we should be using by default through ICU4X.

That was not my intent. I believe we need a type for Data Provider that is optimized for use within it. It seems to me that this is not the same type that should be used as a foundational type for Rust external use.

This means that DateTimeFormat::try_new should accept a parameter in the DataProvider format.

My mental model leads me toward try::new<L: Into<DataProviderLocale>>(locale: L) - is that reasonable for you?

@sffc sffc removed backlog blocked A dependency must be resolved before this is actionable labels Feb 12, 2022
@sffc sffc self-assigned this Feb 12, 2022
@sffc
Copy link
Member Author

sffc commented Feb 12, 2022

Given #1589, here's what needs to be done on this issue.

  1. Create LanguageIdentifierStr and LocaleStr as public types in the icu_locid crate (as described above)
  2. Add impls to convert between those types and the stack types, as well as impls to compare them without converting

These changes will unblock #243.

@sffc
Copy link
Member Author

sffc commented Mar 3, 2022

We're currently looking at B1/B2/B3. We need a LocaleStr representation that is cheap to compare with Locale (and which is compact in memory).

@Manishearth
Copy link
Member

Manishearth commented Mar 3, 2022

Things we have decided on:

  • Locale/LI stay mostly the same, with potential changes for optimizations
  • LocaleULE/LIULE will be custom ULE types that can be converted to/from Locale/LI as needed
  • Map lookups occur via binary_search_by, so the important thing is having a cross compatible Ord-like comparison function
  • Overall it's fine if LocaleULE/LIULE can't be converted to a string cheaply

Things we have yet to decide on:

  • What LocaleULE/LIULE should look like

We have the B1/B2/B3 options, and to quickly restate them:

  • B1: thin wrapper around str
  • B2: wrapper around string with preparsed indices
  • B3: A bunch of TinyStrs plus an unparsed str for variants

Cost for comparisons decreases going down this list, but cost for converting to string goes up.

Overall since we do not care too much about string conversions (do we?) I feel like solutions like B3 are more acceptable. However, this is what B3 looks like right now (call this B3α)

struct LocaleStr {
   lang: TinyAsciiStr<4>,
   script: TinyAsciiStr<4>,
   region: TinyAsciiStr<4>,
   n_variants: usize, // maybe?
   variants: str,
}

This has two problems:

  • TinyAsciiStr doesn't allow itself to be null; we may need to introduce a MaybeTinyAsciiStr or something (perhaps a ULE implementation on Option<TinyAsciiStr>
  • we pay the cost of storing script/region all the time
  • variants needs to be parsed during comparison.

A slightly more efficient approach (B3β) for variants would be to use VarZeroSlice<str> instead; preparsing the list. VarZeroSlice is empty for empty vectors so we pay no cost for the no-variant case, and comparisons are cheap. This still has the same drawbacks for script/region.

An alternate approach (B3γ) might be struct LocaleStr(VarZeroSlice<str>), where the first three strings are lang/script/region. This can be more space-efficient, but needs #1443 .

I'm leaning towards B3β for a first pass.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Mar 3, 2022
@sffc
Copy link
Member Author

sffc commented Mar 3, 2022

I lean toward B1 as a first pass, because a function to perform a fast comparison between Locale and its corresponding BCP-47 string seems like a generally useful feature, and if we have that function, B1 is basically free: we don't even need to make LocaleStr its own type. B1 solves the immediate problem (having to allocate a Locale into a String) and still leaves room for us to adopt B3 in the future.

Between the different versions of B3, I lean toward B3γ, which can be implemented after #1443 is done. However, if B1 ends up being sufficiently efficient, maybe we won't even need B3.

@Manishearth
Copy link
Member

Discussion:

  • B3γ can work well for LanguageIdentifier, not so well for Locale extensions
  • Maybe worth starting with BCP47 str

(discussion incomplete)

@sffc
Copy link
Member Author

sffc commented Mar 14, 2022

Since no one else is taking this issue, I will take it, since I need it.

@sffc
Copy link
Member Author

sffc commented Mar 24, 2022

The solution I implemented in #1713 is sufficient for lookup of locales in a ZeroVec. Simply store a VarZeroVec<[u8]> of BCP-47 strings, and then use VZV::get_by with the new Locale::cmp_bytes function.

My solution does not solve loading locales from a ZeroVec. However, the following things can be done:

  1. For language identifiers without variants, store a ZeroVec<(Language, Script, Region)>.
  2. For freeform language identifiers or locales, for now, store it as bytes, and then parse on the fly.

Note that since LanguageIdentifier and Locale are not themselves zero-copy, loading one from a ZeroVec will require the possibility of an allocation, even if we implemented a structured LocaleStr type such as B3α.

I am therefore going to close this issue, because the immediate problem of lookup is solved, and we have several concrete options for the problem of loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants