Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize name and location of UnvalidatedX #3546

Open
sffc opened this issue Jun 19, 2023 · 55 comments
Open

Finalize name and location of UnvalidatedX #3546

sffc opened this issue Jun 19, 2023 · 55 comments
Assignees
Labels
A-design Area: Architecture or design C-zerovec Component: Yoke, ZeroVec, DataBake needs-approval One or more stakeholders need to approve proposal

Comments

@sffc
Copy link
Member

sffc commented Jun 19, 2023

Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.

There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.

I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.

How about schrodinger::SchrödingerStr? (also re-exported without the diacritic)

Discuss with:

Optional:

@sffc sffc added A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting labels Jun 19, 2023
@ajtribick
Copy link

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

@robertbastian
Copy link
Member

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

Thanks for pointing this out, TIL 🤢

@robertbastian
Copy link
Member

I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like serde_maybe::MaybeStr, however serde-maybe is already taken unfortunately.

@skius
Copy link
Member

skius commented Jun 19, 2023

Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ?

@robertbastian
Copy link
Member

Previous discussion and background why we want the extra layer: #2489

@sffc
Copy link
Member Author

sffc commented Jun 19, 2023

Two main reasons for the abstraction:

  1. We can use serialize_bytes
  2. We can serialize as a string for readability in JSON

@sffc
Copy link
Member Author

sffc commented Jun 19, 2023

The current semantics of UnvalidatedStr are:

  • Serialization: bytes in binary, string or error in human-readable
  • Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

@sffc
Copy link
Member Author

sffc commented Jun 20, 2023

A name I've been using on my branch is BytesOrStr (could also be StrOrBytes). In that pattern, just prepend "BytesOr" or append "OrBytes" to the validated type name.

@robertbastian
Copy link
Member

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

@skius
Copy link
Member

skius commented Jun 20, 2023

I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, StrAsBytes/BytesStr?

Otherwise maybe some non-negated antonyms of "trusted": QuestionableStr, ShadyStr?

@robertbastian
Copy link
Member

Haha I love ShadyStr. SketchyStr could be another option.

@skius
Copy link
Member

skius commented Jun 20, 2023

My problem with these alternatives is that (imo) they either require more thinking than UnvalidatedStr, e.g., SketchyStr, or they are somewhat ambiguous, e.g., MaybeStr - to me this sounds like the case where the value is not a Str is supported, like the Haskell Maybe (Rust's Option), or is there some precedent for using Maybe to mean Unvalidated?

@Manishearth
Copy link
Member

I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead (serdefer makes for a cute crate name in that case). Or LaterStr if you want to use a simpler word.

I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there.

@skius
Copy link
Member

skius commented Jun 20, 2023

I agree with @Manishearth about the naming, UnvalidatedStr perfectly describes what the type is. If we have to avoid the negation, I like LaterStr the most out of all alternatives discussed in this thread.

@sffc
Copy link
Member Author

sffc commented Jun 20, 2023

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

This was a progression from my suggestion that it may be useful to rethink the semantics:

The current semantics of UnvalidatedStr are:

  • Serialization: bytes in binary, string or error in human-readable
  • Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible.

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023
@robertbastian
Copy link
Member

The unvalidated name is available. We could use unvalidated::Str and unvalidated::Char

@Manishearth
Copy link
Member

Not opposed.

@sffc
Copy link
Member Author

sffc commented Feb 29, 2024

I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

@sffc
Copy link
Member Author

sffc commented Feb 29, 2024

Should we consider a name like PotentiallyIllFormedUtf8?

Is there a semantic difference?

@robertbastian
Copy link
Member

Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

I really really dislike all the "clever" names.

@Manishearth
Copy link
Member

In this case I can't immediately think of anything and unvalidated is available and accurate, we should just use that imo.

@sffc
Copy link
Member Author

sffc commented Feb 29, 2024

I am not very happy with that proposal because:

  1. In the OP, I said I would prefer to avoid negated terms like un or dis or non. I prefer to start with a positive term.
  2. unvalidated::Str violates our style guide naming. We don't want to import unvalidated::Str and use Str at call sites; that is super confusing. We could need to either import it as unvalidated::Str as UnvalidatedStr or export unvalidated::UnvalidatedStr.

@sffc
Copy link
Member Author

sffc commented Feb 29, 2024

Bikeshed, with some help from gemini:

  • RawStr
  • DeferredStr
  • QuasiStr
  • ChameleonStr
  • ProspectStr
  • OptimisticStr
  • PotentialStr

If we don't like any of those, how about choosing a letter of the alphabet, like

  • WStr
  • XStr
  • YStr
  • ZStr

Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. ZeroMap<XStr, PatternULE> seems fairly clear: the key is a string that is potentially a bit weird, and the value is a ULE pattern type.

@Manishearth
Copy link
Member

I'm fine with unvalidated::UnvalidatedStr.

I don't really think any of the other names work well here. I think the un is warranted here.

@sffc
Copy link
Member Author

sffc commented Apr 10, 2024

The crate should also have an optional dependency on writeable as discussed in #4786 (comment)

@sffc
Copy link
Member Author

sffc commented Apr 10, 2024

I really want to start the bikeshed for this. Let's focus just on the dynamically sized type that can be infallibly converted from a byte sequence, fallibly converted to a str, and includes impls for things like serde, zerovec, and writeable.

  1. potentially_ill_formed
    • 1a potentially_ill_formed::PotentiallyIllFormedStr, potentially_ill_formed::PotentiallyIllFormedUtf16
    • 1b potentially_ill_formed::PotentiallyIllFormedUtf8, potentially_ill_formed::PotentiallyIllFormedUtf16
  2. pill
    • 2a pill::PillStr, pill::Pill16
    • 2b pill::PillUtf8, pill::PillUtf16
    • 2c pill::Pill8, pill::Pill16
  3. unvalidated
    • 3a unvalidated::UnvalidatedStr, unvalidated::UnvalidatedUtf16
    • 3b unvalidated::UnvalidatedUtf8, unvalidated::UnvalidatedUtf16
  4. quasistr
    • 4a quasistr::QuasiStr, quasistr::Quasi16
    • 4b quasistr::QuasiUtf8, quasistr::QuasiUtf16
    • 4c quasistr::Quasi8, quasistr::Quasi16
  5. quasi_str
    • 5a quasi_str::QuasiStr, quasi_str::Quasi16
    • 5b quasi_str::QuasiUtf8, quasi_str::QuasiUtf16
    • 5b quasi_str::Quasi8, quasi_str::Quasi16

Any more suggestions before I send out a ballot? I will distribute it after this week's ICU4X-WG call.

@robertbastian
Copy link
Member

We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.

@robertbastian
Copy link
Member

robertbastian commented Apr 10, 2024

  1. maybe_utf::MaybeUtf8, maybe_utf::MaybeUtf16
  2. potential_utf::PotentialUtf8, potential_utf::PotentialUtf16

@sffc
Copy link
Member Author

sffc commented Apr 10, 2024

Proposed ballot language:

The ICU4X Working Group would like to introduce a crate containing a type that can be infallibly converted from [u8], fallibly converted to a str, and has impls for serde, zerovec, and writeable that assume the content is UTF-8 but may write replacement characters if the bytes are ill-formed. There will be a sibling type that works on [u16]. These types are intended to be used widely within data structs, FFI, and API as necessary. The names of fields, function parameters, and function names are out of scope of this bikeshed.

@sffc
Copy link
Member Author

sffc commented Apr 10, 2024

We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.

I added the UTF-16 naming to the options. I don't necessarily feel the (a)s are automatically ruled out but you can vote them down in the ballot if you feel this way. There are some options I plan to be voting down.

@Manishearth
Copy link
Member

I'm fine with the options so far.

@echeran
Copy link
Contributor

echeran commented Apr 15, 2024

When sitting down to vote and take a closer look at the options, I didn't like the choice of word / prefix in those options to indicate the uncertainty of the well-formedness of the UTF-{8, 16} encoding adhered to by the string.

Comments per option:

  1. PotentiallyIllFormedUtf{8,16} - the substring prefix PotentiallyIllFormed is 100% accurate but long. Since there are options that are shorter and imply the same meaning well enough, I think the length is too long.
  2. Pill - this doesn't make sense without knowing the phrase PotentiallyIllFormed and realizing that this is a shortening. Given that ICU4X has a lot of constituent crates that interact with each other, I think it's good to avoid codenames / puns / etc. if there are common words (or combinations thereof) that can describe the type/crate so that we don't create yet another bespoke language that makes it harder for new users to grok. Especially if it's for a "lower-level" type in the sense that it might occur in multiple places throughout the codebase.
  3. Unvalidated - I do not think that this is a good descriptor because it might imply that the string needs to be validated before it can be used, but this is not the case. The "validity" is in relation to UTF-{8,16} well-formedness, and the lack of preciseness causes enough ambiguity to be undesriable.
  4. Quasi has the same problem as in Option 3, but it's even less clear that "quasi" is in regards to the UTF well-formedness, as opposed to the validity of the string itself not being a true string (character sequence).
  5. MaybeUtf{8,16} is not bad because it is fairly short and it gets across fairly clearly what we're talking about (the uncertainty pertains to UTF)
  6. Potential is similar to Option 5 but a slightly longer and more unfamiliar term than Maybe when it comes to naming an identifier for a type/variable/function in code, AFAICT. I think Option 5 is strictly better than this option for that reason.

I want to offer a few more options to try to capture different angles. For example, when thinking about the type of element that we have a sequence of, we know more about it than u8 but we also can't assert that it has all of the aspects/constraints of char.

  • StrBytes, StrBytes16 - it concisely says "these are bytes for a string". Its existence would imply some extra criterion that couldn't be satisfied by just [u8] or String.
  • Raw - I've seen the adjective "raw" used as a prefix for identifiers, especially types, to indicate some initial form prior to validation or processing. Very concise.
  • CodeUnits - this is the Unicode terminology for u8 and u16 for a character sequence that is attempting an encoding of UTF-8 or UTF-16. The term "code unit" doesn't imply that they're all part of a well-formed UTF-{8,16} string, and the term isn't too long. The prefix Raw could be added to make that latter point especially clear.

@hsivonen
Copy link
Member

@sffc asked me to comment on the naming here:

I think "potentially ill-formed" is the most correct term, but it makes for very long identifiers, so as a matter of type naming, I prefer Unvalidated and put PotentiallyIllFormed second.

CodeUnits seem technically correct, but relative to Unvalidated doesn't emphasize that the point is the potential ill-formedness.

I think we shouldn't use Pill: it's not clear as an abbreviation, and I don't want the naming to be evocative of any slang or memes involving "pill".

Maybe is suggestive of type-wise either-or in the enum sense.

Quasi isn't clear on quasi in what sense.

Raw is suggestive of RawVec and the like. I think Raw in Rust libraries doesn't suggest that the type is allowed to violate invariants but that the implementation itself does not take care of upholding the invariants that are required to be upheld.

StrBytes has the problem that it is not, in fact, necessarily the bytes of str but the whole point is potentially holding bytes that aren't OK as str.

@sffc
Copy link
Member Author

sffc commented Apr 18, 2024

What I hear from the discussions here is that everyone is in general agreement that PotentiallyIllFormedUtf8 correctly conveys the semantics, but people prefer something shorter. We have 15 shorter names on the table (including variants), and we haven't been able to converge on one yet.

If I may try to summarize the pros and cons:

  • Pill*
    • Pros: Short; derived from "Potentially ill-formed"; thematic
    • Cons: Too bespoke; evocative of slang terms
  • Unvalidated*
    • Pros: Clear; suggests that the string is in a specific state; suggests it is some type of string and we have not yet validated whether it is the type of string we are expecting
    • Cons: Implies that the string must first be validated, precluding algorithms designed to operate on strings in this state; primarily based on Serde's way of thinking; focuses on what the string is not rather than what it is
  • Quasi*
    • Pros: Short; nice-sounding; dictionary definition aligns with semantic
    • Cons: Unclear meaning; in science, usually means "fake, almost but not": quasicrystalline
  • Maybe*
    • Pros: Clear; might imply a union type
    • Cons: Might imply an either-or enum like an Option; there is an existing crate maybe_utf8 with different usage semantics
  • Potential*
    • Pros: Clear; might imply statefulness
    • Cons: Slightly long and unfamiliar term; might imply an either-or enum like Option, but not strongly; might imply a time component: "in the future, X has the potential to become Y"
  • StrBytes*
    • Pros: Clear; corresponds to return value of str::as_bytes()
    • Cons: Name does not suggest that the bytes might be ill-formed; UTF-16 is not composed of "bytes"
  • Raw*
    • Pros: Short; suggests a state, similar to "unvalidated"; less opinionated
    • Cons: The term "raw" is overloaded in contexts such as RawVec; might imply the wrong semantics
  • CodeUnits*
    • Pros: Aligned with Unicode terminology
    • Cons: Doesn't emphasize potential ill-formedness; in environments such as JavaScript, "code units" most commonly refers to UTF-16 code units as opposed to UTF-8 code units

I'm still searching for the best compromise with the fewest cons. Doing this writeup makes me think that Potential might be a good compromise. The only downside of "potential" that has been expressed in this thread was Elango's comment that "potential" was strictly worse than "maybe"; however, since that comment was posted, Henri noted that "maybe" implies an enum-like semantic in Rust. I also like that "potential" is a shortening of the phrase "potentially ill-formed", which we all agree on.

Can anyone verbalize any more downsides to the Potential prefix?

Also, if I failed to note any pros or cons of the options, please reply to this issue.

@Manishearth
Copy link
Member

I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.

Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.

@sffc
Copy link
Member Author

sffc commented Apr 18, 2024

I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.

The proposed name is PotentialUtf8; does that change your position?

Do you feel both Potential and Maybe imply the Option semantics equally, or does one of them imply it more strongly than the other?

@Manishearth
Copy link
Member

Manishearth commented Apr 18, 2024

Maybe is stronger (EDIT @sffc: Maybe more strongly suggests the Option semantics)

PotentialUtf8 seems ok to me

@sffc
Copy link
Member Author

sffc commented Apr 18, 2024

Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.

I added this to the "pros" of unvalidated.

@sffc
Copy link
Member Author

sffc commented Apr 19, 2024

As stated above in #3546 (comment), I'd like to propose the following as a compromise.

Crate Name: potential_utf

Primary types:

  • pub struct PotentialUtf8(pub [u8])
  • pub struct PotentialUtf16(pub [u16])

Potential (🤣) expansion types to be added if needed:

  • pub struct PotentialCodePoint(pub [u8; 3]) // a 24-bit type
  • pub struct PotentialUtf16Bytes(pub ZeroSlice<u16>) // a ULE version of PotentialUtf16

Although this was not anyone's first choice, it is the only option for which a fundamental flaw was not verbalized in this thread (except for perhaps quasi, whose objections I would characterize as more a distaste for the less-precise / clever naming convention). To read up on the flaws in every other option, see my post above.

Please check the box if you have no objection to the proposal.

@sffc sffc added needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Apr 19, 2024
@echeran
Copy link
Contributor

echeran commented Apr 24, 2024

I still prefer MaybeUtf{8,16} over PotentialUtf{8,16}. The word "maybe" says that it "could be X, it could also not be X", where here, X = valid UTF-{8,16} encoding form. The word "potential" has a similar meaning but also implies a time component: "in the future, X has the potential to become Y". Plus, the word "potential" is longer and less frequently used in programming identifiers in my experience.

@Manishearth
Copy link
Member

Maybe has a strong connotation of being an Option in Rust.

@sffc
Copy link
Member Author

sffc commented Apr 24, 2024

Ok, we'll schedule a follow-up meeting. Alternatively, we could cover this in tomorrow's ICU4X-WG call assuming all the attendees are able to make it (@hsivonen @sffc @Manishearth @echeran).

@sffc
Copy link
Member Author

sffc commented Apr 24, 2024

I think there are basically two options:

  1. We agree as a group that the fully spelled out PotentiallyInvalidUtf8 is fine
  2. We go over the shorter options and choose the one whose downsides are least bad to us as a group

@hsivonen
Copy link
Member

How will this relate to https://github.com/BurntSushi/bstr ?

@Manishearth
Copy link
Member

@echeran is your comment a strong objection to Potential or a weak preference for Maybe?

I'm open to trying bstr instead. However it does have a non optional dependency on memchr.

@sffc
Copy link
Member Author

sffc commented Apr 24, 2024

My take:

  • @echeran's comment raises an additional downside to potential, the temporal component of it. I have added this to the list of downsides in Finalize name and location of UnvalidatedX #3546 (comment).
  • In general, in the context of the current discussion, arguments of the form "X is better than Y so we should choose X and not Y" does not help us reach a conclusion, but verbalizing pros and cons does help.
  • I think a factor we haven't explicitly considered is whether this is a Rust-specific abstraction or something we also use over FFI. If we name the type in FFI bindings, concerns about general intelligibility should be given a greater amount of weight relative to concerns about Rust conventions. If it is a Rust-specific abstraction, then the Rust conventions should take greater weight.
  • In this case, what's being proposed is a Rust crate. I'm not proposing any change to the FFI surface. Over FFI, we have abstractions that allow targets to declare whether their string types are guaranteed valid UTF-8 or not, and then the appropriate Rust API is chosen for them.
  • Regarding bstr: thanks for raising this! I know I had spent a fair bit of time searching crates.io for a crate that did this, but I never found bstr. I've worked with BurntSushi before. If we could collaborate to make sure all of our needs are met (making memchr an optional dependency, adding impls for Writeable, ...), then I think bstr is a really solid option.

@Manishearth
Copy link
Member

In general, in the context of the current discussion, arguments of the form "X is better than Y so we should choose X and not Y"

Yes, strong agree. I'd like us to be talking about pros and cons, not comparing at this stage, because a comparison is not in and of itself a blocking argument.

I think a factor we haven't explicitly considered is whether this is a Rust-specific abstraction or something we also use over FFI.

You address this later, but to explicitly talk about this: We already use DiplomatStr over FFI, and that translates to something natively meaningful on the other side. Which means it's unlikely this abstraction will ever "escape" over FFI.

@hsivonen
Copy link
Member

I think there are basically two options:

1. We agree as a group that the fully spelled out `PotentiallyInvalidUtf8` is fine

After thinking about this more, I'm OK with this. We have IDE autocomplete, etc., to deal with the identifier length. PotentialUtf8 works, too.

After a look at the bstr issues, it's probably better to keep this in ICU4X and not try to change btsr to fit ICU4X.

@sffc
Copy link
Member Author

sffc commented Apr 25, 2024

Summary of discussion with @Manishearth @echeran @sffc:

  • This is a Rust type so Rust concerns should carry additional weight
  • The three adjectives "Maybe", "Potential", and "Quasi" have various different dictionary definitions and usages in science, and those definitions/usages align with varying degrees to the semantic we're aiming for:
    • The statefulness of "Potential" is a pro more than a con
    • "Quasi" implies that something is fake (almost but not), which is not the correct semantic
  • "Maybe" has a Rust ecosystem usage as either a safe enum (MaybeOwned) or an unsafe union (MaybeUninit) type; the union usage is a semantic roughly equal to what we're going for, but the enum usage has a greater chance of being misleading
  • "Maybe" means "Option" in other ecosystems like Haskell
  • "Potential" is a short version of "PotentiallyIllFormed"

@sffc
Copy link
Member Author

sffc commented Apr 25, 2024

Just to post this somewhere:

I think it would not be completely unreasonable or inconsistent with Rust style to introduce the following type

pub struct MaybeUtf8(pub [u8]);

impl MaybeUtf8 {
    pub unsafe fn assume_utf8(&self) -> &str { ... }
    pub fn try_to_utf8(&self) -> Result<&str, Utf8Error> { ... }
}

The parallelism of MaybeUninit::assume_init to MaybeUtf8::assume_utf8 is just not something I could close this thread without addressing first.

@Manishearth
Copy link
Member

I think my main gripe with that is still the usage pattern, where the point of this type is actually that you can often use it without ever having to deal with validating or assuming UTF8, it's quite useful without those two. Because of that it feels very different from the stdlib unsafe helpers, which are more enumlike and the usage pattern is extremely stateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-zerovec Component: Yoke, ZeroVec, DataBake needs-approval One or more stakeholders need to approve proposal
Projects
None yet
Development

No branches or pull requests

7 participants