Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an "ascii character" type to reduce unsafe needs in conversions #179

Closed
scottmcm opened this issue Feb 12, 2023 · 13 comments
Closed

Add an "ascii character" type to reduce unsafe needs in conversions #179

scottmcm opened this issue Feb 12, 2023 · 13 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@scottmcm
Copy link
Member

scottmcm commented Feb 12, 2023

Proposal

Problem statement

For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe blocks because the safe equivalents have material performance impact.

We should offer a nice way that people can write such "the problem fundamentally only produces ASCII Strings" code without needing unsafe and without needing spurious UTF-8 validation checks.

After all, to quote std::ascii,

However, at times it makes more sense to only consider the ASCII character set for a specific operation.

Motivation, use-cases

I was reminded about this by this comment in rust-lang/rust#105076:

    pub fn as_str(&self) -> &str {
        // SAFETY: self.data[self.alive] is all ASCII characters.
        unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
    }

But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap() at the end because other options aren't great.

One might say "oh, just build a String as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8> generates substantially less complicated code than pushing one to a String (https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255 USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str or String.

There should be a core type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.

[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new).

[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos

Solution sketches

In core::ascii,

/// One of the 128 Unicode characters from U+0000 through U+007F, often known as
/// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset.
///
/// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973,
/// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf).
///
/// # Layout
///
/// This type is guaranteed to have a size and alignment of 1 byte.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[repr(transparent)]
struct Char(u8 is 0..=127);

impl Debug for Char {}
impl Display for Char {}

impl Char {
    const fn new(c: char) -> Option<Self> {}
    const fn from_u8(x: u8) -> Option<Self> {}
    const unsafe fn from_u8_unchecked(x: u8) -> Self {}
}

impl From<Char> for char {}
impl From<&[Char]> for &str {}

In alloc::string:

impl From<Vec<ascii::Char>> for String {}

^ this From being the main idea of the whole thing

Safe code can Char::new(…).unwrap() since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in consts, then use the non-reallocating infallible Froms later if they need Strings or &strs.

Other possibilities

I wouldn't put any of these in an initial PR, but as related things

  • This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.
  • There could be associated constants (or variants) named ACK and DEL and such.
  • Lots of AsRef<str>s are possible, like on ascii::Char itself or arrays/vectors thereof
    • And potentially AsRef<[u8]>s too
  • Additional methods like String::push_ascii
  • More implementations like String: Extend<ascii::Char> or String: FromIterator<ascii::Char>
  • Checked conversions (using the well-known ASCII fast paths) from &str (or &[u8]) back to &[ascii::Char]
  • The base85 example would really like a [u8; N] -> [ascii::Char; N] operation it can use in a const so it can have something like const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();. Long-term that's called array::map and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.

And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar, despite the stuttering name, to avoid Char vs char confusion?

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

@scottmcm scottmcm added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Feb 12, 2023
@pitaj
Copy link

pitaj commented Feb 12, 2023

I like the idea. Could it be added as a private API and used as an implementation detail for now? That would be a perfect demonstration of the usefulness of this type.

One consideration: if we add an "ascii character" type, why not also an "ascii string" type? Or would an "ascii string" just be [ascii::Char] / Vec<ascii:Char>

@scottmcm
Copy link
Member Author

scottmcm commented Feb 12, 2023

It doesn't fit great as a pub(crate) internal thing. To use it in both core and alloc, it needs to be pub and unstable+hidden, so I'd rather just make it an unstable thing than try to hide it. And core has lots of unsafe already, so removing a few makes much less of a different compared to in external crates where it might be the only unsafe. I'm not sure I'd want to, say, churn all the stringizing code to use this if libs-api isn't a fan of the general idea.

There'd be no extra invariants for an AsciiStr or AsciiString, since there's nothing like the "but it's UTF-8" that str and String have over [u8] and Vec<u8>. Whether they should exist anyway to have a separate documentation page, so that Vec<T> isn't cluttered with a bunch of Vec<ascii::Char> methods, I'm poorly equipped to judge. (I didn't want to get into discussions like whether there should be make_ascii_lowercase-style things that work on sequences of ascii chars, which is part of why I focused on using it to build Strings, rather than on operating on them.)

@safinaskar
Copy link

It seems you build on pattern types ( rust-lang/rust#107606 )?

@scottmcm
Copy link
Member Author

@safinaskar I wrote it that way as the simplest way to phrase it in the summary. The field is private, so there are many possible implementation -- including the basic works-on-stable approach of just making it (be or wrap) a big enum -- and this could thus move forward immediately, without needing to be blocked on speculative features like pattern types.

@programmerjake
Copy link
Member

one other thing to add: From<AsciiChar> for u8

@safinaskar
Copy link

This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.

You can wrap 128-variant enum in #[repr(transparent)] struct :)

@kupiakos
Copy link

  • This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.

Worth mentioning this is exactly what the ascii crate does

@BurntSushi
Copy link
Member

BurntSushi commented Apr 28, 2023

I'll second this.

What I like about this is that it feels like a pretty small addition (a single type), but it provides a lot of additional expressive and safe power. I've run into these kinds of situations in various places, and it is in part (small part) one of the nice things about bstr. That is, the reason why an ascii::Char is useful is, IMO, mostly because of the UTF-8 requirement of String/&str. Without the UTF-8 requirement, a lot of the pain that motivates an ascii::Char goes away.

Also, while not part of the proposal, I saw a mention of separate AsciiString/AsciiStr types in the discussion above. I think these would probably be a bad idea. In particular, bstr 0.1 started with distinct BString/BStr types, and it turned out to be a major pain in practice. It's because you were constantly needing to convert to and from Vec<u8>/&[u8] types. It seems like a small thing, but it's a lot of friction.

One other thing that I think is worth bringing up is the Debug impl. Having a distinct ascii::Char does let us make the Debug impl for &[ascii::Char] nicer than what you'd get with &[u8]. But I don't think it can be as nice as a dedicated impl for &AsciiStr. (Although perhaps std can use specialization to fix that. Idk.)

@kupiakos
Copy link

kupiakos commented Apr 28, 2023

Other reproductions of an ASCII enum: icu4x, the Unicode library, has a tinystr::AsciiByte enum. There's even meme that's the top Google result for rustc_layout_scalar_valid_range_end.

Personally, I lean towards using an enum like that over this design:

#[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(128)]
#[repr(transparent)]
pub struct AsciiChar(u8);

This is a less flexible design. The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.

In particular, bstr 0.1 started with distinct BString/BStr types, and it turned out to be a major pain in practice. It's because you were constantly needing to convert to and from Vec<u8>/&[u8] types. It seems like a small thing, but it's a lot of friction.

It looks like that's still bstr's model and hasn't changed. Though BStr: Deref<Target=[u8]> and BString: Deref<Target=Vec<u8>> do mitigate issues. EDIT: it still does provide this model, but it suggests you use extension traits with [u8] and Vec<u8> instead of BStr/BString in APIs as the conversions are free. Worth noting that @BurntSushi is the author of bstr.

@BurntSushi
Copy link
Member

It looks like that's still bstr's model and hasn't changed. Though BStr: Deref<Target=[u8]> and BString: Deref<Target=Vec<u8>> do mitigate issues.

It's not. bstr works by defining extension traits, ByteSliceExt and ByteVecExt, which provide additional methods on [u8] and Vec<u8>, respectively. It still defines BStr and BString types for various reasons that generally revolve around "it's convenient to have a dedicated string type." For example, for Debug and other trait impls (such as Serde). bstr 0.1 didn't have extension traits at all. It just had BStr and BString and a bunch of inherent methods.

@scottmcm
Copy link
Member Author

scottmcm commented May 1, 2023

PR open: rust-lang/rust#111009
Tracking issue: rust-lang/rust#110998

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue May 4, 2023
Add `ascii::Char` (ACP#179)

ACP second: rust-lang/libs-team#179 (comment)
New tracking issue: rust-lang#110998

For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag.

There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR.

I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`.

As a demonstration, made a commit updating some formatting code to use this: scottmcm@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
@scottmcm
Copy link
Member Author

This landed on nightly https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html so I'll close this as accepted 🎉

@scottmcm
Copy link
Member Author

@rustbot labels +ACP-accepted

@rustbot rustbot added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Jun 15, 2023
thomcc pushed a commit to tcdi/postgrestd that referenced this issue Jul 18, 2023
Add `ascii::Char` (ACP#179)

ACP second: rust-lang/libs-team#179 (comment)
New tracking issue: rust-lang/rust#110998

For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag.

There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR.

I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`.

As a demonstration, made a commit updating some formatting code to use this: scottmcm/rust@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
mina86 added a commit to mina86/rust that referenced this issue Dec 15, 2023
Since core::ascii::Char::digit returns Some for decimal digits only,
introduce core::ascii::Char::from_digit method which handles values
up to 35.  It mimics char::from_digit though it forgoes the radix
parameter.

Issue: rust-lang/libs-team#179
Issue: rust-lang#110998
mina86 added a commit to mina86/rust that referenced this issue Dec 15, 2023
Since core::ascii::Char::digit returns Some for decimal digits only,
introduce core::ascii::Char::from_digit method which handles values
up to 35.  It mimics char::from_digit though it forgoes the radix
parameter.

Issue: rust-lang/libs-team#179
Issue: rust-lang#110998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

7 participants