Add an "ascii character" type to reduce `unsafe` needs in conversions #179

scottmcm · 2023-02-12T01:13:56Z

Proposal

Problem statement

For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe blocks because the safe equivalents have material performance impact.

We should offer a nice way that people can write such "the problem fundamentally only produces ASCII Strings" code without needing unsafe and without needing spurious UTF-8 validation checks.

After all, to quote std::ascii,

However, at times it makes more sense to only consider the ASCII character set for a specific operation.

Motivation, use-cases

I was reminded about this by this comment in rust-lang/rust#105076:

    pub fn as_str(&self) -> &str {
        // SAFETY: self.data[self.alive] is all ASCII characters.
        unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
    }

But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap() at the end because other options aren't great.

One might say "oh, just build a String as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8> generates substantially less complicated code than pushing one to a String (https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255 USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str or String.

There should be a core type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.

[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new).

[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos

Solution sketches

In core::ascii,

/// One of the 128 Unicode characters from U+0000 through U+007F, often known as
/// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset.
///
/// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973,
/// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf).
///
/// # Layout
///
/// This type is guaranteed to have a size and alignment of 1 byte.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[repr(transparent)]
struct Char(u8 is 0..=127);

impl Debug for Char { … }
impl Display for Char { … }

impl Char {
    const fn new(c: char) -> Option<Self> { … }
    const fn from_u8(x: u8) -> Option<Self> { … }
    const unsafe fn from_u8_unchecked(x: u8) -> Self { … }
}

impl From<Char> for char { … }
impl From<&[Char]> for &str { … }

In alloc::string:

impl From<Vec<ascii::Char>> for String { … }

^ this From being the main idea of the whole thing

Safe code can Char::new(…).unwrap() since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in consts, then use the non-reallocating infallible Froms later if they need Strings or &strs.

Other possibilities

I wouldn't put any of these in an initial PR, but as related things

This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.
There could be associated constants (or variants) named ACK and DEL and such.
Lots of AsRef<str>s are possible, like on ascii::Char itself or arrays/vectors thereof
- And potentially AsRef<[u8]>s too
Additional methods like String::push_ascii
More implementations like String: Extend<ascii::Char> or String: FromIterator<ascii::Char>
Checked conversions (using the well-known ASCII fast paths) from &str (or &[u8]) back to &[ascii::Char]
The base85 example would really like a [u8; N] -> [ascii::Char; N] operation it can use in a const so it can have something like const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();. Long-term that's called array::map and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.

And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar, despite the stuttering name, to avoid Char vs char confusion?

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

The text was updated successfully, but these errors were encountered:

pitaj · 2023-02-12T01:52:21Z

I like the idea. Could it be added as a private API and used as an implementation detail for now? That would be a perfect demonstration of the usefulness of this type.

One consideration: if we add an "ascii character" type, why not also an "ascii string" type? Or would an "ascii string" just be [ascii::Char] / Vec<ascii:Char>

scottmcm · 2023-02-12T02:07:34Z

It doesn't fit great as a pub(crate) internal thing. To use it in both core and alloc, it needs to be pub and unstable+hidden, so I'd rather just make it an unstable thing than try to hide it. And core has lots of unsafe already, so removing a few makes much less of a different compared to in external crates where it might be the only unsafe. I'm not sure I'd want to, say, churn all the stringizing code to use this if libs-api isn't a fan of the general idea.

There'd be no extra invariants for an AsciiStr or AsciiString, since there's nothing like the "but it's UTF-8" that str and String have over [u8] and Vec<u8>. Whether they should exist anyway to have a separate documentation page, so that Vec<T> isn't cluttered with a bunch of Vec<ascii::Char> methods, I'm poorly equipped to judge. (I didn't want to get into discussions like whether there should be make_ascii_lowercase-style things that work on sequences of ascii chars, which is part of why I focused on using it to build Strings, rather than on operating on them.)

safinaskar · 2023-02-20T08:23:51Z

It seems you build on pattern types ( rust-lang/rust#107606 )?

scottmcm · 2023-02-20T10:23:31Z

@safinaskar I wrote it that way as the simplest way to phrase it in the summary. The field is private, so there are many possible implementation -- including the basic works-on-stable approach of just making it (be or wrap) a big enum -- and this could thus move forward immediately, without needing to be blocked on speculative features like pattern types.

programmerjake · 2023-04-27T22:17:51Z

one other thing to add: From<AsciiChar> for u8

safinaskar · 2023-04-28T07:49:12Z

This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.

You can wrap 128-variant enum in #[repr(transparent)] struct :)

kupiakos · 2023-04-28T17:13:28Z

This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.

Worth mentioning this is exactly what the ascii crate does

BurntSushi · 2023-04-28T17:54:36Z

I'll second this.

What I like about this is that it feels like a pretty small addition (a single type), but it provides a lot of additional expressive and safe power. I've run into these kinds of situations in various places, and it is in part (small part) one of the nice things about bstr. That is, the reason why an ascii::Char is useful is, IMO, mostly because of the UTF-8 requirement of String/&str. Without the UTF-8 requirement, a lot of the pain that motivates an ascii::Char goes away.

Also, while not part of the proposal, I saw a mention of separate AsciiString/AsciiStr types in the discussion above. I think these would probably be a bad idea. In particular, bstr 0.1 started with distinct BString/BStr types, and it turned out to be a major pain in practice. It's because you were constantly needing to convert to and from Vec<u8>/&[u8] types. It seems like a small thing, but it's a lot of friction.

One other thing that I think is worth bringing up is the Debug impl. Having a distinct ascii::Char does let us make the Debug impl for &[ascii::Char] nicer than what you'd get with &[u8]. But I don't think it can be as nice as a dedicated impl for &AsciiStr. (Although perhaps std can use specialization to fix that. Idk.)

kupiakos · 2023-04-28T18:50:04Z

Other reproductions of an ASCII enum: icu4x, the Unicode library, has a tinystr::AsciiByte enum. There's even meme that's the top Google result for rustc_layout_scalar_valid_range_end.

Personally, I lean towards using an enum like that over this design:

#[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(128)]
#[repr(transparent)]
pub struct AsciiChar(u8);

This is a less flexible design. The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.

In particular, bstr 0.1 started with distinct BString/BStr types, and it turned out to be a major pain in practice. It's because you were constantly needing to convert to and from Vec<u8>/&[u8] types. It seems like a small thing, but it's a lot of friction.

~~It looks like that's still bstr's model and hasn't changed. Though BStr: Deref<Target=[u8]> and BString: Deref<Target=Vec<u8>> do mitigate issues.~~ EDIT: it still does provide this model, but it suggests you use extension traits with [u8] and Vec<u8> instead of BStr/BString in APIs as the conversions are free. Worth noting that @BurntSushi is the author of bstr.

BurntSushi · 2023-04-28T18:57:51Z

It looks like that's still bstr's model and hasn't changed. Though BStr: Deref<Target=[u8]> and BString: Deref<Target=Vec<u8>> do mitigate issues.

It's not. bstr works by defining extension traits, ByteSliceExt and ByteVecExt, which provide additional methods on [u8] and Vec<u8>, respectively. It still defines BStr and BString types for various reasons that generally revolve around "it's convenient to have a dedicated string type." For example, for Debug and other trait impls (such as Serde). bstr 0.1 didn't have extension traits at all. It just had BStr and BString and a bunch of inherent methods.

scottmcm · 2023-05-01T18:23:22Z

PR open: rust-lang/rust#111009
Tracking issue: rust-lang/rust#110998

Add `ascii::Char` (ACP#179) ACP second: rust-lang/libs-team#179 (comment) New tracking issue: rust-lang#110998 For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag. There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR. I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`. As a demonstration, made a commit updating some formatting code to use this: scottmcm@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)

scottmcm · 2023-06-15T10:46:25Z

This landed on nightly https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html so I'll close this as accepted 🎉

scottmcm · 2023-06-15T10:47:18Z

@rustbot labels +ACP-accepted

Add `ascii::Char` (ACP#179) ACP second: rust-lang/libs-team#179 (comment) New tracking issue: rust-lang/rust#110998 For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag. There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR. I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`. As a demonstration, made a commit updating some formatting code to use this: scottmcm/rust@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)

Since core::ascii::Char::digit returns Some for decimal digits only, introduce core::ascii::Char::from_digit method which handles values up to 35. It mimics char::from_digit though it forgoes the radix parameter. Issue: rust-lang/libs-team#179 Issue: rust-lang#110998

scottmcm added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Feb 12, 2023

This was referenced Apr 29, 2023

Tracking Issue for ascii::Char (ACP 179) rust-lang/rust#110998

Open

Add ascii::Char (ACP#179) rust-lang/rust#111009

Merged

scottmcm closed this as completed Jun 15, 2023

rustbot added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Jun 15, 2023

tgross35 mentioned this issue Oct 26, 2023

ACP: Add FromByteStr trait with blanket impl FromStr #287

Open

mina86 mentioned this issue Dec 15, 2023

core: add core::ascii::Char::from_digit rust-lang/rust#118963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an "ascii character" type to reduce `unsafe` needs in conversions #179

Add an "ascii character" type to reduce `unsafe` needs in conversions #179

scottmcm commented Feb 12, 2023 •

edited

Loading

pitaj commented Feb 12, 2023

scottmcm commented Feb 12, 2023 •

edited

Loading

safinaskar commented Feb 20, 2023

scottmcm commented Feb 20, 2023

programmerjake commented Apr 27, 2023

safinaskar commented Apr 28, 2023

kupiakos commented Apr 28, 2023

BurntSushi commented Apr 28, 2023 •

edited

Loading

kupiakos commented Apr 28, 2023 •

edited

Loading

BurntSushi commented Apr 28, 2023

scottmcm commented May 1, 2023

scottmcm commented Jun 15, 2023

scottmcm commented Jun 15, 2023

Add an "ascii character" type to reduce unsafe needs in conversions #179

Add an "ascii character" type to reduce unsafe needs in conversions #179

Comments

scottmcm commented Feb 12, 2023 • edited Loading

Proposal

Problem statement

Motivation, use-cases

Solution sketches

Other possibilities

What happens now?

pitaj commented Feb 12, 2023

scottmcm commented Feb 12, 2023 • edited Loading

safinaskar commented Feb 20, 2023

scottmcm commented Feb 20, 2023

programmerjake commented Apr 27, 2023

safinaskar commented Apr 28, 2023

kupiakos commented Apr 28, 2023

BurntSushi commented Apr 28, 2023 • edited Loading

kupiakos commented Apr 28, 2023 • edited Loading

BurntSushi commented Apr 28, 2023

scottmcm commented May 1, 2023

scottmcm commented Jun 15, 2023

scottmcm commented Jun 15, 2023

Add an "ascii character" type to reduce `unsafe` needs in conversions #179

Add an "ascii character" type to reduce `unsafe` needs in conversions #179

scottmcm commented Feb 12, 2023 •

edited

Loading

scottmcm commented Feb 12, 2023 •

edited

Loading

BurntSushi commented Apr 28, 2023 •

edited

Loading

kupiakos commented Apr 28, 2023 •

edited

Loading