Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for ascii::Char (ACP 179) #110998

Open
1 of 7 tasks
scottmcm opened this issue Apr 29, 2023 · 92 comments
Open
1 of 7 tasks

Tracking Issue for ascii::Char (ACP 179) #110998

scottmcm opened this issue Apr 29, 2023 · 92 comments
Labels
A-unicode Area: Unicode C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@scottmcm
Copy link
Member

scottmcm commented Apr 29, 2023

Feature gate: #![feature(ascii_char)] #![feature(ascii_char_variants)]

This is a tracking issue for the ascii::Char type from rust-lang/libs-team#179

https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html

Public API

// core::ascii

#[repr(u8)]
enum Char {
    Null = 0,Tilde = 127,
}

impl Debug for Char {}
impl Display for Char {}
impl Default for Char { ... }

impl Step for Char { ... } // so `Range<Char>` is an Iterator

impl Char {
    const fn from_u8(x: u8) -> Option<Self>;
    const unsafe fn from_u8_unchecked(x: u8) -> Self;
    const fn digit(d: u8) -> Option<Self>;
    const unsafe fn digit_unchecked(d: u8) -> Self;
    const fn as_u8(self) -> u8;
    const fn as_char(self) -> char;
    const fn as_str(&self) -> &str;
}

impl [Char] {
    const fn as_str(&self) -> &str;
    const fn as_bytes(&self) -> &[u8];
}

impl From<Char> for u8 {}
impl From<Char> for char {}
impl From<&[Char]> for &str {}

// core::array

impl<const N: usize> [u8; N] {
    const fn as_ascii(&self) -> Option<&[ascii::Char; N]>;
    const unsafe fn as_ascii_unchecked(&self) -> &[ascii::Char; N];
}

// core::char

impl char {
    const fn as_ascii(&self) -> Option<ascii::Char>;
}

// core::num

impl u8 {
    const fn as_ascii(&self) -> Option<ascii::Char>;
}

// core::slice

impl [u8] {
    const fn as_ascii(&self) -> Option<&[ascii::Char]>;
    const unsafe fn as_ascii_unchecked(&self) -> &[ascii::Char];
}

// core::str

impl str {
    const fn as_ascii(&self) -> Option<&[ascii::Char]>;
}

Steps / History

Unresolved Questions

  • What should it be named? Code mixing char and Char might be too confusing.
  • How should its Debug impl work?
  • Is allowing as-casting from it a good or bad feature?
    • FWIW, there's no char::to_u32, just as u32 for it.
  • Some of the as_ascii methods take &self for consistency with is_ascii. Should they take self instead where possible, as the usually-better option, or stick with &self for the consistency?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@scottmcm scottmcm added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Apr 29, 2023
@BurntSushi
Copy link
Member

I guess I'll kick off the discussion about how to actually define the ascii::Char type. Right now, it's an enum like this:

#[repr(u8)]
enum Char {
    Null = 0,Tilde = 127,
}

I'd like to first make sure I understand the pros of using this representation. Do I have them right?

  1. It provides a niche optimization such that Option<ascii::Char> is the same size as ascii::Char.
  2. It provides a way to write ch as u8 where ch has type ascii::Char.
  3. For "free," you can get cute and write the names of ASCII characters instead of their visual or numerical representation. (I don't mean to trivialize this by calling it "cute." I think it's a nice benefit, especially for non-printable characters.)

Are there more? Am I mistaken about the above?

And now, the mitigations:

  1. I think the niche optimization can still be obtained in this case because std can use unstable features? And I think there's a way to say, "make this specific value the niche" without any other additional cost.
  2. We can provide a ascii::Char::as_u8() method. (Which is already done in the implementation PR.) Is there anything else that being able to write ch as u8 buys us?
  3. I don't think there is a mitigation here. It feels like a "nice to have" aspect of using an enum that we probably wouldn't go out of our way to re-create via other means.

In terms of negatives, why not do the enum in the first place? Good question. I'm not entirely sure. It feels to me like it stresses my weirdness budget. It's also a generally much bigger enum than most, and I wonder whether that will provoke any surprises in places. Maybe not.

@ChayimFriedman2
Copy link
Contributor

Do we need both u8::as_ascii() and ascii::Char::from_u8()?

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue May 4, 2023
Add `ascii::Char` (ACP#179)

ACP second: rust-lang/libs-team#179 (comment)
New tracking issue: rust-lang#110998

For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag.

There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR.

I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`.

As a demonstration, made a commit updating some formatting code to use this: scottmcm@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
@scottmcm
Copy link
Member Author

scottmcm commented May 4, 2023

Regarding the enum, @kupiakos said the following in

The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.

And mentioned icu4x's enum,

https://github.com/unicode-org/icu4x/blob/b6c4018a736e79790898c5b91ff3ab25d33192c2/utils/tinystr/src/asciibyte.rs#L8-L11

Though given those names it's possible it's just doing that for the niche and not for direct use.

(Man I wish we had private variants as a possibility.)

To your questions, @BurntSushi ,

  1. Yeah, there's at least three other possibilites here: a) put the enum in a newtype (like ptr::Alignment) to still get the niches without exposing the variants b) use the rustc-internal magic c) just don't get a niche for now since that's not needed for its core scenario of convertibility to UTF-8 (though it's certainly nice to have)

  2. Given that we're overall not fond of as casts, and it also allows weirder things like as i64, I actually wish the cast could be disabled, and have people use as_u8 as you mention. But that might also just happen by deprecating or warning on these casts, and eventually having ascii::Char: AsRepr<Repr = u8> does seem reasonable.

  3. Now that Refactor core::char::EscapeDefault and co. structures #105076 has landed, I've been thinking of redoing the internals of those using ascii::Char to try out how it feels using the names. That would mean code like

            '\0' => EscapeDebug::backslash(Null),
            '\r' => EscapeDebug::backslash(CarriageReturn),
            '\n' => EscapeDebug::backslash(LineFeed ),

because without the variants it ends up being more like

            '\0' => EscapeDebug::backslash(b'\0'.as_ascii().unwrap()),
            '\r' => EscapeDebug::backslash(b'\r'.as_ascii().unwrap()),
            '\n' => EscapeDebug::backslash(b'\n'.as_ascii().unwrap()),

which is nice in that it keeps the escape sequences lining up, but having to copy-paste .as_ascii().unwrap() all over like that doesn't fill me with joy.

I guess this is yet another situation in which I'd like custom literals. Or something like a'\n', I guess, but that's way too far a jump to propose right now.

EDIT a while later: The PR to update EscapeDebug like that is #111524

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue May 6, 2023
Constify `[u8]::is_ascii` (unstably)

UTF-8 checking in `const fn`-stabilized back in 1.63 (rust-lang#97367), but apparently somehow ASCII checking was never const-ified, despite being simpler.

New constness-tracking issue for `is_ascii`: rust-lang#111090

I noticed this working on `ascii::Char`: rust-lang#110998
bors added a commit to rust-lang-ci/rust that referenced this issue May 7, 2023
Constify `[u8]::is_ascii` (unstably)

UTF-8 checking in `const fn`-stabilized back in 1.63 (rust-lang#97367), but apparently somehow ASCII checking was never const-ified, despite being simpler.

New constness-tracking issue for `is_ascii`: rust-lang#111090

I noticed this working on `ascii::Char`: rust-lang#110998
@clarfonthey
Copy link
Contributor

I wasn't in the discussion when this type was created (which I fully support), but is there a reason to explicitly use an enum instead of a newtype + internal niche attribute like NonZeroU8?

If we decide to go the pattern types route (#107606) then NonZeroU8 could become u8 is 1.. and this could become u8 as ..128. Alternatively, if we go the generic integers route (#2581) then this could become uint<7> or just u7.

IMHO, explicitly going with an enum kind of prevents these kinds of optimisations later down the road, since it forces an enum as the representation.

@BurntSushi
Copy link
Member

@clarfonthey I think the discussion above answers your question about why we went with the enum for now. In summary, I think its principle upside is that it gives nice names to each character. Not particularly compelling, but nice.

My general feeling is that nobody is stuck on using an enum here. My best guess is that unless something compelling comes up, I wouldn't be surprised if we changed to an opaque representation such that we could change to an enum later in a compatible manner. (I think that's possible?) Purely because it's the conservative route.

I don't grok the second two paragraphs of your comment. Probably because I don't know what "pattern types" are. I also don't know what uint<7> means. (Don't have time to follow your links at the moment.)

@clarfonthey
Copy link
Contributor

clarfonthey commented May 13, 2023

I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect. Personally, the benefits I'm seeing here are the ability to have more-performant code without the need for extra unsafe code, since any slice of ASCII characters is automatically valid UTF-8. Otherwise, a type invariant would mostly be unnecessary, and you could just have constants for the names.

Explaining the two proposals:

  • Pattern types just allow you to write T is pat which is a new type that means T, but only values which match pat. So, Option<T> is Some(_) means "options which are some" and u8 is 1.. is effectively NonZeroU8.
  • Generic integers is a separate proposal which "converts" all the existing uN and iN types into aliases for a more generic uint<N> and int<N> type, which has the benefit of allowing non-power-of-two sizes and sizes larger than 128. The idea behind uint<7> is that it would be equivalent to u8 internally with the first bit always zero, although still allow all the normal operations on integers like pattern-matching and be castable to u8.

The pattern type version is very enticing because it very naturally allows exhaustive matching, while still just being a compile-time constraint. And like I said, we could still have constants for the different names for characters without really affecting things.

@BurntSushi
Copy link
Member

having nice names is still achievable

Yes, as I said:

I don't think there is a mitigation here. It feels like a "nice to have" aspect of using an enum that we probably wouldn't go out of our way to re-create via other means.

I don't know what the stabilization story is for pattern types or generic integers, but I would prefer we focus on finding a path to stabilization that doesn't require being blocked on hypothetical language features. Or at least, focusing discussion on whether we can utilize them later in a backwards compatible fashion.

I still don't really know what pattern types are (I just don't have time right now to grok it), but generic integer types seem like an implementation detail that's consistent with defining ascii::Char as an opaque type?

@clarfonthey
Copy link
Contributor

Right, the point here is that any opaque type would be equivalent to the primary hypothetical future implementations. So, explicitly not blocking on any particular path forward, but making sure that they're compatible.

@scottmcm
Copy link
Member Author

I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect.

Though you can use a variant, but can't use an associated constant. Thus use ascii::Char::*; works with an enum, but not with constants for the names.

I agree that this could absolutely be done with the magic attributes. But is there anything in particular that you think is better with that approach?

The enum version still gets the niche, notably:

[src/main.rs:3] std::alloc::Layout::new::<Option<Option<Option<Option<Option<std::ascii::Char>>>>>>() = Layout {
    size: 1,
    align: 1 (1 << 0),
}

@clarfonthey
Copy link
Contributor

I will admit that the lack of an ability to use associated constants is an immediate downgrade, although I do think that'll eventually be allowed to eventually deprecate the std::fN::consts modules.

As far as benefits -- the main ones I see are the ability to unify this type with any one of the potential future features I mentioned. If we stabilise an enum now, we are explicitly locked out of this.

That said, there is precedent for separating out the char type from the other integer types, and I guess that ascii::Char fits right in with that. I'm just not convinced that locking in this as an enum is worth losing the benefits we could get later, especially considering how it could be converted from an opaque type into an enum publicly later if we eventually decide that is better than the alternatives.

@scottmcm
Copy link
Member Author

I think it's important that this stay a separate type. To me, it's an advantage that this not just be a u7, but that it have the semantic intent behind it of actually being ASCII.

For example, if we have pattern types, I don't want to impl Display for &[u8 is ..128], but I think it's possible that we'd add impl Display for &[ascii::Char]. (Like how Debug for char is very different from Debug for a hypothetical u32 is ..=1114111.)

Should it be something else to be able to stabilize a subset sooner? Maybe; I don't know. I think I'll leave questions like that more to libs-api.

@clarfonthey
Copy link
Contributor

That's fair, and in that case I'll concede that the enum works in this case. I still am not sure if it's the best-performing version (I use the bounded-integer crate for all sorts of things, and it uses enums as it's codegen) but I feel convinced enough that we can stabilise this as a separate type.

bors added a commit to rust-lang-ci/rust that referenced this issue May 20, 2023
`ascii::Char`-ify the escaping code in `core`

This means that `EscapeIterInner::as_str` no longer needs unsafe code, because the type system ensures the internal buffer is only ASCII, and thus valid UTF-8.

Come to think of it, this also gives it a (non-guaranteed) niche.

cc `@BurntSushi` as potentially interested
`ascii::Char` tracking issue: rust-lang#110998
@safinaskar
Copy link
Contributor

I don't like name Char. It is too similar to char, despite these are totally different things. I propose ASCII instead

@safinaskar
Copy link
Contributor

If you don't like name ASCII, then name it Septet, but, please, not Char

@BurntSushi
Copy link
Member

I don't think ASCII or Septet are going to work. I can't imagine those being the names we end up with personally.

ascii::Char and char are not totally different things. They both represent abstract characters. One is just a more limited definition than the other, but both are in fact quite limited! ascii::Char is namespaced under the ascii module, which provides pretty adequate context for what kind of character it represents IMO.

@clarfonthey
Copy link
Contributor

Honestly, the name Ascii for a type would make sense since it lives in the ascii module, since that would mirror existing other Rust types. It doesn't seem out of place to refer to an ASCII character as "an Ascii" especially considering how there are already lots of terms that use "ASCII" as a placeholder for "text," as in stuff like "ASCII art." Nowadays, people even refer to Unicode art as ASCII art.

In terms of precedent for the ascii::Char name, I would point toward the various Result types in modules offered across libstd and the ecosystem, where io::Result and fmt::Result are two common examples. While the base Result is in prelude, these separate Result types are in separate modules and often used exclusively as module-specific imports, although in rare cases code will directly import them. I don't think this is a good analogy, however, since ascii::Char and char actually are completely different types with entirely different properties, and not just shorthands for each other, even though you can convert one into the other.

The only real downside I can see to using the name Ascii is that it's converting an acronym into a a word and lowercasing most of the letters, which… didn't stop us for stuff like Arc and Rc and OsStr.

Personally, I've thought that the core::ascii module has been mostly useless until the addition of this character now that AsciiExt is deprecated, although revitalising it to introduce a type called Ascii would make the most sense to me. Plus, it would even lead a path forward to potentially adding this type to prelude if it were considered useful there, or maybe the preludes of other crates that rely on it.

@BurntSushi
Copy link
Member

Using Ascii as a singular noun sounds very strange to me. "ASCII art" is using "ASCII" as an adjective. I'm not aware of any wide usage of "Ascii" as a singular noun referring to a single character. Its usage as a noun, in my experience, refers to a character set and/or encoding. Not a member of that set/encoding.

@clarfonthey
Copy link
Contributor

clarfonthey commented May 22, 2023

I mean, it does feel weird, but then again, so does calling a vector a "vec" until you say it enough times. If you think of "ASCII" as shorthand for "ASCII character", I don't think that's too far-fetched.

Oh, and even "char" is an abbreviation.

@kupiakos
Copy link
Contributor

2. Given that we're overall not fond of as casts, and it also allows weirder things like as i64, I actually wish the cast could be disabled

I see no issue with this; ascii::Char, or whatever it's called, can be fully represented by every numeric type except bool. The issue with as casts changing the value is irrelevant for this case.

I mean, it does feel weird

This message is composed of 420 ASCIIs.

@bluebear94
Copy link
Contributor

I also support calling this type Ascii because calling it Char is likely to give worse diagnostics if you mix up char and Char. The Char name would also result in confusion to readers unless you qualify it as ascii::Char, which is more verbose than Ascii.

Also, rustdoc currently doesn’t play well with types that are exported under a different name from their declaration; for instance, this method shows the return type as AsciiChar, which is the name used in the declaration of this type.

(Note: I dislike the convention of having multiple types named Error and Result in the standard library and would rather have had FmtError, IoError, and so on, so my opinions might be biased.)

@rben01
Copy link

rben01 commented Feb 6, 2024

Arithmetic should not exist on AsciiChar (or whatever it ends up being called). Any arithmetic common and useful enough to be worth thinking about, like addition with 'a' - 'A' for case conversion, should just be exposed through a method like to_ascii_uppercase (or just to_uppercase since the domain is already limited to ASCII).

I disagree. As I expressed in #120219 (comment), ASCII characters have no holes in the domain (char forbids surrogates) and all 7-bit numbers are valid ASCII characters (char doesn’t fit in 20 bits but doesn’t cover all 21 bits).

There is a difference between char, NonZero and AsciiChar in regards to Add. AsciiChar can be thought of as u7 where addition has an obvious wrapping behaviour of keeping only the seven least significant bits of the result. There is no such clear rule for char or NonZero. char is not quite 21-bit and both have forbidden representations less than their max value.

I certainly would not expect wrapping. If adding 1 to char 127 got me back to '\0' I'd be very very surprised. If anything I'd expect a panic, as with all other default arithmetic operations. (Now, should there be Wrapping<AsciiChar>? I'd still say no, but at least its behavior wouldn't be surprising.)

@mina86
Copy link
Contributor

mina86 commented Feb 6, 2024

I certainly would not expect wrapping. If adding 1 to char 127 got me back to '\0' I'd be very very surprised. If anything I'd expect a panic, as with all other default arithmetic operations.

And you would get behaviour as with any default arithmetic operations. AsciiChar(127u8) + 1 would produce the same effect as AsciiChar(255u8 + 1). Why wrapping in AsciiChar would surprising where wrapping with u8 isn’t?

@Finomnis
Copy link
Contributor

Finomnis commented Feb 6, 2024

I certainly would not expect wrapping. If adding 1 to char 127 got me back to '\0' I'd be very very surprised. If anything I'd expect a panic, as with all other default arithmetic operations.

And you would get behaviour as with any default arithmetic operations. AsciiChar(127u8) + 1 would produce the same effect as AsciiChar(255u8 + 1). Why wrapping in AsciiChar would surprising where wrapping with u8 isn’t?

u8 does not wrap, it panics on wrap (on debug builds)

@mina86
Copy link
Contributor

mina86 commented Feb 6, 2024

u8 does not wrap, it panics on [overflow] (on debug builds)

Yes, and that’s what Add in my PR does as well.

@sffc
Copy link

sffc commented Feb 7, 2024

If AsciiChar(127u8) + 1 panics in debug, like other overflowing integer types do, what does it do on release? It can't just roll over to 128 because that would be UB since 128 is not a valid AsciiChar, and rolling to \0 is not free (still requires branching, unlike unsigned int overflow which is handled in hardware). So it seems cleaner to not impl Add for AsciiChar.

@clarfonthey
Copy link
Contributor

What exactly is the intent behind some kind of Add operation that isn't covered by Step and a to_digit method?

@mina86
Copy link
Contributor

mina86 commented Feb 7, 2024

If AsciiChar(127u8) + 1 panics in debug, like other overflowing integer types do, what does it do on release? It can't just roll over to 128 because that would be UB since 128 is not a valid AsciiChar, and rolling to \0 is not free (still requires branching, unlike unsigned int overflow which is handled in hardware).

It wraps to zero which doesn’t require branching. It’s just an and operation. Do you also argue that Index shouldn’t do bound checking because those introduce branch?

What exactly is the intent behind some kind of Add operation that isn't covered by Step and a to_digit method?

This is disingenuous question. You might just as well ask what is intend behind some kind of AsciiChar which isn’t covered by char::is_ascii, str::is_ascii and str::from_utf8? If you really think that Step, a trait designed for iterators, is appropriate for use with arithmetic than I really don’t know what to tell you.

@kupiakos
Copy link
Contributor

kupiakos commented Feb 7, 2024

About 1/4 of the ASCII range is not displayable. A valid interpretation of the type is that it utilizes more human-readable rendering when possible, but it is still just a byte.

This doesn't seem particularly relevant - this is just a statistical difference between Unicode and ASCII. U+202E, anyone?

what does it do on release

Overflow from 128 -> 0, with the associated code cost to do that.

still requires branching, unlike unsigned int overflow which is handled in hardware

The overflow is branchless: it's transmute((x as u8 + 128 + y as u8) & 0x7f). It's ~3x as costly as a u8 add, but it's not terrible.

I disagree. As I expressed in my comment in PR implementing Add, ASCII characters have no holes in the domain (char forbids surrogates) and all 7-bit numbers are valid ASCII characters (char doesn’t fit in 20 bits but doesn’t cover all 21 bits).

This makes a solid case for the differences with char/NonZero*, though I'm still skeptical it's the right choice for Rust long-term. It's the only numeric add where release-profile overflow isn't just handled by the hardware. I suspect there are common cases that would be lighter-weight to do the arithmetic as u8, and then convert to AsciiChar, then to perform multiple AsciiChar wrapping operations. (Relatedly, I'm broadly against adding custom-bit-width integers to Rust. Eager masking/wraparound is, IMO, a niche need that libraries should provide for.)

I think we should discourage users from writing code like x + a'0' or c + (a'a' - a'A') and push them towards helper methods that can be more trivially optimized and tend to be much more readable.

It’s easy to convert to u8. It’s hard to convert from u8.

I challenge "hard". It's fallible, sure, but we could provide a fn from_u7(x: u8) -> Self { unsafe { transmute(x & 0x7f) } } for a safe, infallible "I don't really care if that top bit is set, just truncate to convert like an as cast".

@scottmcm
Copy link
Member Author

scottmcm commented Feb 7, 2024

This is disingenuous question. You might just as well ask what is intend behind some kind of AsciiChar which isn’t covered by char::is_ascii, str::is_ascii and str::from_utf8?

This is easily answered with reference to the ACP rust-lang/libs-team#179:

The purpose of the type is to be a type-system proof of ASCII-ness so that slice-to-str and vec-to-String conversions trivially do not require checking, and thus can be O(1) in safe code, reducing the need for "I only put ASCII in the buffer so I'm calling from_utf8_unchecked" unsafe code.

Everything else is extra, and plausibly wouldn't even be part of the first wave of stabilization.

The exemplar for using this is fast & safe base64, for example, where AsciiChar: Add isn't helpful, as it'll most likely use indexing into a constant [AsciiChar; 64]. Hex formatting would do the same, though perhaps via [[AsciiChar; 2]; 256] instead.

The numeric formatting in core currently uses addition for generating decimal digits, but I don't think it'd be willing to use an Add that adds extra bitmasking cost to those paths. And if the masking optimizes away in that code, then so would any checks in AsciiChar::new(b'0' + x).unwrap().

@mina86
Copy link
Contributor

mina86 commented Feb 7, 2024

It's a valid question. The Step implementation is O(1), it's just less ergonomic than + usize. The intent and real-world use cases for AsciiChar have been addressed and described. Implementing controversial features requires answers to "Why should we have this when there are workable alternatives?"

It’s not because it presupposes that if something can be done with Step or to_digit than the alternatives are to be dismissed. Getting nth letter and Ascii85 implementation would benefit from Add but of course it all can be done with AsciiChar::new(...).unwrap(). If you don’t think ergonomics are better/worth it than there’s nothing I can say to convince you otherwise.

@clarfonthey
Copy link
Contributor

It's a valid question. The Step implementation is O(1), it's just less ergonomic than + usize. The intent and real-world use cases for AsciiChar have been addressed and described. Implementing controversial features requires answers to "Why should we have this when there are workable alternatives?"

It’s not because it presupposes that if something can be done with Step or to_digit than the alternatives are to be dismissed. Getting nth letter and Ascii85 implementation would benefit from Add but of course it all can be done with AsciiChar::new(...).unwrap(). If you don’t think ergonomics are better/worth it than there’s nothing I can say to convince you otherwise.

The point here is that Add fundamentally is a nonsensical operation for ASCII characters, even though the type internally is just a byte whose value can be added. char doesn't support Add for similar reasons.

The reason why I mentioned Step is because things like getting the nth letter are covered by it; 'a'..='z'.nth(x) works for char just fine and it would also work just as well for AsciiChar too. I would even argue that this form is clearer than using addition, and less error-prone.

I don't see how implementing addition gives you any positive ergonomics, and there's even the additional confusion of what you're adding: u8? other ASCII characters? It just introduces so many problems that feel like they could be easily solved by other methods, which is why I'm asking what problems you'd like to solve that aren't covered by those methods.

@joseluis
Copy link
Contributor

I just want to point out that AsciiChar should derive Default returning the Null variant, matching char's implementation returning '\x00', since I've not seen it mentioned anywhere and right now it lacks it.

@clarfonthey
Copy link
Contributor

I also hate to be a broken record but Step is still not mentioned anywhere in the issue description, and it should be.

@scottmcm
Copy link
Member Author

@joseluis That makes sense to me. Want to send a PR and r? @scottmcm me on it?

@clarfonthey Updated.

@programmerjake
Copy link
Member

i think we should add [Char; N]::as_bytes() which returns &[u8; N]

@scottmcm
Copy link
Member Author

I had a weird idea that's probably a vestige from doing C++ a bunch:

Add a pub struct Ascii<T: private::Innards + ?Sized>(T::ImplType);. Then use that to have Ascii<str>, Ascii<char>, Ascii<[char; 10]>, etc that end up storing [u8], u8, [u8; 10] respectively.

Then it has the customized display, without needing to debate things like whether to need to specialize Debug for [ascii::Char] to show a string.

(That's pretty unprecedented for Rust, though, so I assume libs-api wouldn't be a fan. But I figured I might as well write it down.)

@BurntSushi
Copy link
Member

@scottmcm My interest is piqued. It reminds me a bit of the type shenanigans used in rkyv. I really like getting a nice Debug impl. I think that's a huge deal for UX. (It was quite literally one of the original motivating reasons for me creating bstr.)

@programmerjake
Copy link
Member

programmerjake commented Feb 14, 2024

If we are going to have Ascii<T>, I would write it Ascii<u8> instead of Ascii<char> since the latter makes me think it's just a 32-bit value range limited to 0..=0x7F, similarly for arrays. For unknown-length, idk if we use [u8] or str...
one benefit of Ascii is a owned string type easily falls out: Ascii<Vec<u8>>/Ascii<String>

@scottmcm
Copy link
Member Author

With NonZero<T> in FCP (#120257 (comment)), I was inspired by it and jake's previous comment to come back to this and propose something that I think might be reasonable.

We're about to stabilize

pub struct NonZero<T>(/* private fields */)
where T: ZeroablePrimitive;

where ZeroablePrimitive is a forever-internal implementation detail, but there's always a get: NonZero<T> -> T.

So here we could do something similar:

pub struct Ascii<T: ?Sized>(/* private fields */)
where T: SupersetOfAscii;

Then for T: Sized we'd have the same get: Ascii<T> -> T, but we'd also have for everything a as_str: &Ascii<T> -> &str as well as allowing it to deref (perhaps indirectly) from &Ascii<T> to &T.

So internally that might be

pub struct Ascii<T: ?Sized + SupersetOfAscii>(<T as SupersetOfAscii>::StorageType);

Then we could implement that trait for various types -- u8 and u32 where the internal StorageType is a private thing with #[rustc_layout_scalar_valid_range_end(0x7F)], but also implement it for [u8; N] and [u8] and such to cover more general things.

(That would allow Ascii<u32>: AsRef<char> + AsRef<&str> for example, since you can get references to either from it. Might not be worth bothering having that, though, since I've never seen anything that cares about AsRef<char>.)

Thoughts, @BurntSushi ? Would you like to see that as a PR or an ACP (or nothing)?

@programmerjake

This comment was marked as resolved.

@scottmcm

This comment was marked as resolved.

@programmerjake
Copy link
Member

programmerjake commented Feb 22, 2024

If we have Ascii<T: ?Sized>: Deref<Target = T>, I think we'll need both Ascii<str> and Ascii<[u8]> as well as the corresponding owned types. We should have Ascii<[u8]>: AsRef<Ascii<str>> and visa versa, and other corresponding conversions for owned/borrowed types.

@scottmcm
Copy link
Member Author

On Deref: yeah, I haven't quite thought this one through all the way. I added the "(perhaps indirectly)" parenthetical to try to add some space for it -- like maybe we don't always have Ascii<T: ?Sized>: Deref<Target = T> because we deref the ascii one to something else that then derefs to the original thing.

But thinking about it more as I type, maybe it's just a bad idea. We don't have &String -> &Vec<u8> as a deref coercion -- not even indirectly -- so maybe trying to have it here would be wrong too.

Maybe I should propose Ascii<T: ?Sized>: AsRef<T> as the general thing instead, since that we can definitely do, and we'll be more limited in which things we allow to Deref at all.

@Kimundi
Copy link
Member

Kimundi commented Feb 25, 2024

Heh, that actually reminds me of the API I came up with for a toy hex formatting crate of me: https://docs.rs/easy-hex/1.0.0/easy_hex/struct.Hex.html

Basically, the whole api resolves around being able to cast T to/from Hex<T> just to get changed trait impl semantic. In my case its just a repr(transparent) wrapper, so I do not change representation, but still the idea seems similar.

That said, I feel like a basic AsciiChar type is still the best course of action, otherwise it seems like the whole API discussion here has to be started from scratch :D

For the [AsciiChar] does not implement Debug ruight problem, could we maybe provide a struct AsciiSlice([AschiiChar]);, and just make it easy to convert to/from the basic slice type? I could image that becoming useful for more things than just the debug formatting impl.

@jongiddy
Copy link
Contributor

Will it be possible to backwards-compatibly redefine [u8]::escape_ascii to return an Iterator<Item=AsciiChar> instead of the current Iterator<Item=u8>?

Currently the returned iterator provides a to_string() method that collects the characters into a string, but if any other iterators are chained, we lose the info that the bytes are safe to collect into a String.

@programmerjake
Copy link
Member

Will it be possible to backwards-compatibly redefine [u8]::escape_ascii to return an Iterator<Item=AsciiChar> instead of the current Iterator<Item=u8>?

yes, but likely only if rust implements something like edition-dependent name resolution where the same name resolves to two different escape_ascii functions depending on the edition.

I previously proposed something like that: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Effect.20of.20fma.20disabled/near/274199236

@clarfonthey
Copy link
Contributor

This is a random thought I had, but seeing the progress on this makes me wonder if we could make char itself implemented as a lang item struct instead of a primitive in the language without breaking anything.

Figuring out what would have to be done so avoid any breakages there could probably be useful here, since they would also need to be applied to ASCII chars too.

@scottmcm
Copy link
Member Author

scottmcm commented May 1, 2024

@clarfonthey Practically I think no, because char has exhaustive matching for its particular disjoint range, which isn't something exposable today.

Maybe eventually, with fancy enough matching-transparent pattern types, but I'm not sure it'd ever really be worth it. (For something like str, sure, but char has so much special stuff that I'm skeptical.)

@clarfonthey
Copy link
Contributor

Getting back to this feature a bit. Genuine question: instead of the common named variants, now that the precedent has been set for CStr having dedicated c"..." literals, would it be reasonable to add a'x' and a"..." literals? This could also solve the issue with ascii_char_variants being unstable, preferring that you use the literals instead of the variants.

Not sure if implementing this as a POC in unstable would require an RFC or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unicode Area: Unicode C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests