-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Unicode and escape codes in literals #3349
Conversation
The first part seems like an obviously good thing. The second part "Allow \x… escape codes in regular string literals, as long as they are valid UTF-8." I'm not strongly opposed to but I also can't really think of a use-case that wouldn't be better served by a byte literal? Did you have one in mind? The downsides of the second part are:
|
I don't have serious use cases for it in mind for \x-encoded UTF-8 in regular string literals, but I also don't think we should disallow it. That is, if we were to design the language from scratch today, I'd argue that forbidding \x in "" just makes things inconsistent and doesn't bring much value. And since it's a backwards-compatible change to make, I think we should make that change to Rust. As mentioned in the RFC, it also helps with macros like |
This RFC talks about byte string literals, e.g. Also, the RFC as written makes it sound like
But something like |
I think the RFC wording is precisely correct here. Today, x-escapes are only allowed if they're ASCII. This RFC expands it to allow everything that is valid utf8, which includes ASCII. Edit: or maybe you're saying the rfc should call out what is supported today and phrase it as an expansion. |
I think both examples are ambiguously worded, and I read them both as meaning that no \x escapes are allowed at all. Here are possible rewordings that I think would make things clearer.
|
Thanks for all the feedback! I've updated the document. :) |
This seems ready to FCP, apart from incorporating @scottmcm's feedback. |
I've updated the RFC by moving the "validate later" part to the future possibilities section. I'll submit a separate RFC for that. (Edit: started a discussion on Zulip.) @joshtriplett I think this is ready to FCP. :) |
I think this RFC should use the characters and strings table, because it's easy to overlook cases when dealing with prose. Here's the status quo:
Here is what the RFC is proposing, AIUI, with changes in bold, and uncertain changes with
I have numbered some inconsistencies.
Answering all of those questions in the affirmative gives the maximally permissive, maximally consistent table:
One possible drawback is that it becomes more important to understand that vanilla string literals are utf8 encoded while char literals are not. Which means that |
An alternative, more minimal proposal would be this:
Arguments in favour:
Arguments against:
Right now, this is the direction I am leaning in. |
@nnethercote These questions seem to be mixing up a character's codepoint with its UTF-8 representation.
My proposal is basically to just remove the "Characters" and "Escapes" column, and replace it by a requirement that some types of literals must be valid UTF-8. All literals then accept all escape codes, and validation is now about whether the result is valid UTF-8, after the escape codes have been processed. The only open question is what to do with character literals, since multi-character literals are parsed as two lifetimes rather than as an opening and closing quote. But that seems fine to me, because allowing |
I have been assuming that UTF-8 encoding is irrelevant for char literals and byte literals, and relevant for the other four kinds of literal. You seem to agree on this for char literals:
What about byte literals? A
This text says "some types of literals", then "all literals", then adds an exception for char literals. Also, "All literals then accept all escape codes" is clearly imprecise because raw string literals and raw byte string literals don't accept any escape codes. The reason I like the table approach is that it forces us to be precise and look at every possibility; it shows there are lots of different possibilities. I find it easier and more precise to think about. |
I do think we should permit I'm going to go ahead and propose FCP for this. This does not preclude making further changes to how this information is presented. @rfcbot merge @rfcbot concern raw-byte-strings-with-unicode |
Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members: Concerns:
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns. |
@rfcbot fcp reviewed This makes sense to me. |
Thanks for all the feedback and patience! I finally got around to updating the RFC. I think it's all much clearer now. :) I think both of the concerns registered with the rfcbot are resolved now. |
@rfcbot resolved raw-byte-strings-with-unicode |
@pnkfelix Could you please resolve |
@rfcbot resolved waiting-on-update-re-using-char-and-string-tables |
Are numbers which don't represent a Unicode scalar value excluded from the definition of a Unicode escape (eg The Reference isn't currently very explicit about that (which it can get away with because at present those escapes can only appear in contexts where we promise valid utf-8). I think If they're excluded, I think the "Valid unicode code point" text for validation of character literals is unnecessary (and perhaps misleading), as I think there's no way to write a character literal that would fail that validation rule. |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. This will be merged soon. |
This RFC has been merged, and we've opened a tracking issue: rust-lang/rust#116907 Thanks go out to the authors of this RFC for making Rust better by drafting it and pushing it through to acceptance. |
…, r=<try> Implement RFC 3349, mixed utf8 literals RFC: rust-lang/rfcs#3349 Tracking issue: rust-lang#116907 r? `@ghost`
Rendered