RFC: Unicode and escape codes in literals #3349

m-ou-se · 2022-11-15T17:47:01Z

Diggsey · 2022-11-15T18:03:28Z

The first part seems like an obviously good thing. The second part "Allow \x… escape codes in regular string literals, as long as they are valid UTF-8." I'm not strongly opposed to but I also can't really think of a use-case that wouldn't be better served by a byte literal? Did you have one in mind?

The downsides of the second part are:

Possible confusion about why the \x** sometimes works in string literals but not other times (and not consistently for the same byte values).
Issues with concatenation (eg. depending on how concat!() is implemented, splitting a unicode character across two literals may or may not work in practice.)
People accidentally using string literals when they should be using a byte literal, but not realising until they come to write a specific byte value which is not valid UTF-8.

m-ou-se · 2022-11-15T18:08:21Z

I don't have serious use cases for it in mind for \x-encoded UTF-8 in regular string literals, but I also don't think we should disallow it. That is, if we were to design the language from scratch today, I'd argue that forbidding \x in "" just makes things inconsistent and doesn't bring much value. And since it's a backwards-compatible change to make, I think we should make that change to Rust.

As mentioned in the RFC, it also helps with macros like cstr!("\xff"). (Although I suppose we could still disallow \x escape codes entirely at the point of converting it to a literal AST token.)

text/3349-mixed-utf8-literals.md

nnethercote · 2022-11-15T22:58:39Z

This RFC talks about byte string literals, e.g. b"foo". Does it also need to discuss byte literals, e.g. b'x'?

Also, the RFC as written makes it sound like \x escapes are never allowed in regular string literals.

Allow \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

Only extend b"", but still don't accept \x in regular string literals ("").

But something like "\x61' is fine, the escape just must be in the range \x00-\x7f. (This isn't an issue with the RFC's intent, just a point of clarification.)

BurntSushi · 2022-11-15T23:21:58Z

Allow \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

Only extend b"", but still don't accept \x in regular string literals ("").

But something like "\x61' is fine, the escape just must be in the range \x00-\x7f. (This isn't an issue with the RFC's intent, just a point of clarification.)

I think the RFC wording is precisely correct here. Today, x-escapes are only allowed if they're ASCII. This RFC expands it to allow everything that is valid utf8, which includes ASCII.

Edit: or maybe you're saying the rfc should call out what is supported today and phrase it as an expansion.

nnethercote · 2022-11-15T23:55:10Z

I think both examples are ambiguously worded, and I read them both as meaning that no \x escapes are allowed at all.

Here are possible rewordings that I think would make things clearer.

Allow \x escape codes in the range \x80-\xff in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80". (Escape codes in the range \x00-0x7f are currently allowed.)

Only extend b"", but still don't accept \x80-\xff in regular string literals ("").

m-ou-se · 2022-11-17T16:39:52Z

Thanks for all the feedback! I've updated the document. :)

joshtriplett · 2022-11-25T02:23:29Z

This seems ready to FCP, apart from incorporating @scottmcm's feedback.

m-ou-se · 2022-11-30T12:55:35Z

Consider splitting the tokenizing part of this out from the rest of the RFC. Given that lang just agreed with this rationale for what should be a tokenizing problem vs a semantic problem (rust-lang/rust#102944 (comment)), we could probably do the ""\u{D8D8}" and "\xFF" are valid tokens, but not valid values" part of this quickly as it's in-line with existing and recent precedent.

I've updated the RFC by moving the "validate later" part to the future possibilities section. I'll submit a separate RFC for that. (Edit: started a discussion on Zulip.)

@joshtriplett I think this is ready to FCP. :)

nnethercote · 2022-12-02T02:00:12Z

I think this RFC should use the characters and strings table, because it's easy to overlook cases when dealing with prose.

Here's the status quo:

	Example	# sets*	Characters	Escapes
Character	'H'	0	All Unicode	Quote & ASCII & Unicode
String	"hello"	0	All Unicode	Quote & ASCII & Unicode
Raw string	r#"hello"#	<256	All Unicode	N/A
Byte	b'H'	0	All ASCII	Quote & Byte
Byte string	b"hello"	0	All ASCII	Quote & Byte
Raw byte string	br#"hello"#	<256	All ASCII	N/A

Here is what the RFC is proposing, AIUI, with changes in bold, and uncertain changes with ?:

	Example	# sets*	Characters	Escapes
Character	'H'	0	All Unicode	Quote & ASCII [1] & Unicode
String	"hello"	0	All Unicode	Quote & Byte & Unicode
Raw string	r#"hello"#	<256	All Unicode	N/A
Byte	b'H'	0	All ASCII [2]	Quote & Byte & Unicode? [3]
Byte string	b"hello"	0	All Unicode	Quote & Byte & Unicode
Raw byte string	br#"hello"#	<256	All ASCII [4]	N/A

I have numbered some inconsistencies.

[1] Should this be Byte? '¥' is already allowed. Why not '\xa5', its equivalent?
[2] Should this be All Unicode? b'\xa5' is already allowed. Why not b'¥'?
[3] (Already suggested) b\xa5 is already allowed. Why not b\u{a5}?
[4] Should this be All Unicode? If we are going to allow b"¥¥¥", why not allow br"¥¥¥"?

Answering all of those questions in the affirmative gives the maximally permissive, maximally consistent table:

	Example	# sets*	Characters	Escapes
Character	'H'	0	All Unicode	Quote & Byte & Unicode
String	"hello"	0	All Unicode	Quote & Byte & Unicode
Raw string	r#"hello"#	<256	All Unicode	N/A
Byte	b'H'	0	All Unicode	Quote & Byte & Unicode
Byte string	b"hello"	0	All Unicode	Quote & Byte & Unicode
Raw byte string	br#"hello"#	<256	All Unicode	N/A

One possible drawback is that it becomes more important to understand that vanilla string literals are utf8 encoded while char literals are not. Which means that '\xa5' would be valid, while "\xa5" would not. But this is just a slight extension of the existing drawback of this RFC as written.

nnethercote · 2022-12-02T02:46:15Z

An alternative, more minimal proposal would be this:

	Example	# sets*	Characters	Escapes
Character	'H'	0	All Unicode	Quote & ASCII & Unicode
String	"hello"	0	All Unicode	Quote & ASCII & Unicode
Raw string	r#"hello"#	<256	All Unicode	N/A
Byte	b'H'	0	All ASCII	Quote & Byte
Byte string	b"hello"	0	All Unicode	Quote & Byte & Unicode
Raw byte string	br#"hello"#	<256	All Unicode	N/A

Arguments in favour:

Still solves the primary motivation: "byte string literals are currently not a superset of regular string literals".
A near-minimal change.
- The minimal change would leave raw byte string literals alone, but that seems silly. Byte string literals and raw byte string literals should behave the same except for escape handling.
All other possible changes could cause confusion, without a clear benefit.

Arguments against:

Byte literals are the odd one out. But then, they already are the odd one out. E.g. they're not a superset of char literals the way byte string literals are a superset of string literals.

Right now, this is the direction I am leaning in.

m-ou-se · 2022-12-06T21:47:51Z

[1] Should this be Byte? '¥' is already allowed. Why not '\xa5', its equivalent?

[2] Should this be All Unicode? b'\xa5' is already allowed. Why not b'¥'?

[3] (Already suggested) b\xa5 is already allowed. Why not b\u{a5}?

[4] Should this be All Unicode? If we are going to allow b"¥¥¥", why not allow br"¥¥¥"?

@nnethercote These questions seem to be mixing up a character's codepoint with its UTF-8 representation.

\xa5 is invalid unicode. '¥' has codepoint 0xa5, which in UTF-8 is encoded as two bytes: "\xc2\xa5".

b'¥' doesn't work because that's two bytes, not one. Same for b'\u{a5}'.

b'\u{30}' should be fine though. That's just a single byte as UTF-8 (so, ascii).

I think this RFC should use the characters and strings table

My proposal is basically to just remove the "Characters" and "Escapes" column, and replace it by a requirement that some types of literals must be valid UTF-8. All literals then accept all escape codes, and validation is now about whether the result is valid UTF-8, after the escape codes have been processed.

The only open question is what to do with character literals, since multi-character literals are parsed as two lifetimes rather than as an opening and closing quote. But that seems fine to me, because allowing '\xc2\xa5' would be a bit weird anyway, considering that char doesn't store UTF-8. ('\x30' or '¥' is fine though.)

nnethercote · 2022-12-07T01:40:41Z

I have been assuming that UTF-8 encoding is irrelevant for char literals and byte literals, and relevant for the other four kinds of literal.

You seem to agree on this for char literals:

The only open question is what to do with character literals... allowing '\xc2\xa5' would be a bit weird anyway, considering that char doesn't store UTF-8.

What about byte literals? A u8 also doesn't store UTF-8. With that in mind, I think my questions above do make sense.

My proposal is basically to just remove the "Characters" and "Escapes" column, and replace it by a requirement that some types of literals must be valid UTF-8. All literals then accept all escape codes, and validation is now about whether the result is valid UTF-8, after the escape codes have been processed.

The only open question is what to do with character literals...

This text says "some types of literals", then "all literals", then adds an exception for char literals. Also, "All literals then accept all escape codes" is clearly imprecise because raw string literals and raw byte string literals don't accept any escape codes.

The reason I like the table approach is that it forces us to be precise and look at every possibility; it shows there are lots of different possibilities. I find it easier and more precise to think about.

joshtriplett · 2023-01-19T10:25:31Z

I do think we should permit br"¥¥¥", but I don't think we should make any of the other changes proposed in that table, for the reasons @m-ou-se stated.

I'm going to go ahead and propose FCP for this. This does not preclude making further changes to how this information is presented.

@rfcbot merge

@rfcbot concern raw-byte-strings-with-unicode

rfcbot · 2023-01-19T10:25:32Z

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

~~raw-byte-strings-with-unicode~~ resolved by RFC: Unicode and escape codes in literals #3349 (comment)
~~waiting-on-update-re-using-char-and-string-tables~~ resolved by RFC: Unicode and escape codes in literals #3349 (comment)

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns.
See this document for info about what commands tagged team members can give me.

text/3349-mixed-utf8-literals.md

nikomatsakis · 2023-04-04T18:09:43Z

@rfcbot fcp reviewed

This makes sense to me.

m-ou-se · 2023-08-10T09:26:40Z

Thanks for all the feedback and patience! I finally got around to updating the RFC. I think it's all much clearer now. :)

I think both of the concerns registered with the rfcbot are resolved now.

joshtriplett · 2023-08-22T18:14:41Z

@rfcbot resolved raw-byte-strings-with-unicode

joshtriplett · 2023-08-22T18:20:28Z

@pnkfelix Could you please resolve waiting-on-update-re-using-char-and-string-tables?

pnkfelix · 2023-08-22T18:31:16Z

@rfcbot resolved waiting-on-update-re-using-char-and-string-tables

joshtriplett · 2023-08-22T18:48:38Z

Pinging @pnkfelix, @scottmcm, and @tmandry for checkboxes, now that concerns have been resolved.

mattheww · 2023-08-23T18:39:19Z

Are numbers which don't represent a Unicode scalar value excluded from the definition of a Unicode escape (eg \u{DC00} or \u{FFFFFF})?

The Reference isn't currently very explicit about that (which it can get away with because at present those escapes can only appear in contexts where we promise valid utf-8). I think \u{DC00} does have a natural interpretation in a byte string.

If they're excluded, I think the "Valid unicode code point" text for validation of character literals is unnecessary (and perhaps misleading), as I think there's no way to write a character literal that would fail that validation rule.

rfcbot · 2023-09-12T18:57:28Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2023-09-22T19:13:47Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

traviscross · 2023-10-19T02:25:21Z

This RFC has been merged, and we've opened a tracking issue: rust-lang/rust#116907

Thanks go out to the authors of this RFC for making Rust better by drafting it and pushing it through to acceptance.

…, r=<try> Implement RFC 3349, mixed utf8 literals RFC: rust-lang/rfcs#3349 Tracking issue: rust-lang#116907 r? `@ghost`

Add mixed utf8 literals rfc.

392a290

m-ou-se added T-lang Relevant to the language team, which will review and decide on the RFC. A-syntax Syntax related proposals & ideas labels Nov 15, 2022

m-ou-se mentioned this pull request Nov 15, 2022

RFC: c"…" string literals #3348

Merged

Add RFC number.

4038665

scottmcm reviewed Nov 15, 2022

View reviewed changes

text/3349-mixed-utf8-literals.md Outdated Show resolved Hide resolved

m-ou-se added 2 commits November 17, 2022 19:09

Update.

3409a49

Updateee.

679c165

Move the "validate later" part to future possibilities.

3892774

rfcbot added proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. labels Jan 19, 2023

joshtriplett reviewed Jan 19, 2023

View reviewed changes

text/3349-mixed-utf8-literals.md Show resolved Hide resolved

juntyr mentioned this pull request Feb 6, 2023

Rusty byte strings in RON, deprecate base64 (byte) strings ron-rs/ron#438

Merged

10 tasks

scottmcm reviewed Feb 16, 2023

View reviewed changes

text/3349-mixed-utf8-literals.md Outdated Show resolved Hide resolved

petrochenkov mentioned this pull request Mar 6, 2023

Implement RFC 3348, c"foo" literals rust-lang/rust#108801

Merged

m-ou-se changed the title ~~RFC: UTF-8 characters and escape codes in (byte) string literals~~ RFC: Unicode and and escape codes in literals Aug 10, 2023

Update.

bf9a91e

m-ou-se requested a review from a team August 14, 2023 09:31

m-ou-se added A-string Proposals relating to strings. I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. labels Aug 14, 2023

joshtriplett approved these changes Sep 12, 2023

View reviewed changes

rfcbot added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Sep 12, 2023

rfcbot removed the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label Sep 12, 2023

shepmaster changed the title ~~RFC: Unicode and and escape codes in literals~~ RFC: Unicode and escape codes in literals Sep 13, 2023

tmandry removed the I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. label Sep 19, 2023

rfcbot added finished-final-comment-period The final comment period is finished for this RFC. to-announce and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Sep 22, 2023

traviscross mentioned this pull request Oct 18, 2023

Tracking Issue for unicode and escape codes in literals rust-lang/rust#116907

Open

4 tasks

traviscross merged commit bf9a91e into rust-lang:master Oct 18, 2023

polarathene mentioned this pull request Oct 19, 2023

JSONC parser fails to correctly parse non-BMP escape sequences dprint/jsonc-parser#31

Open

m-ou-se deleted the mixed-utf8-literals branch December 7, 2023 12:07

petrochenkov mentioned this pull request Jan 12, 2024

Delay literal unescaping rust-lang/rust#118699

Closed

nnethercote mentioned this pull request Jan 23, 2024

Implement RFC 3349, mixed utf8 literals rust-lang/rust#120286

Draft

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 25, 2024

Auto merge of rust-lang#120286 - nnethercote:3349-mixed-utf8-literals…

b626f8d

…, r=<try> Implement RFC 3349, mixed utf8 literals RFC: rust-lang/rfcs#3349 Tracking issue: rust-lang#116907 r? `@ghost`

nnethercote mentioned this pull request Jun 2, 2024

Reserve guarded string literals (RFC 3593) rust-lang/rust#123951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Unicode and escape codes in literals #3349

RFC: Unicode and escape codes in literals #3349

m-ou-se commented Nov 15, 2022 •

edited by traviscross

Loading

Diggsey commented Nov 15, 2022 •

edited

Loading

m-ou-se commented Nov 15, 2022

nnethercote commented Nov 15, 2022

BurntSushi commented Nov 15, 2022 •

edited

Loading

nnethercote commented Nov 15, 2022

m-ou-se commented Nov 17, 2022 •

edited

Loading

joshtriplett commented Nov 25, 2022

m-ou-se commented Nov 30, 2022 •

edited

Loading

nnethercote commented Dec 2, 2022

nnethercote commented Dec 2, 2022

m-ou-se commented Dec 6, 2022

nnethercote commented Dec 7, 2022

joshtriplett commented Jan 19, 2023

rfcbot commented Jan 19, 2023 •

edited by pnkfelix

Loading

nikomatsakis commented Apr 4, 2023

m-ou-se commented Aug 10, 2023 •

edited

Loading

joshtriplett commented Aug 22, 2023

joshtriplett commented Aug 22, 2023

pnkfelix commented Aug 22, 2023

joshtriplett commented Aug 22, 2023

mattheww commented Aug 23, 2023

rfcbot commented Sep 12, 2023

rfcbot commented Sep 22, 2023

traviscross commented Oct 19, 2023

RFC: Unicode and escape codes in literals #3349

RFC: Unicode and escape codes in literals #3349

Conversation

m-ou-se commented Nov 15, 2022 • edited by traviscross Loading

Diggsey commented Nov 15, 2022 • edited Loading

m-ou-se commented Nov 15, 2022

nnethercote commented Nov 15, 2022

BurntSushi commented Nov 15, 2022 • edited Loading

nnethercote commented Nov 15, 2022

m-ou-se commented Nov 17, 2022 • edited Loading

joshtriplett commented Nov 25, 2022

m-ou-se commented Nov 30, 2022 • edited Loading

nnethercote commented Dec 2, 2022

nnethercote commented Dec 2, 2022

m-ou-se commented Dec 6, 2022

nnethercote commented Dec 7, 2022

joshtriplett commented Jan 19, 2023

rfcbot commented Jan 19, 2023 • edited by pnkfelix Loading

nikomatsakis commented Apr 4, 2023

m-ou-se commented Aug 10, 2023 • edited Loading

joshtriplett commented Aug 22, 2023

joshtriplett commented Aug 22, 2023

pnkfelix commented Aug 22, 2023

joshtriplett commented Aug 22, 2023

mattheww commented Aug 23, 2023

rfcbot commented Sep 12, 2023

rfcbot commented Sep 22, 2023

traviscross commented Oct 19, 2023

m-ou-se commented Nov 15, 2022 •

edited by traviscross

Loading

Diggsey commented Nov 15, 2022 •

edited

Loading

BurntSushi commented Nov 15, 2022 •

edited

Loading

m-ou-se commented Nov 17, 2022 •

edited

Loading

m-ou-se commented Nov 30, 2022 •

edited

Loading

rfcbot commented Jan 19, 2023 •

edited by pnkfelix

Loading

m-ou-se commented Aug 10, 2023 •

edited

Loading