Delay literal unescaping #118699

nnethercote · 2023-12-07T08:10:39Z

rustbot · 2023-12-07T08:10:43Z

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

…e1-dead Unescaping cleanups Minor improvements I found while working on rust-lang#118699. r? `@fee1-dead`

Rollup merge of rust-lang#118734 - nnethercote:literal-cleanups, r=fee1-dead Unescaping cleanups Minor improvements I found while working on rust-lang#118699. r? `@fee1-dead`

nnethercote · 2023-12-12T04:50:15Z

@rust-lang/lang: This PR proposes a backward-compatible language change.

It changes some errors related to literals from syntactic to semantic. In the compiler, this is all the errors in the EscapeError type. These all relate to char/byte/string literals, and mostly relate to escape sequences, plus a couple of others (e.g. having more or less than one char/byte in a char/byte string literal, or having a NUL char in a C string literal.)

Here is example code showing how things change. This covers some of the relevant errors, but not all; the missing ones would be affected in the same way.

 fn main() {
    '';            //~ error: empty character literal
    b'ab';         //~ error: character literal may only contain one codepoint
    "\a";          //~ error: unknown character escape: `a`
    b"\xzz";       //~ error: invalid character in numeric character escape
    "\u20";        //~ error: incorrect unicode escape sequence
    c"\u{999999}"; //~ error: invalid unicode character escape
}

sink! {
    '';             // was an error, now allowed
    b'ab';          // was an error, now allowed
    "\a";           // was an error, now allowed
    b"\xzz";        // was an error, now allowed
    "\u20";         // was an error, now allowed
    c"\u{999999}";  // was an error, now allowed
};  

#[cfg(FALSE)]
fn configured_out() {
    '';             // was an error, now allowed
    b'ab';          // was an error, now allowed
    "\a";           // was an error, now allowed
    b"\xzz";        // was an error, now allowed
    "\u20";         // was an error, now allowed
    c"\u{999999}";  // was an error, now allowed
}

This means a macro can assign meaning to arbitrary escape sequences, such as foo!("\a\b\c").

This change is consistent with a general trend of delayed literal checking:

Floating point literals have had this delayed checking behaviour for a long time (forever?)
This PR did a similar thing for numeric literals.
The NUL char checking for C string literals is implemented in this semantic (delayed) fashion.

cc @m-ou-se

scottmcm · 2023-12-12T18:12:00Z

Some parts of this 💯 agreed. For example, "\u{999999}" working lexically seems entirely appropriate, along the lines of the numeric literal things we decided before.

I do wonder, though, whether we want to leave some of the lexical space intentionally reserved here. The above says

This means a macro can assign meaning to arbitrary escape sequences,

But that restricts the possibilities of new lexically-meaningful escape sequences in future.

Imagine, say, that we wanted to one day allow something like "hello \{foo("world")}!". Supporting that is a non-breaking change today, since \{ is a lexical error, but if I understand correctly it'd be a breaking change after doing this, since it'd change 1 token to 3 tokens.

Relatedly, 'ab' working lexically seems fine, I guess, but it's not obvious to me that '' should. That just not being a token seems fine? I don't see how there's an "unescaping" complication with it like there is for some of the other cases here. The regex using + instead of * doesn't seem like a "semantic" check to me.

And we've discussed things like using ''a as a variant of lifetimes before. (IIRC it was to give two different scopes of lifetimes for the in-band lifetimes discussions.) So even if we don't have any plans to actually use that today, it's not clear to me that we should have it start tokenizing now, vs staying a lexical error. Letting it potentially be a single token in future, without needing a breaking change, rather than the two it would be in this PR (if I'm understanding right) could potentially be valuable.

To try to make another example, I'm definitely happy with "\u{111111}" and '\u{111111}' and "\u{11_1111}". But I'd be very skeptical of lexically allowing, say, "\u{11'1111}" or particularly '\u{11'1111}', even though we could.

TL/DR: I agree with delayed semantic checking, but my instinct is that in-the-straight-forward-lexing-regex restrictions are different from semantic checks.

nnethercote · 2023-12-12T21:52:37Z

For completeness, here is a program exhibiting all the relevant literal errors. The comments describe the current behaviour.

fn main() {
    // ---- SYNTACTIC (LEXING) ERRORS ----

    '';             // ZeroChars

    'ab';           // MoreThanOneChar

    "\x\\";         // InvalidCharInHexEscape + LoneSlash (a weird one)

    "\a";           // InvalidEscape

    "a^Mb";         // BareCarriageReturn
    r"a^Mb";        // BareCarriageReturnInRawString

    '   ';          // (TAB) EscapeOnlyChar

    "\x1";          // TooShortHexEscape

    "\xz";          // InvalidCharInHexEscape

    "\xff";         // OutOfRangeHexEscape

    "\u1234";       // NoBraceInUnicodeEscape

    "\u{xyz}";      // InvalidCharInUnicodeEscape

    "\u{}";         // EmptyUnicodeEscape

    "\u{1234";      // UnclosedUnicodeEscape

    "\u{_123}";     // LeadingUnderscoreUnicodeEscape

    "\u{1234567}";  // OverlongUnicodeEscape

    "\u{dfff}";     // LoneSurrogateUnicodeEscape

    "\u{123456}";   // OutOfRangeUnicodeEscape

    b"\u{1234}";    // UnicodeEscapeInByte

    b"🦀";          // NonAsciiCharInByte

    // ---- SEMANTIC (DELAYED) ERRORS ----

    "abc"xyz;       // InvalidSuffix

    1xyz;           // InvalidIntSuffix
    1u20;           // InvalidIntSuffix

    1.0xyz;         // InvalidFloatSuffix
    1.0f20;         // InvalidFloatSuffix

    0b10f32;        // NonDecimalFloat
    0o10f32;        // NonDecimalFloat

    0x100000000000000000000000000000000; // IntTooLarge

    c"a \0 b";      // NulInCStr 
}

This PR would make all of them delayed, i.e. compiling this program with -Zparse-only would produce no errors.

I'm going to argue for one of two outcomes here.

Accept this PR. It makes for a consistent story: all literal checking is delayed. But it does preclude (or at least make more difficult) the addition of some hypothetical future escape syntax.
Reject this PR, but change NulInCStr to a lexer error. It's a weird exception at the moment, being the only string literal error that is delayed. I'd be happy to implement that change in a separate PR. (In fact, it's already implemented as the "Detect NulInCStr error earlier" commit in this PR). I think there is still time for this because C string literals haven't reached the release channel yet.

I don't want to end up in a situation where some string literal errors are lexer-level and some are delayed (e.g. \u{999999}). I think that's ugly from both a language design POV and an implementation POV. The numeric literals are already like that, which is unfortunate, but let's not make it worse.

scottmcm · 2023-12-12T22:55:32Z

I think that's ugly from both a language design POV and an implementation POV.

I'm not convinced by this everywhere.

If I look at https://spec.ferrocene.dev/lexical-elements.html#character-literals for example, that core of

CharacterLiteral ::=
    ' CharacterContent '

seems entirely reasonable, and changing that to

CharacterLiteral ::=
    ' CharacterContent+ '

(or to *) doesn't seem like that big of a difference to me. If anything, it might be easier to not have the repeat there, since that's one fewer thing that needs to be specified later, since it's lexically one codepoint automatically.

Whereas there are things like

A UnicodeEscape starts with a \u{ literal, followed by 1 to 6 instances of a HexadecimalDigit, inclusive, followed by a } character. It can represent any Unicode codepoint between U+00000 and U+10FFFF, inclusive, except Unicode surrogate codepoints, which exist between the range of U+D800 and U+DFFF, inclusive.

where I do agree with this PR that having it be just either \\u\{\x{1,6}\} or \\u\{\x+\} from a lexical sense would be a nice simplification, moving the USV requirements out to a semantic check instead.

nnethercote · 2023-12-12T23:25:30Z

I do agree with this PR that having it be just either \\u\{\x{1,6}\} or \\u\{\x+\} from a lexical sense would be a nice simplification, moving the USV requirements out to a semantic check instead.

It would simplify the lexical specification, at the cost of requiring an additional semantic check later. Overall that's less simple, IMO.

Likewise, from the implementation POV, it's nice if all the char/byte/string literal checking is in one place, be it the lexer, or HIR lowering. Having some checks in the lexer and others in HIR lowering is worse.

…, r=fee1-dead More unescaping cleanups More minor improvements I found while working on rust-lang#118699. r? `@fee1-dead`

By making it an `EscapeError` instead of a `LitError`. This makes it more like the other errors produced during unescaping. NOTE: this means these errors are issued earlier, before expansion, which changes behaviour. The next commit will delay issue of this error and others, reverting the behaviour change for this particular error. One nice thing about this: the old approach had some code in `report_lit_error` to calculate the span of the nul char from a range. This code used a hardwired `+2` to account for the `c"` at the start of a C string literal, but this should have changed to a `+3` for raw C string literals to account for the `cr"`, which meant that the caret in `cr"` nul error messages was one short of where it should have been. The new approach doesn't need any of this and avoids the off-by-one error.

It's a more logical spot for it, and will be a big help for the next commit. Doing this creates a new dependency from `rustc_ast_lowering` on `rustc_parse`, but `rustc_ast_lowering` is clearly higher up the crate graph, so this isn't a big deal. One thing in favour of this change, is that two fluent labels were duplicated across `rustc_session` and `rustc_parse`: `invalid_literal_suffix` and `parse_not_supported`. This duplication is now gone, so that's nice evidence that this is a reasonable change.

Currently string literals are unescaped twice. - Once during lexing in `cook_quoted`/`cook_c_string`/`cook_common`. This one just checks for errors. - Again in `LitKind::from_token_lit`, which is called when lowering AST to HIR, and also in a few other places during expansion. This one actually constructs the unescaped string. It also has error checking code, but that part of the code is actually dead (and has several bugs) because the check during lexing catches all errors! This commit removes the error-check-only unescaping during lexing, and fixes up `LitKind::from_token_lit` so it properly does both checking and construction. This is a backwards-compatible language change: some programs now compile that previously did not. For example, it is now possible for macros to consume "invalid" string literals like "\a\b\c". This is a continuation of a trend of delaying semantic error checking of literals to after expansion: - rust-lang#102944 did this for some cases for numeric literals - The detection of NUL chars in C string literals is already delayed in this way. Notes about test changes: - `ignore-block-help.rs`: this requires a parse error for the test to work. The error used was an unescaping error, which is now delayed to after parsing. So the commit changes it to an "unterminated character literal" error which still occurs during parsing. - `tests/ui/lexer/error-stage.rs`: this shows the newly allowed cases, due to delayed literal unescaping. - Several tests had unescaping errors combined with unterminated literal errors. The former are now delayed but the latter remain as lexing errors. So the unterminated literal part needed to be split into a separate test file otherwise compilation would end before the other errors were reported. - issue-62913.rs: The structure and output changed a bit. Issue rust-lang#62913 was about an ICE due to an unterminated string literal, so the new version should be good enough. - literals-are-validated-before-expansion.rs: this tests exactly the behaviour that has been changed, and so was removed - A couple of other test produce the same errors, just in a different order.

rust-log-analyzer · 2023-12-19T11:52:49Z

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

GITHUB_ENV=/home/runner/work/_temp/_runner_file_commands/set_env_3a5a031f-968a-478b-b431-02e86fc1bc96
GITHUB_EVENT_NAME=pull_request
GITHUB_EVENT_PATH=/home/runner/work/_temp/_github_workflow/event.json
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
GITHUB_HEAD_REF=delay-literal-unescaping
GITHUB_JOB=pr
GITHUB_PATH=/home/runner/work/_temp/_runner_file_commands/add_path_3a5a031f-968a-478b-b431-02e86fc1bc96
GITHUB_REF=refs/pull/118699/merge
GITHUB_REF_NAME=118699/merge
GITHUB_REF_PROTECTED=false
---
    Checking tracing v0.1.40
    Checking ra-ap-rustc_abi v0.21.0
    Checking rustc-dependencies v0.0.0 (/checkout/src/tools/rust-analyzer/crates/rustc-dependencies)
    Checking parser v0.0.0 (/checkout/src/tools/rust-analyzer/crates/parser)
error[E0004]: non-exhaustive patterns: `EscapeError::NulInCStr` not covered
    |
343 |     match error {
343 |     match error {
    |           ^^^^^ pattern `EscapeError::NulInCStr` not covered
    |
note: `EscapeError` defined here
   --> /checkout/compiler/rustc_lexer/src/unescape.rs:16:1
16  | pub enum EscapeError {
    | ^^^^^^^^^^^^^^^^^^^^
...
63  |     NulInCStr,
63  |     NulInCStr,
    |     --------- not covered
    = note: the matched value is of type `EscapeError`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
375 ~         EscapeError::MultipleSkippedLinesWarning => "",
376 ~         EscapeError::NulInCStr => todo!(),

   Compiling chalk-derive v0.95.0
    Checking rust-analyzer-salsa v0.17.0-pre.4
For more information about this error, try `rustc --explain E0004`.

traviscross · 2023-12-21T06:56:44Z

@rustbot labels -I-lang-nominated

We discussed this in the T-lang triage meeting on 2023-12-20. There were parts of this that the members liked (e.g. allowing an arbitrary string of 9s), and there were parts of this the members did not (e.g. allowing '').

Our impression was that @nnethercote would prefer to go fully in one direction or the other here. In our discussion, we were somewhat open to and curious about the direction of going the other way and rejecting syntactically more of those things that are semantically invalid. @nnethercote, what are your thoughts on the desirability and feasibility of this?

Alternatively, if we were to go in the direction of relaxing the syntactic rules, we were interested in perhaps breaking these out. @nnethercote, is this something you'd consider doing, and if so, what categories might you propose?

Please renominate this for us with your answers.

bors · 2023-12-22T02:50:22Z

☔ The latest upstream changes (presumably #119097) made this pull request unmergeable. Please resolve the merge conflicts.

nnethercote · 2023-12-22T02:58:37Z

Our impression was that @nnethercote would prefer to go fully in one direction or the other here. In our discussion, we were somewhat open to and curious about the direction of going the other way and rejecting syntactically more of those things that are semantically invalid. @nnethercote, what are your thoughts on the desirability and feasibility of this?

I'm going to switch terminology from syntactic/semantic to lex-time/post-expansion, because I think the definition of what's syntactic vs. semantic is open to interpretation, and what's important is when the check occurs, not which conceptual category each check belongs to.

In general, you can't move post-expansion checks to lex-time, because that could break code that currently compiles. The one exception is NulInCStr, because that hasn't yet stabilized. I am advocating for moving that earlier in #119172 and this Zulip thread, at least in the short term, for consistency with all the other byte/char/string literal checks. That needs to be done before stabilization occurs.

Moving a check from lex-time to post-expansion can be done at any time, because it's a backwards compatible change.

Alternatively, if we were to go in the direction of relaxing the syntactic rules, we were interested in perhaps breaking these out. @nnethercote, is this something you'd consider doing, and if so, what categories might you propose?

Do you mean break them into categories? I don't particularly want to do that, because it would encourage a conclusion where some checks are done at lex-time and some post-expansion. (The attraction of this PR is that it moves all the checks to post-expansion, for maximum consistency.) But if you put a gun to my head I'd look at this comment above and come up with:

Wrong length: ZeroChars, MoreThanOneChar
Disallowed char: BareCarriageReturn, BareCarriageReturnInRawString, NonAsciiCharInByte, EscapeOnlyChar
Invalid escape: all the others

traviscross · 2023-12-27T23:59:05Z

@nnethercote:

In general, you can't move post-expansion checks to lex-time, because that could break code that currently compiles.

We do have an edition coming up. If we could do it, what things would we want to change to achieve consistency?

petrochenkov · 2024-01-12T15:07:38Z

There's a category of artificial errors, like "this can compile just right, but we don't want to accept it to avoid confusing human readers".
I think errors like BareCarriageReturn(InRawString) or some of EscapeOnlyChar are in this category.

If we want them reported at all, then we want them to be reported in lexer because human eyes look at code regardless of whether it's in main, sink! or #[cfg(FALSE)].

Edit: errors like NonAsciiCharInByte are in the same category as well, we can compile them just fine, except the shouldn't be errors at all, IMO (is rust-lang/rfcs#3349 implemented yet?).

petrochenkov · 2024-01-12T15:19:23Z

I don't think moving all errors either to post-expansion, or to lexer is a goal.
Having both points gives us enough flexibility to both report human-oriented (#118699 (comment)) or future-proofing errors early, and to report "clearly semantic" errors at a later point.

The NulInCStr is in the "clearly semantic" category because it doesn't depend on how exactly the NUL is written lexically - \0, or \x00, or \u{0} - it's the lowered "semantic string content" that matters here.
It's different from e.g. bare carriage return errors, because those depend on how exactly the symbol is written lexically - \r - ok, \x0D - ok, bare - not ok.

petrochenkov · 2024-01-12T15:23:54Z

There's also a question of when a "not a literal" turns into a "literal with escaping problems", which is relevant to cases like ''.
This is purely a question of error recovery strategy in the lexer, so perhaps the error reporting should also stay in the lexer for cases like this.

petrochenkov · 2024-01-12T15:28:09Z

Basically, I'm currently on the position that what we have now is fine, but we should probably remove some artificial errors.

nnethercote · 2024-01-12T22:22:14Z

is rust-lang/rfcs#3349 implemented yet?).

I have a partial implementation of RFC3349. There is no PR yet. It's blocked on two things:

The decision on whether to delay literal unescaping checks or not (i.e. this PR).
Some issues with the RFC itself, as described here.

I know I wrote this PR, but I think I will close it. Why?

It's getting in the way of RFC3349.
This eager vs delayed checking discussion is prone to hair-splitting, e.g. about the exact meaning of "syntactic" vs "semantic", and what is or isn't "human-oriented". The kind of thing that can drag on and block useful progress.
I'm motivated by (a) what is useful to users, and (b) what simplifies the implementation. For (a), eager checking is working pretty well right now and I haven't seen any compelling reasons to delay checking so far. For (b), and doing all the checks early simplifies the implementation. (It's also easier to explain.)
If we later find a compelling reason to delay checking, we can do so.

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 7, 2023

nnethercote marked this pull request as draft December 7, 2023 08:10