Improve diagnostics for non XID characters in identifiers #86102

Manishearth · 2021-06-07T15:17:26Z

fn main() {
    let 🦀 = "Manish";
}

errors with

error: unknown start of token: \u{1f980}
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";

For non-ASCII characters, we should perhaps error with something better. I don't know what the error text should be (cc @estebank), because it's not just emoji, and there's no easy way to define "XID characters" without just linking to the spec, which seems bad. Maybe we can link to the reference?

Tagging as easy since the implementation isn't tricky, but we will probably need to figure out a good error message.

The text was updated successfully, but these errors were encountered:

estebank · 2021-06-07T20:50:01Z

We probably should rework the parser to accept these as identifiers, and reject them in a later pass.

Manishearth · 2021-06-07T21:12:15Z

@estebank It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like. We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases.

jeanlucthumm · 2021-06-08T00:53:52Z

@rustbot claim

estebank · 2021-06-08T17:08:16Z

It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like.

That's fair. I left out that the recovery should be sane and only be done for things that could be an identifier, but completely agree.

We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases.

That is something that applies to all errors 🌲

Manishearth · 2021-06-08T17:50:37Z

Note that it is possible to use unicode general categories to tell "could be identifier" apart from "probably not" but it's still iffy.

jeanlucthumm · 2021-06-08T21:15:09Z

How does the E-mentor tag work in this case? I'm not sure who my point of contact is.

Manishearth · 2021-06-08T21:59:48Z

@jeanlucthumm just ask questions here, I'll help

jeanlucthumm · 2021-06-17T02:41:22Z

It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like. We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases.

@Manishearth Would it be wrong to say that anything we don't explicitly call punctuation in the lexer is just identifier-like by default? Because we have a well defined set of what we use as punctuation, e.g. ? and {, so if you're trying to use unicode characters that aren't part of that set you're almost certainly trying to use them as identifiers. That's the only part of the program that users have "control" over, everything else is set in stone. In other words, it wouldn't make sense to have a user try to use 𓃄 instead of an open bracket, but it would make more sense for people to experiment with what they can label variables with, including 𓃄. That's exactly what happened here when you tried to use an emoji. And for a lot of characters, like 𓃄, it would be genuinely impossible to determine if they are more identifier like or punctuation like anyways.

Also, the only other case I can think of where you would use non-XID characters that were also not intended to be part of an identifier, is if you accidentally typoed them in a place were identifiers cannot semantically be. But in the off chance this happens, the error message would still be more than enough to let you know you have a random typo, and I don't think having the compiler label the character as a potential identifier would confuse the user enough to prevent them from fixing it.

If we don't consider that wrong, then a satisfactory solution would be to link them to the formal identifier definition.

Manishearth · 2021-06-17T03:11:11Z

@Manishearth Would it be wrong to say that anything we don't explicitly call punctuation in the lexer is just identifier-like by default? Because we have a well defined set of what we use as punctuation, e.g. ? and {, so if you're trying to use unicode characters that aren't part of that set you're almost certainly trying to use them as identifiers

I think that's not a good assumption because often special characters get copied in by accident. Intentional use isn't the only thing. Curly quotes being copied in (because Word transforms them) is common.

We may even at some point expand the lexer to allow numbers in other writing systems!

I'm unsure if we should link to the formal definition here (@estebank, thoughts?) but I think just being clearer that that character isn't allowed in identifiers is enough?

A decent heuristic for "identifierlike" might be filtering by General_Category, basically filtering out punctuation, numbers, and spaces.

jeanlucthumm · 2021-06-21T21:47:16Z

If we filter by General_Category, and say for example, we deem the Letter category as identifier-like, what happens when you copy paste in text (as you were saying) that has an unknown Letter in a place where identifiers cannot semantically be? Then the compiler will give you an alert that you can't use that letter in an identifier, even though you were not trying to use it in an identifier.

Also, what if a user puts a special punctuation character like the curly quotes in the middle of an identifier. Then the correct error would indeed be "You can't use curly quotes in an identifier", but the filtering by General_Category would prevent that error.

But while the filtering will still lead to wrong error messages, I think I agree with you that it will be fewer than treating everything as identifier-like because it's much less likely that e.g. punctuation characters will appear inside of identifiers and letter characters outside of identifiers.

I think the core problem is that you can't use information from the unicode to make a statement like You can't use [character] in X because nothing in the unicode itself tells you what X is. That's dependent on the characters around the problematic one and that semantic information is only available after several passes beyond the lexer. We're basically just assuming that there's a correlation between X and General_Category but I think this is (probably) a safe assumption.

Also on second thought, I agree with not linking the formal definition, in the spirit of keeping things user friendly.

I'll get started by going through General_Category to figure out the heuristic, but maybe the absolute best solution is to go higher up the stack and somehow source that semantic information. Given how much effort Rust puts into compiler messages, this may not be so farfetched.

Manishearth · 2021-06-21T21:55:12Z

Yeah, precisely. This is kinda nuanced.

I think a first pass can just improve the error message to say "only certain kinds of characters are allowed in identifiers", perhaps linking to the RFC. Idk. But yeah if you want to look at figuring out a smarter heuristic that would be great, and it's not a huge deal if we get it wrong since it's a diagnostic and we can iterate on it.

jeanlucthumm · 2021-07-08T23:18:13Z

I ran into a very problematic category So (Symbol Other). The Ferris the Rustacean emoji in the original post is part of So, as are most common emojis you'd try to use in an identifier. So I decided to include that category as identifier-like. But this leads to a contradiction in the code. Take this example:

fn main() {
    let 🦀 = "Manish";
    println!("Result: {}", 2 ➖ 3);
}

The symbol ➖ (char description) is part of So as well, but it's also part of a list of substitute recommendations we have here: code link, so that we actually recommend the - symbol, a clear punctuation character as a diagnostic. What you end up with is somewhat contradicting diagnostic messages:

   Compiling rust-id-test v0.1.0 (/home/jeanluc/Code/rust-id-test)
error: cannot use 🦀 in an identifier
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^

error: cannot use ➖ in an identifier
 --> src/main.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^
  |
help: Unicode character '➖' (Heavy Minus Sign) looks like '-' (Minus/Hyphen), but it is not
  |
3 |     println!("Result: {}", 2 - 3);
  |                              ^

error: expected pattern, found `=`
 --> src/main.rs:2:11
  |
2 |     let 🦀 = "Manish";
  |            ^ expected pattern

error: aborting due to 3 previous errors

error: could not compile `rust-id-test`

To learn more, run the command again with --verbose.

To solve the issue of So, I think what I'm gonna experiment with next is considering the Block property and filtering So itself based on that.

In the meantime, I have a proof of concept in the unicode branch of my fork that you can experiment with different characters on, if you'd like.

But I also have a few questions:

How difficult would it be to delay the diagnostic until we have more context from the AST and customize the error message based on that?
When I was trying to commit this I was getting a tidy error:

tidy error: Dependencies not explicitly permitted:
* matches 0.1.8 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-property 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-range 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-common 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-category 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-version 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)

I assume this is because I added an external dependency. Is there something special you have to do for rustc everytime you modify the Cargo.toml file?

klensy · 2021-07-08T23:33:28Z

@jeanlucthumm You should that to rust\src\tools\tidy\src\deps.rs PERMITTED_DEPENDENCIES

Manishearth · 2021-07-09T00:51:12Z

@jeanlucthumm Yeah so emoji are kinda scattered everywhere, it's hard to tell with the block/etc property. You're looking for Emoji_Presentation.

I think delaying the diagnostic would be very hard since this stuff has to be rejected at lex time.

I think it's totally fine to treat all So as diagnostic-worthy, the heavy dash is not a "clear punctuation character" to me since it's also an emoji. So I don't actually see a problem with the current diagnostic; it's showing multiple helps, that's fine. Perhaps we should not have the error itself say "not allowed in identifiers" and instead keep the original error but have the help say "not allowed in identifiers"

As far as the tidy check, yes, that's to prevent additional deps from being pulled in.

jeanlucthumm · 2021-07-16T00:55:09Z

A few things:

So I'm basically done with the code changes I wanted to make, but there are quite a few broken UI tests as a result. Before I go through and fix them, I just wanted to get a second opinion on the error messages:

   Compiling rust-id-test v0.1.0 (/home/jeanluc/Code/rust-id-test)
error: unkown token: 🦀
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^
  |
  = help: character 🦀 cannot be used in identifiers

error: unkown token: ➖
 --> src/main.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^
  |
help: Unicode character '➖' (Heavy Minus Sign) looks like '-' (Minus/Hyphen), but it is not
  |
3 |     println!("Result: {}", 2 - 3);
  |                              ^

error: expected pattern, found `=`
 --> src/main.rs:2:11
  |
2 |     let 🦀 = "Manish";
  |            ^ expected pattern

error: aborting due to 3 previous errors

error: could not compile `rust-id-test`

To learn more, run the command again with --verbose.

The current heuristic is to print the help message about identifiers if both the following are true:

The first code point of the first grapheme cluster in the token has GeneralCategory as L* (any letter) or So
We have not suggested a substitution character (like the heavy minus sign)

I figured point 2 would be a good addition since we only suggest punctuation characters as substitutions, but given your point that some of these could also be considered emojis, we can change that back if you'd like.

I'm running into issues trying to minimize dependencies. For the unicode segmentation I picked unicode-segmentation and I figured that's fine given we already depend on other packages in the unicode-rs group. But for picking up the general category, I couldn't find anything in the group so I found two alternatives: 1) unic-ucd-category and 2)unicode-general-category. The former has a lot of dependencies:

tidy error: Dependencies not explicitly permitted:
* matches 0.1.8 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-property 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-range 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-common 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-category 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-version 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)

And the latter has no dependencies but tidy complained that it has a bad license: Apache-2.0. Maybe a third option is to 3) add this functionality to the unicode-rs group as well. Finally, 4) I can also implement this on my own in the code, but I'm not sure how complicated / bloated that will be. I'm leaning towards 2) and asking the maintainer to potentially add another license.

Manishearth · 2021-07-16T03:42:45Z

I figured point 2 would be a good addition since we only suggest punctuation characters as substitutions, but given your point that some of these could also be considered emojis, we can change that back if you'd like.

Yeah I'd change it back.

I would go for unicode-general-category. @wezm are you open to relicensing it?

@rust-lang/compiler I believe Apache-only crates are allowed if they're dependencies of the compiler but not the stdlib, yes? So we can opt out with

rust/src/tools/tidy/src/deps.rs

Lines 24 to 27 in ac8c3bf

    
           /// These are exceptions to Rust's permissive licensing policy, and 
        
           /// should be considered bugs. Exceptions are only allowed in Rust 
        
           /// tooling. It is _crucial_ that no exception crates be dependencies 
        
           /// of the Rust runtime (std/test).

if we need to?

estebank · 2021-07-16T17:33:24Z

@jeanlucthumm, can you make the lexer give back a new placeholder identifier? I would love it if we could get rid of the "expected pattern, found =" superfluous error.

You could also tweak the output to be closer to

error: unkown token: 🦀
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^ this character cannot be used in identifiers

If we have graceful recovery, we might also want to handle things like

fn 🍮() -> i32 { 4}
let my_🦀 = 🍮() ➖ 🍮(); // `: i32 = 0`

where we would hopefully only have the lexer errors and nothing else.

eddyb · 2021-07-27T14:43:55Z

can you make the lexer give back a new placeholder identifier? I would love it if we could get rid of the "expected pattern, found =" superfluous error.

I suspect a more general solution is a one codepoint (or grapheme, if we care about that) "error" token, that would cause the parser to give up half-way during parsing Ident("let"), Error, Punct('='), instead of seeing the same thing it would for let =, i.e. Ident("let"), Punct('='), and attempting to parse it.

jeanlucthumm · 2021-08-05T01:41:46Z

Yeah sorry, had to deep dive rustc_parse (was hard with no rustc background). I agree with eddyb. The relevant current code is this loop in StringReader::next_token (link):

// Skip trivial (whitespace & comments) tokens
loop {
    // ....
    match self.cook_lexer_token(token.kind, start) {
        Some(kind) => {
            let span = self.mk_sp(start, self.pos);
            return (spacing, Token::new(kind, span));
        }
        None => spacing = Spacing::Alone,
    }
}

Which just skips over unkown tokens as if they were whitespace to be ignored. This is where we would instead return a new error rustc_ast::TokenKind which represents an unkown token for which rustc_parse::lexer has already provided diagnostics

Then in rustc_parse::parser, currently the error is generated by Parser::parse_pat_with_range_pat (link) which doesn't recognize the = as a valid pattern, hence the "expected pattern" as eddyb was saying.In here, we can exit early if we detect the error token.

Currently doing these changes.

This only solves the issue for patterns though. This exception would have to be hardcoded in all parsing paths, like function identifiers from @estebank's example. Will go through them.

eddyb · 2021-08-05T08:12:07Z

In here, we can exit early if we detect the error token.
This only solves the issue for patterns though. This exception would have to be hardcoded in all parsing paths, like function identifiers from @estebank's example. Will go through them.

Hmm, what happens when you add a new TokenKind::Err, without any handling in parsing code? Can we special-case it in the "unexpected token" error? (i.e. if the token that caused an "unexpected token" error, happens to be TokenKind::Err, emit that error as a delay_span_bug instead of a normal error)

cc @petrochenkov

jeanlucthumm · 2021-08-27T20:34:25Z

Hit a particularly busy patch of life at the moment, should be free again in a month. Unassigning for now, and if no one has claimed this by then I will work on it again.

estebank · 2021-08-29T09:40:48Z

I checked to see how hard it would be to accept emojis using rust-emoji-char and it wasn't hard. I haven't checked what the perf impact it might have yet, but the changes aren't extensive. Most of the logic would be in making sure we emit a single diagnostic per ident instead of one per use like this code does and to point at specific parts of the ident, potentially proposing an alternate name, but it seems this approach would work out fine:

estebank · 2021-08-29T10:00:05Z

We should also use the confusables table to take whitespace and token like unicode chars (like greep question mark) error on them in the lexer and translate them all to their valid ascii lookalike token. The confusable table we use isn't exhaustive but should be enough to improve the situation with things like smart quotes. We today recover somewhat, but have too many knock down errors afterwards. This and the change above to the lexer would be a net user experience improvement. Now, if someone tries to use a smart quote for example as an identifier... 🤷‍♀️

estebank · 2021-09-09T15:07:20Z

Opened #88781 after cleaning up the code I linked earlier. It only handles identifiers, namely things that look like emojis are treated as XID_Start and emojis+ZWJ are considered XID_Continue and rejected after lexing.

Edit: BTW, the output for the earlier example with this PR:

error: expected one of `,`, `.`, `?`, or an operator, found `➖`
 --> f8.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^ expected one of `,`, `.`, `?`, or an operator

error: identifiers cannot contain emojis: `➖`
 --> f8.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^

error: identifiers cannot contain emojis: `🦀`
 --> f8.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^

Edit 2: it now handles confusables:

error: unknown start of token: \u{2796}
 --> f8.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^
  |
help: Unicode character '➖' (Heavy Minus Sign) looks like '-' (Minus/Hyphen), but it is not
  |
3 |     println!("Result: {}", 2 - 3);
  |                              ~

error: identifiers cannot contain emojis: `🦀`
 --> f8.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^

Tokenize emoji as if they were valid identifiers In the lexer, consider emojis to be valid identifiers and reject them later to avoid knock down parse errors. Partially address rust-lang#86102.

estebank · 2021-12-02T18:50:48Z

@Manishearth can you verify what might be the outstanding work to close this ticket?

Manishearth · 2021-12-03T06:08:21Z

@estebank I think we're good!

JohnTitor added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 7, 2021

estebank added A-unicode Area: Unicode D-papercut Diagnostics: An error or lint that needs small tweaks. labels Jun 7, 2021

rustbot assigned jeanlucthumm Jun 8, 2021

jeanlucthumm removed their assignment Aug 27, 2021

estebank mentioned this issue Sep 9, 2021

Tokenize emoji as if they were valid identifiers #88781

Merged

Manishearth closed this as completed Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve diagnostics for non XID characters in identifiers #86102

Improve diagnostics for non XID characters in identifiers #86102

Manishearth commented Jun 7, 2021 •

edited by rustbot

Loading

estebank commented Jun 7, 2021

Manishearth commented Jun 7, 2021

jeanlucthumm commented Jun 8, 2021

estebank commented Jun 8, 2021

Manishearth commented Jun 8, 2021

jeanlucthumm commented Jun 8, 2021

Manishearth commented Jun 8, 2021

jeanlucthumm commented Jun 17, 2021 •

edited

Loading

Manishearth commented Jun 17, 2021

jeanlucthumm commented Jun 21, 2021 •

edited

Loading

Manishearth commented Jun 21, 2021

jeanlucthumm commented Jul 8, 2021 •

edited

Loading

klensy commented Jul 8, 2021

Manishearth commented Jul 9, 2021

jeanlucthumm commented Jul 16, 2021

Manishearth commented Jul 16, 2021

estebank commented Jul 16, 2021

eddyb commented Jul 27, 2021

jeanlucthumm commented Aug 5, 2021 •

edited

Loading

eddyb commented Aug 5, 2021

jeanlucthumm commented Aug 27, 2021

estebank commented Aug 29, 2021

estebank commented Aug 29, 2021

estebank commented Sep 9, 2021 •

edited

Loading

estebank commented Dec 2, 2021

Manishearth commented Dec 3, 2021

Improve diagnostics for non XID characters in identifiers #86102

Improve diagnostics for non XID characters in identifiers #86102

Comments

Manishearth commented Jun 7, 2021 • edited by rustbot Loading

estebank commented Jun 7, 2021

Manishearth commented Jun 7, 2021

jeanlucthumm commented Jun 8, 2021

estebank commented Jun 8, 2021

Manishearth commented Jun 8, 2021

jeanlucthumm commented Jun 8, 2021

Manishearth commented Jun 8, 2021

jeanlucthumm commented Jun 17, 2021 • edited Loading

Manishearth commented Jun 17, 2021

jeanlucthumm commented Jun 21, 2021 • edited Loading

Manishearth commented Jun 21, 2021

jeanlucthumm commented Jul 8, 2021 • edited Loading

klensy commented Jul 8, 2021

Manishearth commented Jul 9, 2021

jeanlucthumm commented Jul 16, 2021

Manishearth commented Jul 16, 2021

estebank commented Jul 16, 2021

eddyb commented Jul 27, 2021

jeanlucthumm commented Aug 5, 2021 • edited Loading

eddyb commented Aug 5, 2021

jeanlucthumm commented Aug 27, 2021

estebank commented Aug 29, 2021

estebank commented Aug 29, 2021

estebank commented Sep 9, 2021 • edited Loading

estebank commented Dec 2, 2021

Manishearth commented Dec 3, 2021

Manishearth commented Jun 7, 2021 •

edited by rustbot

Loading

jeanlucthumm commented Jun 17, 2021 •

edited

Loading

jeanlucthumm commented Jun 21, 2021 •

edited

Loading

jeanlucthumm commented Jul 8, 2021 •

edited

Loading

jeanlucthumm commented Aug 5, 2021 •

edited

Loading

estebank commented Sep 9, 2021 •

edited

Loading