Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve diagnostics for non XID characters in identifiers #86102

Closed
Manishearth opened this issue Jun 7, 2021 · 26 comments
Closed

Improve diagnostics for non XID characters in identifiers #86102

Manishearth opened this issue Jun 7, 2021 · 26 comments
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. D-papercut Diagnostics: An error or lint that needs small tweaks. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Manishearth
Copy link
Member

Manishearth commented Jun 7, 2021

fn main() {
    let 🦀 = "Manish";
}

errors with

error: unknown start of token: \u{1f980}
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";

For non-ASCII characters, we should perhaps error with something better. I don't know what the error text should be (cc @estebank), because it's not just emoji, and there's no easy way to define "XID characters" without just linking to the spec, which seems bad. Maybe we can link to the reference?

Tagging as easy since the implementation isn't tricky, but we will probably need to figure out a good error message.

@Manishearth Manishearth added E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. A-diagnostics Area: Messages for errors, warnings, and lints E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. labels Jun 7, 2021
@JohnTitor JohnTitor added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 7, 2021
@estebank
Copy link
Contributor

estebank commented Jun 7, 2021

We probably should rework the parser to accept these as identifiers, and reject them in a later pass.

@estebank estebank added A-unicode Area: Unicode D-papercut Diagnostics: An error or lint that needs small tweaks. labels Jun 7, 2021
@Manishearth
Copy link
Member Author

@estebank It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like. We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases.

@jeanlucthumm
Copy link
Contributor

@rustbot claim

@estebank
Copy link
Contributor

estebank commented Jun 8, 2021

It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like.

That's fair. I left out that the recovery should be sane and only be done for things that could be an identifier, but completely agree.

We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases.

That is something that applies to all errors 🌲

@Manishearth
Copy link
Member Author

Note that it is possible to use unicode general categories to tell "could be identifier" apart from "probably not" but it's still iffy.

@jeanlucthumm
Copy link
Contributor

How does the E-mentor tag work in this case? I'm not sure who my point of contact is.

@Manishearth
Copy link
Member Author

@jeanlucthumm just ask questions here, I'll help

@jeanlucthumm
Copy link
Contributor

jeanlucthumm commented Jun 17, 2021

It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like. We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases.

@Manishearth Would it be wrong to say that anything we don't explicitly call punctuation in the lexer is just identifier-like by default? Because we have a well defined set of what we use as punctuation, e.g. ? and {, so if you're trying to use unicode characters that aren't part of that set you're almost certainly trying to use them as identifiers. That's the only part of the program that users have "control" over, everything else is set in stone. In other words, it wouldn't make sense to have a user try to use 𓃄 instead of an open bracket, but it would make more sense for people to experiment with what they can label variables with, including 𓃄. That's exactly what happened here when you tried to use an emoji. And for a lot of characters, like 𓃄, it would be genuinely impossible to determine if they are more identifier like or punctuation like anyways.

Also, the only other case I can think of where you would use non-XID characters that were also not intended to be part of an identifier, is if you accidentally typoed them in a place were identifiers cannot semantically be. But in the off chance this happens, the error message would still be more than enough to let you know you have a random typo, and I don't think having the compiler label the character as a potential identifier would confuse the user enough to prevent them from fixing it.

If we don't consider that wrong, then a satisfactory solution would be to link them to the formal identifier definition.

@Manishearth
Copy link
Member Author

@Manishearth Would it be wrong to say that anything we don't explicitly call punctuation in the lexer is just identifier-like by default? Because we have a well defined set of what we use as punctuation, e.g. ? and {, so if you're trying to use unicode characters that aren't part of that set you're almost certainly trying to use them as identifiers

I think that's not a good assumption because often special characters get copied in by accident. Intentional use isn't the only thing. Curly quotes being copied in (because Word transforms them) is common.

We may even at some point expand the lexer to allow numbers in other writing systems!

I'm unsure if we should link to the formal definition here (@estebank, thoughts?) but I think just being clearer that that character isn't allowed in identifiers is enough?

A decent heuristic for "identifierlike" might be filtering by General_Category, basically filtering out punctuation, numbers, and spaces.

@jeanlucthumm
Copy link
Contributor

jeanlucthumm commented Jun 21, 2021

If we filter by General_Category, and say for example, we deem the Letter category as identifier-like, what happens when you copy paste in text (as you were saying) that has an unknown Letter in a place where identifiers cannot semantically be? Then the compiler will give you an alert that you can't use that letter in an identifier, even though you were not trying to use it in an identifier.

Also, what if a user puts a special punctuation character like the curly quotes in the middle of an identifier. Then the correct error would indeed be "You can't use curly quotes in an identifier", but the filtering by General_Category would prevent that error.

But while the filtering will still lead to wrong error messages, I think I agree with you that it will be fewer than treating everything as identifier-like because it's much less likely that e.g. punctuation characters will appear inside of identifiers and letter characters outside of identifiers.

I think the core problem is that you can't use information from the unicode to make a statement like You can't use [character] in X because nothing in the unicode itself tells you what X is. That's dependent on the characters around the problematic one and that semantic information is only available after several passes beyond the lexer. We're basically just assuming that there's a correlation between X and General_Category but I think this is (probably) a safe assumption.

Also on second thought, I agree with not linking the formal definition, in the spirit of keeping things user friendly.

I'll get started by going through General_Category to figure out the heuristic, but maybe the absolute best solution is to go higher up the stack and somehow source that semantic information. Given how much effort Rust puts into compiler messages, this may not be so farfetched.

@Manishearth
Copy link
Member Author

Yeah, precisely. This is kinda nuanced.

I think a first pass can just improve the error message to say "only certain kinds of characters are allowed in identifiers", perhaps linking to the RFC. Idk. But yeah if you want to look at figuring out a smarter heuristic that would be great, and it's not a huge deal if we get it wrong since it's a diagnostic and we can iterate on it.

@jeanlucthumm
Copy link
Contributor

jeanlucthumm commented Jul 8, 2021

I ran into a very problematic category So (Symbol Other). The Ferris the Rustacean emoji in the original post is part of So, as are most common emojis you'd try to use in an identifier. So I decided to include that category as identifier-like. But this leads to a contradiction in the code. Take this example:

fn main() {
    let 🦀 = "Manish";
    println!("Result: {}", 2 ➖ 3);
}

The symbol (char description) is part of So as well, but it's also part of a list of substitute recommendations we have here: code link, so that we actually recommend the - symbol, a clear punctuation character as a diagnostic. What you end up with is somewhat contradicting diagnostic messages:

   Compiling rust-id-test v0.1.0 (/home/jeanluc/Code/rust-id-test)
error: cannot use 🦀 in an identifier
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^

error: cannot use ➖ in an identifier
 --> src/main.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^
  |
help: Unicode character '➖' (Heavy Minus Sign) looks like '-' (Minus/Hyphen), but it is not
  |
3 |     println!("Result: {}", 2 - 3);
  |                              ^

error: expected pattern, found `=`
 --> src/main.rs:2:11
  |
2 |     let 🦀 = "Manish";
  |            ^ expected pattern

error: aborting due to 3 previous errors

error: could not compile `rust-id-test`

To learn more, run the command again with --verbose.

To solve the issue of So, I think what I'm gonna experiment with next is considering the Block property and filtering So itself based on that.

In the meantime, I have a proof of concept in the unicode branch of my fork that you can experiment with different characters on, if you'd like.

But I also have a few questions:

  1. How difficult would it be to delay the diagnostic until we have more context from the AST and customize the error message based on that?
  2. When I was trying to commit this I was getting a tidy error:
tidy error: Dependencies not explicitly permitted:
* matches 0.1.8 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-property 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-range 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-common 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-category 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-version 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)

I assume this is because I added an external dependency. Is there something special you have to do for rustc everytime you modify the Cargo.toml file?

@klensy
Copy link
Contributor

klensy commented Jul 8, 2021

@jeanlucthumm You should that to rust\src\tools\tidy\src\deps.rs PERMITTED_DEPENDENCIES

@Manishearth
Copy link
Member Author

@jeanlucthumm Yeah so emoji are kinda scattered everywhere, it's hard to tell with the block/etc property. You're looking for Emoji_Presentation.

I think delaying the diagnostic would be very hard since this stuff has to be rejected at lex time.

I think it's totally fine to treat all So as diagnostic-worthy, the heavy dash is not a "clear punctuation character" to me since it's also an emoji. So I don't actually see a problem with the current diagnostic; it's showing multiple helps, that's fine. Perhaps we should not have the error itself say "not allowed in identifiers" and instead keep the original error but have the help say "not allowed in identifiers"

As far as the tidy check, yes, that's to prevent additional deps from being pulled in.

@jeanlucthumm
Copy link
Contributor

A few things:

So I'm basically done with the code changes I wanted to make, but there are quite a few broken UI tests as a result. Before I go through and fix them, I just wanted to get a second opinion on the error messages:

   Compiling rust-id-test v0.1.0 (/home/jeanluc/Code/rust-id-test)
error: unkown token: 🦀
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^
  |
  = help: character 🦀 cannot be used in identifiers

error: unkown token: ➖
 --> src/main.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^
  |
help: Unicode character '➖' (Heavy Minus Sign) looks like '-' (Minus/Hyphen), but it is not
  |
3 |     println!("Result: {}", 2 - 3);
  |                              ^

error: expected pattern, found `=`
 --> src/main.rs:2:11
  |
2 |     let 🦀 = "Manish";
  |            ^ expected pattern

error: aborting due to 3 previous errors

error: could not compile `rust-id-test`

To learn more, run the command again with --verbose.

The current heuristic is to print the help message about identifiers if both the following are true:

  1. The first code point of the first grapheme cluster in the token has GeneralCategory as L* (any letter) or So
  2. We have not suggested a substitution character (like the heavy minus sign)

I figured point 2 would be a good addition since we only suggest punctuation characters as substitutions, but given your point that some of these could also be considered emojis, we can change that back if you'd like.


I'm running into issues trying to minimize dependencies. For the unicode segmentation I picked unicode-segmentation and I figured that's fine given we already depend on other packages in the unicode-rs group. But for picking up the general category, I couldn't find anything in the group so I found two alternatives: 1) unic-ucd-category and 2)unicode-general-category. The former has a lot of dependencies:

tidy error: Dependencies not explicitly permitted:
* matches 0.1.8 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-property 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-char-range 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-common 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-category 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)
* unic-ucd-version 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)

And the latter has no dependencies but tidy complained that it has a bad license: Apache-2.0. Maybe a third option is to 3) add this functionality to the unicode-rs group as well. Finally, 4) I can also implement this on my own in the code, but I'm not sure how complicated / bloated that will be. I'm leaning towards 2) and asking the maintainer to potentially add another license.

@Manishearth
Copy link
Member Author

I figured point 2 would be a good addition since we only suggest punctuation characters as substitutions, but given your point that some of these could also be considered emojis, we can change that back if you'd like.

Yeah I'd change it back.


I would go for unicode-general-category. @wezm are you open to relicensing it?

@rust-lang/compiler I believe Apache-only crates are allowed if they're dependencies of the compiler but not the stdlib, yes? So we can opt out with

/// These are exceptions to Rust's permissive licensing policy, and
/// should be considered bugs. Exceptions are only allowed in Rust
/// tooling. It is _crucial_ that no exception crates be dependencies
/// of the Rust runtime (std/test).
if we need to?

@estebank
Copy link
Contributor

@jeanlucthumm, can you make the lexer give back a new placeholder identifier? I would love it if we could get rid of the "expected pattern, found =" superfluous error.

You could also tweak the output to be closer to

error: unkown token: 🦀
 --> src/main.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^ this character cannot be used in identifiers

If we have graceful recovery, we might also want to handle things like

fn 🍮() -> i32 { 4}
let my_🦀 = 🍮() ➖ 🍮(); // `: i32 = 0`

where we would hopefully only have the lexer errors and nothing else.

@eddyb
Copy link
Member

eddyb commented Jul 27, 2021

can you make the lexer give back a new placeholder identifier? I would love it if we could get rid of the "expected pattern, found =" superfluous error.

I suspect a more general solution is a one codepoint (or grapheme, if we care about that) "error" token, that would cause the parser to give up half-way during parsing Ident("let"), Error, Punct('='), instead of seeing the same thing it would for let =, i.e. Ident("let"), Punct('='), and attempting to parse it.

@jeanlucthumm
Copy link
Contributor

jeanlucthumm commented Aug 5, 2021

Yeah sorry, had to deep dive rustc_parse (was hard with no rustc background). I agree with eddyb. The relevant current code is this loop in StringReader::next_token (link):

// Skip trivial (whitespace & comments) tokens
loop {
    // ....
    match self.cook_lexer_token(token.kind, start) {
        Some(kind) => {
            let span = self.mk_sp(start, self.pos);
            return (spacing, Token::new(kind, span));
        }
        None => spacing = Spacing::Alone,
    }
}

Which just skips over unkown tokens as if they were whitespace to be ignored. This is where we would instead return a new error rustc_ast::TokenKind which represents an unkown token for which rustc_parse::lexer has already provided diagnostics

Then in rustc_parse::parser, currently the error is generated by Parser::parse_pat_with_range_pat (link) which doesn't recognize the = as a valid pattern, hence the "expected pattern" as eddyb was saying.In here, we can exit early if we detect the error token.

Currently doing these changes.

This only solves the issue for patterns though. This exception would have to be hardcoded in all parsing paths, like function identifiers from @estebank's example. Will go through them.

@eddyb
Copy link
Member

eddyb commented Aug 5, 2021

In here, we can exit early if we detect the error token.
This only solves the issue for patterns though. This exception would have to be hardcoded in all parsing paths, like function identifiers from @estebank's example. Will go through them.

Hmm, what happens when you add a new TokenKind::Err, without any handling in parsing code? Can we special-case it in the "unexpected token" error? (i.e. if the token that caused an "unexpected token" error, happens to be TokenKind::Err, emit that error as a delay_span_bug instead of a normal error)

cc @petrochenkov

@jeanlucthumm jeanlucthumm removed their assignment Aug 27, 2021
@jeanlucthumm
Copy link
Contributor

Hit a particularly busy patch of life at the moment, should be free again in a month. Unassigning for now, and if no one has claimed this by then I will work on it again.

@estebank
Copy link
Contributor

I checked to see how hard it would be to accept emojis using rust-emoji-char and it wasn't hard. I haven't checked what the perf impact it might have yet, but the changes aren't extensive. Most of the logic would be in making sure we emit a single diagnostic per ident instead of one per use like this code does and to point at specific parts of the ident, potentially proposing an alternate name, but it seems this approach would work out fine:

Two errors from a test file with the code on the linked commit, showing "identifiers can't be emojis" and a follow up error showing that items named with emojis still cause other errors

@estebank
Copy link
Contributor

We should also use the confusables table to take whitespace and token like unicode chars (like greep question mark) error on them in the lexer and translate them all to their valid ascii lookalike token. The confusable table we use isn't exhaustive but should be enough to improve the situation with things like smart quotes. We today recover somewhat, but have too many knock down errors afterwards. This and the change above to the lexer would be a net user experience improvement. Now, if someone tries to use a smart quote for example as an identifier... 🤷‍♀️

@estebank
Copy link
Contributor

estebank commented Sep 9, 2021

Opened #88781 after cleaning up the code I linked earlier. It only handles identifiers, namely things that look like emojis are treated as XID_Start and emojis+ZWJ are considered XID_Continue and rejected after lexing.


Edit: BTW, the output for the earlier example with this PR:

Screen Shot 2021-09-09 at 9 33 46 AM

error: expected one of `,`, `.`, `?`, or an operator, found `➖`
 --> f8.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^ expected one of `,`, `.`, `?`, or an operator

error: identifiers cannot contain emojis: `➖`
 --> f8.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^

error: identifiers cannot contain emojis: `🦀`
 --> f8.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^

Edit 2: it now handles confusables:

Screen Shot 2021-09-09 at 10 06 29 AM

error: unknown start of token: \u{2796}
 --> f8.rs:3:30
  |
3 |     println!("Result: {}", 2 ➖ 3);
  |                              ^^
  |
help: Unicode character '➖' (Heavy Minus Sign) looks like '-' (Minus/Hyphen), but it is not
  |
3 |     println!("Result: {}", 2 - 3);
  |                              ~

error: identifiers cannot contain emojis: `🦀`
 --> f8.rs:2:9
  |
2 |     let 🦀 = "Manish";
  |         ^^

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Sep 13, 2021
Tokenize emoji as if they were valid identifiers

In the lexer, consider emojis to be valid identifiers and reject
them later to avoid knock down parse errors.

Partially address rust-lang#86102.
Manishearth added a commit to Manishearth/rust that referenced this issue Sep 14, 2021
Tokenize emoji as if they were valid identifiers

In the lexer, consider emojis to be valid identifiers and reject
them later to avoid knock down parse errors.

Partially address rust-lang#86102.
Manishearth added a commit to Manishearth/rust that referenced this issue Sep 14, 2021
Tokenize emoji as if they were valid identifiers

In the lexer, consider emojis to be valid identifiers and reject
them later to avoid knock down parse errors.

Partially address rust-lang#86102.
Manishearth added a commit to Manishearth/rust that referenced this issue Sep 14, 2021
Tokenize emoji as if they were valid identifiers

In the lexer, consider emojis to be valid identifiers and reject
them later to avoid knock down parse errors.

Partially address rust-lang#86102.
bors added a commit to rust-lang-ci/rust that referenced this issue Nov 25, 2021
Tokenize emoji as if they were valid identifiers

In the lexer, consider emojis to be valid identifiers and reject
them later to avoid knock down parse errors.

Partially address rust-lang#86102.
@estebank
Copy link
Contributor

estebank commented Dec 2, 2021

@Manishearth can you verify what might be the outstanding work to close this ticket?

@Manishearth
Copy link
Member Author

@estebank I think we're good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. D-papercut Diagnostics: An error or lint that needs small tweaks. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants