Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent whitespace definitions in string literals and language itself #60209

Open
matklad opened this issue Apr 23, 2019 · 4 comments
Open
Labels
A-frontend Area: frontend (errors, parsing and HIR) A-parser Area: The parsing of Rust source code to an AST. A-unicode Area: Unicode C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@matklad
Copy link
Member

matklad commented Apr 23, 2019

Lexer uses Pattern_White_Space unicode property when skipping over trivia. However, when we process string literals with escaped newlines, we only skip ASCII whitespace:

Some(' ') | Some('\n') | Some('\r') | Some('\t') => {

Here's an example program that shows that U+200F is ignored in program text, but not in the string literal

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ec59778d31dde69f29f1095aff2c9b66

Here's the text of the program in Debug format, to make whitespace slightly more visible

"fn main() {\n\u{200f}\u{200f}\u{200f}\n    let s = \"\\\n\u{200f}\u{200f}\u{200f}hello\n\";\n    println!(\"{:?}\", s);\n}    \n"
@Centril Centril added C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-parser Area: The parsing of Rust source code to an AST. A-frontend Area: frontend (errors, parsing and HIR) labels Apr 23, 2019
@Centril
Copy link
Contributor

Centril commented Apr 23, 2019

Tentatively classifying as a bug.

@matklad
Copy link
Member Author

matklad commented Apr 23, 2019

I am pretty ignorant about unicode, but I would prefer to fix this the other way around, by restricting whitespace definition in the reference to ASCII. Opened https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876 for that discussion

@estebank
Copy link
Contributor

cc @Manishearth

@Manishearth
Copy link
Member

200F is definitely useful in text, we should not be skipping it at all in strings.

As for the lexer: this was one of the questions we punted for later on the non ascii ident story: RLM is useful for having code using RTL scripts that renders well, so having it as allowed whitespace is somewhat useful (if confusing)

@workingjubilee workingjubilee added the A-unicode Area: Unicode label Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: frontend (errors, parsing and HIR) A-parser Area: The parsing of Rust source code to an AST. A-unicode Area: Unicode C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants