Member

matklad commented Apr 25, 2019

 A WIP PR to gauge early feedback Currently, we deal with escape sequences twice: once when we lex a string, and a second time when we unescape literals. Note that we also produce different sets of diagnostics in these two cases. This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules. I think this would be a useful cleanup by itself, but I also need this for #59706. In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module What this PR doesn't have yet are: handling of byte and byte string literals (should be simple to add) good diagnostics actual removal of code from lexer (giant scan_char_or_byte should go away completely) performance check general cleanup of the new code Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

Collaborator

Collaborator

Collaborator

Collaborator

Contributor

petrochenkov commented Apr 28, 2019

 Meta: for reviewing convenience it's better to update UI test outputs and satisfy tidy to make CI green, even if the changes in test results are temporarily wrong / intended to disappear. This way it's clear how exactly they are wrong and what still needs to be fixed.
Collaborator

Contributor

petrochenkov commented Apr 28, 2019

 Question: what happens if a literal is lexed, but never "parsed properly"? For example, if it's passed to a macro that accepts tts and throws them away. The errors for incorrect escapes, etc, should be reported in that case as well. (P.S. I haven't reviewed everything yet, will continue tomorrow.)
Contributor

Member Author

matklad commented Apr 29, 2019

 Question: what happens if a literal is lexed, but never "parsed properly"? Good question! Given that diag: Option<(Span, &Handler)> argument to char_lit function, I was under the impression that we always parse literals properly. Turns out that even today we don't do that, so existing code is buggy. The following compiles, while it shouldn't (the 6F literal is out of range for char): macro_rules! erase { ($($tt:tt)*) => {} } fn main() { erase! { '\u{FFFFFF}' } } playground If we pursue the approach in this PR, then we should run unescape_* family of functions twice: once in the lexer, where we just report errors and disregard escaped characters, and once in the parser, where we do the opposite and ignore errors, but collect unescaped literals. That means that we will be able to remove that diag: Option argument (indeed, "optionally" reporting diagnostics seems like a sure way to have bugs)
Member Author

matklad commented Apr 29, 2019

 Hm, or is the above example an expected behavior? We don't check ranges of integer literals, for example: macro_rules! erase { ($($tt:tt)*) => {} } fn main() { erase!(999u8); } for chars, we do check that there are at most six hex digits in the lexer, but we only do precise check for range and surrogates in the parser, which seems somewhat arbitrary.

Collaborator

Contributor

petrochenkov commented May 3, 2019

 Ok, let's resolve #60494 separately then. @bors try
Contributor

 Auto merge of #60261 - matklad:one-escape, r=<try> 
introduce unescape module

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant scan_char_or_byte should go away completely)
* [ ] performance check
* [x] general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov
Contributor

Member Author

matklad commented May 3, 2019

 This probably should be tagged with Breaking Change and Waiting on Crater presumably?
Contributor

petrochenkov commented May 3, 2019

 @craterbot run mode=check-only
Collaborator

Collaborator

Contributor

petrochenkov commented May 3, 2019

 @rust-timer build bfdcf6d

Member Author

matklad commented May 3, 2019

 Looks like there are no significant perf differences, let's wait what crater says
Collaborator

Contributor

petrochenkov commented May 5, 2019

 @bors r+
Contributor

Contributor

 Auto merge of #60261 - matklad:one-escape, r=petrochenkov 
introduce unescape module

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant scan_char_or_byte should go away completely)
* [x] performance check
* [x] general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov
Contributor

Member Author

matklad commented May 7, 2019

 FWIW, this is now used by rust-analyzer: rust-analyzer/rust-analyzer#1253

Draft