# introduce unescape module #60261

Merged
merged 2 commits into from May 6, 2019
+1,048 −785

## Conversation

Projects
None yet
8 participants
Member

### matklad commented Apr 25, 2019 • edited

 A WIP PR to gauge early feedback Currently, we deal with escape sequences twice: once when we lex a string, and a second time when we unescape literals. Note that we also produce different sets of diagnostics in these two cases. This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules. I think this would be a useful cleanup by itself, but I also need this for #59706. In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module What this PR doesn't have yet are: handling of byte and byte string literals (should be simple to add) good diagnostics actual removal of code from lexer (giant scan_char_or_byte should go away completely) performance check general cleanup of the new code Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

Collaborator

### rust-highfive commented Apr 25, 2019

 r? @eddyb (rust_highfive has picked a reviewer for you, use r? to override)

Collaborator

### rust-highfive commented Apr 25, 2019

 The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem. Click to expand the log. travis_time:end:02400c06:start=1556199436669093120,finish=1556199437433886678,duration=764793558 $git checkout -qf FETCH_HEAD travis_fold:end:git.checkout Encrypted environment variables have been removed for security reasons. See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions$ export SCCACHE_BUCKET=rust-lang-ci-sccache2 $export SCCACHE_REGION=us-west-1$ export GCP_CACHE_BUCKET=rust-lang-ci-cache $export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55 --- [00:03:55] travis_fold:start:tidy travis_time:start:tidy tidy check [00:03:55] tidy error: /checkout/src/libsyntax/parse/lexer/mod.rs:1456: line longer than 100 chars [00:03:55] tidy error: /checkout/src/libsyntax/parse/unescape_error_reporting.rs:10: line longer than 100 chars [00:03:55] tidy error: /checkout/src/libsyntax/parse/unescape_error_reporting.rs:57: TODO is deprecated; use FIXME [00:03:57] some tidy checks failed [00:03:57] [00:03:57] [00:03:57] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet" [00:03:57] [00:03:57] [00:03:57] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy [00:03:57] Build completed unsuccessfully in 0:00:48 [00:03:57] Build completed unsuccessfully in 0:00:48 [00:03:57] Makefile:67: recipe for target 'tidy' failed [00:03:57] make: *** [tidy] Error 1 The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2. travis_time:start:0b5480ef $date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true) Thu Apr 25 13:41:26 UTC 2019 --- travis_time:end:033982d0:start=1556199687689453400,finish=1556199687694864454,duration=5411054 travis_fold:end:after_failure.3 travis_fold:start:after_failure.4 travis_time:start:04cfeb68$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo$CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!$$.*$$|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true travis_fold:end:after_failure.4 travis_fold:start:after_failure.5 travis_time:start:06b813f7 travis_time:start:06b813f7 $cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory travis_fold:end:after_failure.5 travis_fold:start:after_failure.6 travis_time:start:0401c513$ dmesg | grep -i kill  I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

Collaborator

### rust-highfive commented Apr 25, 2019

 The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem. Click to expand the log. travis_time:end:10f74b48:start=1556220154917281636,finish=1556220157524918567,duration=2607636931 $git checkout -qf FETCH_HEAD travis_fold:end:git.checkout Encrypted environment variables have been removed for security reasons. See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions$ export SCCACHE_BUCKET=rust-lang-ci-sccache2 $export SCCACHE_REGION=us-west-1$ export GCP_CACHE_BUCKET=rust-lang-ci-cache $export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55 --- [00:03:38] travis_fold:start:tidy travis_time:start:tidy tidy check [00:03:38] tidy error: /checkout/src/libsyntax/parse/lexer/mod.rs:1456: line longer than 100 chars [00:03:38] tidy error: /checkout/src/libsyntax/parse/unescape_error_reporting.rs:10: line longer than 100 chars [00:03:38] tidy error: /checkout/src/libsyntax/parse/unescape_error_reporting.rs:57: TODO is deprecated; use FIXME [00:03:40] some tidy checks failed [00:03:40] [00:03:40] [00:03:40] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet" [00:03:40] [00:03:40] [00:03:40] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy [00:03:40] Build completed unsuccessfully in 0:00:44 [00:03:40] Build completed unsuccessfully in 0:00:44 [00:03:40] Makefile:67: recipe for target 'tidy' failed [00:03:40] make: *** [tidy] Error 1 The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2. travis_time:start:1600b5d4 $date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true) Thu Apr 25 19:26:28 UTC 2019 --- travis_time:end:0c392e75:start=1556220389503326639,finish=1556220389507923003,duration=4596364 travis_fold:end:after_failure.3 travis_fold:start:after_failure.4 travis_time:start:23cccc3a$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo$CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!$$.*$$|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true travis_fold:end:after_failure.4 travis_fold:start:after_failure.5 travis_time:start:04596812 travis_time:start:04596812 $cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory travis_fold:end:after_failure.5 travis_fold:start:after_failure.6 travis_time:start:2d50cf26$ dmesg | grep -i kill  I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)
Collaborator

### rust-highfive commented Apr 28, 2019

 The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem. Click to expand the log. travis_time:end:065c394c:start=1556464050685635777,finish=1556464144389422280,duration=93703786503 $git checkout -qf FETCH_HEAD travis_fold:end:git.checkout Encrypted environment variables have been removed for security reasons. See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions$ export SCCACHE_BUCKET=rust-lang-ci-sccache2 $export SCCACHE_REGION=us-west-1$ export GCP_CACHE_BUCKET=rust-lang-ci-cache $export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55 --- [00:03:32] travis_fold:start:tidy travis_time:start:tidy tidy check [00:03:32] tidy error: /checkout/src/libsyntax/parse/lexer/mod.rs:1181: line longer than 100 chars [00:03:32] tidy error: /checkout/src/libsyntax/parse/unescape_error_reporting.rs:10: line longer than 100 chars [00:03:32] tidy error: /checkout/src/libsyntax/parse/unescape_error_reporting.rs:57: TODO is deprecated; use FIXME [00:03:34] some tidy checks failed [00:03:34] [00:03:34] [00:03:34] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet" [00:03:34] [00:03:34] [00:03:34] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy [00:03:34] Build completed unsuccessfully in 0:00:44 [00:03:34] Build completed unsuccessfully in 0:00:44 [00:03:34] make: *** [tidy] Error 1 [00:03:34] Makefile:67: recipe for target 'tidy' failed The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2. travis_time:start:075162db $date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true) Sun Apr 28 15:12:48 UTC 2019 --- travis_time:end:27d9f214:start=1556464368987353964,finish=1556464368991966173,duration=4612209 travis_fold:end:after_failure.3 travis_fold:start:after_failure.4 travis_time:start:05d3e0d4$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo$CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!$$.*$$|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true travis_fold:end:after_failure.4 travis_fold:start:after_failure.5 travis_time:start:19a1c122 travis_time:start:19a1c122 $cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory travis_fold:end:after_failure.5 travis_fold:start:after_failure.6 travis_time:start:2260a790$ dmesg | grep -i kill  I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)
Contributor

### petrochenkov commented Apr 28, 2019

 Meta: for reviewing convenience it's better to update UI test outputs and satisfy tidy to make CI green, even if the changes in test results are temporarily wrong / intended to disappear. This way it's clear how exactly they are wrong and what still needs to be fixed.
Collaborator

### petrochenkov reviewed Apr 28, 2019

src/libsyntax/parse/lexer/mod.rs

### petrochenkov reviewed Apr 28, 2019

src/libsyntax/parse/lexer/mod.rs

### petrochenkov reviewed Apr 28, 2019

src/libsyntax/parse/mod.rs Outdated

### petrochenkov reviewed Apr 28, 2019

src/libsyntax/parse/unescape.rs

### petrochenkov reviewed Apr 28, 2019

src/libsyntax/parse/mod.rs
Contributor

### petrochenkov commented Apr 28, 2019

 Question: what happens if a literal is lexed, but never "parsed properly"? For example, if it's passed to a macro that accepts tts and throws them away. The errors for incorrect escapes, etc, should be reported in that case as well. (P.S. I haven't reviewed everything yet, will continue tomorrow.)
Contributor

Member Author

### matklad commented Apr 29, 2019

 Question: what happens if a literal is lexed, but never "parsed properly"? Good question! Given that diag: Option<(Span, &Handler)> argument to char_lit function, I was under the impression that we always parse literals properly. Turns out that even today we don't do that, so existing code is buggy. The following compiles, while it shouldn't (the 6F literal is out of range for char): macro_rules! erase { ($($tt:tt)*) => {} } fn main() { erase! { '\u{FFFFFF}' } } playground If we pursue the approach in this PR, then we should run unescape_* family of functions twice: once in the lexer, where we just report errors and disregard escaped characters, and once in the parser, where we do the opposite and ignore errors, but collect unescaped literals. That means that we will be able to remove that diag: Option argument (indeed, "optionally" reporting diagnostics seems like a sure way to have bugs)
Member Author

### matklad commented Apr 29, 2019

 Hm, or is the above example an expected behavior? We don't check ranges of integer literals, for example: macro_rules! erase { ($($tt:tt)*) => {} } fn main() { erase!(999u8); } for chars, we do check that there are at most six hex digits in the lexer, but we only do precise check for range and surrogates in the parser, which seems somewhat arbitrary.

Collaborator

Contributor

### petrochenkov commented May 3, 2019

 Ok, let's resolve #60494 separately then. @bors try
Contributor

### bors commented May 3, 2019

 ⌛️ Trying commit 1835cbe with merge bfdcf6d...

### bors added a commit that referenced this pull request May 3, 2019

 Auto merge of #60261 - matklad:one-escape, r=<try> 
introduce unescape module

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant scan_char_or_byte should go away completely)
* [ ] performance check
* [x] general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov
 bfdcf6d 
Contributor

### bors commented May 3, 2019

 ☀️ Try build successful - checks-travis Build commit: bfdcf6d
Member Author

### matklad commented May 3, 2019

 This probably should be tagged with Breaking Change and Waiting on Crater presumably?
Contributor

### petrochenkov commented May 3, 2019

 @craterbot run mode=check-only
Collaborator

### craterbot commented May 3, 2019

 👌 Experiment pr-60261 created and queued. 🤖 Automatically detected try build bfdcf6d 🔍 You can check out the queue and this experiment's details. ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

Collaborator

### craterbot commented May 3, 2019

 🚧 Experiment pr-60261 is now running on agent aws-2. ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more
Contributor

### petrochenkov commented May 3, 2019

 @rust-timer build bfdcf6d

### rust-timer commented May 3, 2019

 Success: Queued bfdcf6d with parent 1891bfa, comparison URL.

### rust-timer commented May 3, 2019

 Finished benchmarking try commit bfdcf6d
Member Author

### matklad commented May 3, 2019 • edited

 Looks like there are no significant perf differences, let's wait what crater says
Collaborator

### craterbot commented May 5, 2019

 🎉 Experiment pr-60261 is completed! 📊 0 regressed and 0 fixed (60951 total) 📰 Open the full report. ⚠️ If you notice any spurious failure please add them to the blacklist! ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

Contributor

### petrochenkov commented May 5, 2019

 @bors r+
Contributor

### bors commented May 5, 2019

 📌 Commit 1835cbe has been approved by petrochenkov

Contributor

### bors commented May 6, 2019

 ⌛️ Testing commit 1835cbe with merge 46d0ca0...

### bors added a commit that referenced this pull request May 6, 2019

 Auto merge of #60261 - matklad:one-escape, r=petrochenkov 
introduce unescape module

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant scan_char_or_byte should go away completely)
* [x] performance check
* [x] general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov
 46d0ca0 
Contributor

### bors commented May 6, 2019

 ☀️ Test successful - checks-travis, status-appveyor Approved by: petrochenkov Pushing 46d0ca0 to master...

Closed

### bors merged commit 1835cbe into rust-lang:master May 6, 2019 2 checks passed

#### 2 checks passed

Travis CI - Pull Request Build Passed
Details
homu Test successful
Details

Member Author

### matklad commented May 7, 2019

 FWIW, this is now used by rust-analyzer: rust-analyzer/rust-analyzer#1253

Draft