Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize ident #66670

Merged
merged 3 commits into from Dec 26, 2019
Merged

Normalize ident #66670

merged 3 commits into from Dec 26, 2019

Conversation

@crlf0710
Copy link
Contributor

crlf0710 commented Nov 23, 2019

Perform unicode normalization on identifiers. Resolving the first bullet point in #55467.

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Nov 23, 2019

r? @davidtwco

(rust_highfive has picked a reviewer for you, use r? to override)

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Nov 23, 2019

r? @estebank
And i think will hurt performance a little. Maybe we should measure it before it lands.

@rust-highfive rust-highfive assigned estebank and unassigned davidtwco Nov 23, 2019
@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Nov 23, 2019

The job x86_64-gnu-llvm-6.0 of your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2019-11-23T14:44:27.8437353Z ##[command]git remote add origin https://github.com/rust-lang/rust
2019-11-23T14:44:27.8453281Z ##[command]git config gc.auto 0
2019-11-23T14:44:27.8499331Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2019-11-23T14:44:27.8540359Z ##[command]git config --get-all http.proxy
2019-11-23T14:44:27.8663581Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/66670/merge:refs/remotes/pull/66670/merge
---
2019-11-23T15:33:31.2561630Z .................................................................................................... 1600/9281
2019-11-23T15:33:35.4769065Z .................................................................................................... 1700/9281
2019-11-23T15:33:46.6108886Z ...........................i........................................................................ 1800/9281
2019-11-23T15:33:52.2225717Z .................................................................................................... 1900/9281
2019-11-23T15:34:04.2520439Z ............iiiii................................................................................... 2000/9281
2019-11-23T15:34:12.2595071Z .................................................................................................... 2200/9281
2019-11-23T15:34:14.7292704Z .................................................................................................... 2300/9281
2019-11-23T15:34:19.2892354Z .................................................................................................... 2400/9281
2019-11-23T15:34:36.5565960Z .................................................................................................... 2500/9281
---
2019-11-23T15:36:58.2344464Z ............i...............i....................................................................... 4800/9281
2019-11-23T15:37:06.6957700Z .................................................................................................... 4900/9281
2019-11-23T15:37:11.5742193Z .................................................................................................... 5000/9281
2019-11-23T15:37:19.7502276Z .................................................................................................... 5100/9281
2019-11-23T15:37:25.2078253Z .................ii.ii...........i.................................................................. 5200/9281
2019-11-23T15:37:33.5455009Z .................................................................................................... 5400/9281
2019-11-23T15:37:43.2590696Z ...................................................................................................i 5500/9281
2019-11-23T15:37:50.5966007Z .................................................................................................... 5600/9281
2019-11-23T15:37:55.9819282Z .................................................................................................... 5700/9281
2019-11-23T15:37:55.9819282Z .................................................................................................... 5700/9281
2019-11-23T15:38:05.3697115Z .....................................................................................ii...i..ii..... 5800/9281
2019-11-23T15:38:25.1355922Z .................................................................................................... 6000/9281
2019-11-23T15:38:31.8522220Z .................................................................................................... 6100/9281
2019-11-23T15:38:35.6726851Z .................................................................................................... 6200/9281
2019-11-23T15:38:35.6726851Z .................................................................................................... 6200/9281
2019-11-23T15:38:46.8930340Z ........i..ii....................................................................................... 6300/9281
2019-11-23T15:39:02.3584974Z ............................................................................i....................... 6500/9281
2019-11-23T15:39:04.4406845Z .................................................................................................... 6600/9281
2019-11-23T15:39:06.3849978Z ...................................................................i................................ 6700/9281
2019-11-23T15:39:08.8141806Z .................................................................................................... 6800/9281
---
2019-11-23T15:43:05.8447433Z ---- [ui] ui/codemap_tests/unicode_2.rs stdout ----
2019-11-23T15:43:05.8447480Z diff of stderr:
2019-11-23T15:43:05.8447527Z 
2019-11-23T15:43:05.8447558Z 14    |
2019-11-23T15:43:05.8447593Z 15    = help: valid widths are 8, 16, 32, 64 and 128
2019-11-23T15:43:05.8447831Z - error[E0425]: cannot find value `a̐é` in this scope
2019-11-23T15:43:05.8448008Z + error[E0425]: cannot find value `a̐é` in this scope
2019-11-23T15:43:05.8448184Z 18   --> $DIR/unicode_2.rs:6:13
2019-11-23T15:43:05.8448235Z 19    |
2019-11-23T15:43:05.8448235Z 19    |
2019-11-23T15:43:05.8448382Z 20 LL |     let _ = a̐é;
2019-11-23T15:43:05.8448404Z 
2019-11-23T15:43:05.8448422Z 
2019-11-23T15:43:05.8448471Z The actual stderr differed from the expected stderr.
2019-11-23T15:43:05.8448700Z Actual stderr saved to /checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/codemap_tests/unicode_2/unicode_2.stderr
2019-11-23T15:43:05.8448887Z To update references, rerun the tests and pass the `--bless` flag
2019-11-23T15:43:05.8449099Z To only update this specific test, also pass `--test-args codemap_tests/unicode_2.rs`
2019-11-23T15:43:05.8449157Z error: 1 errors occurred comparing output.
2019-11-23T15:43:05.8449191Z status: exit code: 1
2019-11-23T15:43:05.8449191Z status: exit code: 1
2019-11-23T15:43:05.8449899Z command: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/checkout/src/test/ui/codemap_tests/unicode_2.rs" "-Zthreads=1" "--target=x86_64-unknown-linux-gnu" "--error-format" "json" "-Zui-testing" "-C" "prefer-dynamic" "--out-dir" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/codemap_tests/unicode_2" "-Crpath" "-O" "-Cdebuginfo=0" "-Zunstable-options" "-Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "-L" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/codemap_tests/unicode_2/auxiliary" "-A" "unused"
2019-11-23T15:43:05.8450207Z ------------------------------------------
2019-11-23T15:43:05.8450233Z 
2019-11-23T15:43:05.8450392Z ------------------------------------------
2019-11-23T15:43:05.8450441Z stderr:
2019-11-23T15:43:05.8450441Z stderr:
2019-11-23T15:43:05.8450782Z ------------------------------------------
2019-11-23T15:43:05.8451221Z error: invalid width `7` for integer literal
2019-11-23T15:43:05.8451586Z    |
2019-11-23T15:43:05.8451586Z    |
2019-11-23T15:43:05.8451829Z LL |     let _ = ("a̐éö̲", 0u7); //~ ERROR invalid width
2019-11-23T15:43:05.8451955Z    |
2019-11-23T15:43:05.8451955Z    |
2019-11-23T15:43:05.8452001Z    = help: valid widths are 8, 16, 32, 64 and 128
2019-11-23T15:43:05.8452034Z 
2019-11-23T15:43:05.8452098Z error: invalid width `42` for integer literal
2019-11-23T15:43:05.8452425Z    |
2019-11-23T15:43:05.8452425Z    |
2019-11-23T15:43:05.8452713Z LL |     let _ = ("아あ", 1i42); //~ ERROR invalid width
2019-11-23T15:43:05.8452809Z    |
2019-11-23T15:43:05.8452809Z    |
2019-11-23T15:43:05.8452876Z    = help: valid widths are 8, 16, 32, 64 and 128
2019-11-23T15:43:05.8453165Z error[E0425]: cannot find value `a̐é` in this scope
2019-11-23T15:43:05.8453424Z   --> /checkout/src/test/ui/codemap_tests/unicode_2.rs:6:13
2019-11-23T15:43:05.8453489Z    |
2019-11-23T15:43:05.8453725Z LL |     let _ = a̐é; //~ ERROR cannot find
---
2019-11-23T15:43:05.8479568Z thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:537:22
2019-11-23T15:43:05.8479634Z note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
2019-11-23T15:43:05.8491102Z 
2019-11-23T15:43:05.8491212Z 
2019-11-23T15:43:05.8494428Z command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/ui" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "ui" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Cdebuginfo=0 -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Cdebuginfo=0 -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
2019-11-23T15:43:05.8494915Z 
2019-11-23T15:43:05.8495070Z 
2019-11-23T15:43:05.8504576Z failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
2019-11-23T15:43:05.8504636Z Build completed unsuccessfully in 0:53:08
2019-11-23T15:43:05.8504636Z Build completed unsuccessfully in 0:53:08
2019-11-23T15:43:05.8548820Z == clock drift check ==
2019-11-23T15:43:05.8560947Z   local time: Sat Nov 23 15:43:05 UTC 2019
2019-11-23T15:43:06.3766966Z   network time: Sat, 23 Nov 2019 15:43:06 GMT
2019-11-23T15:43:06.3770366Z == end clock drift check ==
2019-11-23T15:43:07.2796943Z 
2019-11-23T15:43:07.2877973Z ##[error]Bash exited with code '1'.
2019-11-23T15:43:07.2907889Z ##[section]Starting: Checkout
2019-11-23T15:43:07.2909365Z ==============================================================================
2019-11-23T15:43:07.2909406Z Task         : Get sources
2019-11-23T15:43:07.2909441Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@Centril

This comment has been minimized.

Copy link
Member

Centril commented Nov 23, 2019

Copy link
Member

Manishearth left a comment

lgtm from a unicode standpoint

hopefully there are no other places where we create Symbols?

@Centril

This comment has been minimized.

Copy link
Member

Centril commented Nov 23, 2019

cc @petrochenkov re. Symbols

@crlf0710 crlf0710 force-pushed the crlf0710:normalize_ident branch from 912227a to 0e3036b Nov 23, 2019
@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Nov 23, 2019

hopefully there are no other places where we create Symbols?

At least there's one in the proc_macro crate. I looked at the code but not sure how should i change it. I know nothing about the "bridge" thing. I'm leaving that into a further PR.

@estebank

This comment has been minimized.

Copy link
Contributor

estebank commented Nov 25, 2019

@Manishearth shouldn't we have a warning when encountering non-NFC chars? Shouldn't we, for the sake of third party tools, attempt to "soft-enforce" byte comparable tokens? These questions might are possibly unrelated to this PR in particular and I should have brought up this point in the mega-thread.

@estebank

This comment has been minimized.

Copy link
Contributor

estebank commented Nov 25, 2019

The PR LGTM, but I can foresee issues cropping up when involving macros.

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Nov 25, 2019

@estebank no, non-NFC chars appear for all kinds of reasons

@estebank

This comment has been minimized.

Copy link
Contributor

estebank commented Nov 26, 2019

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Nov 26, 2019

Awaiting bors try build completion

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Nov 26, 2019

⌛️ Trying commit 0e3036b with merge ce0c4f6...

bors added a commit that referenced this pull request Nov 26, 2019
Normalize ident

Perform unicode normalization on identifiers. Resolving the first bullet point in #55467.
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Nov 26, 2019

☀️ Try build successful - checks-azure
Build commit: ce0c4f6 (ce0c4f69329436f488883ad75290b9e38d921fed)

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Nov 26, 2019

Queued ce0c4f6 with parent a44774c, future comparison URL.

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Nov 28, 2019

Mmm, is the timer task still queued?

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Nov 29, 2019

@bors try @rust-timer queue

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Nov 29, 2019

@crlf0710: 🔑 Insufficient privileges: not in try users

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Nov 29, 2019

Insufficient permissions to issue commands to rust-timer.

@Dylan-DPC

This comment has been minimized.

Copy link
Member

Dylan-DPC commented Dec 11, 2019

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Dec 11, 2019

Awaiting bors try build completion

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 11, 2019

🔒 Merge conflict

This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again.

How do I rebase?

Assuming self is your fork and upstream is this repository, you can resolve the conflict following these steps:

  1. git checkout normalize_ident (switch to your branch)
  2. git fetch upstream master (retrieve the latest master)
  3. git rebase upstream/master -p (rebase on top of it)
  4. Follow the on-screen instruction to resolve conflicts (check git status if you got lost).
  5. git push self normalize_ident --force-with-lease (update this PR)

You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial.

Please avoid the "Resolve conflicts" button on GitHub. It uses git merge instead of git rebase which makes the PR commit history more difficult to read.

Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how Cargo.lock conflict is handled during merge and rebase. This is normal, and you should still perform step 5 to update this PR.

Error message
Auto-merging src/librustc_parse/lexer/mod.rs
Auto-merging src/librustc_parse/Cargo.toml
CONFLICT (content): Merge conflict in src/librustc_parse/Cargo.toml
Auto-merging Cargo.lock
Automatic merge failed; fix conflicts and then commit the result.

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented Dec 19, 2019

I believe this caused the rollup failure here: #67435 (comment)

You can likely fix it by adding smallvec at version 1.0 (not 0.6!) to https://github.com/rust-lang/rust/blob/master/src/tools/rustc-workspace-hack/Cargo.toml#L65 and use Cargo's package renaming to make sure those two don't conflict as keys in the dependencies section.

@bors r-

@crlf0710 crlf0710 force-pushed the crlf0710:normalize_ident branch from 49f3bc9 to 27e7a1b Dec 26, 2019
@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Dec 26, 2019

Rebased and added dependency according to @Mark-Simulacrum 's instruction. @estebank

@Dylan-DPC

This comment has been minimized.

Copy link
Member

Dylan-DPC commented Dec 26, 2019

@bors r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 26, 2019

📌 Commit 27e7a1b has been approved by Dylan-DPC

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 26, 2019

⌛️ Testing commit 27e7a1b with merge acb6690...

bors added a commit that referenced this pull request Dec 26, 2019
Normalize ident

Perform unicode normalization on identifiers. Resolving the first bullet point in #55467.
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 26, 2019

☀️ Test successful - checks-azure
Approved by: Dylan-DPC
Pushing acb6690 to master...

@bors bors added the merged-by-bors label Dec 26, 2019
@bors bors merged commit 27e7a1b into rust-lang:master Dec 26, 2019
5 checks passed
5 checks passed
homu Test successful
Details
pr Build #20191226.1 succeeded
Details
pr (Linux mingw-check) Linux mingw-check succeeded
Details
pr (Linux x86_64-gnu-llvm-7) Linux x86_64-gnu-llvm-7 succeeded
Details
pr (Linux x86_64-gnu-tools) Linux x86_64-gnu-tools succeeded
Details
@crlf0710 crlf0710 deleted the crlf0710:normalize_ident branch Dec 26, 2019
@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Dec 28, 2019

I have two questions / concerns:

  • Is it possible to normalize a non-ASCII identifier into an ASCII identifier?
  • Even without that non-ASCII identifiers are gated at the wrong point (AST after macro expansion) and not entirely correctly (visit_name instead of identifier tokens), rather than at the correct point (during lexing).
    Is there any desire to fix the gating (do it in lexer before the normalization) before the feature is stabilized?
@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Dec 28, 2019

Following the RFC, since we're using NFC normalization, I think the answer to the first question is "no it's not possible". NFC normalization is basically reordering and regrouping, so nothing really disappears.

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Dec 28, 2019

For the second question, i think that's a question for a potential "shepherd" for RFC 2457, currently we don't have one, but it would be best if there is one. cc #55467

In my personal opinion though, i think it's not too wrong to gate at the AST after macro expansion, since macros and procedure macros can generate new identifiers, and need to be gated too. I basically know nothing about the (visit_name instead of identifier tokens) part.

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Dec 28, 2019

And by the way i believe there're some remaining work on performing this normalization to the user data sent to proc_macro crate. the RFC mentioned that we need to make sure:

Every API accepting raw identifiers (such as proc_macro::Ident::new normalizes them to NFC and APIs returning them as strings (like proc_macro::Ident::to_string) return the normalized form.

I think I need some mentor here. @petrochenkov could you give some instructions when you have time? I tried to read the code in proc_macro crate, but got total lost in the bridge-related thing. It's also concerning if we need to add the unicode-normalization dependency to the proc_macro crate.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Dec 29, 2019

@crlf0710
The "cannot be a raw identifier" error is an example of catching something at lexer level.

It's emitted twice in librustc_parse\lexer\mod.rs and in libsyntax_expand\proc_macro_server.rs and those are the only places where an identifier token can be created from user input.

So, the normalization (and also gating, arguably) needs to happen in the same places.
(It will catch anything produced by macros as well.)

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Dec 29, 2019

@petrochenkov do you mean https://github.com/rust-lang/rust/blob/master/src/libsyntax_expand/proc_macro_server.rs#L330 ?

I'm a little confused. After normalization, if the identifier changes, should i intern, get a new Symbol and use it in the result identifier? What should i do there with the existing Symbol, leaving it alone?

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Dec 29, 2019

@crlf0710
Yes.

@crlf0710

This comment has been minimized.

Copy link
Contributor Author

crlf0710 commented Dec 29, 2019

Great, let me create a branch and PR it.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Dec 29, 2019

By the way, the RFC says the normalization is performed during parsing rather than lexing.
That means proc macros receive the original unnormalized tokens and can perform their own normalization (same as with escaping in literals).
But that's probably not the intent of the RFC.

On the other hand, if the normalization is performed during lexing, that would be the first case in which we do not preserve original tokens by design (except for a big hack with doc comments in macros).

Centril added a commit to Centril/rust that referenced this pull request Dec 31, 2019
…henkov

Add symbol normalization for proc_macro_server.

Follow up for rust-lang#66670, finishing the first bullet point in rust-lang#55467.

r? @petrochenkov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.