Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for RFC 2457, "Allow non-ASCII identifiers" #55467

Closed
17 of 23 tasks
Centril opened this issue Oct 29, 2018 · 138 comments
Closed
17 of 23 tasks

Tracking issue for RFC 2457, "Allow non-ASCII identifiers" #55467

Centril opened this issue Oct 29, 2018 · 138 comments
Labels
B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. B-RFC-implemented Blocker: Approved by a merged RFC and implemented. B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. F-non_ascii_idents `#![feature(non_ascii_idents)]` finished-final-comment-period The final comment period is finished for this PR / Issue. T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@Centril
Copy link
Contributor

Centril commented Oct 29, 2018

This is a tracking issue for the RFC "Allow non-ASCII identifiers" (rust-lang/rfcs#2457).

Steps:

Unresolved questions:

  • Which context is adequate for confusable detection: file, current scope, crate?
  • Should ZWNJ and ZWJ be allowed in identifiers?
  • How are non-ASCII idents best supported in debuggers?
    Resolved: DWARF and debuggers handle UTF-8 just fine
  • Which name mangling scheme is used by the compiler? (Punycode, see RFC2603)
  • Is there a better name for the less_used_codepoints lint?
    Resolved in favour of uncommon_codepoints
  • Which lint should the global mixed scripts confusables detection trigger?
    Resolved in favor of mixed_script_confusables
  • How badly do non-ASCII idents exacerbate const pattern confusion
    (Statics shadow local variables causing "refutable pattern error", and non-obvious bugs. #7526, We shouldn't even try to resolve irrefutable patterns as constants #49680)?
    Can we improve precision of linting here?
  • In mixed_script_confusables, do we actually need to make an exception for Latin identifiers?
  • Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?
  • right-to-left scripts can lead to weird rendering in mixed contexts (depending on the software used), especially when mixed with operators. This is not something that should block stabilization, however we feel it is important to explicitly call out. Future RFCs (preferably put forth by RTL-using communities) may attempt to improve this situation (e.g. by allowing bidi control characters in specific contexts).
  • Tweak XID_Start / XID_Continue? XID_Start / XID_Continue might not be quite right #4928

    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

    The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead.


zulip channel topic for real-time discussion:
https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/nonascii.20identifiers(rfc.202457)

@Centril Centril added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 29, 2018
@Manishearth
Copy link
Member

last unresolved question isn't a real unresolved question, it was included in the RFC for completeness but does not block this issue.

@Centril
Copy link
Contributor Author

Centril commented Oct 29, 2018

@joshtriplett Please check that the list of checkboxes above are satisfactory. :)

@Manishearth alright; leave a note under it to that effect?

@Manishearth
Copy link
Member

The note saying so is already in the unresolved q

@8573
Copy link

8573 commented Oct 29, 2018

Is there a better name for the less_used_codepoints lint?

Substituting "rare" or "unusual" for "less used" seems to me a simple, if not necessarily final, improvement, replacing the somewhat awkward "less used" with a single, shorter, more usual synonym.

(Edit: I note that I personally oppose allowing non-ASCII identifiers, but I recognize that the Rust Team favors it, and I have no problem bowing to their decision and chipping in my cents to help.)

@Manishearth
Copy link
Member

Manishearth commented Oct 29, 2018 via email

@Serentty
Copy link
Contributor

I would prefer “rare” as it sounds more objective to me than “unusual”, and perhaps less judgemental as well.

@eaglgenes101
Copy link

My first thought was "uncommon", but that's not strong enough of an adjective to get the intended meaning across.

@Centril
Copy link
Contributor Author

Centril commented Nov 1, 2018

I'm partial towards "rare" as well; rare_codepoints is pretty short and sweet.

@glaebhoerl
Copy link
Contributor

If we need something even stronger we might try "mythic". 😛

@eddyb
Copy link
Member

eddyb commented Nov 3, 2018

I prefer uncommon_codepoints as being more serious/boring than rare/legendary/mythic/etc.

@8573
Copy link

8573 commented Nov 4, 2018

@eddyb

I prefer uncommon_codepoints as being more serious/boring than rare/legendary/mythic/etc.

As a native and competent English-user who generally is seen as serious/boring [1], while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does. The distinction I would draw between uncommon_codepoints and rare_codepoints is that "rare" may be seen as stronger than "uncommon", at least in US English (Merriam-Webster and Wiktionary say so, whereas Oxford and Cambridge don't say so explicitly).

[1] (I recognize this may not have been a trait I displayed when I was a member of your channel.)
[2] (similarly to the Unicode term "astral plane", which, possibly relevantly, if I understand correctly, was changed to "supplementary plane")

@Kixunil
Copy link
Contributor

Kixunil commented Nov 5, 2018

I'm much against non-ascii identifiers, but I'm not going to repeat what other folks said. I just noticed that no example of malicious code was provided so far, so I'm providing one:

fn list_items_in_category(category: &str) -> io::Result<String> {
    let cate𝚐ory = sanitize_untrusted_input(category);
    debug!("Listing category {}", cate𝚐ory);
    system(format!("grep '^{},' /my/simple/database | awk -F , "{{ print $2 }}", category))
}

Who can spot the problem without looking at character codes?

As a side note, I'd like to provide my experience with attempts to localize everything (feel free to skip the rest of my comment if you want to remain technical). I went to a school where they translate literally every technical term. In an attempt to make everything understandable to everyone, they translated even things that are very difficult to translate reasonably.

You'd expect that it was much easier to learn at that school compared to others schools that don't do that, right? Well, life is weird. It was hard to understand, I felt like Alice in wonderland and it took me a week to realize that "that weird term I didn't hear before" was the actual thing I wanted to study and the very reason I signed up for that specific school!

Of course, this is not directly an argument against non-ascii identifiers. I just wanted to express my concerns to all those wonderful loving people (seriously) who want for everyone to feel great in Rust community, so that they remain vigilant and avoid accidentally going against their own beliefs.

@glaebhoerl
Copy link
Contributor

while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does.

(Agreed. My comment was entirely in jest (as I hope should've been evident?), and I had no intention of tarnishing "rare" by association, a word which is itself ordinary and common.)

@eddyb
Copy link
Member

eddyb commented Nov 5, 2018

Who can spot the problem without looking at character codes?

It's trivial on my font. Also, we want to start with lints against this sort of thing.

@ketsuban
Copy link
Contributor

ketsuban commented Nov 5, 2018

Who can spot the problem without looking at character codes?

Github's monospace font configuration on my machine ended up using a binocular glyph for U+0067 LATIN SMALL LETTER G but a monocular one for U+1D690 MATHEMATICAL MONOSPACE SMALL G, so I spotted it instantly.

That said, the confusable_idents lint will once implemented almost certainly flag this code, since MATHEMATICAL MONOSPACE SMALL G → LATIN SMALL LETTER G is listed in confusables.txt.

@varkor
Copy link
Member

varkor commented Nov 5, 2018

The Oxford English Dictionary has as one definition of "uncommon":

Of an unusual type or character; exceptional in kind or quality

which seems especially appropriate 😉(Emphasis mine.)

I feel it's a more suitable choice of words than "rare", which has significantly more meanings than "uncommon", some associated specifically with being good, e.g. (also OED):

Unusually good, fine, or worthy; of uncommon excellence or merit.

I also think more reserved wording is generally appropriate for compiler naming conventions.

@Manishearth
Copy link
Member

@Kixunil

but I'm not going to repeat what other folks said

That's precisely what you're doing -- your "counterexample" is caught by both the less_used_codepoints lint and the confusable_idents lint.

At this point these counterexample discussions have been done to death (and we're leveraging a unicode spec designed to deal with this!) -- please actually check if your "counterexample" isn't something we or the Unicode Consortium have thought of already.

@Serentty
Copy link
Contributor

Serentty commented Nov 9, 2018

Exactly. People have thought a lot about this, and it's certainly possible to implement a feature that many people will find useful while dealing with complications that might arise. At this point, adding this feature is a given. If you have ideas about how to improve the lints for finding confusable identifiers, by all means share them, but there's no need to simply point out an issue that everyone is already aware of.

@Centril
Copy link
Contributor Author

Centril commented Nov 9, 2018

Re. less_used_codepoints: I propose that we bring the bikeshed to a halt in favor of uncommon_codepoints because "meh".

I was personally in favor of rare_codepoints but uncommon_codepoints is basically the same and also works for me so... 🤷‍♀️

Everyone OK with this?

@Manishearth
Copy link
Member

works for me. Slight preference for rare but very slight. Both work for me, and I don't think it's really worth bikeshedding this too much :)

@Kixunil
Copy link
Contributor

Kixunil commented Nov 15, 2018

@Manishearth Sorry, I didn't mean to argue that the example is unsolved, I just wanted to provide actual code for those who might have difficulties imagining how to turn this feature into something malicious.

I wonder though, why confusables lint is not mandatory (according to RFC). It sounds to me like making borrow checker non-mandatory. My understanding is that Rust should be safe by default, where you can opt-into unsafety. For me, this means denying confusables in every crate I make, if I want to ensure high quality of my crates.

@Manishearth
Copy link
Member

This isn't a matter of safety in the way that Rust describes it.

Making the warnings on by default has been discussed. Please let's not use this tracking issue to relitigate things which have already been decided through a rather long RFC.

@rfcbot
Copy link

rfcbot commented Mar 30, 2021

🔔 This is now entering its final comment period, as per the review above. 🔔

@ehuss
Copy link
Contributor

ehuss commented Apr 6, 2021

A concern has been raised in #83923 that extern blocks are not handled as described in the RFC. I would appreciate considering addressing that before this is stabilized. I suspect a validation check would be easy to add if that indeed should be rejected.

@joshtriplett
Copy link
Member

Thanks for calling attention to that, @ehuss!

@rfcbot concern extern-blocks

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Apr 6, 2021
@Manishearth
Copy link
Member

@joshtriplett #83936 has merged now

@joshtriplett
Copy link
Member

@rfcbot resolved extern-blocks

@rfcbot
Copy link

rfcbot commented Apr 7, 2021

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 7, 2021
@rfcbot rfcbot added the finished-final-comment-period The final comment period is finished for this PR / Issue. label Apr 17, 2021
@rfcbot
Copy link

rfcbot commented Apr 17, 2021

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

@rfcbot rfcbot added to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Apr 17, 2021
@crlf0710
Copy link
Member

Stablization PR is at #83799 .

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Apr 18, 2021
… r=Manishearth

Stablize `non-ascii-idents`

This is the stablization PR for RFC 2457. Currently this is waiting on fcp in [tracking issue](rust-lang#55467).

r? `@Manishearth`
bors added a commit to rust-lang-ci/rust that referenced this issue Apr 18, 2021
…=Manishearth

Stablize `non-ascii-idents`

This is the stablization PR for RFC 2457. Currently this is waiting on fcp in [tracking issue](rust-lang#55467).

r? `@Manishearth`
@crlf0710
Copy link
Member

Stablization PR has landed, closing.

@ghost
Copy link

ghost commented Jun 5, 2021

Why not support Chinese names

@Manishearth
Copy link
Member

@Mr-Zzg They are! See https://play.rust-lang.org/?version=beta&mode=debug&edition=2018&gist=296bbb7bc2d69f8d2c4245b9df93992a

This feature hasn't hit stable yet, it will in the next release.

@vyamkovyi
Copy link

@wooster0
Copy link
Contributor

@Hexawolf GHSA-rcv6-wg5m-24v6

@Manishearth
Copy link
Member

We are not affected by the homoglyph attack, please see the mitigations that were implemented as a part of this RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. B-RFC-implemented Blocker: Approved by a merged RFC and implemented. B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. F-non_ascii_idents `#![feature(non_ascii_idents)]` finished-final-comment-period The final comment period is finished for this PR / Issue. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests