New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Closed
DemiMarie opened this Issue Oct 12, 2015 · 54 comments

Comments

Projects
None yet
@DemiMarie
Contributor

DemiMarie commented Oct 12, 2015

Non-ASCII identifiers are currently feature gated. Handling of them should be fixed and the feature gate removed.

@steveklabnik steveklabnik added the A-lang label Oct 29, 2015

@steveklabnik

This comment has been minimized.

Member

steveklabnik commented Oct 29, 2015

@pnkfelix pnkfelix added the T-lang label Oct 29, 2015

@pnkfelix

This comment has been minimized.

Member

pnkfelix commented Oct 29, 2015

nominating

@nrc

This comment has been minimized.

Member

nrc commented Oct 30, 2015

cc @SimonSapin

Apparently we implement this: http://www.unicode.org/reports/tr31/ or something like it.

I would like to see this stabilised, but it will take some work to persuade ourselves that we are doing the right thing.

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Oct 30, 2015

I have no idea what the right thing is here. In addition to Unicode recommendations, we might want to look at what other languages actually do, and what related bug reports or criticism they get. Or was this already done when the feature was first introduced?

@petrochenkov

This comment has been minimized.

Contributor

petrochenkov commented Oct 30, 2015

@SimonSapin
C and C++ use http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax (with some minor restrictions) and I haven't seen any complaints about it on isocpp forums or issue lists :)
Overview of the problem: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
Implementation in Clang: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Lex/UnicodeCharSets.h?view=markup
cc #4928

There's also a problem with normalization of identifiers and mapping unicode mod names to the filesystem names (on OS X, IIRC), but I can't find the relevant link here it is: #2253. (In the worst case non-inline mods and extern crates can be forced to be ASCII)

@pnkfelix

This comment has been minimized.

Member

pnkfelix commented Nov 1, 2015

Yes #2253 is the big issue I know of that makes me worry about premature stabilization of non-unicode identifiers.

(The discussion there is more broad and arguably could be forked off into two threads; e.g. we could take one normalization path for identifiers and another for string literal contents.)

@pnkfelix

This comment has been minimized.

Member

pnkfelix commented Nov 1, 2015

we may want to migrate This discussion to the RFCS repo, e.g. at rust-lang/rfcs#802

@bstrie

This comment has been minimized.

Contributor

bstrie commented Nov 4, 2015

I agree that this is a feature that deserves to be put through the RFC process.

@aturon aturon changed the title from Fix non-ASCII identifiers to Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") Nov 4, 2015

@aturon aturon added the B-unstable label Nov 4, 2015

@aturon

This comment has been minimized.

Member

aturon commented Nov 4, 2015

I've repurposed this issue to track stabilization (or deprecation, etc) of the non_ascii_idents feature gate.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Nov 5, 2015

After discussion in the lang team meeting, we decided that yes, an RFC would be the proper way forward here. We need something that collects the solutions from other languages, analyzes their pros/cons, and suggests the appropriate choice for Rust. This is controversial and complex enough that it should be brought to the community at large -- especially as many of us hacking on Rust on a daily basis don't have a lot of experience with non-ASCII anyhow.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Nov 5, 2015

triage: P-low

Marking as low as there is no RFC at present and hence no actionable content.

@rust-highfive rust-highfive added P-low and removed I-nominated labels Nov 5, 2015

@huonw

This comment has been minimized.

Member

huonw commented Nov 5, 2015

cc #7539

@kberov

This comment has been minimized.

kberov commented Jan 8, 2017

In JavaScript, Perl 5 and Perl 6 this feature is available.
JavaScript (Firefox 50)

function Слово(стойност) {
  this.стойност = стойност;
}
var здрасти = new Слово("Здравей, свят");
console.log(здрасти.стойност) //Здравей, свят

Perl >=5.12

use utf8;
{
  package Слово;
  sub new {
    my $self = bless {}, shift;
    $self->{стойност} = shift;
    $self
  }
};
my $здрасти = Слово->new("здравей, свят");
say ucfirst($здрасти->{стойност}); #Здравей, свят

Perl6 (this is not just next version of Perl. This is a new language)

class Слово {
  has $.стойност;
}

my $здрасти = Слово.new(стойност => 'здравей, свят');
say $здрасти.tc; #Здравей, свят

I would be happy to see it in Rust too.

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Jan 8, 2017

For what it’s worth identifiers in ECMAScript 2015 are based on the Default Identifier Syntax from Unicode Standard Annex #31.

Perl with use utf8; uses the regexp below, with XID_Start and XID_Continue presumably also from UAX # 31.

/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
        (?[ ( \p{Word} & \p{XID_Continue} ) ]) *    /x
@kberov

This comment has been minimized.

kberov commented Jan 8, 2017

Yes! Thanks @SimonSapin!

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Jan 8, 2017

For Python it’s <XID_Start> <XID_Continue>*.

So it looks like many programming languages that allow non-ASCII identifiers are based on the same standard, but in the details they each do something slightly different…

@dstu dstu referenced this issue Jan 18, 2017

Closed

Support for Postgres enums #580

0 of 6 tasks complete

@SimonSapin SimonSapin referenced this issue Feb 1, 2017

Closed

Tracking issue for RFC 1566: Procedural macros #38356

24 of 31 tasks complete
@mjbshaw

This comment has been minimized.

Contributor

mjbshaw commented Feb 8, 2017

I would personally love to see support for math-related identifiers. For example, ∅ (and set operators, like ∩ and ∪). Translating equations from research papers/specifications into code is often a terrible process resulting in verbose and difficult to read code. Being able to use the same identifiers in the code that are in the paper's math equations would simplify implementation and would make the code easier to check and compare against the paper's equations.

@SimonSapin SimonSapin referenced this issue Feb 19, 2017

Open

Tracking issue for 1.0.0 tracking issues #39954

9 of 28 tasks complete
@DoumanAsh

This comment has been minimized.

DoumanAsh commented Mar 17, 2017

What's point of this feature exactly? Aside from adding possibility to create truly ugly mix of different languages in your code(english is the only truly international language), it gives no benefits to language functionality wise. Or is it support of unicode for the sake of supporting unicode?

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Apr 12, 2017

@SamWhited, in your first link I find:

identifier = letter { letter | unicode_digit } .
letter        = unicode_letter | "_" .

But as far as I can tell Go currently doesn’t do any normalization and using PRECIS is a proposal. Is that correct?

@SamWhited

This comment has been minimized.

Contributor

SamWhited commented Apr 12, 2017

But as far as I can tell Go currently doesn’t do any normalization and using PRECIS is a proposal. Is that correct?

@SimonSapin that's correct; well, not even really a proposal, just an idea to be thought through like this issue (sorry, reread that sentence and my link and it was poorly worded; didn't mean to suggest that it does use it right now, just that I don't know what anything other than Go actually does to handle non-ASCII identifiers).

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Apr 13, 2017

@SimonSapin

It may still be worth going through the RFC process with a detailed design, even if it happens to match the current implementation.

👍

@SamWhited

This comment has been minimized.

Contributor

SamWhited commented Apr 13, 2017

I was just reading through UAX #31 to see what they did, and another benefit of using a PRECIS profile stood out to me: just like deprecating stringprep and using PRECIS instead, it provides a way to be future compatible and agile across Unicode versions (by operating on derived properties of code points instead of individual code points themselves).

While TR31 does have a concept of "Immutable Identifiers" to help address this, it effectively is a slightly less restrictive version of a PRECIS protocol derived from the freeform class, but without the considerations PRECIS has given to the order in which rules need to be applied (I don't think?) it also doesn't cover edge cases covered by the PRECIS framework such as use of Greek final sigma, or some of the edge cases around Hangul Jamo (again, I am no expert in either of these, but that's why PRECIS exists; the experts have done the work already).

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Apr 13, 2017

it provides a way to be future compatible and agile across Unicode versions (by operating on derived properties of code points instead of individual code points themselves).

I don’t understand this point. XID_Start and XID_Continue are derived properties.

@SamWhited

This comment has been minimized.

Contributor

SamWhited commented Apr 13, 2017

I don’t understand this point. XID_Start and XID_Continue are derived properties.

I might have misunderstood UAX 31 then; it looked to me like it required a specific Unicode version. Re-reading I can't see where I got that from though.

@ryankurte

This comment has been minimized.

ryankurte commented Jun 27, 2017

Not sure if this is the right place to post this, but some interesting issues are are likely to appear with linting of mathematical symbols. Easily avoided by writing out variable names, but could be important if better correlation with real equations is a goal.

For example, Δ (uppercase) vs. δ (lowercase) in the following screenshot. The linter is not /wrong/, but it also imo doesn't really make sense to apply the snake case requirement here.

screen shot 2017-06-27 at 2 28 55 pm

@ghost

This comment has been minimized.

ghost commented Jul 26, 2017

would it be possible to allow emoji in variable names even though they aren't XID Start/Continue, like in Swift?

@behnam

This comment has been minimized.

Contributor

behnam commented Jul 26, 2017

@fwrs, Emojis are way more complicated now than non-Emoji characters.

Thanks to some vendors, now you can have Emoji joining (ZWJ) sequences that just keep changing their colors and small details, many of which are not necessarily visible to the naked eye.

Also, the definition of Emoji is expanding fast, every single year, which is not something a system-level programming languages that wants to be stable and reliable needs.

So, although it's cute, I don't think it sits well with Rust goals. But, rust-based scripting/educational languages may benefit from allowing Emojis, depending on their goals.

@ketsuban

This comment has been minimized.

Contributor

ketsuban commented Jul 31, 2017

@ryankurte There's a semantic problem in your example—you're transcribing mathematical formulae, but you used U+0394 GREEK CAPITAL LETTER DELTA rather than U+2206 INCREMENT. The former is a letter of the Greek alphabet, and as such has casemapping; the latter is a mathematical symbol and does not.

@steveklabnik

This comment has been minimized.

Member

steveklabnik commented Nov 9, 2017

I'd like to cross-link this comment: #4928 (comment)

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Jan 17, 2018

I haven't seen the possibility of enabling homoglyph-based attacks here (If somebody mentioned them please ignore the noise), but I just filled a clippy issue to request a lint that warns on code like this:

#![feature(non_ascii_idents)]
fn main() {
    let a = 2;
    let а = 3;
    assert_eq!(a, 2);  // OK
    assert_eq!(а, 3);  // OK
}

In a nutshell, those two as are different unicode characters so the second let binding does not shadow the first one, and both asserts pass (the playground doesn't seem to support unicode identifiers though so the only way to try this is locally; works for me).

This "feature" can be used to introduce exploits in Rust programs that are harder to detect, in particular given that shadowing let bindings are considered idiomatic Rust by many, myself included.

P.S.: this "feature" might be useful in underhanded Rust contests, although that #![feature(non_ascii_idents)] should raise some eyebrows :)

@ketsuban

This comment has been minimized.

Contributor

ketsuban commented Jan 17, 2018

@gnzlbg I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers. If it does, then that solves that problem; if it doesn't, at least we have the tooling to do it ready to go.

I'm a little concerned that this is a candidate for being closed and the code removed from the compiler because it's not had significant movement for a while and requires an RFC. I care a fair amount about Rust being a language of the 21st century, which means Unicode, and about Rust being friendly to non-English-speaking programmers. What I lack is the ability to actually write an RFC.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Jan 18, 2018

@ketsuban

I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers.

yes, I think that, as suggested by @oli-obk in the clippy issue, Rust implementation would instead just use the latest official confusable list:

http://www.unicode.org/Public/security/revision-06/confusables.txt

homoglyph-based attacks can be prevented. This list would need to be kept in sync though, but that is something that can be automated as part of the build system.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Jan 18, 2018

@ketsuban

If you care about this, there are other languages that support unicode in their identifiers, and these languages have processes similar to the RFC process. You could start by checking those. Who knows, maybe you can just merge them together with the feedback in this issue, and get a pre-RFC in the internals forum going? From that point on, it is just about incorporating/arguing feedback with others, and before you know it you will have an RFC ready.

Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue May 17, 2018

Fix grammar documentation wrt Unicode identifiers
The grammar defines identifiers in terms of XID_start and XID_continue,
but this is referring to the unstable non_ascii_idents feature.
The documentation implies that non_ascii_idents is forthcoming, but this
is left over from pre-1.0 documentation; in reality, non_ascii_idents
has been without even an RFC for several years now, and will not be
stabilized anytime soon. Furthermore, according to the tracking issue at
rust-lang#28979 , it's highly
questionable whether or not this feature will use XID_start or
XID_continue even when or if non_ascii_idents is stabilized.
This commit fixes this by respecifying identifiers as the usual
[a-zA-Z_][a-zA-Z0-9_]*

Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue May 17, 2018

Rollup merge of rust-lang#50790 - bstrie:grammar, r=steveklabnik
Fix grammar documentation wrt Unicode identifiers

The grammar defines identifiers in terms of XID_start and XID_continue,
but this is referring to the unstable non_ascii_idents feature.
The documentation implies that non_ascii_idents is forthcoming, but this
is left over from pre-1.0 documentation; in reality, non_ascii_idents
has been without even an RFC for several years now, and will not be
stabilized anytime soon. Furthermore, according to the tracking issue at
rust-lang#28979 , it's highly
questionable whether or not this feature will use XID_start or
XID_continue even when or if non_ascii_idents is stabilized.
This commit fixes this by respecifying identifiers as the usual
[a-zA-Z_][a-zA-Z0-9_]*
@mitsuhiko

This comment has been minimized.

Contributor

mitsuhiko commented Jul 21, 2018

In a way I hope we stick with ASCII identifiers forever. Handling unicode identifiers is such a massive interoperability pain. Some of the more bizarre examples of NFKC mappings is that things like this map to the same identifier:

>>>= 1
>>> H
1
>>>= 42
>>> IX
42
>>>= 23
>>> N
23
>>> import math
>>>= math.e
>>> e
2.718281828459045
>>>= 2
>>> Z
2
@Serentty

This comment has been minimized.

Serentty commented Jul 24, 2018

@mitsuhiko The real world has that kind of pain. We can't just ignore this problem because it's hard to deal with and involves a feature that you personally have no use for.

@Ixrec

This comment has been minimized.

Contributor

Ixrec commented Jul 28, 2018

Also, the current RFC explicitly proposes NFC over NFKC, after a lot of discussion about examples very similar to those.

@Centril

This comment has been minimized.

Contributor

Centril commented Oct 29, 2018

Closing in favor of #55467.

@Centril Centril closed this Oct 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment