Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979
Comments
steveklabnik
added
the
A-lang
label
Oct 29, 2015
This comment has been minimized.
This comment has been minimized.
|
/cc @rust-lang/lang |
pnkfelix
added
the
T-lang
label
Oct 29, 2015
This comment has been minimized.
This comment has been minimized.
|
nominating |
pnkfelix
added
the
I-nominated
label
Oct 29, 2015
This comment has been minimized.
This comment has been minimized.
|
cc @SimonSapin Apparently we implement this: http://www.unicode.org/reports/tr31/ or something like it. I would like to see this stabilised, but it will take some work to persuade ourselves that we are doing the right thing. |
This comment has been minimized.
This comment has been minimized.
|
I have no idea what the right thing is here. In addition to Unicode recommendations, we might want to look at what other languages actually do, and what related bug reports or criticism they get. Or was this already done when the feature was first introduced? |
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin There's also a problem with normalization of identifiers and mapping unicode |
This comment has been minimized.
This comment has been minimized.
|
Yes #2253 is the big issue I know of that makes me worry about premature stabilization of non-unicode identifiers. (The discussion there is more broad and arguably could be forked off into two threads; e.g. we could take one normalization path for identifiers and another for string literal contents.) |
This comment has been minimized.
This comment has been minimized.
|
we may want to migrate This discussion to the RFCS repo, e.g. at rust-lang/rfcs#802 |
This comment has been minimized.
This comment has been minimized.
|
I agree that this is a feature that deserves to be put through the RFC process. |
aturon
changed the title
Fix non-ASCII identifiers
Tracking issue for non-ASCII identifiers (feature "non_ascii_idents")
Nov 4, 2015
aturon
added
the
B-unstable
label
Nov 4, 2015
This comment has been minimized.
This comment has been minimized.
|
I've repurposed this issue to track stabilization (or deprecation, etc) of the |
This comment has been minimized.
This comment has been minimized.
|
After discussion in the lang team meeting, we decided that yes, an RFC would be the proper way forward here. We need something that collects the solutions from other languages, analyzes their pros/cons, and suggests the appropriate choice for Rust. This is controversial and complex enough that it should be brought to the community at large -- especially as many of us hacking on Rust on a daily basis don't have a lot of experience with non-ASCII anyhow. |
This comment has been minimized.
This comment has been minimized.
|
triage: P-low Marking as low as there is no RFC at present and hence no actionable content. |
rust-highfive
added
P-low
and removed
I-nominated
labels
Nov 5, 2015
This comment has been minimized.
This comment has been minimized.
|
cc #7539 |
8573
referenced this issue
Oct 24, 2016
Closed
What about named identifiers in local language? #1776
This comment has been minimized.
This comment has been minimized.
kberov
commented
Jan 8, 2017
|
In JavaScript, Perl 5 and Perl 6 this feature is available. function Слово(стойност) {
this.стойност = стойност;
}
var здрасти = new Слово("Здравей, свят");
console.log(здрасти.стойност) //Здравей, святPerl >=5.12 use utf8;
{
package Слово;
sub new {
my $self = bless {}, shift;
$self->{стойност} = shift;
$self
}
};
my $здрасти = Слово->new("здравей, свят");
say ucfirst($здрасти->{стойност}); #Здравей, святPerl6 (this is not just next version of Perl. This is a new language) class Слово {
has $.стойност;
}
my $здрасти = Слово.new(стойност => 'здравей, свят');
say $здрасти.tc; #Здравей, святI would be happy to see it in Rust too. |
This comment has been minimized.
This comment has been minimized.
|
For what it’s worth identifiers in ECMAScript 2015 are based on the Default Identifier Syntax from Unicode Standard Annex #31. Perl with
|
This comment has been minimized.
This comment has been minimized.
kberov
commented
Jan 8, 2017
|
Yes! Thanks @SimonSapin! |
This comment has been minimized.
This comment has been minimized.
|
For Python it’s So it looks like many programming languages that allow non-ASCII identifiers are based on the same standard, but in the details they each do something slightly different… |
SimonSapin
referenced this issue
Feb 1, 2017
Closed
Tracking issue for RFC 1566: Procedural macros #38356
This comment has been minimized.
This comment has been minimized.
|
I would personally love to see support for math-related identifiers. For example, ∅ (and set operators, like ∩ and ∪). Translating equations from research papers/specifications into code is often a terrible process resulting in verbose and difficult to read code. Being able to use the same identifiers in the code that are in the paper's math equations would simplify implementation and would make the code easier to check and compare against the paper's equations. |
This comment has been minimized.
This comment has been minimized.
DoumanAsh
commented
Mar 17, 2017
|
What's point of this feature exactly? Aside from adding possibility to create truly ugly mix of different languages in your code(english is the only truly international language), it gives no benefits to language functionality wise. Or is it support of unicode for the sake of supporting unicode? |
This comment has been minimized.
This comment has been minimized.
I might have misunderstood UAX 31 then; it looked to me like it required a specific Unicode version. Re-reading I can't see where I got that from though. |
This comment has been minimized.
This comment has been minimized.
ryankurte
commented
Jun 27, 2017
|
Not sure if this is the right place to post this, but some interesting issues are are likely to appear with linting of mathematical symbols. Easily avoided by writing out variable names, but could be important if better correlation with real equations is a goal. For example, Δ (uppercase) vs. δ (lowercase) in the following screenshot. The linter is not /wrong/, but it also imo doesn't really make sense to apply the snake case requirement here. |
Mark-Simulacrum
added
the
C-tracking-issue
label
Jul 22, 2017
This comment has been minimized.
This comment has been minimized.
ghost
commented
Jul 26, 2017
|
would it be possible to allow emoji in variable names even though they aren't XID Start/Continue, like in Swift? |
This comment has been minimized.
This comment has been minimized.
|
@fwrs, Emojis are way more complicated now than non-Emoji characters. Thanks to some vendors, now you can have Emoji joining (ZWJ) sequences that just keep changing their colors and small details, many of which are not necessarily visible to the naked eye. Also, the definition of Emoji is expanding fast, every single year, which is not something a system-level programming languages that wants to be stable and reliable needs. So, although it's cute, I don't think it sits well with Rust goals. But, rust-based scripting/educational languages may benefit from allowing Emojis, depending on their goals. |
This comment has been minimized.
This comment has been minimized.
|
@ryankurte There's a semantic problem in your example—you're transcribing mathematical formulae, but you used U+0394 GREEK CAPITAL LETTER DELTA rather than U+2206 INCREMENT. The former is a letter of the Greek alphabet, and as such has casemapping; the latter is a mathematical symbol and does not. |
This comment has been minimized.
This comment has been minimized.
|
I'd like to cross-link this comment: #4928 (comment) |
This comment has been minimized.
This comment has been minimized.
|
I haven't seen the possibility of enabling homoglyph-based attacks here (If somebody mentioned them please ignore the noise), but I just filled a clippy issue to request a lint that warns on code like this: #![feature(non_ascii_idents)]
fn main() {
let a = 2;
let а = 3;
assert_eq!(a, 2); // OK
assert_eq!(а, 3); // OK
}In a nutshell, those two This "feature" can be used to introduce exploits in Rust programs that are harder to detect, in particular given that shadowing let bindings are considered idiomatic Rust by many, myself included. P.S.: this "feature" might be useful in underhanded Rust contests, although that |
This comment has been minimized.
This comment has been minimized.
|
@gnzlbg I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers. If it does, then that solves that problem; if it doesn't, at least we have the tooling to do it ready to go. I'm a little concerned that this is a candidate for being closed and the code removed from the compiler because it's not had significant movement for a while and requires an RFC. I care a fair amount about Rust being a language of the 21st century, which means Unicode, and about Rust being friendly to non-English-speaking programmers. What I lack is the ability to actually write an RFC. |
This comment has been minimized.
This comment has been minimized.
yes, I think that, as suggested by @oli-obk in the clippy issue, Rust implementation would instead just use the latest official confusable list: http://www.unicode.org/Public/security/revision-06/confusables.txt homoglyph-based attacks can be prevented. This list would need to be kept in sync though, but that is something that can be automated as part of the build system. |
This comment has been minimized.
This comment has been minimized.
|
If you care about this, there are other languages that support unicode in their identifiers, and these languages have processes similar to the RFC process. You could start by checking those. Who knows, maybe you can just merge them together with the feedback in this issue, and get a pre-RFC in the internals forum going? From that point on, it is just about incorporating/arguing feedback with others, and before you know it you will have an RFC ready. |
bstrie
referenced this issue
May 16, 2018
Merged
Fix grammar documentation wrt Unicode identifiers #50790
Mark-Simulacrum
pushed a commit
to Mark-Simulacrum/rust
that referenced
this issue
May 17, 2018
Mark-Simulacrum
added a commit
to Mark-Simulacrum/rust
that referenced
this issue
May 17, 2018
nibags
referenced this issue
Jun 25, 2018
Open
Add unicode escapes, allow non-ASCII identifiers & others improvements #136
fbstj
referenced this issue
Jul 16, 2018
Closed
update all dates in state-of-rust features table #156
This comment has been minimized.
This comment has been minimized.
|
In a way I hope we stick with ASCII identifiers forever. Handling unicode identifiers is such a massive interoperability pain. Some of the more bizarre examples of NFKC mappings is that things like this map to the same identifier: >>> ℌ = 1
>>> H
1
>>> Ⅸ = 42
>>> IX
42
>>> ℕ = 23
>>> N
23
>>> import math
>>> ℯ = math.e
>>> e
2.718281828459045
>>> ℨ = 2
>>> Z
2 |
This comment has been minimized.
This comment has been minimized.
Serentty
commented
Jul 24, 2018
|
@mitsuhiko The real world has that kind of pain. We can't just ignore this problem because it's hard to deal with and involves a feature that you personally have no use for. |
This comment has been minimized.
This comment has been minimized.
|
Also, the current RFC explicitly proposes NFC over NFKC, after a lot of discussion about examples very similar to those. |
This comment has been minimized.
This comment has been minimized.
|
Closing in favor of #55467. |

DemiMarie commentedOct 12, 2015
Non-ASCII identifiers are currently feature gated. Handling of them should be fixed and the feature gate removed.