Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we allow identifiers with different scripts? #2003

Open
lizmat opened this issue Jul 1, 2018 · 12 comments

Comments

@lizmat
Copy link
Contributor

commented Jul 1, 2018

https://metacpan.org/pod/distribution/perl/pod/perldelta.pod#Mixed-Unicode-scripts-are-now-detectable and https://metacpan.org/pod/perlre#Script-Runs

is the reason this all started. I know that in the past similar issues have been raised. To give an example:

my $after;
my $аfter;

This compiles without any problems. Why? Because the first "a" is different from the second "а". Which renders almost exactly the same on my MacOS, but is a cyrillic "a" rather than a latin "a". One could have expected a compilation error.

I think we should not allow identifiers at any level in the Perl 6 ecosystem, that use different scripts (as defined in the uniscript functionality I've just committed: e0e221d ). Disallowing these now will make sure that we won't break much more in the future should we find out that this is really a good thing to do. Whereas if we don't do it now, and we disallow it in the future, could potentially cause a lot of breakage.

Suggestions? Opinions??

@lizmat

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2018

Reading https://metacpan.org/pod/perlre#Script-Runs, it looks that this would be most efficiently done at the VM level.

@AlexDaniel

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

Obligatory ping: @samcv

To be honest, I don't see the problem. So these are the confusables for $after variable: https://unicode.org/cldr/utility/confusables.jsp?a=after&r=None

Looking at your committed uniscrip sub, it simply does uniprop which returns the General_Category. I'm pretty sure you wanted .uniprop(‘Script’). See this gist from unicodable: https://gist.github.com/52130fb00556452cc884762813031b9b

General category, as I understand it, is of no help. According to the docs, we already allow just Ll, Lu, Lt, Lm, and Lo, and differentiating between them is not very useful.

EDIT: this was fixed in 105f4db

I'm not sure if I see the vector of attack here. So someone sends you a PR with an extra variable declared? Has anybody ever attempted this in any language that supports unicode identifiers? (a lot of languages do)

FWIW see this ticket also: perl6/doc#1444. I haven't had time to work on it yet, but it's about configuring your editor to highlight characters that are out of ascii range. Point is, this can also be tackled by making sure most common text editors for perl 6 give hints about unicode stuff (let's say a slightly tinted background color).

@ugexe

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

The types of attacks mentioned generally rely on the diff of the PR to hide the malicious intent. Consider something like:

my $after = "/"; # original $after

# 100 lines later, leading the above to be outside the view of the diff

if(my $аfter = "~/.temp") { # malicious $after
    rm_rf($after) # do something with original $after
}

such that the PR may very well seem reasonable since you would likely only see:

if(my $аfter = "~/.temp") {
    rm_rf($after)
}
@AlexDaniel

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

Thank you, @ugexe, that was short and clear!

OK, I would agree then that there is a potential problem. That being said, it has nothing to do with scripts.

See this example:

 $after
+ $𝚊fter

Here is how it's rendered in my browser:
image

Here is what proposed uniscript sub says:

<AlexDaniel> m: say uniscript ‘after’
<camelia> rakudo-moar 105f4db38: OUTPUT: «Latin␤»
<AlexDaniel> m: say uniscript ‘𝚊fter’
<camelia> rakudo-moar 105f4db38: OUTPUT: «Latin␤»

And here are the actual characters being used:

<AlexDaniel> u: a𝚊
<unicodable6> AlexDaniel, U+0061 LATIN SMALL LETTER A [Ll] (a)
<unicodable6> AlexDaniel, U+1D68A MATHEMATICAL MONOSPACE SMALL A [Ll] (𝚊)

I think it has nothing to do with scripts. It's all about confusables.

@samcv can we include confusables table in rakudo? What do you think?

EDIT: bonus IRC discussion: http://colabti.org/irclogger/irclogger_log/perl6-dev?date=2018-07-01#l130

@zoffixznet

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2018

rely on the diff of the PR to hide the malicious intent.

FWIW, different scripts isn't the only way something like this can be done:

@jnthn

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

If we do anything in this space, then it's a parse-time decision, and so the behavior change can be guarded by language version in effect. So we could do it for 6.e an onward, for example, and it'd not affect existing code (provided that code declared the language version it was written for).

I agree with @AlexDaniel that this is a confusables issue rather than a script one. I don't think it's unreasonable to include two scripts in a variable name (for example, Greek letters often show up in math/physics problems, and using that plus an English word would not be an unreasonable thing to do). But two symbols identical except for a confusable is hard to imagine a legitimate use for.

@samcv can we include confusables table in rakudo? What do you think?

I'm not @samcv but I'm +1 to doing this. We include loads of other Unicode data, and this seems like a relatively small data set that helps people deal with a genuine problem.

I don't yet have a position either way on whether we should try to detect and report confusable variable names, sub names, etc. as part of the language, or declare it to be a job for a linter (noting that anyone dealing with code from untrusted sources can integrate that easily into their PR workflow). I'm guessing that there's a name mangling method that lets us at least make it an O(1) lookup, not O(n) in the number of symbols declared?

@AlexDaniel

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

But two symbols identical except for a confusable is hard to imagine a legitimate use for.

Just want to clarify that it's a bit more complicated than that. If this is implemented based on confusables, then variables like ‘hello’ and ‘heIIo’ will also trigger the detection (even though all characters are in ascii range). I think this is good, but the detection gets questionable with shorter identifiers. For example $al vs $a1, or something like that. If that's an issue, then we can apply other heuristics based on variable length and other things.

@samcv

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

My opinion is that we should not enforce anything regarding different scripts. The only issues is non-ascii confusables, like cyrillic а and ascii a.

I propose the following:

  1. We scan variable names for non-ascii characters
  2. If it contains any non-ascii characters we replace these non-ascii characters with their ascii lookalikes
  3. We then look up and see if this already exists
  4. If it already exists we warn or error etc. depending on what is determined we should do

This obviously could be expanded later, but I think it's a good middle ground that protects ascii variable users from these problems while avoiding having an overcomplicated method of trying to compare all variables with 'confusable-normalized' forms.

@lizmat

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2018

FWIW, I think the problem extends to any identifier, so also class names. And from there on out, to the module space and ecosystem. Do we want two modules with confusable names in the ecosystem at the same time?

@ugexe

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

The ecosystems will ultimately govern themselves, with each ecosystem determining their own rules for inclusion. CURs themselves very much allow two modules of the exact same name ( let alone names that are close ), so the robust ecosystem should follow suit. Otherwise you must also add another layer, that of the recommendation manager that aggregates the results from multiple ecosystems.

Modules with confusing names are not nearly as at risk, because you cannot simply sneak the name/import in -- you must also find a way to sneak the maliciously named module / dependency into the code base ( i.e. adding it to a META6.json somewhere in the dependency chain ) such that a use a1 doesn't compile time error trying when trying to make the reviewer see use al.

@samcv

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

Here's the list of things that are confusable with ascii for your perusal:

https://gist.github.com/samcv/f501c4d0716ec49454d814bf663ddbe1

@AlexDaniel

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

So I see it this way:

  1. Data about confusables should be added to rakudo. It can be available to users, for example through .uniprop(‘Confusables’) or something similar. For any character it can return a Set of characters that are related to each other transitively.
  2. We (and users?) need an easy and efficient way to work with confusables. This can be done by adding a mechanism for generating confusable hashes for strings. For example, strings “𝓱𝖊lΙσ” and “hello” should give an identical confusable hash. Note that from №1 we'll get a relatively easy way to get confusable set id for any character, so generating a hash given all that info shouldn't be too problematic.
  3. Once №2 is done, it will be trivial to do efficient lookups. We can then implement a compile-time error and tweak it to our liking.

Does it sound like a plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.