Warn against unicode identifier normalization #4906

ickc · 2021-08-23T22:21:18Z

Current problem

Unicode normalization in unicode identifier. c.f.

The issue is that while some characters can be used as identifier, internally it is normalized. So when looked up using __all__ or getattr, it would results in an error. (See the 2nd & 3rd link for respective example.)

Desired solution

pylint can warns about an identifier when normalized is different, and suggested a normalized identifier instead. i.e. it notifies that the identifier will be converted internally silently.

Similarly, pylint can warns those inside __all__ and getattr too, as they can never be looked up due to no identifier would have such an unnormalized string.

Additional context

Edit: keys of globals() & locals() also shares the same problem:

Issue 42680: unicode identifiers not accessible or assignable through globals() - Python tracker

Keys to check

The text was updated successfully, but these errors were encountered:

Pierre-Sassoulas · 2021-08-25T05:23:40Z

Thank you, I did not know about this behavior, it makes a lot of sense to add a check for that in pylint. Would implicit-identifier-normalization work as a name ?

ickc · 2021-08-26T18:20:32Z

That sounds good. Thanks! I added a list of the possible places this happens at the end of the original post above. Feel free to add more if you found some missing.

JulienPalard · 2021-11-17T08:17:03Z

More context: https://lwn.net/ml/python-dev/85d2c121-86a9-e04a-6dfa-da36b073a95d@gmail.com/

In the following case I'd say it's OK to warn about using a not-normalized identifier, it's an easy fix for the dev:

ﬁ_fou = 42  # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou)  # Using only codepoints < 127

In the following case it's more complicated to fix as the two looks identical, it's probably the case where normalization just plays its role. Bad thing, the human users with keyboards will input the 1st one, while the NFKD is the 2nd one, so it's not (easily?) fixable by using a keyboard:

étage = 42  # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage)  # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT

But it can be detected as legitimate as it "rounds-trip" via canonical composition (examples below).

In the following one, warning about normalisation is clearly a good idea, as it leads to a "visual bug":

>>> ϕ = 1
>>> φ = 2
>>> print(ϕ, φ)
2, 2

About the 'it's legit if it rounds trip" here's a demo of what I have in mind:

def is_legit(name):
    """Returns True if the name is already in the expected denormalized form,
    None if it's not but acceptable, and False if not. 
    """
    if unicodedata.normalize("NFKD", name) == name:
        return True
    if unicodedata.normalize("NFKC", unicodedata.normalize("NFKD", name)) == name:
        return None
    return False

Which gives:

>>> is_legit("abc123")  # A normal, plain, old, ASCII identifier
True
>>> print(is_legit("été"))  # That's how a french keyboard would input été, in a composed form
None # means it's not normalized but still acceptable.
>>> print(is_legit("étage")) # This is the denormalized form
True # But I don't know how to input this with a keyboard :D
>>> print(is_legit("étage")) # This is the composed form ("keyboard" / "user" input form)
None # The way user write it is acceptable, youhouuuuuuuu!
>>> print(is_legit("ϕ")) # This one is the one introducing a "visual" bug in my last example
False # Note this is the only one refused by my function.
>>> print(is_legit("φ")) # It's normalized form.
True

Pierre-Sassoulas · 2022-07-04T19:59:05Z

From @JulienPalard example, following the amazing work on the unicode checker by @CarliJoy there's only one false negative left as pylint now warns about non ascii name:

ﬁ_fou = 42  # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou)  # Using only codepoints < 127
étage = 42  # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage)  # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT
ϕ = 1
φ = 2
print(ϕ, φ)

a.py:5:0: C2401: Variable name "étage" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:7:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:8:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)

The first example with LATIN SMALL LIGATURE FI is still not detected by pylint. I'm pretty sure the unicode checker will be easy to extends to fix this as it's pretty modular and well designed.

CarliJoy · 2022-07-06T14:14:04Z

If someone want to fix this issue, feel free to contact me.

CarliJoy · 2022-07-08T13:43:13Z

I had a look into it.
I don't think this is solvable with acceptable effort.

The issues is that Python normalizes already when defining the variable:

Therefore it will be impossible for the AST based checker (i.e. non-ascii-names determine that there was an issue at all.
Otherwise it would already detect it.

The only solution would be to check the source code in plain like the unicode checker.
But we can't simply check for the occurrence of this kind of characters but need to extract all variable/name definitions.

So we kind of need to write a simple Python parser, in order that

valid = "ﬁ_invalid = '23'"
# or even worse
still_valid = """
ﬁ_invalid = "23"
"""

don't become false positives.

Also we need to consider function definitions, function argument definition, class definitions and so one.
Here just some examples to get a glimpse of the complexity:

class Invalid_ﬁ:
    def invalid_method_ﬁ(invalid_argument_ﬁ):
        ...
with open("file", "w") as invalid_ﬁ:
    ...
try:
    a = 1/0
except ValueError as invalid_ﬁ:
    ...

So really a kind of customer Python parser.

I would remove the "Good first issue", because I think it isn't ;-)

DanielNoord · 2022-07-08T13:47:24Z

@CarliJoy Thanks for this investigation. Did you check if these characters also get normalised in the tokens produced by tokenize? If not, this specific check is probably better handled by programs such as flake8 which use tokens instead of the ast representation.

CarliJoy · 2022-07-08T13:48:26Z

@DanielNoord I didn't. Wasn't on my mind.

DanielNoord · 2022-07-08T13:50:09Z

Btw, we also have token checkers so we could also implement these ourselves if the tokens do indeed retain these characters.

CarliJoy · 2022-07-08T13:51:06Z

Tokenize should work. Good idea!

python -m tokenize test.py 
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,5:            NAME           'ﬁ_fou'        
1,6-1,7:            OP             '='            
1,8-1,33:           STRING         '"This will be normalized"'
1,33-1,34:          NEWLINE        '\n'           
2,0-2,5:            NAME           'print'        
2,5-2,6:            OP             '('            
2,6-2,7:            OP             '['            
2,7-2,11:           NAME           'name'         
2,12-2,15:          NAME           'for'          
2,16-2,20:          NAME           'name'         
2,21-2,23:          NAME           'in'           
2,24-2,30:          NAME           'locals'       
2,30-2,31:          OP             '('            
2,31-2,32:          OP             ')'            
2,32-2,33:          OP             '.'            
2,33-2,37:          NAME           'keys'         
2,37-2,38:          OP             '('            
2,38-2,39:          OP             ')'            
2,40-2,42:          NAME           'if'           
2,43-2,47:          NAME           'name'         
2,47-2,48:          OP             '.'            
2,48-2,56:          NAME           'endswith'     
2,56-2,57:          OP             '('            
2,57-2,62:          STRING         '"fou"'        
2,62-2,63:          OP             ')'            
2,63-2,64:          OP             ']'            
2,64-2,65:          OP             ')'            
2,65-2,66:          NEWLINE        '\n'           
3,0-3,5:            NAME           'print'        
3,5-3,6:            OP             '('            
3,6-3,12:           NAME           'fi_fou'       
3,12-3,13:          OP             ')'            
3,13-3,14:          NEWLINE        '\n'           
4,0-4,0:            ENDMARKER      ''

ickc added Enhancement ✨ Improvement to a component Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Aug 23, 2021

Pierre-Sassoulas removed the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Aug 25, 2021

Pierre-Sassoulas added Help wanted 🙏 Outside help would be appreciated, good for new contributors Good first issue Friendly and approachable by new contributors Needs PR This issue is accepted, sufficiently specified and now needs an implementation labels Jul 4, 2022

Pierre-Sassoulas added the Hacktoberfest label Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn against unicode identifier normalization #4906

Warn against unicode identifier normalization #4906

ickc commented Aug 23, 2021 •

edited

Pierre-Sassoulas commented Aug 25, 2021 •

edited

ickc commented Aug 26, 2021

JulienPalard commented Nov 17, 2021

Pierre-Sassoulas commented Jul 4, 2022

CarliJoy commented Jul 6, 2022

CarliJoy commented Jul 8, 2022

DanielNoord commented Jul 8, 2022

CarliJoy commented Jul 8, 2022

DanielNoord commented Jul 8, 2022

CarliJoy commented Jul 8, 2022

Warn against unicode identifier normalization #4906

Warn against unicode identifier normalization #4906

Comments

ickc commented Aug 23, 2021 • edited

Current problem

Desired solution

Additional context

Keys to check

Pierre-Sassoulas commented Aug 25, 2021 • edited

ickc commented Aug 26, 2021

JulienPalard commented Nov 17, 2021

Pierre-Sassoulas commented Jul 4, 2022

CarliJoy commented Jul 6, 2022

CarliJoy commented Jul 8, 2022

DanielNoord commented Jul 8, 2022

CarliJoy commented Jul 8, 2022

DanielNoord commented Jul 8, 2022

CarliJoy commented Jul 8, 2022

ickc commented Aug 23, 2021 •

edited

Pierre-Sassoulas commented Aug 25, 2021 •

edited