Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn against unicode identifier normalization #4906

Open
5 tasks
ickc opened this issue Aug 23, 2021 · 10 comments
Open
5 tasks

Warn against unicode identifier normalization #4906

ickc opened this issue Aug 23, 2021 · 10 comments
Labels
Enhancement ✨ Improvement to a component Good first issue Friendly and approachable by new contributors Hacktoberfest Help wanted 🙏 Outside help would be appreciated, good for new contributors Needs PR This issue is accepted, sufficiently specified and now needs an implementation

Comments

@ickc
Copy link

ickc commented Aug 23, 2021

Current problem

Unicode normalization in unicode identifier. c.f.

The issue is that while some characters can be used as identifier, internally it is normalized. So when looked up using __all__ or getattr, it would results in an error. (See the 2nd & 3rd link for respective example.)

Desired solution

pylint can warns about an identifier when normalized is different, and suggested a normalized identifier instead. i.e. it notifies that the identifier will be converted internally silently.

Similarly, pylint can warns those inside __all__ and getattr too, as they can never be looked up due to no identifier would have such an unnormalized string.

Additional context

Edit: keys of globals() & locals() also shares the same problem:

Keys to check

  • __all__[key]
  • getattr(..., key)
  • __getattr__(..., key)
  • globals()[key]
  • locals()[key]
@ickc ickc added Enhancement ✨ Improvement to a component Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Aug 23, 2021
@Pierre-Sassoulas Pierre-Sassoulas removed the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Aug 25, 2021
@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Aug 25, 2021

Thank you, I did not know about this behavior, it makes a lot of sense to add a check for that in pylint. Would implicit-identifier-normalization work as a name ?

@ickc
Copy link
Author

ickc commented Aug 26, 2021

That sounds good. Thanks! I added a list of the possible places this happens at the end of the original post above. Feel free to add more if you found some missing.

@JulienPalard
Copy link
Contributor

More context: https://lwn.net/ml/python-dev/85d2c121-86a9-e04a-6dfa-da36b073a95d@gmail.com/

In the following case I'd say it's OK to warn about using a not-normalized identifier, it's an easy fix for the dev:

fi_fou = 42  # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou)  # Using only codepoints < 127

In the following case it's more complicated to fix as the two looks identical, it's probably the case where normalization just plays its role. Bad thing, the human users with keyboards will input the 1st one, while the NFKD is the 2nd one, so it's not (easily?) fixable by using a keyboard:

étage = 42  # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage)  # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT

But it can be detected as legitimate as it "rounds-trip" via canonical composition (examples below).

In the following one, warning about normalisation is clearly a good idea, as it leads to a "visual bug":

>>> ϕ = 1
>>> φ = 2
>>> print(ϕ, φ)
2, 2

About the 'it's legit if it rounds trip" here's a demo of what I have in mind:

def is_legit(name):
    """Returns True if the name is already in the expected denormalized form,
    None if it's not but acceptable, and False if not. 
    """
    if unicodedata.normalize("NFKD", name) == name:
        return True
    if unicodedata.normalize("NFKC", unicodedata.normalize("NFKD", name)) == name:
        return None
    return False

Which gives:

>>> is_legit("abc123")  # A normal, plain, old, ASCII identifier
True
>>> print(is_legit("été"))  # That's how a french keyboard would input été, in a composed form
None # means it's not normalized but still acceptable.
>>> print(is_legit("étage")) # This is the denormalized form
True # But I don't know how to input this with a keyboard :D
>>> print(is_legit("étage")) # This is the composed form ("keyboard" / "user" input form)
None # The way user write it is acceptable, youhouuuuuuuu!
>>> print(is_legit("ϕ")) # This one is the one introducing a "visual" bug in my last example
False # Note this is the only one refused by my function.
>>> print(is_legit("φ")) # It's normalized form.
True

@Pierre-Sassoulas
Copy link
Member

From @JulienPalard example, following the amazing work on the unicode checker by @CarliJoy there's only one false negative left as pylint now warns about non ascii name:

fi_fou = 42  # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou)  # Using only codepoints < 127
étage = 42  # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage)  # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT
ϕ = 1
φ = 2
print(ϕ, φ)
a.py:5:0: C2401: Variable name "étage" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:7:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:8:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)

The first example with LATIN SMALL LIGATURE FI is still not detected by pylint. I'm pretty sure the unicode checker will be easy to extends to fix this as it's pretty modular and well designed.

@Pierre-Sassoulas Pierre-Sassoulas added Help wanted 🙏 Outside help would be appreciated, good for new contributors Good first issue Friendly and approachable by new contributors Needs PR This issue is accepted, sufficiently specified and now needs an implementation labels Jul 4, 2022
@CarliJoy
Copy link
Contributor

CarliJoy commented Jul 6, 2022

If someone want to fix this issue, feel free to contact me.

@CarliJoy
Copy link
Contributor

CarliJoy commented Jul 8, 2022

I had a look into it.
I don't think this is solvable with acceptable effort.

The issues is that Python normalizes already when defining the variable:
image

Therefore it will be impossible for the AST based checker (i.e. non-ascii-names determine that there was an issue at all.
Otherwise it would already detect it.

The only solution would be to check the source code in plain like the unicode checker.
But we can't simply check for the occurrence of this kind of characters but need to extract all variable/name definitions.

So we kind of need to write a simple Python parser, in order that

valid = "fi_invalid = '23'"
# or even worse
still_valid = """
fi_invalid = "23"
"""

don't become false positives.

Also we need to consider function definitions, function argument definition, class definitions and so one.
Here just some examples to get a glimpse of the complexity:

class Invalid_fi:
    def invalid_method_fi(invalid_argument_fi):
        ...
with open("file", "w") as invalid_fi:
    ...
try:
    a = 1/0
except ValueError as invalid_fi:
    ...

So really a kind of customer Python parser.

I would remove the "Good first issue", because I think it isn't ;-)

@DanielNoord
Copy link
Collaborator

@CarliJoy Thanks for this investigation. Did you check if these characters also get normalised in the tokens produced by tokenize? If not, this specific check is probably better handled by programs such as flake8 which use tokens instead of the ast representation.

@CarliJoy
Copy link
Contributor

CarliJoy commented Jul 8, 2022

@DanielNoord I didn't. Wasn't on my mind.

@DanielNoord
Copy link
Collaborator

Btw, we also have token checkers so we could also implement these ourselves if the tokens do indeed retain these characters.

@CarliJoy
Copy link
Contributor

CarliJoy commented Jul 8, 2022

Tokenize should work. Good idea!

python -m tokenize test.py 
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,5:            NAME           'fi_fou'        
1,6-1,7:            OP             '='            
1,8-1,33:           STRING         '"This will be normalized"'
1,33-1,34:          NEWLINE        '\n'           
2,0-2,5:            NAME           'print'        
2,5-2,6:            OP             '('            
2,6-2,7:            OP             '['            
2,7-2,11:           NAME           'name'         
2,12-2,15:          NAME           'for'          
2,16-2,20:          NAME           'name'         
2,21-2,23:          NAME           'in'           
2,24-2,30:          NAME           'locals'       
2,30-2,31:          OP             '('            
2,31-2,32:          OP             ')'            
2,32-2,33:          OP             '.'            
2,33-2,37:          NAME           'keys'         
2,37-2,38:          OP             '('            
2,38-2,39:          OP             ')'            
2,40-2,42:          NAME           'if'           
2,43-2,47:          NAME           'name'         
2,47-2,48:          OP             '.'            
2,48-2,56:          NAME           'endswith'     
2,56-2,57:          OP             '('            
2,57-2,62:          STRING         '"fou"'        
2,62-2,63:          OP             ')'            
2,63-2,64:          OP             ']'            
2,64-2,65:          OP             ')'            
2,65-2,66:          NEWLINE        '\n'           
3,0-3,5:            NAME           'print'        
3,5-3,6:            OP             '('            
3,6-3,12:           NAME           'fi_fou'       
3,12-3,13:          OP             ')'            
3,13-3,14:          NEWLINE        '\n'           
4,0-4,0:            ENDMARKER      ''             

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component Good first issue Friendly and approachable by new contributors Hacktoberfest Help wanted 🙏 Outside help would be appreciated, good for new contributors Needs PR This issue is accepted, sufficiently specified and now needs an implementation
Projects
None yet
Development

No branches or pull requests

5 participants