New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn against unicode identifier normalization #4906
Comments
Thank you, I did not know about this behavior, it makes a lot of sense to add a check for that in pylint. Would |
That sounds good. Thanks! I added a list of the possible places this happens at the end of the original post above. Feel free to add more if you found some missing. |
More context: https://lwn.net/ml/python-dev/85d2c121-86a9-e04a-6dfa-da36b073a95d@gmail.com/ In the following case I'd say it's OK to warn about using a not-normalized identifier, it's an easy fix for the dev: fi_fou = 42 # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou) # Using only codepoints < 127 In the following case it's more complicated to fix as the two looks identical, it's probably the case where normalization just plays its role. Bad thing, the human users with keyboards will input the 1st one, while the NFKD is the 2nd one, so it's not (easily?) fixable by using a keyboard: étage = 42 # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage) # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT But it can be detected as legitimate as it "rounds-trip" via canonical composition (examples below). In the following one, warning about normalisation is clearly a good idea, as it leads to a "visual bug": >>> ϕ = 1
>>> φ = 2
>>> print(ϕ, φ)
2, 2 About the 'it's legit if it rounds trip" here's a demo of what I have in mind: def is_legit(name):
"""Returns True if the name is already in the expected denormalized form,
None if it's not but acceptable, and False if not.
"""
if unicodedata.normalize("NFKD", name) == name:
return True
if unicodedata.normalize("NFKC", unicodedata.normalize("NFKD", name)) == name:
return None
return False Which gives:
|
From @JulienPalard example, following the amazing work on the unicode checker by @CarliJoy there's only one false negative left as pylint now warns about non ascii name: fi_fou = 42 # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou) # Using only codepoints < 127
étage = 42 # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage) # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT
ϕ = 1
φ = 2
print(ϕ, φ)
The first example with LATIN SMALL LIGATURE FI is still not detected by pylint. I'm pretty sure the unicode checker will be easy to extends to fix this as it's pretty modular and well designed. |
If someone want to fix this issue, feel free to contact me. |
I had a look into it. The issues is that Python normalizes already when defining the variable: Therefore it will be impossible for the AST based checker (i.e. non-ascii-names determine that there was an issue at all. The only solution would be to check the source code in plain like the unicode checker. So we kind of need to write a simple Python parser, in order that valid = "fi_invalid = '23'"
# or even worse
still_valid = """
fi_invalid = "23"
""" don't become false positives. Also we need to consider function definitions, function argument definition, class definitions and so one. class Invalid_fi:
def invalid_method_fi(invalid_argument_fi):
...
with open("file", "w") as invalid_fi:
...
try:
a = 1/0
except ValueError as invalid_fi:
... So really a kind of customer Python parser. I would remove the "Good first issue", because I think it isn't ;-) |
@CarliJoy Thanks for this investigation. Did you check if these characters also get normalised in the |
@DanielNoord I didn't. Wasn't on my mind. |
Btw, we also have token checkers so we could also implement these ourselves if the tokens do indeed retain these characters. |
Tokenize should work. Good idea!
|
Current problem
Unicode normalization in unicode identifier. c.f.
__all__
- Python trackerThe issue is that while some characters can be used as identifier, internally it is normalized. So when looked up using
__all__
orgetattr
, it would results in an error. (See the 2nd & 3rd link for respective example.)Desired solution
pylint can warns about an identifier when normalized is different, and suggested a normalized identifier instead. i.e. it notifies that the identifier will be converted internally silently.
Similarly, pylint can warns those inside
__all__
andgetattr
too, as they can never be looked up due to no identifier would have such an unnormalized string.Additional context
Edit: keys of
globals()
&locals()
also shares the same problem:Keys to check
__all__[key]
getattr(..., key)
__getattr__(..., key)
globals()[key]
locals()[key]
The text was updated successfully, but these errors were encountered: