Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid-name does not recognize identifiers with non-ASCII characters #2725

Closed
dato opened this issue Feb 4, 2019 · 13 comments
Closed

invalid-name does not recognize identifiers with non-ASCII characters #2725

dato opened this issue Feb 4, 2019 · 13 comments
Labels
Enhancement ✨ Improvement to a component Good first issue Friendly and approachable by new contributors

Comments

@dato
Copy link

dato commented Feb 4, 2019

In Python 3, identifiers can include non-ASCII characters. See Identifiers and keywords in the Python Reference.

Roughly speaking, the definition of letters is extended from [A-Za-z] to all Unicode letter categories. The standard re module, however, does not support Unicode character properties like Lu, Ll, etc; and I’m not sure its standard Unicode support will be enough.

Alternatives include the drop-in, compatible regex module (which supports Unicode properties with \p{}, or manually compiling a list of character ranges to use (which gets very ugly very quickly).

Steps to reproduce

  1. Create an UTF-8 file with, for example, the following contents:

    """Python 3 code with non-ASCII identifiers."""
    
    def validar_contraseña(cadena):
        """Validates a password, in Spanish.
        """
        contraseña_válida = len(cadena) >= 8
        return contraseña_válida

    (N.B.: I don’t write code with non-ASCII characters, but my students do.)

  2. Run pylint3 on it.

Current behavior

$ pylint3 test.py
test.py:3:0: C0103: Function name "validar_contraseña" doesn't conform to snake_case naming style (invalid-name)                                                                              
test.py:6:4: C0103: Variable name "contraseña_válida" doesn't conform to snake_case naming style (invalid-name)

(I encounter this when I run pylint on my students’ code.)

Expected behavior

$ pylint3 test.py
Your code has been rated at 10.00/10

(This is what I want their code to look like. 😄)

pylint --version output

Tested with:

$ pylint3 --version
pylint3 2.2.2
astroid 2.1.0
Python 3.7.2 (default, Jan  3 2019, 02:55:40) 
[GCC 8.2.0]

$ ~/.local/bin/pylint --version
pylint 2.3.0-dev1
astroid 2.2.0-dev
Python 3.7.2 (default, Jan  3 2019, 02:55:40) 
[GCC 8.2.0]

Many thanks in advance.

@PCManticore
Copy link
Contributor

Thanks for the report. if I understand right, using the regex library will allow you to customize the regular expressions for variables names, but right now that's not possible due to the fact that we use the re stdlib module? if that's the case, I think that makes sense, but other than that I don't think we should tinker with those regular expression defaults just to support Unicode, that would still need to be set on a case by case basis in your pylintrc file.

@dato
Copy link
Author

dato commented Feb 4, 2019

Hi, and thank you for the quick reply.

if I understand right, using the regex library will allow you to customize the regular expressions for variables names

To be honest I’m not really interested in using my own regular expressions.

What would be great is for pylint’s default verification of snake case (or any other supported style) to accept all valid Python 3 identifiers in snake case. At the moment it fails to do that.

but other than that I don't think we should tinker with those regular expression defaults just to support Unicode

Well, that’s exactly what I’m arguing for: that the current, default regular expressions in checkers/base.py are insufficient, insofar as they don’t cover the totality of Python 3 allowed syntax.

I can see a case for not changing them. But this bug report is not about having regex available (that’d be a separate topic), but about having pylint comprehensively support all valid identifiers by default.

Many thanks for considering.

@dato
Copy link
Author

dato commented Feb 4, 2019

(Just to be clear, I’ll perfectly understand if you prefer to close this issue with “wontfix”, I just wanted for it to exist and for its scope to be clear. I just do not believe users of Unicode identifiers should be left with writing up their own regular expressions as their only option.)

@PCManticore
Copy link
Contributor

@dato This makes sense, I think we should do it if it's possible via the regex library.

@PCManticore PCManticore added the Enhancement ✨ Improvement to a component label Feb 18, 2019
@PCManticore PCManticore changed the title Does not recognize identifiers with non-ASCII characters invalid-name does not recognize identifiers with non-ASCII characters Feb 18, 2019
@gyermolenko
Copy link
Contributor

gyermolenko commented Aug 14, 2019

Maybe I'm out of context here but by default re module handles Unicode.
This behavior can be tuned with re.ASCII.

>>> import re
>>> re.search("\w*_\w*", "validar_contraseña")
<re.Match object; span=(0, 18), match='validar_contraseña'>

>>> re.search("\w*_\w*", "validar_contraseña", re.ASCII)
<re.Match object; span=(0, 16), match='validar_contrase'>

imho, user should be discouraged from using non-ascii names. Some chars are very similar which may cause issues later.

@dato
Copy link
Author

dato commented Aug 15, 2019

That's a very good point, thank you! The current checkers don't use \w and force [a-z]:

https://github.com/PyCQA/pylint/blob/33b8185a455c1686d038258697bb93005f2441c2/pylint/checkers/base.py#L87-L97

@gyermolenko
Copy link
Contributor

  • I like current behavior
  • but also can imagine situation when user wants to change that (not necessary knowing what current regex is, or what is should be)

@PCManticore
Copy link
Contributor

I agree with @gyermolenko that we should discourage users from non-ASCII identifiers. At the same time, I don't believe that should happen with invalid-name. As @dato suggested, those identifiers are valid identifiers anyway and pylint should accept them. Maybe the solution is two-fold:

  • adjust the current regex rules to work with non-ASCII identifiers
  • add a new rule to disallow non-ASCII identifiers, which can be disabled separately of invalid-name

@PCManticore PCManticore added the Good first issue Friendly and approachable by new contributors label Aug 16, 2019
@frederickjeanguerin
Copy link

I am also a teacher and use some accented identifiers in my code with my students. The proposed two-fold solution by @PCManticore makes perfect sense to me. Hope that it will be implemented soon. Thanks.

@PCManticore
Copy link
Contributor

Thanks for the comment @frederickjeanguerin We'll get to this issue eventually but if you or any of your students have any free time to work on a patch, that would speed things up considerably (there are almost 500 issues on this tracker so might take us a while to fix it)

@PCManticore
Copy link
Contributor

This has been implemented by #3409 which adds support for Unicode names and adds a new check non-ascii-name for non-ASCII name identifiers. Let us know if you encounter any issues with this solution. It's going to be released in 2.5 once we're ready for release.

@arisliang
Copy link

How do we enable this new non-ascii-name?

@PCManticore
Copy link
Contributor

@arisliang It will be enabled by default once we'll launch 2.5 to the public (which should happen this week)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component Good first issue Friendly and approachable by new contributors
Projects
None yet
Development

No branches or pull requests

5 participants