invalid-name does not recognize identifiers with non-ASCII characters #2725

dato · 2019-02-04T15:06:35Z

In Python 3, identifiers can include non-ASCII characters. See Identifiers and keywords in the Python Reference.

Roughly speaking, the definition of letters is extended from [A-Za-z] to all Unicode letter categories. The standard re module, however, does not support Unicode character properties like Lu, Ll, etc; and I’m not sure its standard Unicode support will be enough.

Alternatives include the drop-in, compatible regex module (which supports Unicode properties with \p{}, or manually compiling a list of character ranges to use (which gets very ugly very quickly).

Steps to reproduce

Create an UTF-8 file with, for example, the following contents:

"""Python 3 code with non-ASCII identifiers."""

def validar_contraseña(cadena):
    """Validates a password, in Spanish.
    """
    contraseña_válida = len(cadena) >= 8
    return contraseña_válida

(N.B.: I don’t write code with non-ASCII characters, but my students do.)

Run pylint3 on it.

Current behavior

$ pylint3 test.py
test.py:3:0: C0103: Function name "validar_contraseña" doesn't conform to snake_case naming style (invalid-name)                                                                              
test.py:6:4: C0103: Variable name "contraseña_válida" doesn't conform to snake_case naming style (invalid-name)

(I encounter this when I run pylint on my students’ code.)

Expected behavior

$ pylint3 test.py
Your code has been rated at 10.00/10

(This is what I want their code to look like. 😄)

pylint --version output

Tested with:

$ pylint3 --version
pylint3 2.2.2
astroid 2.1.0
Python 3.7.2 (default, Jan  3 2019, 02:55:40) 
[GCC 8.2.0]

$ ~/.local/bin/pylint --version
pylint 2.3.0-dev1
astroid 2.2.0-dev
Python 3.7.2 (default, Jan  3 2019, 02:55:40) 
[GCC 8.2.0]

Many thanks in advance.

The text was updated successfully, but these errors were encountered:

PCManticore · 2019-02-04T15:17:19Z

Thanks for the report. if I understand right, using the regex library will allow you to customize the regular expressions for variables names, but right now that's not possible due to the fact that we use the re stdlib module? if that's the case, I think that makes sense, but other than that I don't think we should tinker with those regular expression defaults just to support Unicode, that would still need to be set on a case by case basis in your pylintrc file.

dato · 2019-02-04T15:48:09Z

Hi, and thank you for the quick reply.

if I understand right, using the regex library will allow you to customize the regular expressions for variables names

To be honest I’m not really interested in using my own regular expressions.

What would be great is for pylint’s default verification of snake case (or any other supported style) to accept all valid Python 3 identifiers in snake case. At the moment it fails to do that.

but other than that I don't think we should tinker with those regular expression defaults just to support Unicode

Well, that’s exactly what I’m arguing for: that the current, default regular expressions in checkers/base.py are insufficient, insofar as they don’t cover the totality of Python 3 allowed syntax.

I can see a case for not changing them. But this bug report is not about having regex available (that’d be a separate topic), but about having pylint comprehensively support all valid identifiers by default.

Many thanks for considering.

dato · 2019-02-04T15:49:09Z

(Just to be clear, I’ll perfectly understand if you prefer to close this issue with “wontfix”, I just wanted for it to exist and for its scope to be clear. I just do not believe users of Unicode identifiers should be left with writing up their own regular expressions as their only option.)

PCManticore · 2019-02-18T14:36:13Z

@dato This makes sense, I think we should do it if it's possible via the regex library.

gyermolenko · 2019-08-14T12:43:01Z

Maybe I'm out of context here but by default re module handles Unicode.
This behavior can be tuned with re.ASCII.

>>> import re
>>> re.search("\w*_\w*", "validar_contraseña")
<re.Match object; span=(0, 18), match='validar_contraseña'>

>>> re.search("\w*_\w*", "validar_contraseña", re.ASCII)
<re.Match object; span=(0, 16), match='validar_contrase'>

imho, user should be discouraged from using non-ascii names. Some chars are very similar which may cause issues later.

dato · 2019-08-15T02:39:20Z

That's a very good point, thank you! The current checkers don't use \w and force [a-z]:

https://github.com/PyCQA/pylint/blob/33b8185a455c1686d038258697bb93005f2441c2/pylint/checkers/base.py#L87-L97

gyermolenko · 2019-08-15T08:55:58Z

I like current behavior
but also can imagine situation when user wants to change that (not necessary knowing what current regex is, or what is should be)

PCManticore · 2019-08-16T15:02:31Z

I agree with @gyermolenko that we should discourage users from non-ASCII identifiers. At the same time, I don't believe that should happen with invalid-name. As @dato suggested, those identifiers are valid identifiers anyway and pylint should accept them. Maybe the solution is two-fold:

adjust the current regex rules to work with non-ASCII identifiers
add a new rule to disallow non-ASCII identifiers, which can be disabled separately of invalid-name

frederickjeanguerin · 2019-11-01T14:19:03Z

I am also a teacher and use some accented identifiers in my code with my students. The proposed two-fold solution by @PCManticore makes perfect sense to me. Hope that it will be implemented soon. Thanks.

PCManticore · 2019-11-08T15:16:45Z

Thanks for the comment @frederickjeanguerin We'll get to this issue eventually but if you or any of your students have any free time to work on a patch, that would speed things up considerably (there are almost 500 issues on this tracker so might take us a while to fix it)

PCManticore · 2020-03-11T13:34:29Z

This has been implemented by #3409 which adds support for Unicode names and adds a new check non-ascii-name for non-ASCII name identifiers. Let us know if you encounter any issues with this solution. It's going to be released in 2.5 once we're ready for release.

arisliang · 2020-04-25T14:58:19Z

How do we enable this new non-ascii-name?

PCManticore · 2020-04-27T07:20:02Z

@arisliang It will be enabled by default once we'll launch 2.5 to the public (which should happen this week)

PCManticore added the Enhancement ✨ Improvement to a component label Feb 18, 2019

PCManticore changed the title ~~Does not recognize identifiers with non-ASCII characters~~ invalid-name does not recognize identifiers with non-ASCII characters Feb 18, 2019

PCManticore added the Good first issue Friendly and approachable by new contributors label Aug 16, 2019

bfgray3 mentioned this issue Feb 14, 2020

allow non-ASCII characters in identifiers in the invalid-name rule and add non-ascii-name #3409

Merged

4 tasks

PCManticore closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid-name does not recognize identifiers with non-ASCII characters #2725

invalid-name does not recognize identifiers with non-ASCII characters #2725

dato commented Feb 4, 2019

PCManticore commented Feb 4, 2019

dato commented Feb 4, 2019 •

edited

dato commented Feb 4, 2019 •

edited

PCManticore commented Feb 18, 2019

gyermolenko commented Aug 14, 2019 •

edited

dato commented Aug 15, 2019

gyermolenko commented Aug 15, 2019

PCManticore commented Aug 16, 2019

frederickjeanguerin commented Nov 1, 2019

PCManticore commented Nov 8, 2019

PCManticore commented Mar 11, 2020

arisliang commented Apr 25, 2020

PCManticore commented Apr 27, 2020

invalid-name does not recognize identifiers with non-ASCII characters #2725

invalid-name does not recognize identifiers with non-ASCII characters #2725

Comments

dato commented Feb 4, 2019

Steps to reproduce

Current behavior

Expected behavior

pylint --version output

PCManticore commented Feb 4, 2019

dato commented Feb 4, 2019 • edited

dato commented Feb 4, 2019 • edited

PCManticore commented Feb 18, 2019

gyermolenko commented Aug 14, 2019 • edited

dato commented Aug 15, 2019

gyermolenko commented Aug 15, 2019

PCManticore commented Aug 16, 2019

frederickjeanguerin commented Nov 1, 2019

PCManticore commented Nov 8, 2019

PCManticore commented Mar 11, 2020

arisliang commented Apr 25, 2020

PCManticore commented Apr 27, 2020

dato commented Feb 4, 2019 •

edited

dato commented Feb 4, 2019 •

edited

gyermolenko commented Aug 14, 2019 •

edited