Unicode control characters are not allowed as identifiers #49608

baijum · 2009-02-24T11:53:50Z

BPO	5358
Nosy	@loewis, @ezio-melotti
Files	identifier.py: File with Unicode control character in identifier

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-02-27.18:32:17.612>
created_at = <Date 2009-02-24.11:53:50.445>
labels = ['type-bug', 'expert-unicode']
title = 'Unicode control characters are not allowed as identifiers'
updated_at = <Date 2009-02-27.18:32:17.537>
user = 'https://bugs.python.org/baijum'

bugs.python.org fields:

activity = <Date 2009-02-27.18:32:17.537>
actor = 'loewis'
assignee = 'none'
closed = True
closed_date = <Date 2009-02-27.18:32:17.612>
closer = 'loewis'
components = ['Unicode']
creation = <Date 2009-02-24.11:53:50.445>
creator = 'baijum'
dependencies = []
files = ['13162']
hgrepos = []
issue_num = 5358
keywords = []
message_count = 7.0
messages = ['82664', '82666', '82820', '82821', '82822', '82842', '82858']
nosy_count = 4.0
nosy_names = ['loewis', 'ezio.melotti', 'mrabarnett', 'baijum']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue5358'
versions = ['Python 3.0', 'Python 3.1']

baijum · 2009-02-24T11:53:49Z

I tried to use Zero-width joiner (U+200D) as part of an identifier.
It produce an exception like this:

SyntaxError: invalid character in identifier

I have attached the Python file which produce this error.

Zero-width joiner (U+200D) is a Unicode control character:
http://en.wikipedia.org/wiki/Unicode_control_characters

loewis · 2009-02-24T16:21:44Z

Why do you think this is a bug?

baijum · 2009-02-27T06:47:50Z

On a further look at this issue, I understood Python cannot use all
Unicode control characters as identifiers. But for many international
languages, without some control characters like ZWJ & ZWNJ [1], it won't
be possible to construct all characters with proper visual
representation. So, if Python really want to support international
characters as identifiers (for some reason), ZWJ & ZWNJ are unavoidable,
may be some other characters also.

[1] http://en.wikipedia.org/wiki/Zero-width_joiner
http://en.wikipedia.org/wiki/Zero-width_non-joiner

baijum · 2009-02-27T07:24:12Z

I think RFC-3454 [1] can be used as a base for selecting the control
characters which can be used as a valid identifier character.

[1] http://www.rfc-editor.org/rfc/rfc3454.txt

ezio-melotti · 2009-02-27T07:48:19Z

Valid identifiers should begin with a letter or '_' and contain only
letters, numbers and '_'. This probably means that only the Unicode
characters that belong to the categories Ll, Lu (Letter Lower/Upper
case), Nd (Number, Decimal Digit) and Pc (Punctuation, Connector) - and
possibly other categories like Lm, Lt, No and Nl - are valid.

Some examples:
>>> ａ－ｂ = 5 # U+FF0D, Cat: Pd, FULLWIDTH HYPHEN-MINUS
SyntaxError: invalid character in identifier
>>> a＃ = 5 # U+FF03, Cat: Po, FULLWIDTH NUMBER SIGN
SyntaxError: invalid character in identifier
>>> a）b = 5 # U+FF09, Cat: Pe, FULLWIDTH RIGHT PARENTHESIS
SyntaxError: invalid character in identifier
>>> ａ＿ｂ = 5 # U+FF3F, Cat: Pc, FULLWIDTH LOW LINE
>>> ａ＿ｂ
5
>>> a﹍b﹎c﹏d = 5 # U+FE4D, U+FE4E, U+FE4F, Cat: Pc
>>> a﹍b﹎c﹏d
5

mrabarnett · 2009-02-27T16:55:00Z

The definition of a word in the new re module (actually targetted at
Python 2.7) is currently a sequence of L&, N&, M& and Pc.

I suppose ideally we want the definitions of a word and an identifier to
be basically the same, except that an identifier can't start with N&.

loewis · 2009-02-27T18:32:17Z

See PEP-3131 for a specification what is an identifier in Python.

Closing this as "won't fix".

baijum mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 24, 2009

loewis mannequin closed this as completed Feb 27, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode control characters are not allowed as identifiers #49608

Unicode control characters are not allowed as identifiers #49608

baijum mannequin commented Feb 24, 2009

baijum mannequin commented Feb 24, 2009

loewis mannequin commented Feb 24, 2009

baijum mannequin commented Feb 27, 2009

baijum mannequin commented Feb 27, 2009

ezio-melotti commented Feb 27, 2009

mrabarnett mannequin commented Feb 27, 2009

loewis mannequin commented Feb 27, 2009

Unicode control characters are not allowed as identifiers #49608

Unicode control characters are not allowed as identifiers #49608

Comments

baijum mannequin commented Feb 24, 2009

baijum mannequin commented Feb 24, 2009

loewis mannequin commented Feb 24, 2009

baijum mannequin commented Feb 27, 2009

baijum mannequin commented Feb 27, 2009

ezio-melotti commented Feb 27, 2009

mrabarnett mannequin commented Feb 27, 2009

loewis mannequin commented Feb 27, 2009