Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters #69462

shreevatsa · 2015-09-30T05:19:46Z

BPO	25275
Nosy	@vstinner, @ezio-melotti, @bitdancer, @vadmium

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-09-30.05:19:45.865>
labels = ['interpreter-core', 'type-bug', 'expert-unicode', 'docs']
title = 'Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters'
updated_at = <Date 2015-10-01.04:07:07.422>
user = 'https://bugs.python.org/shreevatsa'

bugs.python.org fields:

activity = <Date 2015-10-01.04:07:07.422>
actor = 'martin.panter'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'Interpreter Core', 'Unicode']
creation = <Date 2015-09-30.05:19:45.865>
creator = 'shreevatsa'
dependencies = []
files = []
hgrepos = []
issue_num = 25275
keywords = []
message_count = 8.0
messages = ['251915', '251930', '251932', '251965', '251966', '251967', '251968', '251991']
nosy_count = 6.0
nosy_names = ['vstinner', 'ezio.melotti', 'r.david.murray', 'docs@python', 'martin.panter', 'shreevatsa']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25275'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

shreevatsa · 2015-09-30T05:19:45Z

Summary: This is about int(u'१२३४') == 1234.

At https://docs.python.org/2/library/functions.html and also https://docs.python.org/3/library/functions.html the documentation for

     class int(x=0)
     class int(x, base=10)

says (respectively):

If x is not a number or if base is given, then x must be a string or Unicode object representing an integer literal in radix base.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in radix base.

If you follow the definition of "integer literal" into the reference (https://docs.python.org/2/reference/lexical_analysis.html#integers and https://docs.python.org/3/reference/lexical_analysis.html#integers respectively), the definitions ultimately involve

 nonzerodigit   ::=  "1"..."9"
 octdigit       ::=  "0"..."7"
 bindigit       ::=  "0" | "1"
 digit          ::=  "0"..."9"

So it looks like whether the behaviour of int() conforms to its documentation hinges on what "representing" means. Apparently it is some definition under which u'१२३४' represents the integer literal 1234, but it would be great to either clarify the documentation of int() or change its behaviour.

bitdancer · 2015-09-30T13:16:34Z

Apparently that documentation is simply wrong. The actual definition of what 'int' handles is *different* from what the parser handles. I think that difference must constitute a bug (not just a doc bug), but I'm not sure if it is something that we want to fix (changing the parser).

I think the *operational* definition of int conversion for both is the same as for isdigit in python3 (https://docs.python.org/3/library/stdtypes.html#str.isdigit). (The python2 docs just say '8 bit strings may be locale dependent', which means the same thing but is less precise).

>>> १२३४
  File "<stdin>", line 1
    १२३४
       ^
SyntaxError: invalid character in identifier
>>> int('१२३४')
1234
>>> '१२३४'.isdigit()
True

The above behavior discrepancy doesn't apply to python2, since in python2 you can't use unicode in integer literals.

So, this is a bit of a mess :(.

The doc fix is simple: just replace the mention of integer literal with a link to isdigit, and fix the python2 isdigit docs to match python3's.

bitdancer · 2015-09-30T13:18:49Z

I mean, in python2 you can't use unicode in python code, only in strings, as opposed to python3 where unicode is valid in identifiers (but not integer literals, obviously).

shreevatsa · 2015-09-30T20:45:10Z

Minor difference, but the relevant function for int() is not quite isdigit(), e.g.:

    >>> import unicodedata
    >>> s = u'\u2460'
    >>> unicodedata.name(s)
    'CIRCLED DIGIT ONE'
    >>> print s
    ①
    >>> s.isdigit()
    True
    >>> s.isdecimal()
    False
    >>> int(s)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'decimal' codec can't encode character u'\u2460' in position 0: invalid decimal Unicode string

It seems to be isdecimal(), plus if there are other digits in the string then many leading and trailing space-like characters are also allowed (e.g. 5760 OGHAM SPACE MARK or 8195 EM SPACE or 12288 IDEOGRAPHIC SPACE:

    >>> 987 == int(u'\u3000\n 987\u1680\t')
    True

shreevatsa · 2015-09-30T20:48:48Z

About the mismatch: of course it's probably not a good idea to change the parser (so that simply typing १२३४ in Python 3 code is like typing 1234), but how about changing the behaviour of int()? Not sure whether anyone should be relying on int(u'१२३४') being 1234, given that it is not documented as such.

bitdancer · 2015-09-30T20:54:43Z

Good catch. Yes, it is already documented that Int ignores leading and trailing whitespace.

But, even that isn't quite correct:

>>> 'A'.isdecimal()
False
>>> int('A', 16)
10

I seem to vaguely recall a discussion somewhere in this tracker about what "should" count as digits for larger-than-decimal radii, but I don't remember the outcome.

bitdancer · 2015-09-30T20:56:09Z

No, we can't make it stop working for int, that would be a backward compatibility break. Doing so was discussed at one point and rejected (another issue somewhere in this tracker :)

vadmium · 2015-10-01T04:07:07Z

Related discussion and background in bpo-10581, although that report seems to be geared at extending the Unicode support even further (disallowing mixed scripts, allowing proper minus signs, full-width characters, Roman numerals, etc).

The existing support is actually documented if you know where to look; see the sixth note under the table at <https://docs.python.org/dev/library/stdtypes.html#typesnumeric\>. I agree that the references for each constructor should also document this as well.

shreevatsa mannequin assigned docspython Sep 30, 2015

shreevatsa mannequin added docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Sep 30, 2015

bitdancer added the type-bug An unexpected behavior, bug, or error label Sep 30, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters #69462

Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters #69462

shreevatsa mannequin commented Sep 30, 2015

shreevatsa mannequin commented Sep 30, 2015

bitdancer commented Sep 30, 2015

bitdancer commented Sep 30, 2015

shreevatsa mannequin commented Sep 30, 2015

shreevatsa mannequin commented Sep 30, 2015

bitdancer commented Sep 30, 2015

bitdancer commented Sep 30, 2015

vadmium commented Oct 1, 2015

Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters #69462

Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters #69462

Comments

shreevatsa mannequin commented Sep 30, 2015

shreevatsa mannequin commented Sep 30, 2015

bitdancer commented Sep 30, 2015

bitdancer commented Sep 30, 2015

shreevatsa mannequin commented Sep 30, 2015

shreevatsa mannequin commented Sep 30, 2015

bitdancer commented Sep 30, 2015

bitdancer commented Sep 30, 2015

vadmium commented Oct 1, 2015