Simplify converting non-ASCII strings to int, float and complex #76160

serhiy-storchaka · 2017-11-08T12:21:21Z

BPO	31979
Nosy	@vstinner, @ezio-melotti, @serhiy-storchaka
PRs	bpo-31979: Simplify transforming decimals to ASCII #4336 bpo-31979: Remove unused align_maxchar() function #4527 bpo-34087: Fix buffer overflow in int(s) and similar functions #8274

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2017-11-13.19:25:21.264>
created_at = <Date 2017-11-08.12:21:21.212>
labels = ['type-feature', '3.7', 'expert-unicode']
title = 'Simplify converting non-ASCII strings to int, float and complex'
updated_at = <Date 2018-07-13.13:12:17.771>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2018-07-13.13:12:17.771>
actor = 'methane'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2017-11-13.19:25:21.264>
closer = 'serhiy.storchaka'
components = ['Unicode']
creation = <Date 2017-11-08.12:21:21.212>
creator = 'serhiy.storchaka'
dependencies = []
files = []
hgrepos = []
issue_num = 31979
keywords = ['patch']
message_count = 6.0
messages = ['305824', '306163', '306166', '306167', '306168', '306852']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'serhiy.storchaka']
pr_nums = ['4336', '4527', '8274']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue31979'
versions = ['Python 3.7']

serhiy-storchaka · 2017-11-08T12:21:21Z

This issue is inspired by the tweet https://twitter.com/dabeaz/status/925787482515533830

    >>> a = 'n'
    >>> b = 'ñ'
    >>> sys.getsizeof(a)
   50
    >>> sys.getsizeof(b)
   74
    >>> float(b)
   Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
   ValueError: could not convert string to float: 'ñ'
    >>> sys.getsizeof(b)
   77

See also a discussion on Python-list (https://mail.python.org/pipermail/python-list/2017-November/728149.html) and Stack Overflow (https://stackoverflow.com/questions/47062184/why-does-the-size-of-this-python-string-change-on-a-failed-int-conversion).

When convert a non-ASCII string which don't contain non-ASCII decimal digits and spaces, this will fail, but will change the size of the input string due to creating an internal UTF-8 representation.

This can look surprising for beginners, but there is nothing wrong here. There are many cases in which an internal UTF-8 representation is created implicitly.

But looking on the code I have found that it is too complicated. Parsers to int, float and complex call _PyUnicode_TransformDecimalAndSpaceToASCII() which transforms non-ASCII decimal digits and spaces to ASCII. This functions uses the general function fixup() which takes a transformation function, creates a new string object, apply the transformation. It checks if the transformation produces a string with larger maximal character code than the original string and creates a new string object and repeat the transformation in that case. Finally, if the resulting string is equal to the original string, destroy the resulting string and returns the original string. In the past fixup() was used for implementing methods like upper(). But now _PyUnicode_TransformDecimalAndSpaceToASCII() is only the user of fixup(), and it doesn't need all complicated logic of fixup(). For example, this transformation never produces wider strings.

The proposed PR simplifies the code by getting rid of fixup() and fix_decimal_and_space_to_ascii(). The semantic of _PyUnicode_TransformDecimalAndSpaceToASCII() has been changed. It now always produces ASCII string (this also simplifies caller places). If non-ASCII characters which is not a decimal digit and is not a space is encountered, the rest of the string is replaced with '?' which will cause an error in parsers.

The only visible behavior change (except not changing the size of the original string) is changing the exception raised by float() and complex() when the string contains lone surrogates from UnicodeEncodeError to ValueError (the same as for other malformed strings). int() already contained a special case for this and raised a ValueError.

Unpatched:

>>> int('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\ud800'
>>> float('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
>>> complex('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Patched:

>>> int('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\ud800'
>>> float('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '\ud800'
>>> complex('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: complex() arg is a malformed string

This PR saves around 80 lines of code.

vstinner · 2017-11-13T17:09:43Z

I don't think that the weird behaviour justify to backport this non-trivial change. I propose to only apply the change in Python 3.7.

serhiy-storchaka · 2017-11-13T19:22:01Z

As a side effect it slightly optimizes parsing non-ASCII numbers.

$ ./python -m perf timeit --compare-to=./python0  'int("۱۲۳۴۵۶۷۸۹")' --duplicate 100
python0: ..................... 277 ns +- 3 ns
python: ..................... 225 ns +- 3 ns

Mean +- std dev: [python0] 277 ns +- 3 ns -> [python] 225 ns +- 3 ns: 1.23x faster (-19%)

$ ./python -m perf timeit --compare-to=./python0  'float("۱۲۳۴۵.۶۷۸۹")' --duplicate 100
python0: ..................... 256 ns +- 1 ns
python: ..................... 199 ns +- 2 ns

Mean +- std dev: [python0] 256 ns +- 1 ns -> [python] 199 ns +- 2 ns: 1.29x faster (-22%)

$ ./python -m perf timeit --compare-to=./python0  'complex("۱۲۳۴۵.۶۷۸۹j")' --duplicate 100
python0: ..................... 298 ns +- 4 ns
python: ..................... 235 ns +- 3 ns

Mean +- std dev: [python0] 298 ns +- 4 ns -> [python] 235 ns +- 3 ns: 1.27x faster (-21%)

serhiy-storchaka · 2017-11-13T19:23:52Z

New changeset 9b6c60c by Serhiy Storchaka in branch 'master':
bpo-31979: Simplify transforming decimals to ASCII (bpo-4336)
9b6c60c

serhiy-storchaka · 2017-11-13T19:26:02Z

Thank you for your review Victor.

vstinner · 2017-11-23T18:02:26Z

New changeset 6a54c67 by Victor Stinner in branch 'master':
bpo-31979: Remove unused align_maxchar() function (bpo-4527)
6a54c67

serhiy-storchaka added the 3.7 (EOL) end of life label Nov 8, 2017

serhiy-storchaka self-assigned this Nov 8, 2017

serhiy-storchaka added topic-unicode type-feature A feature request or enhancement labels Nov 8, 2017

serhiy-storchaka closed this as completed Nov 13, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify converting non-ASCII strings to int, float and complex #76160

Simplify converting non-ASCII strings to int, float and complex #76160

serhiy-storchaka commented Nov 8, 2017

serhiy-storchaka commented Nov 8, 2017

vstinner commented Nov 13, 2017

serhiy-storchaka commented Nov 13, 2017

serhiy-storchaka commented Nov 13, 2017

serhiy-storchaka commented Nov 13, 2017

vstinner commented Nov 23, 2017

Simplify converting non-ASCII strings to int, float and complex #76160

Simplify converting non-ASCII strings to int, float and complex #76160

Comments

serhiy-storchaka commented Nov 8, 2017

serhiy-storchaka commented Nov 8, 2017

vstinner commented Nov 13, 2017

serhiy-storchaka commented Nov 13, 2017

serhiy-storchaka commented Nov 13, 2017

serhiy-storchaka commented Nov 13, 2017

vstinner commented Nov 23, 2017