Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyUnicode_AsEncodedString, PyUnicode_Decode: add fast-path for "us-ascii" encoding #72125

Closed
vstinner opened this issue Sep 2, 2016 · 17 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode

Comments

@vstinner
Copy link
Member

vstinner commented Sep 2, 2016

BPO 27938
Nosy @vstinner, @scop, @ezio-melotti, @vadmium, @serhiy-storchaka, @koobs, @zooba
PRs
  • bpo-28393: Update encoding lookup docs wrt bpo-27938 #4871
  • [3.6] bpo-28393: Update encoding lookup docs wrt bpo-27938 (GH-4871) #4881
  • Files
  • normalize_encoding.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2016-09-10.06:33:25.251>
    created_at = <Date 2016-09-02.10:19:34.089>
    labels = ['interpreter-core', '3.7', 'expert-unicode', 'performance']
    title = 'PyUnicode_AsEncodedString, PyUnicode_Decode: add fast-path for "us-ascii" encoding'
    updated_at = <Date 2018-04-04.17:55:28.488>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-04-04.17:55:28.488>
    actor = 'serhiy.storchaka'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2016-09-10.06:33:25.251>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2016-09-02.10:19:34.089>
    creator = 'vstinner'
    dependencies = []
    files = ['44345']
    hgrepos = []
    issue_num = 27938
    keywords = ['patch']
    message_count = 17.0
    messages = ['274222', '274232', '274455', '274456', '274512', '274692', '274720', '274732', '274796', '274831', '274834', '275574', '275576', '275577', '275625', '308374', '308392']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'scop', 'ezio.melotti', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'koobs', 'steve.dower']
    pr_nums = ['4871', '4881']
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue27938'
    versions = ['Python 3.6', 'Python 3.7']

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 2, 2016

    The "us-ascii" encoding is an alias to the Python ASCII encoding. PyUnicode_AsEncodedString() and PyUnicode_Decode() functions have a fast-path for the "ascii" string, but not for "us-ascii".

    Attached patch uses also the fast-path for "us-ascii". It's a more generic change than the issue bpo-27915. The "us-ascii" name is common in the email and xml.etree modules.

    Other changes of the patch:

    • Rewrite _Py_normalize_encoding() as a C implementation of encodings.normalize_encoding(). For example, " utf-8 " is now normalized to "utf_8". So the fast path is now used for more name variants of the same encoding.
    • Avoid strcpy() when encoding is NULL: call directly the UTF-8 codec
    • Reorder encodings: UTF-8, ASCII, MBCS, Latin1, UTF-16
    • Remove fast-path for UTF-32: seriously, nobody uses this codec. Latin9 is much faster but has no fast-path.

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode performance Performance or resource usage labels Sep 2, 2016
    @serhiy-storchaka
    Copy link
    Member

    See also get_standard_encoding() in Python/codecs.c. I suppose it is faster.

    UTF-32 is rarely used as external encoding, but it is still used as internal encoding in some programming languages and libraries (e.g. wchar_t* in C and std::wstring in C++ on Linux). The codec itself is very fast. I would add fast path all utf encodings (except utf-7).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 5, 2016

    New changeset 99818330b4c0 by Victor Stinner in branch 'default':
    Issue bpo-27938: Add a fast-path for us-ascii encoding
    https://hg.python.org/cpython/rev/99818330b4c0

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 5, 2016

    See also get_standard_encoding() in Python/codecs.c. I suppose it is faster.

    I understand that PyCodec_SurrogatePassErrors() is already called with a normalized encoding name.

    With my enhanced _Py_normalize_encoding(), strange syntaxes like " utf 8 " also take the fast path.

    UTF-32 is rarely used as external encoding, but ...

    Ok, I used the same design than get_standard_encoding() to match the "utf" prefix, so having a fast-path for UTF-16 and UTF-32 doesn't add new strcmp() for "latin9".

    I pushed my change, so I close the issue.

    @vstinner vstinner closed this as completed Sep 5, 2016
    @vadmium
    Copy link
    Member

    vadmium commented Sep 6, 2016

    It seems this change is the cause of the Free BSD buildbot failures. From memory, both failing cases involve sending or receiving non-ASCII bytes in child Python processes.

    http://buildbot.python.org/all/builders/AMD64%20FreeBSD%20CURRENT%20Non-Debug%203.x/builds/110/steps/test/logs/stdio

    ======================================================================
    FAIL: test_non_ascii (test.test_cmd_line_script.CmdLineTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/test_cmd_line_script.py", line 517, in test_non_ascii
        rc, stdout, stderr = assert_python_ok(script_name)
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/support/script_helper.py", line 139, in assert_python_ok
        return _assert_python(True, *args, **env_vars)
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/support/script_helper.py", line 125, in _assert_python
        err))
    AssertionError: Process return code is 1
    command line: ['/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/python', '-X', 'faulthandler', '-I', './@test_60885_tmp\udce7w\udcf0.py']

    stdout:
    ---

    ---

    stderr:
    ---
    UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 17: ordinal not in range(128)
    ---

    ======================================================================
    FAIL: test_nonascii (test.test_readline.TestReadline)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/test_readline.py", line 203, in test_nonascii
        self.assertIn(b"text 't\\xeb'\r\n", output)
    AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\x07\r\x07\x07\x07\x07\x07\x07\x07\x07x[\x08\x07\r\nresult \'x[\'\r\nhistory \'x[\'\r\n")

    @koobs
    Copy link

    koobs commented Sep 7, 2016

    Re-open and assign for regressions. Observed in all koobs-freebsd* buildbots (9/10/11) and build types. Issue is in default branch (add version 3.7)

    First failing test run: http://buildbot.python.org/all/builders/AMD64%20FreeBSD%20CURRENT%20Non-Debug%203.x/builds/110

    @koobs koobs added the 3.7 (EOL) end of life label Sep 7, 2016
    @vadmium
    Copy link
    Member

    vadmium commented Sep 7, 2016

    Koobs if you can, it would be good to understand where the failure is. My guess is that Python doesn’t like running a non-ASCII filename. The following is hopefully a simplified version of the test_cmd_line_script test case:

    import os, subprocess, sys
    
    script_name = os.fsdecode(b'./\xE7w\xF0.py')
    script_file = open(script_name, 'w', encoding='utf-8')
    script_file.write('print(ascii(__file__))\n')
    script_file.close()
    
    cmd_line = [sys.executable, '-X', 'faulthandler', '-I', script_name]
    env = os.environ.copy()
    env['TERM'] = ''
    proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
    out, err = proc.communicate()
    print(proc.returncode)  # Should be 0 but Free BSD has 1
    print(repr(err))  # Error is about encoding 0xE7 with ASCII
    print(repr(out))  # If executed, this would be the file name

    Hopefully fixing the above problem will help with the test_readline failure. The readline test case does Readline (tab) completions involving non-ASCII text, and it seems that the Python completion routine is no longer being called.

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 7, 2016

    Sorry, but I don't have enough information to fix the issue. I don't see how my change can break the two failing tests. Could you please try to collect more information manually?

    @vstinner vstinner reopened this Sep 7, 2016
    @serhiy-storchaka
    Copy link
    Member

    Maybe Windows buildbots failures are related:

    http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/8294/steps/test/logs/stdio

    ======================================================================
    FAIL: test_create_at_shutdown_without_encoding (test.test_io.PyTextIOWrapperTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_io.py", line 3174, in test_create_at_shutdown_without_encoding
        self.assertIn(self.shutdown_error, err.decode())
    AssertionError: 'LookupError: unknown encoding: ascii' not found in 'Exception ignored in: <bound method C.__del__ of <__main__.C object at 0x000000000123BF60>>\r\nTraceback (most recent call last):\r\n  File "<string>", line 12, in __del__\r\n  File "C:\\buildbot.python.org\\3.x.kloth-win64\\build\\lib\\_pyio.py", line 1934, in __init__\r\n  File "C:\\buildbot.python.org\\3.x.kloth-win64\\build\\lib\\encodings\\__init__.py", line 158, in _alias_mbcs\r\nImportError: sys.meta_path is None, Python is likely shutting down'

    @zooba
    Copy link
    Member

    zooba commented Sep 7, 2016

    The Windows buildbot failures are partly my fault and partly Ben's fault (I created a new error message - Ben added it to the wrong test), so I'll go and prevent the error message.

    No idea on the other issue. It doesn't repro for me, but since it seems to be FreeBSD readline related that isn't a surprise.

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 7, 2016

    FAIL: test_create_at_shutdown_without_encoding (test.test_io.PyTextIOWrapperTest)

    Steve fixed it:
    ---
    changeset: 103229:47b4dbd451f5
    tag: tip
    user: Steve Dower <steve.dower@microsoft.com>
    date: Wed Sep 07 09:31:52 2016 -0700
    files: Lib/encodings/init.py Lib/test/test_io.py
    description:
    Issue bpo-27959: Prevent ImportError from escaping codec search function
    ---

    Its new search function now catchs ImportError as expected.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 10, 2016

    New changeset 3b185df3a3e2 by Victor Stinner in branch 'default':
    Fix check_force_ascii()
    https://hg.python.org/cpython/rev/3b185df3a3e2

    @vstinner
    Copy link
    Member Author

    New changeset 3b185df3a3e2 by Victor Stinner in branch 'default':
    Fix check_force_ascii()
    https://hg.python.org/cpython/rev/3b185df3a3e2

    @koobs: That's my tiny gift for your birthday. Happy Birthday! ;-) (It should fix FreeBSD buildbots.)

    @vstinner
    Copy link
    Member Author

    Sorry for the little breakage of FreeBSD buildbots, it seems to be ok now ;-)

    @koobs
    Copy link

    koobs commented Sep 10, 2016

    @victor I was just checking this issue to copy the test command, to provide results to you both when I saw the lovely surprise. Thank you :)

    @vstinner
    Copy link
    Member Author

    New changeset 297fd87 by Victor Stinner (Ville Skyttä) in branch 'master':
    bpo-28393: Update encoding lookup docs wrt bpo-27938 (bpo-4871)
    297fd87

    @vstinner
    Copy link
    Member Author

    New changeset 77bf6da by Victor Stinner (Miss Islington (bot)) in branch '3.6':
    bpo-28393: Update encoding lookup docs wrt bpo-27938 (GH-4871) (bpo-4881)
    77bf6da

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants