Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create PyUnicode_AsWideCharString() function #54188

Closed
vstinner opened this issue Sep 29, 2010 · 6 comments
Closed

Create PyUnicode_AsWideCharString() function #54188

vstinner opened this issue Sep 29, 2010 · 6 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode

Comments

@vstinner
Copy link
Member

BPO 9979
Nosy @malemburg, @amauryfa, @vstinner, @ezio-melotti
Files
  • pyunicode_aswidecharstring-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-29.10:41:57.092>
    created_at = <Date 2010-09-29.00:20:49.186>
    labels = ['interpreter-core', 'expert-unicode']
    title = 'Create PyUnicode_AsWideCharString() function'
    updated_at = <Date 2010-09-29.10:41:57.091>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-09-29.10:41:57.091>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-29.10:41:57.092>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2010-09-29.00:20:49.186>
    creator = 'vstinner'
    dependencies = []
    files = ['19055']
    hgrepos = []
    issue_num = 9979
    keywords = ['patch']
    message_count = 6.0
    messages = ['117566', '117570', '117577', '117578', '117586', '117592']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'vstinner', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue9979'
    versions = ['Python 3.2']

    @vstinner
    Copy link
    Member Author

    PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue bpo-8670.

    It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated.

    To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument).

    Attached patch implements it:
    -------
    /* Convert the Unicode object to a wide character string. The output string
    always ends with a nul character. If size is not NULL, write the number of
    wide characters (including the final nul character) into *size.

    Returns a buffer allocated by PyMem_Alloc() (use PyMem_Free() to free it) on
    success. On error, returns NULL and *size is undefined. */

    PyAPI_FUNC(wchar_t*) PyUnicode_AsWideCharString(
        PyUnicodeObject *unicode,   /* Unicode object */
        Py_ssize_t *size            /* number of characters of the result */
        );

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Sep 29, 2010
    @vstinner
    Copy link
    Member Author

    New version of the patch:

    • fix PyUnicode_AsWideCharString() :-)
    • replace PyUnicode_AsWideChar() by PyUnicode_AsWideCharString() in most functions using PyUnicode_AsWideChar()
    • indicate that PyUnicode_AsWideCharString() raises a MemoryError on error

    Keep the call to PyUnicode_AsWideChar() in:

    • Modules/getpath.c because getpath.c uses a global limitation of MAXPATHLEN+1 characters
    • WCharArray_set_value() and U_set() of ctypes because the output buffer size is fixed

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    New submission from STINNER Victor <victor.stinner@haypocalc.com>:

    PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue bpo-8670.

    It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated.

    To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument).

    Great idea !

    @amauryfa
    Copy link
    Member

    +1 from me as well.
    But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it can? The implementation doesn't do this.

    @vstinner
    Copy link
    Member Author

    But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it
    can? The implementation doesn't do this.

    I don't want to do two different things at the same time. My plan is:

    • create PyUnicode_AsWideCharString()
    • use PyUnicode_AsWideCharString() everywhere
    • patch unicode_aswidechar() (used by PyUnicode_AsWideChar() and
      PyUnicode_AsWideCharString()) to convert surrogates when needed

    So, you agree with the API (and the documentation)?

    @vstinner
    Copy link
    Member Author

    I fixed in this issue in multiple commits:

    • r85093: create PyUnicode_AsWideCharString()
    • r85094: use it in import.c
    • r85095: use it for _locale.strcoll()
    • r85096: use it for time.strftime()
    • r85097: use it in _ctypes module

    So, you agree with the API (and the documentation)?

    Well, you can now directly patch the documentation. I think that the API is simple and fine :-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants