Create PyUnicode_AsWideCharString() function #54188

vstinner · 2010-09-29T00:20:49Z

BPO	9979
Nosy	@malemburg, @amauryfa, @vstinner, @ezio-melotti
Files	pyunicode_aswidecharstring-2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-09-29.10:41:57.092>
created_at = <Date 2010-09-29.00:20:49.186>
labels = ['interpreter-core', 'expert-unicode']
title = 'Create PyUnicode_AsWideCharString() function'
updated_at = <Date 2010-09-29.10:41:57.091>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-09-29.10:41:57.091>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-09-29.10:41:57.092>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-09-29.00:20:49.186>
creator = 'vstinner'
dependencies = []
files = ['19055']
hgrepos = []
issue_num = 9979
keywords = ['patch']
message_count = 6.0
messages = ['117566', '117570', '117577', '117578', '117586', '117592']
nosy_count = 4.0
nosy_names = ['lemburg', 'amaury.forgeotdarc', 'vstinner', 'ezio.melotti']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue9979'
versions = ['Python 3.2']

vstinner · 2010-09-29T00:20:45Z

PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue bpo-8670.

It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated.

To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument).

Attached patch implements it:
-------
/* Convert the Unicode object to a wide character string. The output string
always ends with a nul character. If size is not NULL, write the number of
wide characters (including the final nul character) into *size.

Returns a buffer allocated by PyMem_Alloc() (use PyMem_Free() to free it) on
success. On error, returns NULL and *size is undefined. */

PyAPI_FUNC(wchar_t*) PyUnicode_AsWideCharString(
    PyUnicodeObject *unicode,   /* Unicode object */
    Py_ssize_t *size            /* number of characters of the result */
    );

vstinner · 2010-09-29T01:00:55Z

New version of the patch:

fix PyUnicode_AsWideCharString() :-)
replace PyUnicode_AsWideChar() by PyUnicode_AsWideCharString() in most functions using PyUnicode_AsWideChar()
indicate that PyUnicode_AsWideCharString() raises a MemoryError on error

Keep the call to PyUnicode_AsWideChar() in:

Modules/getpath.c because getpath.c uses a global limitation of MAXPATHLEN+1 characters
WCharArray_set_value() and U_set() of ctypes because the output buffer size is fixed

malemburg · 2010-09-29T06:54:19Z

STINNER Victor wrote:

New submission from STINNER Victor <victor.stinner@haypocalc.com>:

PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue bpo-8670.

It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated.

To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument).

Great idea !

amauryfa · 2010-09-29T07:11:26Z

+1 from me as well.
But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it can? The implementation doesn't do this.

vstinner · 2010-09-29T09:14:24Z

But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it
can? The implementation doesn't do this.

I don't want to do two different things at the same time. My plan is:

create PyUnicode_AsWideCharString()
use PyUnicode_AsWideCharString() everywhere
patch unicode_aswidechar() (used by PyUnicode_AsWideChar() and
PyUnicode_AsWideCharString()) to convert surrogates when needed

So, you agree with the API (and the documentation)?

vstinner · 2010-09-29T10:41:57Z

I fixed in this issue in multiple commits:

r85093: create PyUnicode_AsWideCharString()
r85094: use it in import.c
r85095: use it for _locale.strcoll()
r85096: use it for time.strftime()
r85097: use it in _ctypes module

So, you agree with the API (and the documentation)?

Well, you can now directly patch the documentation. I think that the API is simple and fine :-)

vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Sep 29, 2010

vstinner closed this as completed Sep 29, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create PyUnicode_AsWideCharString() function #54188

Create PyUnicode_AsWideCharString() function #54188

vstinner commented Sep 29, 2010

vstinner commented Sep 29, 2010

vstinner commented Sep 29, 2010

malemburg commented Sep 29, 2010

amauryfa commented Sep 29, 2010

vstinner commented Sep 29, 2010

vstinner commented Sep 29, 2010

Create PyUnicode_AsWideCharString() function #54188

Create PyUnicode_AsWideCharString() function #54188

Comments

vstinner commented Sep 29, 2010

vstinner commented Sep 29, 2010

vstinner commented Sep 29, 2010

malemburg commented Sep 29, 2010

amauryfa commented Sep 29, 2010

vstinner commented Sep 29, 2010

vstinner commented Sep 29, 2010