Make PyUnicode_AsWideCharString() increase temporary #60458

dabeaz · 2012-10-16T20:14:09Z

BPO	16254
Nosy	@loewis, @terryjreedy, @vstinner, @ezio-melotti, @methane

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-04-13.15:34:39.679>
created_at = <Date 2012-10-16.20:14:08.572>
labels = ['extension-modules', 'interpreter-core', 'type-feature', '3.8', 'expert-unicode']
title = 'Make PyUnicode_AsWideCharString() increase temporary'
updated_at = <Date 2019-04-13.15:34:39.674>
user = 'https://bugs.python.org/dabeaz'

bugs.python.org fields:

activity = <Date 2019-04-13.15:34:39.674>
actor = 'methane'
assignee = 'none'
closed = True
closed_date = <Date 2019-04-13.15:34:39.679>
closer = 'methane'
components = ['Extension Modules', 'Interpreter Core', 'Unicode']
creation = <Date 2012-10-16.20:14:08.572>
creator = 'dabeaz'
dependencies = []
files = []
hgrepos = []
issue_num = 16254
keywords = []
message_count = 8.0
messages = ['173089', '173090', '173101', '173104', '173106', '228812', '228842', '340165']
nosy_count = 7.0
nosy_names = ['loewis', 'terry.reedy', 'vstinner', 'ezio.melotti', 'Arfrever', 'methane', 'dabeaz']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue16254'
versions = ['Python 3.8']

dabeaz · 2012-10-16T20:14:08Z

The PyUnicode_AsWideCharString() function is described as creating a new buffer of type wchar_t allocated by PyMem_Alloc() (which must be freed by the user). However, if you use this function, it causes the size of the original string object to permanently increase. For example, suppose you had some extension code like this:

static PyObject *py_receive_wchar(PyObject *self, PyObject *args) {
  PyObject *obj;
  wchar_t *s;
  Py_ssize_t len;

  if (!PyArg_ParseTuple(args, "U", &obj)) {
    return NULL;
  }
  if ((s = PyUnicode_AsWideCharString(obj, &len)) == NULL) {
    return NULL;
  }
  /* Do nothing */
  PyMem_Free(s);
  Py_RETURN_NONE;
}

Now, try an experiment (assume that the above extension function is available as 'receive_wchar').

>>> s = "Hell"*1000
>>> len(s)
4000
>>> import sys
>>> sys.getsizeof(s)
4049
>>> receive_wchar(s)
>>> sys.getsizeof(s)
20053
>>>

It seems that PyUnicode_AsWideCharString() may be filling in the wstr field of the associated PyASCIIObject structure from PEP-393 (I haven't verified). Once filled, it never seems to be discarded.

Background: I am trying to figure out how to convert from Unicode to (wchar_t, int *) that doesn't cause a permanent increase in the memory footprint of the original Unicode object. Also, I'm trying to stay away from deprecated Unicode APIs.

dabeaz · 2012-10-16T20:17:01Z

I should quickly add, is there any way to simply have this function not keep the wchar_t buffer around afterwards? That would be great.

loewis · 2012-10-16T21:35:23Z

As stated, this is not a bug: there is no memory leak, nor any deviation from documented behavior.

You are right that it fills the wstr pointer, by calling PyUnicode_AsUnicodeAndSize in unicode_aswidechar, and then copying the data to a fresh buffer.

This is merely the simplest implementation; it's certainly possible to improve it. Contributions are welcome.

A number of things need to be considered:

Computing the wstr size is somewhat expensive if on a 16-bit wchar_t system, since the result may need surrogate pairs.
I would suggest that if possible, the wstr representation should be returned out of the unicode object (resetting wstr to NULL). This should produce the greatest reuse in code, yet avoid unnecessary copying.
It's not possible to do so for strings where wstr is shared with the canonical representation (i.e. a UCS-2 string on 16-bit wchar_t, and a UCS-4 string on 32-bit wchar_t).
I don't think wstr should be cleared if it was already filled when the function got called. Instead, wstr should only be returned if it was originally NULL.

dabeaz · 2012-10-16T22:07:23Z

Maybe it's not a bug, but I still think it's undesirable. Basically, you have a function that allocates a buffer, fills it with data, and allows the buffer to be destroyed. Yet, as a side effect, it allocates a second buffer, fills it, and permanently attaches it to the original string object. Thus it makes the size of the string object blow up to a size substantially larger than it was before with no way to reclaim memory other than to delete the whole string.

Maybe this is some sort of rare event that doesn't matter, but maybe there's some bit of C extension code that is trying to pass a wchar_t array off to some external library. The extension writer is using the PyUnicode_AsWideCharString() function with the understanding that it creates a new array and that you have to destroy it. They understand that it's not super fast to have to make a copy, but it's better than nothing. What's unfortunate is that all of this attention to memory management doesn't reward the programmer as a copy gets left behind on the string object anyways.

For instance, I start with a 10 Megabyte string, I pass it through a C extension function, and now the string is mysteriously using 50 Megabytes of memory.

I think the idea of filling wstr, returning it and clearing it (if originally NULL) would definitely work here. Actually, that's exactly what I want--don't fill in the wstr member if it's not set already. That way, it's possible for C extensions to temporarily get the wstr buffer, do something, and then toss it away without affecting the original string.

Another suggestion: An API function to simply clear wstr and the UTF-8 representation could also work. Again, this is for extension writers who want to pull data out of strings, but don't want to leave these memory side effects behind.

dabeaz · 2012-10-16T22:14:40Z

Another note: the PyUnicode_AsUTF8String() doesn't leave the UTF-8 encoded byte string behind on the original string object. I got into this thinking that PyUnicode_AsWideCharString() might have similar behavior.

terryjreedy · 2014-10-08T19:31:36Z

I tried to re-title to describe the enhancement proposal. There are multiple API proposals in the two messages above.

vstinner · 2014-10-09T09:08:38Z

I proposed to change PyUnicode_AsWideCharString() to not cache the result: see my issue bpo-22323 "Rewrite PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(): don't cache the result".

See also my issues bpo-22271 "Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning" and bpo-22324 "Use PyUnicode_AsWideCharString() instead of PyUnicode_AsUnicode()".

methane · 2019-04-13T15:34:40Z

fixed by bpo-30863

dabeaz mannequin added extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Oct 16, 2012

terryjreedy changed the title ~~PyUnicode_AsWideCharString() increases string size~~ Make PyUnicode_AsWideCharString() increase temporary Oct 8, 2014

terryjreedy added the type-feature A feature request or enhancement label Oct 8, 2014

methane added the 3.8 only security fixes label Apr 13, 2019

methane closed this as completed Apr 13, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PyUnicode_AsWideCharString() increase temporary #60458

Make PyUnicode_AsWideCharString() increase temporary #60458

dabeaz mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

loewis mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

terryjreedy commented Oct 8, 2014

vstinner commented Oct 9, 2014

methane commented Apr 13, 2019

Make PyUnicode_AsWideCharString() increase temporary #60458

Make PyUnicode_AsWideCharString() increase temporary #60458

Comments

dabeaz mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

loewis mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

dabeaz mannequin commented Oct 16, 2012

terryjreedy commented Oct 8, 2014

vstinner commented Oct 9, 2014

methane commented Apr 13, 2019