Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PyUnicode_AsWideCharString() increase temporary #60458

Closed
dabeaz mannequin opened this issue Oct 16, 2012 · 8 comments
Closed

Make PyUnicode_AsWideCharString() increase temporary #60458

dabeaz mannequin opened this issue Oct 16, 2012 · 8 comments
Labels
3.8 only security fixes extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-feature A feature request or enhancement

Comments

@dabeaz
Copy link
Mannequin

dabeaz mannequin commented Oct 16, 2012

BPO 16254
Nosy @loewis, @terryjreedy, @vstinner, @ezio-melotti, @methane

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-04-13.15:34:39.679>
created_at = <Date 2012-10-16.20:14:08.572>
labels = ['extension-modules', 'interpreter-core', 'type-feature', '3.8', 'expert-unicode']
title = 'Make PyUnicode_AsWideCharString() increase temporary'
updated_at = <Date 2019-04-13.15:34:39.674>
user = 'https://bugs.python.org/dabeaz'

bugs.python.org fields:

activity = <Date 2019-04-13.15:34:39.674>
actor = 'methane'
assignee = 'none'
closed = True
closed_date = <Date 2019-04-13.15:34:39.679>
closer = 'methane'
components = ['Extension Modules', 'Interpreter Core', 'Unicode']
creation = <Date 2012-10-16.20:14:08.572>
creator = 'dabeaz'
dependencies = []
files = []
hgrepos = []
issue_num = 16254
keywords = []
message_count = 8.0
messages = ['173089', '173090', '173101', '173104', '173106', '228812', '228842', '340165']
nosy_count = 7.0
nosy_names = ['loewis', 'terry.reedy', 'vstinner', 'ezio.melotti', 'Arfrever', 'methane', 'dabeaz']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue16254'
versions = ['Python 3.8']

@dabeaz
Copy link
Mannequin Author

dabeaz mannequin commented Oct 16, 2012

The PyUnicode_AsWideCharString() function is described as creating a new buffer of type wchar_t allocated by PyMem_Alloc() (which must be freed by the user). However, if you use this function, it causes the size of the original string object to permanently increase. For example, suppose you had some extension code like this:

static PyObject *py_receive_wchar(PyObject *self, PyObject *args) {
  PyObject *obj;
  wchar_t *s;
  Py_ssize_t len;

  if (!PyArg_ParseTuple(args, "U", &obj)) {
    return NULL;
  }
  if ((s = PyUnicode_AsWideCharString(obj, &len)) == NULL) {
    return NULL;
  }
  /* Do nothing */
  PyMem_Free(s);
  Py_RETURN_NONE;
}

Now, try an experiment (assume that the above extension function is available as 'receive_wchar').

>>> s = "Hell"*1000
>>> len(s)
4000
>>> import sys
>>> sys.getsizeof(s)
4049
>>> receive_wchar(s)
>>> sys.getsizeof(s)
20053
>>>

It seems that PyUnicode_AsWideCharString() may be filling in the wstr field of the associated PyASCIIObject structure from PEP-393 (I haven't verified). Once filled, it never seems to be discarded.

Background: I am trying to figure out how to convert from Unicode to (wchar_t, int *) that doesn't cause a permanent increase in the memory footprint of the original Unicode object. Also, I'm trying to stay away from deprecated Unicode APIs.

@dabeaz dabeaz mannequin added extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Oct 16, 2012
@dabeaz
Copy link
Mannequin Author

dabeaz mannequin commented Oct 16, 2012

I should quickly add, is there any way to simply have this function not keep the wchar_t buffer around afterwards? That would be great.

@loewis
Copy link
Mannequin

loewis mannequin commented Oct 16, 2012

As stated, this is not a bug: there is no memory leak, nor any deviation from documented behavior.

You are right that it fills the wstr pointer, by calling PyUnicode_AsUnicodeAndSize in unicode_aswidechar, and then copying the data to a fresh buffer.

This is merely the simplest implementation; it's certainly possible to improve it. Contributions are welcome.

A number of things need to be considered:

  • Computing the wstr size is somewhat expensive if on a 16-bit wchar_t system, since the result may need surrogate pairs.
  • I would suggest that if possible, the wstr representation should be returned out of the unicode object (resetting wstr to NULL). This should produce the greatest reuse in code, yet avoid unnecessary copying.
  • It's not possible to do so for strings where wstr is shared with the canonical representation (i.e. a UCS-2 string on 16-bit wchar_t, and a UCS-4 string on 32-bit wchar_t).
  • I don't think wstr should be cleared if it was already filled when the function got called. Instead, wstr should only be returned if it was originally NULL.

@dabeaz
Copy link
Mannequin Author

dabeaz mannequin commented Oct 16, 2012

Maybe it's not a bug, but I still think it's undesirable. Basically, you have a function that allocates a buffer, fills it with data, and allows the buffer to be destroyed. Yet, as a side effect, it allocates a second buffer, fills it, and permanently attaches it to the original string object. Thus it makes the size of the string object blow up to a size substantially larger than it was before with no way to reclaim memory other than to delete the whole string.

Maybe this is some sort of rare event that doesn't matter, but maybe there's some bit of C extension code that is trying to pass a wchar_t array off to some external library. The extension writer is using the PyUnicode_AsWideCharString() function with the understanding that it creates a new array and that you have to destroy it. They understand that it's not super fast to have to make a copy, but it's better than nothing. What's unfortunate is that all of this attention to memory management doesn't reward the programmer as a copy gets left behind on the string object anyways.

For instance, I start with a 10 Megabyte string, I pass it through a C extension function, and now the string is mysteriously using 50 Megabytes of memory.

I think the idea of filling wstr, returning it and clearing it (if originally NULL) would definitely work here. Actually, that's exactly what I want--don't fill in the wstr member if it's not set already. That way, it's possible for C extensions to temporarily get the wstr buffer, do something, and then toss it away without affecting the original string.

Another suggestion: An API function to simply clear wstr and the UTF-8 representation could also work. Again, this is for extension writers who want to pull data out of strings, but don't want to leave these memory side effects behind.

@dabeaz
Copy link
Mannequin Author

dabeaz mannequin commented Oct 16, 2012

Another note: the PyUnicode_AsUTF8String() doesn't leave the UTF-8 encoded byte string behind on the original string object. I got into this thinking that PyUnicode_AsWideCharString() might have similar behavior.

@terryjreedy
Copy link
Member

I tried to re-title to describe the enhancement proposal. There are multiple API proposals in the two messages above.

@terryjreedy terryjreedy changed the title PyUnicode_AsWideCharString() increases string size Make PyUnicode_AsWideCharString() increase temporary Oct 8, 2014
@terryjreedy terryjreedy added the type-feature A feature request or enhancement label Oct 8, 2014
@vstinner
Copy link
Member

vstinner commented Oct 9, 2014

I proposed to change PyUnicode_AsWideCharString() to not cache the result: see my issue bpo-22323 "Rewrite PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(): don't cache the result".

See also my issues bpo-22271 "Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning" and bpo-22324 "Use PyUnicode_AsWideCharString() instead of PyUnicode_AsUnicode()".

@methane
Copy link
Member

methane commented Apr 13, 2019

fixed by bpo-30863

@methane methane added the 3.8 only security fixes label Apr 13, 2019
@methane methane closed this as completed Apr 13, 2019
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 only security fixes extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants