-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C API] No efficient C API to get UTF-8 string from unicode object. #83268
Comments
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode. There are two courses: (a) Support three KINDs in the flexible unicode representation. (a) will be the fastest on CPython, but there are few drawbacks:
So I believe (b) should be the preferred way.
For speed and efficiency, I propose a new API:
When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it. |
Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise. The private _PyUnicode_UTF8() macro could help if ((s = _PyUnicode_UTF8(str))) {
size = _PyUnicode_UTF8_LENGTH(str);
tmpbytes = NULL;
}
else {
tmpbytes = _PyUnicode_AsUTF8String(str, "replace");
s = PyBytes_AS_STRING(tmpbytes);
size = PyBytes_GET_SIZE(tmpbytes);
} but it is not even available outside of unicodeobject.c. PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII). |
Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"? Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc. |
Looks like a good idea. int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors) |
It looks nice idea! Py_buffer.obj is decref-ed when releasing the buffer. int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view)
{
if (!PyUnicode_Check(unicode)) {
PyErr_BadArgument();
return NULL;
}
if (PyUnicode_READY(unicode) == -1) {
return NULL;
}
if (PyUnicode_UTF8(unicode) != NULL) {
return PyBuffer_FillInfo(view, unicode,
PyUnicode_UTF8(unicode),
PyUnicode_UTF8_LENGTH(unicode),
1, PyBUF_CONTIG_RO);
}
PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL);
if (bytes == NULL) {
return NULL;
}
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
} |
s/return NULL/return -1/g |
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO); Don't you need to DECREF bytes somehow, at least, in case of failure? |
Thanks. I will create a pull request with suggested changes. |
I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier. |
I created a post about this issue in discuss.python.org. |
I am still not sure about we should add new API only for avoiding cache.
With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String. ## vs master $ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
master: ..................... 96.6 us +- 3.3 us
patched: ..................... 83.3 us +- 0.3 us Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%) ## vs AsUTF8String $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 83.2 us +- 0.2 us
$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 81.9 us +- 2.1 us ## vs AsUTF8String (ASCII) If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string. PyUnicode_GetUTF8Buffer helps only this case. $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 37.5 us +- 1.7 us
$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 46.4 us +- 1.6 us |
Attached patch is the benchmark function I used in previous post. |
I though there are at least 3-4 use cases in the core and stdlib. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: