New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535
Comments
Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER are marked as deprecated in the docs. https://docs.python.org/3/c-api/unicode.html?highlight=py_unicode_tolower#c.Py_UNICODE_TOLOWER Someone submitted a patch to our project which uses these. What is unclear, is if there is an intention to replace these (wrappers for '_PyUnicode_ToLowerFull' for eg). Or if this will be removed without any alternative. I assume the functionality will be kept somewhere since Python needs str.lower/upper. Will there be a way to perform this in the future? Either way, could docs be updated to reflect this? |
Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER *API* doesn't respect the latest Unicode standard. For example, it doesn't support this operation: >>> "ß".upper()
'SS' |
Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects. |
See also bpo-38604: Schedule Py_UNICODE API removal. |
@vstinner Can we please revisit this? Also, there's now no way to upper/lower a ucs4 character using the C api, which follows the latest Unicode standard and handles all the obscure cases like capital sigma, like |
In C, you can create a sub-string of a single character (ex: using Would you mind to elaborate your use case?
Ah right, we should start to deprecate them. |
If possible, I'd like to skip creating unicode objects and work directly on unicode codepoints. The usecase for this is that we're trying to speed up operations on numpy string arrays, so we're writing numpy ufuncs that operate directly on C buffers, as opposed to the previous approach which was to create
I can start working on a PR to remove these macros if you want. |
@vstinner Quick ping here. |
I suggest calling str.lower(), str.title() and str.upper() methods which need a Python str object, the GIL, and to manage reference count. No, there is no API which fits your use cases directly. |
It would be incredibly helpful to some of our work on NumPy, if there was a way to do |
I propose the API: Py_ssize_t PyUnicode_ToLower(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToUpper(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToTitle(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
|
I wrote a script to check for the maximum output length in Python 3.13.
To be safe, I would suggest to pass a buffer of 5 characters, "just in case", to be future proof. Script: def dump(text):
chars = ' '.join(f'U+{ord(ch):04x}' for ch in text)
return f"{text!r} ({chars})"
max_chr = 0x10_ffff
maxl = 0
maxu = 0
maxt = 0
for ch in range(0, max_chr+1):
ch = chr(ch)
output = ch.lower()
x = len(output)
if x > maxl:
maxl = x
print(f"New max lower()={maxl}: {dump(ch)} => {dump(output)}")
output = ch.upper()
x = len(output)
if x > maxu:
maxu = x
print(f"New max upper()={maxu}: {dump(ch)} => {dump(output)}")
output = ch.title()
x = len(output)
if x > maxt:
maxt = x
print(f"New max title()={maxt}: {dump(ch)} => {dump(output)}") Ouptut (sorted manually):
|
Wow, I didn't know that upper-casing or lower-casing could lead to more than one characters being returned. Either way, the API looks good to me, and would definitely be okay to use in NumPy. Would you like me to work on it? |
The implementation already exists:
We should write a public API on top of it which takes care of the buffer size. Oh, @lysnikolaou: Sure, you can propose a PR to implement it. I'm not sure if we should expose a "title" operation and "case folding" operation. In case of doubt, I propose to start with the minimum: upper and lower. |
The issues with #117117:
I think that the current C API is good as is, we only should make it public. We can also extend it to support NULL as |
We can provide a constant which is the minimum buffer size. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: