Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

ideasman42 · 2017-12-18T02:00:13Z

BPO	32354
Nosy	@vstinner, @ezio-melotti, @ideasman42

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-12-18.02:00:12.723>
labels = ['expert-unicode']
title = 'Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER'
updated_at = <Date 2019-10-27.16:17:44.802>
user = 'https://github.com/ideasman42'

bugs.python.org fields:

activity = <Date 2019-10-27.16:17:44.802>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2017-12-18.02:00:12.723>
creator = 'ideasman42'
dependencies = []
files = []
hgrepos = []
issue_num = 32354
keywords = []
message_count = 4.0
messages = ['308506', '308523', '308607', '355480']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'ideasman42']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue32354'
versions = ['Python 3.6']

Linked PRs

gh-76535: Add C API functions for changing case of a single codepoint #117117

ideasman42 · 2017-12-18T02:00:12Z

Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER are marked as deprecated in the docs.

https://docs.python.org/3/c-api/unicode.html?highlight=py_unicode_tolower#c.Py_UNICODE_TOLOWER

Someone submitted a patch to our project which uses these.

What is unclear, is if there is an intention to replace these (wrappers for '_PyUnicode_ToLowerFull' for eg).

Or if this will be removed without any alternative.

I assume the functionality will be kept somewhere since Python needs str.lower/upper.

Will there be a way to perform this in the future? Either way, could docs be updated to reflect this?

vstinner · 2017-12-18T08:56:54Z

Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER *API* doesn't respect the latest Unicode standard. For example, it doesn't support this operation:

>>> "ß".upper()
'SS'

ideasman42 · 2017-12-19T02:11:01Z

Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.

vstinner · 2019-10-27T16:17:45Z

See also bpo-38604: Schedule Py_UNICODE API removal.

lysnikolaou · 2023-12-06T13:14:59Z

@vstinner Can we please revisit this? Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.

Also, there's now no way to upper/lower a ucs4 character using the C api, which follows the latest Unicode standard and handles all the obscure cases like capital sigma, like str.lower()/upper() do. What do you think we should do? Remove them for good or maybe add a new macro that uses the new set of functions?

vstinner · 2023-12-06T13:51:40Z

In C, you can create a sub-string of a single character (ex: using PyUnicode_FromOrdinal()) and then call the upper() method on it. I'm not sure that a new API is needed for that.

Would you mind to elaborate your use case?

Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.

Ah right, we should start to deprecate them.

lysnikolaou · 2023-12-06T14:36:56Z

Would you mind to elaborate your use case?

If possible, I'd like to skip creating unicode objects and work directly on unicode codepoints.

The usecase for this is that we're trying to speed up operations on numpy string arrays, so we're writing numpy ufuncs that operate directly on C buffers, as opposed to the previous approach which was to create PyObjects and call the methods on them. We've observed a very good speed-up, so we're trying to do it for every operation a numpy string supports. See numpy/numpy#24952 for how I'm currently trying to implement it.

Ah right, we should start to deprecate them.

I can start working on a PR to remove these macros if you want.

lysnikolaou · 2024-02-26T11:59:21Z

@vstinner Quick ping here.

vstinner · 2024-03-13T08:47:10Z

Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.

I suggest calling str.lower(), str.title() and str.upper() methods which need a Python str object, the GIL, and to manage reference count. No, there is no API which fits your use cases directly.

lysnikolaou · 2024-03-13T11:38:43Z

It would be incredibly helpful to some of our work on NumPy, if there was a way to do tolower / toupper for single codepoints, which is exactly what the deprecated macros used to do. Is there any way we can undeprecate them?

vstinner · 2024-03-14T10:58:51Z

I propose the API:

Py_ssize_t PyUnicode_ToLower(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToUpper(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToTitle(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);

On success, write characters into buffer and return a number >= 1.
If the buffer is too small, return -1.

vstinner · 2024-03-14T11:00:58Z

I wrote a script to check for the maximum output length in Python 3.13.

lower() expands a single character up to 2 characters.
upper() expands a single character up to 3 characters.
title() expands a single character up to 3 characters.

To be safe, I would suggest to pass a buffer of 5 characters, "just in case", to be future proof.

Script:

def dump(text):
    chars = ' '.join(f'U+{ord(ch):04x}' for ch in text)
    return f"{text!r} ({chars})"

max_chr = 0x10_ffff
maxl = 0
maxu = 0
maxt = 0
for ch in range(0, max_chr+1):
    ch = chr(ch)

    output = ch.lower()
    x = len(output)
    if x > maxl:
        maxl = x
        print(f"New max lower()={maxl}: {dump(ch)} => {dump(output)}")

    output = ch.upper()
    x = len(output)
    if x > maxu:
        maxu = x
        print(f"New max upper()={maxu}: {dump(ch)} => {dump(output)}")

    output = ch.title()
    x = len(output)
    if x > maxt:
        maxt = x
        print(f"New max title()={maxt}: {dump(ch)} => {dump(output)}")

Ouptut (sorted manually):

New max lower()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max lower()=2: 'İ' (U+0130) => 'i̇' (U+0069 U+0307)

New max upper()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max upper()=2: 'ß' (U+00df) => 'SS' (U+0053 U+0053)
New max upper()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)

New max title()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max title()=2: 'ß' (U+00df) => 'Ss' (U+0053 U+0073)
New max title()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)

lysnikolaou · 2024-03-14T11:42:09Z

Wow, I didn't know that upper-casing or lower-casing could lead to more than one characters being returned. Either way, the API looks good to me, and would definitely be okay to use in NumPy. Would you like me to work on it?

vstinner · 2024-03-14T11:53:01Z

The implementation already exists:

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToTitleFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToUpperFull(Py_UCS4 ch, Py_UCS4 *res)

We should write a public API on top of it which takes care of the buffer size.

Oh, case_operation() makes the expectation that a single character can only expand to 3 characters. So maybe 3 is enough.

@lysnikolaou: Sure, you can propose a PR to implement it.

I'm not sure if we should expose a "title" operation and "case folding" operation. In case of doubt, I propose to start with the minimum: upper and lower.

…epoint

serhiy-storchaka · 2024-03-22T11:07:03Z

The issues with #117117:

Buffer overflow if specify too small buffer.
Returning -1 on failure loses information about what size of the buffer is required.

I think that the current C API is good as is, we only should make it public. We can also extend it to support NULL as res -- the result will say us what size of the buffer is needed. But this may be overkill, because the maximal size is not so large, we always can use an output buffer of static size.

vstinner · 2024-03-22T11:17:59Z

We can provide a constant which is the minimum buffer size.

ideasman42 mannequin added the topic-unicode label Dec 18, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

lysnikolaou mentioned this issue Dec 6, 2023

ENH: Add lower/upper ufuncs for unicode and bytes dtypes numpy/numpy#24952

Draft

lysnikolaou added a commit to lysnikolaou/cpython that referenced this issue Mar 21, 2024

pythongh-76535: Add C API functions for changing case of a single cod…

08ec28e

…epoint

lysnikolaou linked a pull request Mar 21, 2024 that will close this issue

gh-76535: Add C API functions for changing case of a single codepoint #117117

Open

lysnikolaou added a commit to lysnikolaou/cpython that referenced this issue Mar 21, 2024

pythongh-76535: Add C API functions for changing case of a single cod…

8facf19

…epoint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

ideasman42 mannequin commented Dec 18, 2017 •

edited by bedevere-app bot

ideasman42 mannequin commented Dec 18, 2017

vstinner commented Dec 18, 2017

ideasman42 mannequin commented Dec 19, 2017

vstinner commented Oct 27, 2019

lysnikolaou commented Dec 6, 2023

vstinner commented Dec 6, 2023

lysnikolaou commented Dec 6, 2023

lysnikolaou commented Feb 26, 2024

vstinner commented Mar 13, 2024

lysnikolaou commented Mar 13, 2024

vstinner commented Mar 14, 2024

vstinner commented Mar 14, 2024

lysnikolaou commented Mar 14, 2024 •

edited

vstinner commented Mar 14, 2024

serhiy-storchaka commented Mar 22, 2024

vstinner commented Mar 22, 2024

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

Comments

ideasman42 mannequin commented Dec 18, 2017 • edited by bedevere-app bot

Linked PRs

ideasman42 mannequin commented Dec 18, 2017

vstinner commented Dec 18, 2017

ideasman42 mannequin commented Dec 19, 2017

vstinner commented Oct 27, 2019

lysnikolaou commented Dec 6, 2023

vstinner commented Dec 6, 2023

lysnikolaou commented Dec 6, 2023

lysnikolaou commented Feb 26, 2024

vstinner commented Mar 13, 2024

lysnikolaou commented Mar 13, 2024

vstinner commented Mar 14, 2024

vstinner commented Mar 14, 2024

lysnikolaou commented Mar 14, 2024 • edited

vstinner commented Mar 14, 2024

serhiy-storchaka commented Mar 22, 2024

vstinner commented Mar 22, 2024

ideasman42 mannequin commented Dec 18, 2017 •

edited by bedevere-app bot

lysnikolaou commented Mar 14, 2024 •

edited