Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

Open
ideasman42 mannequin opened this issue Dec 18, 2017 · 16 comments · May be fixed by #117117
Open

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

ideasman42 mannequin opened this issue Dec 18, 2017 · 16 comments · May be fixed by #117117

Comments

@ideasman42
Copy link
Mannequin

ideasman42 mannequin commented Dec 18, 2017

BPO 32354
Nosy @vstinner, @ezio-melotti, @ideasman42

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-12-18.02:00:12.723>
labels = ['expert-unicode']
title = 'Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER'
updated_at = <Date 2019-10-27.16:17:44.802>
user = 'https://github.com/ideasman42'

bugs.python.org fields:

activity = <Date 2019-10-27.16:17:44.802>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2017-12-18.02:00:12.723>
creator = 'ideasman42'
dependencies = []
files = []
hgrepos = []
issue_num = 32354
keywords = []
message_count = 4.0
messages = ['308506', '308523', '308607', '355480']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'ideasman42']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue32354'
versions = ['Python 3.6']

Linked PRs

@ideasman42
Copy link
Mannequin Author

ideasman42 mannequin commented Dec 18, 2017

Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER are marked as deprecated in the docs.

https://docs.python.org/3/c-api/unicode.html?highlight=py_unicode_tolower#c.Py_UNICODE_TOLOWER

Someone submitted a patch to our project which uses these.

What is unclear, is if there is an intention to replace these (wrappers for '_PyUnicode_ToLowerFull' for eg).

Or if this will be removed without any alternative.

I assume the functionality will be kept somewhere since Python needs str.lower/upper.

Will there be a way to perform this in the future? Either way, could docs be updated to reflect this?

@ideasman42 ideasman42 mannequin added the topic-unicode label Dec 18, 2017
@vstinner
Copy link
Member

Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER *API* doesn't respect the latest Unicode standard. For example, it doesn't support this operation:

>>> "ß".upper()
'SS'

@ideasman42
Copy link
Mannequin Author

ideasman42 mannequin commented Dec 19, 2017

Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.

@vstinner
Copy link
Member

See also bpo-38604: Schedule Py_UNICODE API removal.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@lysnikolaou
Copy link
Contributor

@vstinner Can we please revisit this? Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.

Also, there's now no way to upper/lower a ucs4 character using the C api, which follows the latest Unicode standard and handles all the obscure cases like capital sigma, like str.lower()/upper() do. What do you think we should do? Remove them for good or maybe add a new macro that uses the new set of functions?

@vstinner
Copy link
Member

vstinner commented Dec 6, 2023

In C, you can create a sub-string of a single character (ex: using PyUnicode_FromOrdinal()) and then call the upper() method on it. I'm not sure that a new API is needed for that.

Would you mind to elaborate your use case?

Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.

Ah right, we should start to deprecate them.

@lysnikolaou
Copy link
Contributor

Would you mind to elaborate your use case?

If possible, I'd like to skip creating unicode objects and work directly on unicode codepoints.

The usecase for this is that we're trying to speed up operations on numpy string arrays, so we're writing numpy ufuncs that operate directly on C buffers, as opposed to the previous approach which was to create PyObjects and call the methods on them. We've observed a very good speed-up, so we're trying to do it for every operation a numpy string supports. See numpy/numpy#24952 for how I'm currently trying to implement it.

Ah right, we should start to deprecate them.

I can start working on a PR to remove these macros if you want.

@lysnikolaou
Copy link
Contributor

@vstinner Quick ping here.

@vstinner
Copy link
Member

Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.

I suggest calling str.lower(), str.title() and str.upper() methods which need a Python str object, the GIL, and to manage reference count. No, there is no API which fits your use cases directly.

@lysnikolaou
Copy link
Contributor

It would be incredibly helpful to some of our work on NumPy, if there was a way to do tolower / toupper for single codepoints, which is exactly what the deprecated macros used to do. Is there any way we can undeprecate them?

@vstinner
Copy link
Member

I propose the API:

Py_ssize_t PyUnicode_ToLower(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToUpper(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToTitle(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
  • On success, write characters into buffer and return a number >= 1.
  • If the buffer is too small, return -1.

@vstinner
Copy link
Member

I wrote a script to check for the maximum output length in Python 3.13.

  • lower() expands a single character up to 2 characters.
  • upper() expands a single character up to 3 characters.
  • title() expands a single character up to 3 characters.

To be safe, I would suggest to pass a buffer of 5 characters, "just in case", to be future proof.

Script:

def dump(text):
    chars = ' '.join(f'U+{ord(ch):04x}' for ch in text)
    return f"{text!r} ({chars})"

max_chr = 0x10_ffff
maxl = 0
maxu = 0
maxt = 0
for ch in range(0, max_chr+1):
    ch = chr(ch)

    output = ch.lower()
    x = len(output)
    if x > maxl:
        maxl = x
        print(f"New max lower()={maxl}: {dump(ch)} => {dump(output)}")

    output = ch.upper()
    x = len(output)
    if x > maxu:
        maxu = x
        print(f"New max upper()={maxu}: {dump(ch)} => {dump(output)}")

    output = ch.title()
    x = len(output)
    if x > maxt:
        maxt = x
        print(f"New max title()={maxt}: {dump(ch)} => {dump(output)}")

Ouptut (sorted manually):

New max lower()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max lower()=2: 'İ' (U+0130) => 'i̇' (U+0069 U+0307)

New max upper()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max upper()=2: 'ß' (U+00df) => 'SS' (U+0053 U+0053)
New max upper()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)

New max title()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max title()=2: 'ß' (U+00df) => 'Ss' (U+0053 U+0073)
New max title()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)

@lysnikolaou
Copy link
Contributor

lysnikolaou commented Mar 14, 2024

Wow, I didn't know that upper-casing or lower-casing could lead to more than one characters being returned. Either way, the API looks good to me, and would definitely be okay to use in NumPy. Would you like me to work on it?

@vstinner
Copy link
Member

The implementation already exists:

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToTitleFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToUpperFull(Py_UCS4 ch, Py_UCS4 *res)

We should write a public API on top of it which takes care of the buffer size.

Oh, case_operation() makes the expectation that a single character can only expand to 3 characters. So maybe 3 is enough.

@lysnikolaou: Sure, you can propose a PR to implement it.

I'm not sure if we should expose a "title" operation and "case folding" operation. In case of doubt, I propose to start with the minimum: upper and lower.

lysnikolaou added a commit to lysnikolaou/cpython that referenced this issue Mar 21, 2024
lysnikolaou added a commit to lysnikolaou/cpython that referenced this issue Mar 21, 2024
@serhiy-storchaka
Copy link
Member

The issues with #117117:

  • Buffer overflow if specify too small buffer.
  • Returning -1 on failure loses information about what size of the buffer is required.

I think that the current C API is good as is, we only should make it public. We can also extend it to support NULL as res -- the result will say us what size of the buffer is needed. But this may be overkill, because the maximal size is not so large, we always can use an output buffer of static size.

@vstinner
Copy link
Member

We can provide a constant which is the minimum buffer size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants