-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up unicode normalization of ASCII strings #89150
Comments
I think there is an opportunity to speed up some unicode normalisations significantly. In 3.9 at least, the normalisation appears to be dependent on the length of the string: >>> setup="from unicodedata import normalize; s = 'reverse'"
>>> t1 = Timer('normalize("NFKC", s)', setup=setup)
>>> setup="from unicodedata import normalize; s = 'reverse'*1000"
>>> t2 = Timer('normalize("NFKC", s)', setup=setup)
>>>
>>> min(t1.repeat(repeat=7))
0.04854234401136637
>>> min(t2.repeat(repeat=7))
9.98313440399943 But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP-393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation. |
Well, someone should write a PR for it. |
Well, I sent a patch :) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: