New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast path for unicodedata.normalize() #45076
Comments
Implements quick checking of already normalized forms as The patch is against 2.6 SVN trunk. Normalization test API affected: unicodedata.normalize('NFC', u'a') is u'a' and similar Added memory footprint: A new 8-bit field is added to _PyUnicode_DatabaseRecord, typedef struct {
const unsigned char category;
const unsigned char combining;
const unsigned char bidirectional;
const unsigned char mirrored;
const unsigned char east_asian_width;
const unsigned char normalization_quick_check;
} _PyUnicode_DatabaseRecord; normalization_quick_check is the added field. |
It's a very interesting patch. I wonder why it fell into oblivion. stuff Making sure that all unicode is normalized can be a bottleneck in a lot The downside is that it makes test_codecs and test_unicode_file fail. |
Here is a new patch against trunk, including the modified data files. Martin, do you think this can be committed? |
Should this be considered for 3.1? |
The patch looks fine to me, please apply. One change is necessary: the I think it would be possible to fold the NO and MAYBE answers into NO in With a reduction of the number of bits, it would be possible to reclaim |
Committed in r72054, r72055. Thanks for the patch! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: