Fast path for unicodedata.normalize() #45076

raulir · 2007-06-09T22:45:42Z

BPO	1734234
Nosy	@loewis, @pitrou, @devdanzin
Files	normalization_fast_path.patch uninorm.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/pitrou'
closed_at = <Date 2009-04-27.22:32:48.821>
created_at = <Date 2007-06-09.22:45:42.000>
labels = ['extension-modules', 'performance']
title = 'Fast path for unicodedata.normalize()'
updated_at = <Date 2009-04-27.22:32:48.820>
user = 'https://bugs.python.org/raulir'

bugs.python.org fields:

activity = <Date 2009-04-27.22:32:48.820>
actor = 'pitrou'
assignee = 'pitrou'
closed = True
closed_date = <Date 2009-04-27.22:32:48.821>
closer = 'pitrou'
components = ['Extension Modules']
creation = <Date 2007-06-09.22:45:42.000>
creator = 'raulir'
dependencies = []
files = ['8042', '12350']
hgrepos = []
issue_num = 1734234
keywords = ['patch']
message_count = 6.0
messages = ['52743', '62437', '77790', '86594', '86600', '86705']
nosy_count = 5.0
nosy_names = ['loewis', 'pitrou', 'ajaksu2', 'raulir', 'vdupras']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue1734234'
versions = ['Python 3.1', 'Python 2.7']

raulir · 2007-06-09T22:45:42Z

Implements quick checking of already normalized forms as
described in http://unicode.org/reports/tr15/#Annex8

The patch is against 2.6 SVN trunk. Normalization test
passes on both UCS2 and UCS4 builds on Ubuntu Edgy.

API affected:

unicodedata.normalize('NFC', u'a') is u'a' and similar
expressions become true, as the unicode object is not
copied when it is found to be already normalized.
The documentation does not specify either way.

Added memory footprint:

A new 8-bit field is added to _PyUnicode_DatabaseRecord,
and the generated _PyUnicode_Database_Records
array grows from 219 records to 304 records. Each
record looks like this:

typedef struct {
   const unsigned char category;
   const unsigned char combining;
   const unsigned char bidirectional;
   const unsigned char mirrored;
   const unsigned char east_asian_width;
   const unsigned char normalization_quick_check;
} _PyUnicode_DatabaseRecord;

normalization_quick_check is the added field.

vdupras · 2008-02-15T20:13:14Z

It's a very interesting patch. I wonder why it fell into oblivion. stuff
like "unicode.normalize('NFC', u'\xe9')" was more than twice as fast for
me.

Making sure that all unicode is normalized can be a bottleneck in a lot
of applications (it somewhat is in my apps).

The downside is that it makes test_codecs and test_unicode_file fail.

pitrou · 2008-12-14T11:57:34Z

Here is a new patch against trunk, including the modified data files.
All tests pass and I confirm a very healthy speed-up (~ 5x) when trying
to a normalize an already normalized string. The slowdown for
non-normalized strings is so small that it cannot be distinguished from
measurement noise.

Martin, do you think this can be committed?

devdanzin · 2009-04-26T02:37:36Z

Should this be considered for 3.1?

loewis · 2009-04-26T05:03:36Z

The patch looks fine to me, please apply. One change is necessary: the
quick check should only be performed if it is the newest version (i.e.
self is NULL); otherwise, we would need to add a delta list for changed
quickcheck values, as well.

I think it would be possible to fold the NO and MAYBE answers into NO in
the database already, reducing the number of necessary bits to 4, and
then allowing to check with a simple bit test (i.e. no shift). OTOH, the
shift can be avoided already, by changing quickcheck_shift into a
bitmask. OTTH, perhaps the compiler does that already, anyway.

With a reduction of the number of bits, it would be possible to reclaim
a byte, by merging the bits into one of the other fields. Whether that's
worth it, I don't know.

pitrou · 2009-04-27T22:32:49Z

Committed in r72054, r72055. Thanks for the patch!

raulir mannequin added extension-modules C modules in the Modules dir labels Jun 9, 2007

pitrou added performance Performance or resource usage labels Dec 14, 2008

loewis mannequin assigned pitrou Apr 26, 2009

pitrou closed this as completed Apr 27, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast path for unicodedata.normalize() #45076

Fast path for unicodedata.normalize() #45076

raulir mannequin commented Jun 9, 2007

raulir mannequin commented Jun 9, 2007

vdupras mannequin commented Feb 15, 2008

pitrou commented Dec 14, 2008

devdanzin mannequin commented Apr 26, 2009

loewis mannequin commented Apr 26, 2009

pitrou commented Apr 27, 2009

Fast path for unicodedata.normalize() #45076

Fast path for unicodedata.normalize() #45076

Comments

raulir mannequin commented Jun 9, 2007

raulir mannequin commented Jun 9, 2007

vdupras mannequin commented Feb 15, 2008

pitrou commented Dec 14, 2008

devdanzin mannequin commented Apr 26, 2009

loewis mannequin commented Apr 26, 2009

pitrou commented Apr 27, 2009