Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast path for unicodedata.normalize() #45076

Closed
raulir mannequin opened this issue Jun 9, 2007 · 6 comments
Closed

Fast path for unicodedata.normalize() #45076

raulir mannequin opened this issue Jun 9, 2007 · 6 comments
Assignees
Labels
extension-modules C modules in the Modules dir performance Performance or resource usage

Comments

@raulir
Copy link
Mannequin

raulir mannequin commented Jun 9, 2007

BPO 1734234
Nosy @loewis, @pitrou, @devdanzin
Files
  • normalization_fast_path.patch
  • uninorm.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2009-04-27.22:32:48.821>
    created_at = <Date 2007-06-09.22:45:42.000>
    labels = ['extension-modules', 'performance']
    title = 'Fast path for unicodedata.normalize()'
    updated_at = <Date 2009-04-27.22:32:48.820>
    user = 'https://bugs.python.org/raulir'

    bugs.python.org fields:

    activity = <Date 2009-04-27.22:32:48.820>
    actor = 'pitrou'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2009-04-27.22:32:48.821>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2007-06-09.22:45:42.000>
    creator = 'raulir'
    dependencies = []
    files = ['8042', '12350']
    hgrepos = []
    issue_num = 1734234
    keywords = ['patch']
    message_count = 6.0
    messages = ['52743', '62437', '77790', '86594', '86600', '86705']
    nosy_count = 5.0
    nosy_names = ['loewis', 'pitrou', 'ajaksu2', 'raulir', 'vdupras']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue1734234'
    versions = ['Python 3.1', 'Python 2.7']

    @raulir
    Copy link
    Mannequin Author

    raulir mannequin commented Jun 9, 2007

    Implements quick checking of already normalized forms as
    described in http://unicode.org/reports/tr15/#Annex8

    The patch is against 2.6 SVN trunk. Normalization test
    passes on both UCS2 and UCS4 builds on Ubuntu Edgy.

    API affected:

    unicodedata.normalize('NFC', u'a') is u'a' and similar
    expressions become true, as the unicode object is not
    copied when it is found to be already normalized.
    The documentation does not specify either way.

    Added memory footprint:

    A new 8-bit field is added to _PyUnicode_DatabaseRecord,
    and the generated _PyUnicode_Database_Records
    array grows from 219 records to 304 records. Each
    record looks like this:

    typedef struct {
       const unsigned char category;
       const unsigned char combining;
       const unsigned char bidirectional;
       const unsigned char mirrored;
       const unsigned char east_asian_width;
       const unsigned char normalization_quick_check;
    } _PyUnicode_DatabaseRecord;

    normalization_quick_check is the added field.

    @raulir raulir mannequin added extension-modules C modules in the Modules dir labels Jun 9, 2007
    @vdupras
    Copy link
    Mannequin

    vdupras mannequin commented Feb 15, 2008

    It's a very interesting patch. I wonder why it fell into oblivion. stuff
    like "unicode.normalize('NFC', u'\xe9')" was more than twice as fast for
    me.

    Making sure that all unicode is normalized can be a bottleneck in a lot
    of applications (it somewhat is in my apps).

    The downside is that it makes test_codecs and test_unicode_file fail.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2008

    Here is a new patch against trunk, including the modified data files.
    All tests pass and I confirm a very healthy speed-up (~ 5x) when trying
    to a normalize an already normalized string. The slowdown for
    non-normalized strings is so small that it cannot be distinguished from
    measurement noise.

    Martin, do you think this can be committed?

    @pitrou pitrou added performance Performance or resource usage labels Dec 14, 2008
    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Apr 26, 2009

    Should this be considered for 3.1?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 26, 2009

    The patch looks fine to me, please apply. One change is necessary: the
    quick check should only be performed if it is the newest version (i.e.
    self is NULL); otherwise, we would need to add a delta list for changed
    quickcheck values, as well.

    I think it would be possible to fold the NO and MAYBE answers into NO in
    the database already, reducing the number of necessary bits to 4, and
    then allowing to check with a simple bit test (i.e. no shift). OTOH, the
    shift can be avoided already, by changing quickcheck_shift into a
    bitmask. OTTH, perhaps the compiler does that already, anyway.

    With a reduction of the number of bits, it would be possible to reclaim
    a byte, by merging the bits into one of the other fields. Whether that's
    worth it, I don't know.

    @loewis loewis mannequin assigned pitrou Apr 26, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Apr 27, 2009

    Committed in r72054, r72055. Thanks for the patch!

    @pitrou pitrou closed this as completed Apr 27, 2009
    @pitrou pitrou closed this as completed Apr 27, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant