Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use macros for surrogates in unicodeobject.c #56960

Closed
vstinner opened this issue Aug 15, 2011 · 7 comments
Closed

Use macros for surrogates in unicodeobject.c #56960

vstinner opened this issue Aug 15, 2011 · 7 comments

Comments

@vstinner
Copy link
Member

BPO 12751
Nosy @malemburg, @loewis, @terryjreedy, @abalkin, @pitrou, @vstinner, @benjaminp, @ezio-melotti
Superseder
  • bpo-10542: Py_UNICODE_NEXT and other macros for surrogates
  • Files
  • unicode_macros.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-08-16.13:22:49.760>
    created_at = <Date 2011-08-15.10:06:23.967>
    labels = []
    title = 'Use macros for surrogates in unicodeobject.c'
    updated_at = <Date 2011-08-16.13:22:49.759>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2011-08-16.13:22:49.759>
    actor = 'benjamin.peterson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-08-16.13:22:49.760>
    closer = 'benjamin.peterson'
    components = []
    creation = <Date 2011-08-15.10:06:23.967>
    creator = 'vstinner'
    dependencies = []
    files = ['22901']
    hgrepos = []
    issue_num = 12751
    keywords = ['patch']
    message_count = 7.0
    messages = ['142108', '142109', '142111', '142125', '142148', '142170', '142174']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'belopolsky', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'tchrist']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '10542'
    type = None
    url = 'https://bugs.python.org/issue12751'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    A lot of code is duplicated in unicodeobject.c to manipulate ("encode/decode") surrogates. Each function has from one to three different implementations. The new decode_ucs4() function adds a new implementation. Attached patch replaces this code by macros.

    I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE are important for speed. ((ch & 0xFFFFFC00UL) == 0xD800) (from decode_ucs4) is *a little bit* faster than (0xD800 <= ch && ch <= 0xDBFF) on my CPU (Atom Z520 @ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%).

    These 3 macros have to be checked, I wrote the first one:

    #define IS_SURROGATE(ch) (((ch) & 0xFFFFF800UL) == 0xD800)
    #define IS_HIGH_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xD800)
    #define IS_LOW_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xDC00)

    I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if Py_UNICODE is 16 bits (narrow build). It's maybe useless.

    #define COMBINE_SURROGATES(ch1, ch2) \
     (((((Py_UCS4)(ch1) & 0x3FF) << 10) | ((Py_UCS4)(ch2) & 0x3FF)) + 0x10000)

    HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been preproceed to fit in [0; 0xFFFF]. I added this requirement in the comment of these macros. It would be better to have only one macro to do the two operations, but because "*p++" (dereference and increment) is usually used, I prefer to avoid one unique macro (I don't like passing *p++ in a macro using its argument more than once).

    Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.

    I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on ch2 on narrow build.

    I also added a IS_NONBMP macro just because I prefer macro over hardcoded constants.

    @vstinner
    Copy link
    Member Author

    We may use the following unlikely macro for IS_SURROGATE, IS_HIGH_SURROGATE and IS_LOW_SURROGATE:

    #define likely(x)	__builtin_expect(!!(x), 1)
    #define unlikely(x)	__builtin_expect(!!(x), 0)

    I suppose that we should use microbenchmarks to validate these macros?

    Should I open a new issue for this idea?

    @ezio-melotti
    Copy link
    Member

    This has been proposed already in bpo-10542 (the issue also has patches).

    @pitrou
    Copy link
    Member

    pitrou commented Aug 15, 2011

    HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument
    has been preproceed to fit in [0; 0xFFFF]. I added this requirement in
    the comment of these macros.

    The macros should preprocess the argument themselves. It will make the
    code even simpler.
    Otherwise +1.

    @vstinner
    Copy link
    Member Author

    This has been proposed already in bpo-10542 (the issue also has patches).

    The two issues are different: this issue is only a refactoring, whereas bpo-10542 adds a new "feature" (function/macro: Py_UNICODE_NEXT).

    @ezio-melotti
    Copy link
    Member

    python/cpython#54751 proposes the following macros to factor out common code:
     #define _Py_UNICODE_ISSURROGATE
     #define _Py_UNICODE_ISHIGHSURROGATE
     #define _Py_UNICODE_ISLOWSURROGATE
     #define _Py_UNICODE_JOIN_SURROGATES
    and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring):
     #define _Py_UNICODE_NEXT
     #define _Py_UNICODE_PUT_NEXT
    
    Your patch proposes the same 4 macros:
     #define IS_SURROGATE
     #define IS_HIGH_SURROGATE
     #define IS_LOW_SURROGATE
     #define COMBINE_SURROGATES
    + 3 additional macros:
     #define IS_NONBMP
     #define HIGH_SURROGATE
     #define LOW_SURROGATE

    So the two issue looks quite similar to me.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    bpo-10542 proposes the following macros to factor out common code:
    #define _Py_UNICODE_ISSURROGATE
    #define _Py_UNICODE_ISHIGHSURROGATE
    #define _Py_UNICODE_ISLOWSURROGATE
    #define _Py_UNICODE_JOIN_SURROGATES
    and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring):
    #define _Py_UNICODE_NEXT
    #define _Py_UNICODE_PUT_NEXT

    Your patch proposes the same 4 macros:
    #define IS_SURROGATE
    #define IS_HIGH_SURROGATE
    #define IS_LOW_SURROGATE
    #define COMBINE_SURROGATES

    • 3 additional macros:
      #define IS_NONBMP
      #define HIGH_SURROGATE
      #define LOW_SURROGATE

    So the two issue looks quite similar to me.

    Can we please fold this issue into bpo-10542. We've already had the
    discussion there.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    2011-10-04: PyCon DE 2011, Leipzig, Germany 49 days to go

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants