Use macros for surrogates in unicodeobject.c #56960

vstinner · 2011-08-15T10:06:24Z

BPO	12751
Nosy	@malemburg, @loewis, @terryjreedy, @abalkin, @pitrou, @vstinner, @benjaminp, @ezio-melotti
Superseder	bpo-10542: Py_UNICODE_NEXT and other macros for surrogates
Files	unicode_macros.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-08-16.13:22:49.760>
created_at = <Date 2011-08-15.10:06:23.967>
labels = []
title = 'Use macros for surrogates in unicodeobject.c'
updated_at = <Date 2011-08-16.13:22:49.759>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2011-08-16.13:22:49.759>
actor = 'benjamin.peterson'
assignee = 'none'
closed = True
closed_date = <Date 2011-08-16.13:22:49.760>
closer = 'benjamin.peterson'
components = []
creation = <Date 2011-08-15.10:06:23.967>
creator = 'vstinner'
dependencies = []
files = ['22901']
hgrepos = []
issue_num = 12751
keywords = ['patch']
message_count = 7.0
messages = ['142108', '142109', '142111', '142125', '142148', '142170', '142174']
nosy_count = 9.0
nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'belopolsky', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'tchrist']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = None
status = 'closed'
superseder = '10542'
type = None
url = 'https://bugs.python.org/issue12751'
versions = ['Python 3.3']

vstinner · 2011-08-15T10:06:23Z

A lot of code is duplicated in unicodeobject.c to manipulate ("encode/decode") surrogates. Each function has from one to three different implementations. The new decode_ucs4() function adds a new implementation. Attached patch replaces this code by macros.

I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE are important for speed. ((ch & 0xFFFFFC00UL) == 0xD800) (from decode_ucs4) is *a little bit* faster than (0xD800 <= ch && ch <= 0xDBFF) on my CPU (Atom Z520 @ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%).

These 3 macros have to be checked, I wrote the first one:

#define IS_SURROGATE(ch) (((ch) & 0xFFFFF800UL) == 0xD800)
#define IS_HIGH_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xD800)
#define IS_LOW_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xDC00)

I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if Py_UNICODE is 16 bits (narrow build). It's maybe useless.

#define COMBINE_SURROGATES(ch1, ch2) \
 (((((Py_UCS4)(ch1) & 0x3FF) << 10) | ((Py_UCS4)(ch2) & 0x3FF)) + 0x10000)

HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been preproceed to fit in [0; 0xFFFF]. I added this requirement in the comment of these macros. It would be better to have only one macro to do the two operations, but because "*p++" (dereference and increment) is usually used, I prefer to avoid one unique macro (I don't like passing *p++ in a macro using its argument more than once).

Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.

I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on ch2 on narrow build.

I also added a IS_NONBMP macro just because I prefer macro over hardcoded constants.

vstinner · 2011-08-15T10:10:09Z

We may use the following unlikely macro for IS_SURROGATE, IS_HIGH_SURROGATE and IS_LOW_SURROGATE:

#define likely(x)	__builtin_expect(!!(x), 1)
#define unlikely(x)	__builtin_expect(!!(x), 0)

I suppose that we should use microbenchmarks to validate these macros?

Should I open a new issue for this idea?

ezio-melotti · 2011-08-15T10:23:32Z

This has been proposed already in bpo-10542 (the issue also has patches).

pitrou · 2011-08-15T12:11:28Z

HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument
has been preproceed to fit in [0; 0xFFFF]. I added this requirement in
the comment of these macros.

The macros should preprocess the argument themselves. It will make the
code even simpler.
Otherwise +1.

vstinner · 2011-08-15T21:45:00Z

This has been proposed already in bpo-10542 (the issue also has patches).

The two issues are different: this issue is only a refactoring, whereas bpo-10542 adds a new "feature" (function/macro: Py_UNICODE_NEXT).

ezio-melotti · 2011-08-16T08:19:06Z

python/cpython#54751 proposes the following macros to factor out common code:
 #define _Py_UNICODE_ISSURROGATE
 #define _Py_UNICODE_ISHIGHSURROGATE
 #define _Py_UNICODE_ISLOWSURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring):
 #define _Py_UNICODE_NEXT
 #define _Py_UNICODE_PUT_NEXT

Your patch proposes the same 4 macros:
 #define IS_SURROGATE
 #define IS_HIGH_SURROGATE
 #define IS_LOW_SURROGATE
 #define COMBINE_SURROGATES
+ 3 additional macros:
 #define IS_NONBMP
 #define HIGH_SURROGATE
 #define LOW_SURROGATE

So the two issue looks quite similar to me.

malemburg · 2011-08-16T08:31:38Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

bpo-10542 proposes the following macros to factor out common code:
#define _Py_UNICODE_ISSURROGATE
#define _Py_UNICODE_ISHIGHSURROGATE
#define _Py_UNICODE_ISLOWSURROGATE
#define _Py_UNICODE_JOIN_SURROGATES
and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring):
#define _Py_UNICODE_NEXT
#define _Py_UNICODE_PUT_NEXT

Your patch proposes the same 4 macros:
#define IS_SURROGATE
#define IS_HIGH_SURROGATE
#define IS_LOW_SURROGATE
#define COMBINE_SURROGATES

3 additional macros:
#define IS_NONBMP
#define HIGH_SURROGATE
#define LOW_SURROGATE

So the two issue looks quite similar to me.

Can we please fold this issue into bpo-10542. We've already had the
discussion there.

Thanks,

Marc-Andre Lemburg
eGenix.com

2011-10-04: PyCon DE 2011, Leipzig, Germany 49 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

benjaminp closed this as completed Aug 16, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use macros for surrogates in unicodeobject.c #56960

Use macros for surrogates in unicodeobject.c #56960

vstinner commented Aug 15, 2011

vstinner commented Aug 15, 2011

vstinner commented Aug 15, 2011

ezio-melotti commented Aug 15, 2011

pitrou commented Aug 15, 2011

vstinner commented Aug 15, 2011

ezio-melotti commented Aug 16, 2011

malemburg commented Aug 16, 2011

Use macros for surrogates in unicodeobject.c #56960

Use macros for surrogates in unicodeobject.c #56960

Comments

vstinner commented Aug 15, 2011

vstinner commented Aug 15, 2011

vstinner commented Aug 15, 2011

ezio-melotti commented Aug 15, 2011

pitrou commented Aug 15, 2011

vstinner commented Aug 15, 2011

ezio-melotti commented Aug 16, 2011

malemburg commented Aug 16, 2011

Thanks,