32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

stutzbach · 2010-05-21T12:43:59Z

BPO	8781
Nosy	@malemburg, @loewis, @pitrou, @tjguk, @briancurtin, @florentx
Dependencies	bpo-9684: PC/pyconfig.h should define SIZEOF_WCHAR_T
Files	issue8781.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-08-25.19:21:24.603>
created_at = <Date 2010-05-21.12:43:59.271>
labels = ['interpreter-core', 'expert-unicode', 'performance']
title = "32-bit wchar_t doesn't need to be unsigned to be usable\t(I think)"
updated_at = <Date 2010-08-25.19:21:24.601>
user = 'https://bugs.python.org/stutzbach'

bugs.python.org fields:

activity = <Date 2010-08-25.19:21:24.601>
actor = 'stutzbach'
assignee = 'stutzbach'
closed = True
closed_date = <Date 2010-08-25.19:21:24.603>
closer = 'stutzbach'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-05-21.12:43:59.271>
creator = 'stutzbach'
dependencies = ['9684']
files = ['18569']
hgrepos = []
issue_num = 8781
keywords = ['patch', 'buildbot']
message_count = 19.0
messages = ['106235', '106236', '106240', '106241', '106242', '106522', '114238', '114242', '114246', '114251', '114838', '114883', '114891', '114893', '114894', '114895', '114896', '114916', '114928']
nosy_count = 7.0
nosy_names = ['lemburg', 'loewis', 'pitrou', 'tim.golden', 'stutzbach', 'brian.curtin', 'flox']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue8781'
versions = ['Python 3.2']

stutzbach · 2010-05-21T12:43:58Z

If ./configure detects that the system's wchar_t type is compatible, it will define "#define PY_UNICODE_TYPE wchar_t" and enable certain optimizations when converting between Py_UNICODE and wchar_t (i.e., it can just do a memcpy).

Right now, ./configure considers wchar_t to be compatible if it is the same bit-width as Py_UNICODE and if wchar_t is unsigned. In practice, that means Python only uses wchar_t on Windows, which uses an unsigned 16-bit wchar_t. On Linux, wchar_t is 32-bit and signed.

In the original Unicode implementation for Python, Py_UNICODE was always 16-bit. I believe the "unsigned" requirement heralds back to that time. A 32-bit wchar_t gives us plenty of space to hold the maximum Unicode code point of 0x10FFFF, regardless of whether wchar_t is signed or unsigned.

I believe the condition could be relaxed to the following:

wchar_t must be the same bit-width as Py_UNICODE, and
if wchar_t is 16-bit, it must be unsigned

That would allow a UCS4 Python to use wchar_t on Linux.

I experimented by manually tweaking my pyconfig.h to treat Linux's signed 32-bit wchar_t as compatible. The unit test suite encountered no problems.

However, it's quite possible that I'm missing some important detail here. Someone familiar with the guts of Python's Unicode implementation will presumably have a much better idea of whether I have this right or not. ;-)

pitrou · 2010-05-21T13:19:03Z

The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.

Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?

stutzbach · 2010-05-21T13:55:03Z

Usually you wouldn't want to cast a char directly to a Py_UNICODE, because you need to take into account the encoding of the char and map it to the appropriate Unicode character. The exception is when you're certain the char is 7-bit ASCII, which is a subset of Unicode; that's safe since 7-bit ASCII never uses the sign bit.

However, again, I'm not an expert on the internals of Python's Unicode implementation and it's possible that I'm missing something. ;) You also raise a good point about third-party code.

Your other suggestion is quite workable. ./configure could define HAVE_USABLE_WCHAR_T (which is used to enable the optimizations) when sizeof(wchar_t) == sizeof(Py_UNICODE), yet still define Py_UNICODE as unsigned. Using Google Code I could not find any instances of HAVE_USABLE_WCHAR_T being used outside of CPython itself.

Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T). The plus side is that we wouldn't be changing the semantics of anything.

pitrou · 2010-05-21T13:59:24Z

Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to
enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T).
The plus side is that we wouldn't be changing the semantics of
anything.

I guess it's sufficient, indeed.
Besides, the optimizations don't really seem to be used in any critical
paths anyway...

stutzbach · 2010-05-21T14:05:31Z

Yeah, this is a "I noticed this small optimization while working on something else" kind of thing. ;) ("something else" being bpo-8654)

I can make a patch to change the #if's to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T, though I'll give others a few days to chime in first.

malemburg · 2010-05-26T11:21:12Z

Antoine Pitrou wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.

Right.

The Unicode code was written with an unsigned data type in mind (range
checks, conversions, etc.). We'd have to do some serious code review to
allow switching to a signed data type.

Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?

The memcpy optimizations don't rely on the unsignedness of
wchar_t, so they would work just as well.

stutzbach · 2010-08-18T16:09:11Z

Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).

malemburg · 2010-08-18T16:18:15Z

Daniel Stutzbach wrote:

Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).

Please make sure that those places where you replaced HAVE_USABLE_WCHAR_T
are enclosed in

#ifdef HAVE_WCHAR_H
...
#endif

sections. For unicodeobject.c that's true, not sure about the other
places.

stutzbach · 2010-08-18T16:34:12Z

Thanks. I dug into that a little just now, and it turns out to happen automatically.

If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.

malemburg · 2010-08-18T17:22:18Z

Daniel Stutzbach wrote:

Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

Thanks. I dug into that a little just now, and it turns out to happen automatically.

If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.

Ok, thanks for checking.

Other than that detail, I think the patch looks good.

stutzbach · 2010-08-24T21:57:58Z

Commited in r84307.

florentx · 2010-08-25T07:15:15Z

Hi Daniel,

there's a test failure which is related with r84307 on windows buildbot.

======================================================================
FAIL: test_cw_strings (ctypes.test.test_parameters.SimpleTypesTestCase)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\ctypes\test\test_parameters.py", line 78, in test_cw_strings
    self.assertTrue(c_wchar_p.from_param(s)._obj is s)
AssertionError: False is not True

http://www.python.org/dev/buildbot/builders/x86%20XP-4%203.x/builds/2837

stutzbach · 2010-08-25T11:02:18Z

Thanks, I will take a look sometime today.

stutzbach · 2010-08-25T11:31:42Z

The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).

This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
http://bugs.python.org/issue4474
http://trac.wxwidgets.org/ticket/12013

The problem with my patch can be solved in one of two ways:

In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T

I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?

stutzbach · 2010-08-25T11:33:38Z

Adding other Windows developers to the nosy list. See msg114893 where your input would be helpful.

malemburg · 2010-08-25T11:54:55Z

Daniel Stutzbach wrote:

Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).

This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
http://bugs.python.org/issue4474
http://trac.wxwidgets.org/ticket/12013

The problem with my patch can be solved in one of two ways:

In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or

Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T

I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?

It possible, we should do the right thing and implement #1.

One thing I'm not sure about is how other Windows compilers deal
with wchar_t, e.g. MinGW or the Borland compiler. I suppose
that they all use the standard Windows C lib, so the parameter
should be 2 for them as well, but I'm not 100% sure.

stutzbach · 2010-08-25T12:08:28Z

On Windows, the Python headers define HAVE_USABLE_WCHAR_T and Py_UNICODE_SIZE 2, so we are already relying on sizeof(wchar_t) == 2 on Windows.

My patch ran into trouble because it inadvertently disabled that assumption in a few places, causing code to take the slow path and convert between wchar_t * and Py_UNICODE *. The test that failed checks that the fast path was taken on Windows.

briancurtin · 2010-08-25T14:43:22Z

+1 on option 1

stutzbach · 2010-08-25T19:21:25Z

I opened a separate issue for the SIZEOF_WCHAR_T issue so I could refer to that issue number in Misc/NEWS. Fixed in r84317.

stutzbach mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode performance Performance or resource usage labels May 21, 2010

stutzbach mannequin self-assigned this May 21, 2010

malemburg changed the title ~~32-bit wchar_t doesn't need to be unsigned to be usable (I think)~~ 32-bit wchar_t doesn't need to be unsigned to be usable (I think) May 26, 2010

malemburg changed the title ~~32-bit wchar_t doesn't need to be unsigned to be usable (I think)~~ 32-bit wchar_t doesn't need to be unsigned to be usable (I think) Aug 18, 2010

stutzbach mannequin closed this as completed Aug 24, 2010

florentx mannequin reopened this Aug 25, 2010

malemburg changed the title ~~32-bit wchar_t doesn't need to be unsigned to be usable (I think)~~ 32-bit wchar_t doesn't need to be unsigned to be usable (I think) Aug 25, 2010

stutzbach mannequin closed this as completed Aug 25, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

stutzbach mannequin commented May 21, 2010

stutzbach mannequin commented May 21, 2010

pitrou commented May 21, 2010

stutzbach mannequin commented May 21, 2010

pitrou commented May 21, 2010

stutzbach mannequin commented May 21, 2010

malemburg commented May 26, 2010

stutzbach mannequin commented Aug 18, 2010

malemburg commented Aug 18, 2010

stutzbach mannequin commented Aug 18, 2010

malemburg commented Aug 18, 2010

stutzbach mannequin commented Aug 24, 2010

florentx mannequin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

malemburg commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

briancurtin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

Comments

stutzbach mannequin commented May 21, 2010

stutzbach mannequin commented May 21, 2010

pitrou commented May 21, 2010

stutzbach mannequin commented May 21, 2010

pitrou commented May 21, 2010

stutzbach mannequin commented May 21, 2010

malemburg commented May 26, 2010

stutzbach mannequin commented Aug 18, 2010

malemburg commented Aug 18, 2010

stutzbach mannequin commented Aug 18, 2010

malemburg commented Aug 18, 2010

stutzbach mannequin commented Aug 24, 2010

florentx mannequin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

malemburg commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010

briancurtin commented Aug 25, 2010

stutzbach mannequin commented Aug 25, 2010