Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

Closed
stutzbach mannequin opened this issue May 21, 2010 · 19 comments
Closed

32-bit wchar_t doesn't need to be unsigned to be usable (I think) #53027

stutzbach mannequin opened this issue May 21, 2010 · 19 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode

Comments

@stutzbach
Copy link
Mannequin

stutzbach mannequin commented May 21, 2010

BPO 8781
Nosy @malemburg, @loewis, @pitrou, @tjguk, @briancurtin, @florentx
Dependencies
  • bpo-9684: PC/pyconfig.h should define SIZEOF_WCHAR_T
  • Files
  • issue8781.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-08-25.19:21:24.603>
    created_at = <Date 2010-05-21.12:43:59.271>
    labels = ['interpreter-core', 'expert-unicode', 'performance']
    title = "32-bit wchar_t doesn't need to be unsigned to be usable\t(I think)"
    updated_at = <Date 2010-08-25.19:21:24.601>
    user = 'https://bugs.python.org/stutzbach'

    bugs.python.org fields:

    activity = <Date 2010-08-25.19:21:24.601>
    actor = 'stutzbach'
    assignee = 'stutzbach'
    closed = True
    closed_date = <Date 2010-08-25.19:21:24.603>
    closer = 'stutzbach'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2010-05-21.12:43:59.271>
    creator = 'stutzbach'
    dependencies = ['9684']
    files = ['18569']
    hgrepos = []
    issue_num = 8781
    keywords = ['patch', 'buildbot']
    message_count = 19.0
    messages = ['106235', '106236', '106240', '106241', '106242', '106522', '114238', '114242', '114246', '114251', '114838', '114883', '114891', '114893', '114894', '114895', '114896', '114916', '114928']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'loewis', 'pitrou', 'tim.golden', 'stutzbach', 'brian.curtin', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue8781'
    versions = ['Python 3.2']

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 21, 2010

    If ./configure detects that the system's wchar_t type is compatible, it will define "#define PY_UNICODE_TYPE wchar_t" and enable certain optimizations when converting between Py_UNICODE and wchar_t (i.e., it can just do a memcpy).

    Right now, ./configure considers wchar_t to be compatible if it is the same bit-width as Py_UNICODE and if wchar_t is unsigned. In practice, that means Python only uses wchar_t on Windows, which uses an unsigned 16-bit wchar_t. On Linux, wchar_t is 32-bit and signed.

    In the original Unicode implementation for Python, Py_UNICODE was always 16-bit. I believe the "unsigned" requirement heralds back to that time. A 32-bit wchar_t gives us plenty of space to hold the maximum Unicode code point of 0x10FFFF, regardless of whether wchar_t is signed or unsigned.

    I believe the condition could be relaxed to the following:

    • wchar_t must be the same bit-width as Py_UNICODE, and
    • if wchar_t is 16-bit, it must be unsigned

    That would allow a UCS4 Python to use wchar_t on Linux.

    I experimented by manually tweaking my pyconfig.h to treat Linux's signed 32-bit wchar_t as compatible. The unit test suite encountered no problems.

    However, it's quite possible that I'm missing some important detail here. Someone familiar with the guts of Python's Unicode implementation will presumably have a much better idea of whether I have this right or not. ;-)

    @stutzbach stutzbach mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode performance Performance or resource usage labels May 21, 2010
    @pitrou
    Copy link
    Member

    pitrou commented May 21, 2010

    The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.

    Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 21, 2010

    Usually you wouldn't want to cast a char directly to a Py_UNICODE, because you need to take into account the encoding of the char and map it to the appropriate Unicode character. The exception is when you're certain the char is 7-bit ASCII, which is a subset of Unicode; that's safe since 7-bit ASCII never uses the sign bit.

    However, again, I'm not an expert on the internals of Python's Unicode implementation and it's possible that I'm missing something. ;) You also raise a good point about third-party code.

    Your other suggestion is quite workable. ./configure could define HAVE_USABLE_WCHAR_T (which is used to enable the optimizations) when sizeof(wchar_t) == sizeof(Py_UNICODE), yet still define Py_UNICODE as unsigned. Using Google Code I could not find any instances of HAVE_USABLE_WCHAR_T being used outside of CPython itself.

    Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T). The plus side is that we wouldn't be changing the semantics of anything.

    @pitrou
    Copy link
    Member

    pitrou commented May 21, 2010

    Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to
    enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T).
    The plus side is that we wouldn't be changing the semantics of
    anything.

    I guess it's sufficient, indeed.
    Besides, the optimizations don't really seem to be used in any critical
    paths anyway...

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 21, 2010

    Yeah, this is a "I noticed this small optimization while working on something else" kind of thing. ;) ("something else" being bpo-8654)

    I can make a patch to change the #if's to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T, though I'll give others a few days to chime in first.

    @stutzbach stutzbach mannequin self-assigned this May 21, 2010
    @malemburg
    Copy link
    Member

    Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.

    Right.

    The Unicode code was written with an unsigned data type in mind (range
    checks, conversions, etc.). We'd have to do some serious code review to
    allow switching to a signed data type.

    Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?

    The memcpy optimizations don't rely on the unsignedness of
    wchar_t, so they would work just as well.

    @malemburg malemburg changed the title 32-bit wchar_t doesn't need to be unsigned to be usable (I think) 32-bit wchar_t doesn't need to be unsigned to be usable (I think) May 26, 2010
    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 18, 2010

    Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

    Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).

    Please make sure that those places where you replaced HAVE_USABLE_WCHAR_T
    are enclosed in

    #ifdef HAVE_WCHAR_H
    ...
    #endif

    sections. For unicodeobject.c that's true, not sure about the other
    places.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 18, 2010

    Thanks. I dug into that a little just now, and it turns out to happen automatically.

    If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

    Thanks. I dug into that a little just now, and it turns out to happen automatically.

    If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.

    Ok, thanks for checking.

    Other than that detail, I think the patch looks good.

    @malemburg malemburg changed the title 32-bit wchar_t doesn't need to be unsigned to be usable (I think) 32-bit wchar_t doesn't need to be unsigned to be usable (I think) Aug 18, 2010
    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 24, 2010

    Commited in r84307.

    @stutzbach stutzbach mannequin closed this as completed Aug 24, 2010
    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Aug 25, 2010

    Hi Daniel,

    there's a test failure which is related with r84307 on windows buildbot.

    ======================================================================
    FAIL: test_cw_strings (ctypes.test.test_parameters.SimpleTypesTestCase)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\ctypes\test\test_parameters.py", line 78, in test_cw_strings
        self.assertTrue(c_wchar_p.from_param(s)._obj is s)
    AssertionError: False is not True

    http://www.python.org/dev/buildbot/builders/x86%20XP-4%203.x/builds/2837

    @florentx florentx mannequin reopened this Aug 25, 2010
    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 25, 2010

    Thanks, I will take a look sometime today.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 25, 2010

    The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).

    This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
    http://bugs.python.org/issue4474
    http://trac.wxwidgets.org/ticket/12013

    The problem with my patch can be solved in one of two ways:

    1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
    2. Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T

    I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 25, 2010

    Adding other Windows developers to the nosy list. See msg114893 where your input would be helpful.

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

    The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).

    This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
    http://bugs.python.org/issue4474
    http://trac.wxwidgets.org/ticket/12013

    The problem with my patch can be solved in one of two ways:

    1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
    2. Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T

    I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?

    It possible, we should do the right thing and implement #1.

    One thing I'm not sure about is how other Windows compilers deal
    with wchar_t, e.g. MinGW or the Borland compiler. I suppose
    that they all use the standard Windows C lib, so the parameter
    should be 2 for them as well, but I'm not 100% sure.

    @malemburg malemburg changed the title 32-bit wchar_t doesn't need to be unsigned to be usable (I think) 32-bit wchar_t doesn't need to be unsigned to be usable (I think) Aug 25, 2010
    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 25, 2010

    On Windows, the Python headers define HAVE_USABLE_WCHAR_T and Py_UNICODE_SIZE 2, so we are already relying on sizeof(wchar_t) == 2 on Windows.

    My patch ran into trouble because it inadvertently disabled that assumption in a few places, causing code to take the slow path and convert between wchar_t * and Py_UNICODE *. The test that failed checks that the fast path was taken on Windows.

    @briancurtin
    Copy link
    Member

    +1 on option 1

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Aug 25, 2010

    I opened a separate issue for the SIZEOF_WCHAR_T issue so I could refer to that issue number in Misc/NEWS. Fixed in r84317.

    @stutzbach stutzbach mannequin closed this as completed Aug 25, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants