bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

kulikjak · 2021-03-30T10:12:22Z

This PR fixes wchar_t issues on Oracle Solaris when non-UTF locales are in effect (see the issue for more info).

This is a work in progress.

https://bugs.python.org/issue43667

Objects/unicodeobject.c

…ndling

kulikjak · 2021-04-01T09:54:45Z

Since this doesn't affect other Solaris derivatives, I also added a configure detection for Oracle Solaris and changed all #ifdef to reflect that.

configure.ac

Include/unicodeobject.h

Objects/unicodeobject.c

Include/cpython/fileutils.h

Python/fileutils.c

kulikjak · 2021-04-06T12:28:54Z

I fixed some of your suggestions, but there are still some to fix, like wcstombs truncating the input string. And I am also missing unicode to wchar_t handling.

I am considering using iconv_open() and iconv(); that should make the conversion clearer (no need for wcstombs and mbrtoc32 functions) and better supported because it will always work no matter what encoding wchar_t uses.

It doesn't solve the truncating issue though. I will have to think this through.

vstinner

Thanks. The design now looks better. New review.

I would prefer to include <uchar.h> in Python.h (move the new functions to the internal C API, see my comment), and the issue with NULL characters is not solved yet.

Include/unicodeobject.h

Include/cpython/fileutils.h

Objects/unicodeobject.c

Python/fileutils.c

kulikjak · 2021-04-29T12:06:18Z

Sorry for such a delay - I had to dive deeper into locale stuff and tend to other more urgent things.

I just pushed my latest changes that incorporated your suggestions, changed conversion a little, and added conversion back for corresponding functions. It's tested quite extensively in 3.7 (which is the default on Solaris now), but not much in master (which is a little bit different).

vstinner

Oh thanks, the updated PR looks simpler (thanks to iconv?).

Objects/unicodeobject.c

Python/fileutils.c

vstinner · 2021-04-29T23:15:54Z

Python/fileutils.c

+    wchar_t* result = _Py_ConvertWCharForm(unicode, size, "wchar_t", "UCS-4-INTERNAL");
+    if (!result) {
+        return NULL;
+    }


Maybe add an assertion: assert((wcslen(result) + 1) == size);. I understand that result cannot be shorter or longer.

Is this always the case? You told me that _Py_ConvertWCharForm cannot truncate at first zero and then wcslen+1 doesn't have to correspond to size. Although I am not sure if that can happen when converting from UCS-4 to wchar_t.

vstinner · 2021-04-29T23:18:48Z

Python/fileutils.c

+   The conversion is done in-place. Return a pointer to the wchar_t buffer
+   given as the first argument. Return NULL and raise exception on conversion
+   or memory allocation error. */
+wchar_t *


I suggest to return -1 on error and return 0 on success, so the caller uses _Py_ConvertWCharFormToNative_InPlace() < 0 to check for error which is common in Python.

I changed it as you suggested. At first, I was able to use that returned value nicely in PyUnicode_AsWideCharString, but then additional checks made it unnecessary.

vstinner · 2021-04-29T23:21:11Z

Python/fileutils.c

+   the memory. Return NULL and raise exception on conversion or memory
+   allocation error. */
+wchar_t *
+_Py_ConvertWCharFormToUCS4(const wchar_t *native, Py_ssize_t size)


I don't understand "Form" in "_Py_ConvertWCharFormToUCS4" name. I suggest the name "_Py_ConvertWCharToUCS4" or "_Py_DecodeNonUnicodeWchar".

Same remark for "Form" in "_Py_ConvertWCharFormToNative_InPlace" name.

Changed to decode/encode variant of _Py_DecodeNonUnicodeWchar.

Python/fileutils.c

vstinner · 2021-04-30T13:22:25Z

Thanks @kulikjak! I recall many issues in the locale issue because of that. Python raised an exception since Unicode characters were outside the [U+0000; U+10ffff] range.

miss-islington · 2021-04-30T13:29:18Z

Thanks @kulikjak for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

miss-islington · 2021-04-30T13:29:27Z

Sorry, @kulikjak and @vstinner, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 9032cf5cb1e33c0349089cfb0f6bf11ed3c30e86 3.9

kulikjak · 2021-04-30T13:30:49Z

I thank you very much as well @vstinner! This was likely the biggest remaining issue with Python on Solaris (as you said, there were many [U+0000; U+10ffff] related problems reported before, and I hope this finally fixed that).

vstinner · 2021-04-30T13:32:38Z

The automated backport to 3.9 failed. You requested 3.8 and 3.9 backports in https://bugs.python.org/issue43667

Do you want to try to backport the change to 3.9? Maybe your manual 3.9 backport can be automatically backported later to 3.8 (we can try the bot).

kulikjak · 2021-04-30T14:04:43Z

Sure, I will backport this into 3.9.

I have two more questions:

should I add a news entry for this change (via another PR)?
I tried changing raised exceptions to UnicodeDecode/EncodeErrors instead of ValueError but when I force an exception, I am getting a very unexpected result:

ERROR: test_strftime (test.datetimetester.TestTimeTZ_Pure)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../cpython-master/Lib/test/datetimetester.py", line 3290, in test_strftime
    t.strftime('%H\ud800%M')
  File "/.../cpython-master/Lib/datetime.py", line 1457, in strftime
    return _wrap_strftime(self, fmt, timetuple)
  File "/.../cpython-master/Lib/datetime.py", line 262, in _wrap_strftime
    return _time.strftime(newformat, timetuple)
TypeError: function takes exactly 5 arguments (1 given)

Is this a known issue, or am I doing something wrong (I just replaced PyExc_ValueError with PyExc_UnicodeDecodeError)?

vstinner · 2021-04-30T14:28:23Z

should I add a news entry for this change (via another PR)?

Documenting changes is always a good idea.

UnicodeDecode/EncodeErrors have a complex API. I don't think that it's worth it to both with that.

What is the current exception message for t.strftime('%H\ud800%M')?

kulikjak · 2021-04-30T14:46:37Z

What is the current exception message for t.strftime('%H\ud800%M')?

With non-unicode locale it's ValueError: iconv() failed.

vstinner · 2021-04-30T16:38:55Z

Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information. Anyway, it can be enhanced later. This change is already a huge step to the right direction ;-)

…ythonGH-25096)

…laris (pythonGH-25096). (cherry picked from commit 9032cf5) Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>

bedevere-bot · 2021-05-03T11:37:17Z

GH-25847 is a backport of this pull request to the 3.9 branch.

kulikjak · 2021-05-03T11:45:19Z

Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information.

I see, I could have thought of that. Well, iconv doesn't provide such info so it's not trivial to do so. I might look into that later ;).

kulikjak · 2021-05-21T10:35:44Z

I wanted to create a PR with news entry, but I am unsure whether that is a good idea before this gets backported (my reasoning being that news entry will get most likely auto backported, and then there will be news entry before the actual changes).

…laris (GH-25096) (GH-25847) (cherry picked from commit 9032cf5) Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>

fix wchar_t conversion on Solaris

59548ca

the-knights-who-say-ni added the CLA signed label Mar 30, 2021

bedevere-bot added the awaiting review label Mar 30, 2021

vstinner reviewed Mar 30, 2021

View reviewed changes

kulikjak added 2 commits April 1, 2021 10:57

Add configure detection of Oracle Solaris and its uncommon wchar_t ha…

cb6452c

…ndling

Rework wchar_t conversion based on suggestions

c01b792

vstinner reviewed Apr 1, 2021

View reviewed changes

Incorporate further suggestions

0fc7f54

vstinner reviewed Apr 6, 2021

View reviewed changes

kulikjak added 6 commits April 26, 2021 13:45

Incorporate additional suggestions and rework wchar_t conversion

5fdd1a2

Minor fix

bad1eba

Add reverse conversion from UCS-4 to native wchar_t

e8dd8d1

Fix one more issue introduced with bpo-35883

4aebe1d

Fix include

afaeaa2

Clean unused includes

627e460

Change return value of _Py_LocaleUsesNonUnicodeWchar to reflect its name

02a37ee

vstinner reviewed Apr 29, 2021

View reviewed changes

More improvements

00956fe

vstinner added the skip news label Apr 30, 2021

vstinner merged commit 9032cf5 into python:master Apr 30, 2021

bedevere-bot removed the awaiting review label Apr 30, 2021

vstinner added the needs backport to 3.9 only security fixes label Apr 30, 2021

miss-islington assigned vstinner Apr 30, 2021

kreathon pushed a commit to kreathon/cpython that referenced this pull request May 2, 2021

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris (p…

f5834af

…ythonGH-25096)

kulikjak added a commit to kulikjak/cpython that referenced this pull request May 3, 2021

[3.9] bpo-43667: Fix broken Unicode encoding in non-UTF locales on So…

e967f12

…laris (pythonGH-25096). (cherry picked from commit 9032cf5) Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>

kulikjak mentioned this pull request May 3, 2021

[3.9] bpo-43667: Fix broken Unicode encoding in non-UTF locales on So… #25847

Merged

bedevere-bot removed the needs backport to 3.9 only security fixes label May 3, 2021

vstinner pushed a commit that referenced this pull request May 21, 2021

[3.9] bpo-43667: Fix broken Unicode encoding in non-UTF locales on So…

d3cc689

…laris (GH-25096) (GH-25847) (cherry picked from commit 9032cf5) Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>

kulikjak mentioned this pull request May 27, 2021

bpo-43667: Add news fragment for changes in #25096 #26405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

kulikjak commented Mar 30, 2021 •

edited by bedevere-bot

kulikjak commented Apr 1, 2021

kulikjak commented Apr 6, 2021 •

edited

vstinner left a comment

kulikjak commented Apr 29, 2021

vstinner left a comment

vstinner Apr 29, 2021

kulikjak Apr 30, 2021

vstinner Apr 29, 2021

kulikjak Apr 30, 2021

vstinner Apr 29, 2021

kulikjak Apr 30, 2021

vstinner commented Apr 30, 2021

miss-islington commented Apr 30, 2021

miss-islington commented Apr 30, 2021

kulikjak commented Apr 30, 2021

vstinner commented Apr 30, 2021

kulikjak commented Apr 30, 2021

vstinner commented Apr 30, 2021

kulikjak commented Apr 30, 2021

vstinner commented Apr 30, 2021

bedevere-bot commented May 3, 2021

kulikjak commented May 3, 2021

kulikjak commented May 21, 2021

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

Conversation

kulikjak commented Mar 30, 2021 • edited by bedevere-bot

kulikjak commented Apr 1, 2021

kulikjak commented Apr 6, 2021 • edited

vstinner left a comment

Choose a reason for hiding this comment

kulikjak commented Apr 29, 2021

vstinner left a comment

Choose a reason for hiding this comment

vstinner Apr 29, 2021

Choose a reason for hiding this comment

kulikjak Apr 30, 2021

Choose a reason for hiding this comment

vstinner Apr 29, 2021

Choose a reason for hiding this comment

kulikjak Apr 30, 2021

Choose a reason for hiding this comment

vstinner Apr 29, 2021

Choose a reason for hiding this comment

kulikjak Apr 30, 2021

Choose a reason for hiding this comment

vstinner commented Apr 30, 2021

miss-islington commented Apr 30, 2021

miss-islington commented Apr 30, 2021

kulikjak commented Apr 30, 2021

vstinner commented Apr 30, 2021

kulikjak commented Apr 30, 2021

vstinner commented Apr 30, 2021

kulikjak commented Apr 30, 2021

vstinner commented Apr 30, 2021

bedevere-bot commented May 3, 2021

kulikjak commented May 3, 2021

kulikjak commented May 21, 2021

kulikjak commented Mar 30, 2021 •

edited by bedevere-bot

kulikjak commented Apr 6, 2021 •

edited