Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

Merged
merged 12 commits into from Apr 30, 2021

Conversation

kulikjak
Copy link
Contributor

@kulikjak kulikjak commented Mar 30, 2021

This PR fixes wchar_t issues on Oracle Solaris when non-UTF locales are in effect (see the issue for more info).

This is a work in progress.

https://bugs.python.org/issue43667

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
@kulikjak
Copy link
Contributor Author

kulikjak commented Apr 1, 2021

Since this doesn't affect other Solaris derivatives, I also added a configure detection for Oracle Solaris and changed all #ifdef to reflect that.

configure.ac Outdated Show resolved Hide resolved
Include/unicodeobject.h Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Include/cpython/fileutils.h Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
@kulikjak
Copy link
Contributor Author

kulikjak commented Apr 6, 2021

I fixed some of your suggestions, but there are still some to fix, like wcstombs truncating the input string. And I am also missing unicode to wchar_t handling.

I am considering using iconv_open() and iconv(); that should make the conversion clearer (no need for wcstombs and mbrtoc32 functions) and better supported because it will always work no matter what encoding wchar_t uses.

It doesn't solve the truncating issue though. I will have to think this through.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The design now looks better. New review.

I would prefer to include <uchar.h> in Python.h (move the new functions to the internal C API, see my comment), and the issue with NULL characters is not solved yet.

Include/unicodeobject.h Outdated Show resolved Hide resolved
Include/cpython/fileutils.h Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Python/fileutils.c Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
@kulikjak
Copy link
Contributor Author

Sorry for such a delay - I had to dive deeper into locale stuff and tend to other more urgent things.

I just pushed my latest changes that incorporated your suggestions, changed conversion a little, and added conversion back for corresponding functions. It's tested quite extensively in 3.7 (which is the default on Solaris now), but not much in master (which is a little bit different).

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, the updated PR looks simpler (thanks to iconv?).

Objects/unicodeobject.c Show resolved Hide resolved
Python/fileutils.c Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
wchar_t* result = _Py_ConvertWCharForm(unicode, size, "wchar_t", "UCS-4-INTERNAL");
if (!result) {
return NULL;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an assertion: assert((wcslen(result) + 1) == size);. I understand that result cannot be shorter or longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always the case? You told me that _Py_ConvertWCharForm cannot truncate at first zero and then wcslen+1 doesn't have to correspond to size. Although I am not sure if that can happen when converting from UCS-4 to wchar_t.

The conversion is done in-place. Return a pointer to the wchar_t buffer
given as the first argument. Return NULL and raise exception on conversion
or memory allocation error. */
wchar_t *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to return -1 on error and return 0 on success, so the caller uses _Py_ConvertWCharFormToNative_InPlace() < 0 to check for error which is common in Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it as you suggested. At first, I was able to use that returned value nicely in PyUnicode_AsWideCharString, but then additional checks made it unnecessary.

the memory. Return NULL and raise exception on conversion or memory
allocation error. */
wchar_t *
_Py_ConvertWCharFormToUCS4(const wchar_t *native, Py_ssize_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand "Form" in "_Py_ConvertWCharFormToUCS4" name. I suggest the name "_Py_ConvertWCharToUCS4" or "_Py_DecodeNonUnicodeWchar".

Same remark for "Form" in "_Py_ConvertWCharFormToNative_InPlace" name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to decode/encode variant of _Py_DecodeNonUnicodeWchar.

Python/fileutils.c Show resolved Hide resolved
@vstinner vstinner merged commit 9032cf5 into python:master Apr 30, 2021
@vstinner
Copy link
Member

Thanks @kulikjak! I recall many issues in the locale issue because of that. Python raised an exception since Unicode characters were outside the [U+0000; U+10ffff] range.

@vstinner vstinner added the needs backport to 3.9 only security fixes label Apr 30, 2021
@miss-islington
Copy link
Contributor

Thanks @kulikjak for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry, @kulikjak and @vstinner, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 9032cf5cb1e33c0349089cfb0f6bf11ed3c30e86 3.9

@kulikjak
Copy link
Contributor Author

I thank you very much as well @vstinner! This was likely the biggest remaining issue with Python on Solaris (as you said, there were many [U+0000; U+10ffff] related problems reported before, and I hope this finally fixed that).

@vstinner
Copy link
Member

The automated backport to 3.9 failed. You requested 3.8 and 3.9 backports in https://bugs.python.org/issue43667

Do you want to try to backport the change to 3.9? Maybe your manual 3.9 backport can be automatically backported later to 3.8 (we can try the bot).

@kulikjak
Copy link
Contributor Author

Sure, I will backport this into 3.9.

I have two more questions:

  1. should I add a news entry for this change (via another PR)?
  2. I tried changing raised exceptions to UnicodeDecode/EncodeErrors instead of ValueError but when I force an exception, I am getting a very unexpected result:
ERROR: test_strftime (test.datetimetester.TestTimeTZ_Pure)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../cpython-master/Lib/test/datetimetester.py", line 3290, in test_strftime
    t.strftime('%H\ud800%M')
  File "/.../cpython-master/Lib/datetime.py", line 1457, in strftime
    return _wrap_strftime(self, fmt, timetuple)
  File "/.../cpython-master/Lib/datetime.py", line 262, in _wrap_strftime
    return _time.strftime(newformat, timetuple)
TypeError: function takes exactly 5 arguments (1 given)

Is this a known issue, or am I doing something wrong (I just replaced PyExc_ValueError with PyExc_UnicodeDecodeError)?

@vstinner
Copy link
Member

should I add a news entry for this change (via another PR)?

Documenting changes is always a good idea.

UnicodeDecode/EncodeErrors have a complex API. I don't think that it's worth it to both with that.

What is the current exception message for t.strftime('%H\ud800%M')?

@kulikjak
Copy link
Contributor Author

What is the current exception message for t.strftime('%H\ud800%M')?

With non-unicode locale it's ValueError: iconv() failed.

@vstinner
Copy link
Member

Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information. Anyway, it can be enhanced later. This change is already a huge step to the right direction ;-)

kulikjak added a commit to kulikjak/cpython that referenced this pull request May 3, 2021
…laris (pythonGH-25096).

(cherry picked from commit 9032cf5)

Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
@bedevere-bot
Copy link

GH-25847 is a backport of this pull request to the 3.9 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.9 only security fixes label May 3, 2021
@kulikjak
Copy link
Contributor Author

kulikjak commented May 3, 2021

Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information.

I see, I could have thought of that. Well, iconv doesn't provide such info so it's not trivial to do so. I might look into that later ;).

@kulikjak
Copy link
Contributor Author

I wanted to create a PR with news entry, but I am unsure whether that is a good idea before this gets backported (my reasoning being that news entry will get most likely auto backported, and then there will be news entry before the actual changes).

vstinner pushed a commit that referenced this pull request May 21, 2021
…laris (GH-25096) (GH-25847)

(cherry picked from commit 9032cf5)

Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants