New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096
Conversation
Since this doesn't affect other Solaris derivatives, I also added a configure detection for Oracle Solaris and changed all |
I fixed some of your suggestions, but there are still some to fix, like I am considering using It doesn't solve the truncating issue though. I will have to think this through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The design now looks better. New review.
I would prefer to include <uchar.h> in Python.h (move the new functions to the internal C API, see my comment), and the issue with NULL characters is not solved yet.
Sorry for such a delay - I had to dive deeper into locale stuff and tend to other more urgent things. I just pushed my latest changes that incorporated your suggestions, changed conversion a little, and added conversion back for corresponding functions. It's tested quite extensively in 3.7 (which is the default on Solaris now), but not much in master (which is a little bit different). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks, the updated PR looks simpler (thanks to iconv?).
wchar_t* result = _Py_ConvertWCharForm(unicode, size, "wchar_t", "UCS-4-INTERNAL"); | ||
if (!result) { | ||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an assertion: assert((wcslen(result) + 1) == size);
. I understand that result cannot be shorter or longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always the case? You told me that _Py_ConvertWCharForm
cannot truncate at first zero and then wcslen+1
doesn't have to correspond to size
. Although I am not sure if that can happen when converting from UCS-4 to wchar_t.
Python/fileutils.c
Outdated
The conversion is done in-place. Return a pointer to the wchar_t buffer | ||
given as the first argument. Return NULL and raise exception on conversion | ||
or memory allocation error. */ | ||
wchar_t * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to return -1 on error and return 0 on success, so the caller uses _Py_ConvertWCharFormToNative_InPlace() < 0
to check for error which is common in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it as you suggested. At first, I was able to use that returned value nicely in PyUnicode_AsWideCharString
, but then additional checks made it unnecessary.
Python/fileutils.c
Outdated
the memory. Return NULL and raise exception on conversion or memory | ||
allocation error. */ | ||
wchar_t * | ||
_Py_ConvertWCharFormToUCS4(const wchar_t *native, Py_ssize_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand "Form" in "_Py_ConvertWCharFormToUCS4" name. I suggest the name "_Py_ConvertWCharToUCS4" or "_Py_DecodeNonUnicodeWchar".
Same remark for "Form" in "_Py_ConvertWCharFormToNative_InPlace" name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to decode/encode variant of _Py_DecodeNonUnicodeWchar
.
Thanks @kulikjak! I recall many issues in the locale issue because of that. Python raised an exception since Unicode characters were outside the [U+0000; U+10ffff] range. |
Sorry, @kulikjak and @vstinner, I could not cleanly backport this to |
I thank you very much as well @vstinner! This was likely the biggest remaining issue with Python on Solaris (as you said, there were many [U+0000; U+10ffff] related problems reported before, and I hope this finally fixed that). |
The automated backport to 3.9 failed. You requested 3.8 and 3.9 backports in https://bugs.python.org/issue43667 Do you want to try to backport the change to 3.9? Maybe your manual 3.9 backport can be automatically backported later to 3.8 (we can try the bot). |
Sure, I will backport this into 3.9. I have two more questions:
Is this a known issue, or am I doing something wrong (I just replaced |
Documenting changes is always a good idea. UnicodeDecode/EncodeErrors have a complex API. I don't think that it's worth it to both with that. What is the current exception message for |
With non-unicode locale it's |
Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information. Anyway, it can be enhanced later. This change is already a huge step to the right direction ;-) |
…laris (pythonGH-25096). (cherry picked from commit 9032cf5) Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
GH-25847 is a backport of this pull request to the 3.9 branch. |
I see, I could have thought of that. Well, |
I wanted to create a PR with news entry, but I am unsure whether that is a good idea before this gets backported (my reasoning being that news entry will get most likely auto backported, and then there will be news entry before the actual changes). |
This PR fixes wchar_t issues on Oracle Solaris when non-UTF locales are in effect (see the issue for more info).
This is a work in progress.
https://bugs.python.org/issue43667