UnicodeDecodeError during load failure in non-UTF-8 locale #86060
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2020-10-15.03:16:09.771> created_at = <Date 2020-09-30.17:26:23.229> labels = ['interpreter-core', 'type-bug', '3.8', '3.9', '3.10'] title = 'UnicodeDecodeError during load failure in non-UTF-8 locale' updated_at = <Date 2020-10-15.03:16:09.770> user = 'https://github.com/kadler'
activity = <Date 2020-10-15.03:16:09.770> actor = 'methane' assignee = 'none' closed = True closed_date = <Date 2020-10-15.03:16:09.771> closer = 'methane' components = ['Interpreter Core'] creation = <Date 2020-09-30.17:26:23.229> creator = 'kadler' dependencies =  files =  hgrepos =  issue_num = 41894 keywords = ['patch'] message_count = 15.0 messages = ['377713', '378033', '378170', '378211', '378220', '378223', '378224', '378226', '378228', '378251', '378298', '378656', '378657', '378658', '378659'] nosy_count = 4.0 nosy_names = ['methane', 'serhiy.storchaka', 'miss-islington', 'kadler'] pr_nums = ['22466', '22704', '22705'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue41894' versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']
The text was updated successfully, but these errors were encountered:
If a native module fails to load, the dynload code will call PyUnicode_FromString on the error message to give back to the user. This can cause a UnicodeDecodeError if the locale is not a UTF-8 locale and the error message contains non-ASCII code points.
While Linux systems almost always use a UTF-8 locale by default nowadays, AIX systems typically use non-UTF-8 locales by default. We encountered an issue where a customer did not have libbz2 installed, causing a load failure when bz2 tried to import _bz2 when running in an Italian locale:
$ LC_ALL=it_IT python3 -c 'import bz2' Traceback (most recent call last): File "<string>", line 1, in <module> File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 161: invalid continuation byte
After switching to a UTF-8 locale, the problem goes away:
$ LC_ALL=IT_IT python3 -c 'import bz2' Traceback (most recent call last): File "<string>", line 1, in <module> File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ImportError: 0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 0509-150 Il modulo dipendente libbz2.so non è stato caricato. 0509-022 Impossibile caricare il modulo libbz2.so. 0509-026 Errore di sistema: Un file o una directory nel nome percorso non esiste. 0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 0509-150 Il modulo dipendente /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so non è stato caricato.
While this conceivably affects any Unix-like platform, the only system I can recreate it on is AIX and IBM i PASE. As far as I can tell, on Linux you will always get something like "error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory". Even though there seems to be some translations in GLIBC, I have been unable to get them to be used on either Fedora or Ubuntu.
I succeeded to reproduce it on Ubuntu 20.04.
$ sudo vi /var/lib/locales/supported.d/ja # add "ja_JP.EUC-JP EUC-JP" $ sudo locale-gen ja_JP.EUC-JP Generating locales (this might take a while)... ja_JP.EUC-JP... done Generation complete. $ chmod -r./build/lib.linux-x86_64-3.10/_sha3.cpython-310-x86_64-linux-gnu.so $ LC_ALL=ja_JP.eucjp ./python Python 3.10.0a0 (heads/master:fbf43f051e, Aug 17 2020, 15:13:52) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, "") 'ja_JP.eucjp' >>> import _sha3 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 101: invalid start byte
Error message contains file path (byte string, probably encoded with fs encoding) and translated error message (encoded with locale encoding).
I want to use "backslashescape" error handler, but both of PyUnicode_DecodeLocale() and PyUnicode_DecodeFSDefault() don't support it.
After thinking about this several minutes, now I prefer PyUnicode_DecodeUTF8(msg, strlen(msg), "backslashreplace").
Glad you were able to reproduce on Linux.
I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations.
The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato." By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape.
I guess the question comes to: what's more important to be decoded, the message text or the path?
OK, I changd my b.p.o username.
The issue is not caused by backslashreplace, but by UTF-8 instead of locale. I notice it of course, but:
There is no guranatee that the message is properly decoded with fsencoding.
Additionally, non-UTF-8 locale is quite rare on Unix systems, and users of such systems would be able to handle backslash escaped message, because they might see such message often.
I don't against adding "backslashescape" to PyUnicode_DecodeLocale(). But to backport the bugfix for UnicodeDecodeError, change should be minimum.
So the main problem is: should we allow surrogateescape in error message?
For the record, PyUnicode_DecodeLocale() is using mbstowcs(). I don't know how reliable the function is in various platforms. That is why I had suggested PyUnicode_DecodeFSDefault() at first.