-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError during load failure in non-UTF-8 locale #86060
Comments
If a native module fails to load, the dynload code will call PyUnicode_FromString on the error message to give back to the user. This can cause a UnicodeDecodeError if the locale is not a UTF-8 locale and the error message contains non-ASCII code points. While Linux systems almost always use a UTF-8 locale by default nowadays, AIX systems typically use non-UTF-8 locales by default. We encountered an issue where a customer did not have libbz2 installed, causing a load failure when bz2 tried to import _bz2 when running in an Italian locale: $ LC_ALL=it_IT python3 -c 'import bz2'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module>
from _bz2 import BZ2Compressor, BZ2Decompressor
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 161: invalid continuation byte After switching to a UTF-8 locale, the problem goes away: $ LC_ALL=IT_IT python3 -c 'import bz2'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module>
from _bz2 import BZ2Compressor, BZ2Decompressor
ImportError: 0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so.
0509-150 Il modulo dipendente libbz2.so non è stato caricato.
0509-022 Impossibile caricare il modulo libbz2.so.
0509-026 Errore di sistema: Un file o una directory nel nome percorso non esiste.
0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so.
0509-150 Il modulo dipendente /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so non è stato caricato. While this conceivably affects any Unix-like platform, the only system I can recreate it on is AIX and IBM i PASE. As far as I can tell, on Linux you will always get something like "error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory". Even though there seems to be some translations in GLIBC, I have been unable to get them to be used on either Fedora or Ubuntu. |
I succeeded to reproduce it on Ubuntu 20.04. $ sudo vi /var/lib/locales/supported.d/ja # add "ja_JP.EUC-JP EUC-JP"
$ sudo locale-gen ja_JP.EUC-JP
Generating locales (this might take a while)...
ja_JP.EUC-JP... done
Generation complete.
$ chmod -r./build/lib.linux-x86_64-3.10/_sha3.cpython-310-x86_64-linux-gnu.so
$ LC_ALL=ja_JP.eucjp ./python
Python 3.10.0a0 (heads/master:fbf43f051e, Aug 17 2020, 15:13:52)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'ja_JP.eucjp'
>>> import _sha3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 101: invalid start byte Error message contains file path (byte string, probably encoded with fs encoding) and translated error message (encoded with locale encoding). I want to use "backslashescape" error handler, but both of PyUnicode_DecodeLocale() and PyUnicode_DecodeFSDefault() don't support it. After thinking about this several minutes, now I prefer PyUnicode_DecodeUTF8(msg, strlen(msg), "backslashreplace"). |
Glad you were able to reproduce on Linux. I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations. The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato." By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape. I guess the question comes to: what's more important to be decoded, the message text or the path? |
OK, I changd my b.p.o username.
The issue is not caused by backslashreplace, but by UTF-8 instead of locale. I notice it of course, but:
There is no guranatee that the message is properly decoded with fsencoding. Additionally, non-UTF-8 locale is quite rare on Unix systems, and users of such systems would be able to handle backslash escaped message, because they might see such message often. |
I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX. |
I don't against adding "backslashescape" to PyUnicode_DecodeLocale(). But to backport the bugfix for UnicodeDecodeError, change should be minimum. So the main problem is: should we allow surrogateescape in error message? For the record, PyUnicode_DecodeLocale() is using mbstowcs(). I don't know how reliable the function is in various platforms. That is why I had suggested PyUnicode_DecodeFSDefault() at first. |
Note that error message may be written to file, stream, structured log (JSON). They may be UTF-8:strict. We can not write surrogateescape-d string to them. |
In os.strerror() and PyErr_SetFromErrnoWithFilenameObjects() we use PyUnicode_DecodeLocale(s, "surrogateescape") for decoding the result of strerror(). |
OK. Let's use PyUnicode_DecodeLocale() with surrogateescape for consistency. |
Ok, so should I switch the PR back from PyUnicode_DecodeFSDefault? |
Yes, please. |
Thank you for finding/fixing. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: