Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError during load failure in non-UTF-8 locale #86060

Closed
kadler mannequin opened this issue Sep 30, 2020 · 15 comments
Closed

UnicodeDecodeError during load failure in non-UTF-8 locale #86060

kadler mannequin opened this issue Sep 30, 2020 · 15 comments
Labels
3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@kadler
Copy link
Mannequin

kadler mannequin commented Sep 30, 2020

BPO 41894
Nosy @methane, @serhiy-storchaka, @miss-islington, @kadler
PRs
  • bpo-41894: Fix UTF-8 errors loading native module #22466
  • [3.9] bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466) #22704
  • [3.8] bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466) #22705
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-10-15.03:16:09.771>
    created_at = <Date 2020-09-30.17:26:23.229>
    labels = ['interpreter-core', 'type-bug', '3.8', '3.9', '3.10']
    title = 'UnicodeDecodeError during load failure in non-UTF-8 locale'
    updated_at = <Date 2020-10-15.03:16:09.770>
    user = 'https://github.com/kadler'

    bugs.python.org fields:

    activity = <Date 2020-10-15.03:16:09.770>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-10-15.03:16:09.771>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2020-09-30.17:26:23.229>
    creator = 'kadler'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 41894
    keywords = ['patch']
    message_count = 15.0
    messages = ['377713', '378033', '378170', '378211', '378220', '378223', '378224', '378226', '378228', '378251', '378298', '378656', '378657', '378658', '378659']
    nosy_count = 4.0
    nosy_names = ['methane', 'serhiy.storchaka', 'miss-islington', 'kadler']
    pr_nums = ['22466', '22704', '22705']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue41894'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @kadler
    Copy link
    Mannequin Author

    kadler mannequin commented Sep 30, 2020

    If a native module fails to load, the dynload code will call PyUnicode_FromString on the error message to give back to the user. This can cause a UnicodeDecodeError if the locale is not a UTF-8 locale and the error message contains non-ASCII code points.

    While Linux systems almost always use a UTF-8 locale by default nowadays, AIX systems typically use non-UTF-8 locales by default. We encountered an issue where a customer did not have libbz2 installed, causing a load failure when bz2 tried to import _bz2 when running in an Italian locale:

    $ LC_ALL=it_IT python3 -c 'import bz2'        
    Traceback (most recent call last): 
     File "<string>", line 1, in <module> 
     File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> 
       from _bz2 import BZ2Compressor, BZ2Decompressor 
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 161: invalid continuation byte

    After switching to a UTF-8 locale, the problem goes away:

    $ LC_ALL=IT_IT python3 -c 'import bz2'   
    Traceback (most recent call last): 
     File "<string>", line 1, in <module> 
     File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> 
       from _bz2 import BZ2Compressor, BZ2Decompressor 
    ImportError:    0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 
           0509-150   Il modulo dipendente libbz2.so non è stato caricato. 
           0509-022 Impossibile caricare il modulo libbz2.so. 
           0509-026 Errore di sistema: Un file o una directory nel nome percorso non esiste. 
           0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 
           0509-150   Il modulo dipendente /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so non è stato caricato.

    While this conceivably affects any Unix-like platform, the only system I can recreate it on is AIX and IBM i PASE. As far as I can tell, on Linux you will always get something like "error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory". Even though there seems to be some translations in GLIBC, I have been unable to get them to be used on either Fedora or Ubuntu.

    @kadler kadler mannequin added 3.7 (EOL) end of life 3.10 only security fixes 3.8 (EOL) end of life 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Sep 30, 2020
    @taleinat taleinat removed the 3.7 (EOL) end of life label Oct 2, 2020
    @methane
    Copy link
    Member

    methane commented Oct 5, 2020

    I succeeded to reproduce it on Ubuntu 20.04.

        $ sudo vi /var/lib/locales/supported.d/ja # add "ja_JP.EUC-JP EUC-JP"
        $ sudo locale-gen ja_JP.EUC-JP
        Generating locales (this might take a while)...
        ja_JP.EUC-JP... done
        Generation complete.
        $ chmod -r./build/lib.linux-x86_64-3.10/_sha3.cpython-310-x86_64-linux-gnu.so
        $ LC_ALL=ja_JP.eucjp ./python
        Python 3.10.0a0 (heads/master:fbf43f051e, Aug 17 2020, 15:13:52)
        [GCC 9.3.0] on linux
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import locale
        >>> locale.setlocale(locale.LC_ALL, "")
        'ja_JP.eucjp'
        >>> import _sha3
        Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 101: invalid start byte

    Error message contains file path (byte string, probably encoded with fs encoding) and translated error message (encoded with locale encoding).

    I want to use "backslashescape" error handler, but both of PyUnicode_DecodeLocale() and PyUnicode_DecodeFSDefault() don't support it.

    After thinking about this several minutes, now I prefer PyUnicode_DecodeUTF8(msg, strlen(msg), "backslashreplace").
    It fixes the issue with minimum behavior change, although error message is still backslashescaped.
    It might be the best practice for creating Unicode object from C error message like strerror(3).

    @kadler
    Copy link
    Mannequin Author

    kadler mannequin commented Oct 7, 2020

    Glad you were able to reproduce on Linux.

    I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations.

    The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato." By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape.

    I guess the question comes to: what's more important to be decoded, the message text or the path?

    @methane
    Copy link
    Member

    methane commented Oct 8, 2020

    I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations.

    OK, I changd my b.p.o username.

    The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato."

    The issue is not caused by backslashreplace, but by UTF-8 instead of locale. I notice it of course, but:

    • Using UTF-8 is status quo. UTF-8:backslashreplace is the simplest fix approach.
    • There is no guarantee that the error message can be decoded by locale encoding. Backslash escape is much better than "ignore" or "surrogateescape".

    By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape.

    There is no guranatee that the message is properly decoded with fsencoding.
    And surrogateescape is for round-tripping bytes path, not for human readable.
    Error message should be human readable. So backslashreplace is better than surrogateescape.

    Additionally, non-UTF-8 locale is quite rare on Unix systems, and users of such systems would be able to handle backslash escaped message, because they might see such message often.

    @serhiy-storchaka
    Copy link
    Member

    I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX.

    @methane
    Copy link
    Member

    methane commented Oct 8, 2020

    I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX.

    • PyUnicode_DecodeLocale() doesn't support "backslashescape" error handler.
    • Error message is usually encoded in locale encoding, but it is not guaranteed.
    • Error message may contain path, it may be not locale encoding too.
    • \xXX is far better than UnicodeDecodeError, anyway. We need to fix the UnicodeDecodeError first.
    • non-UTF-8 locale is rare. We used this code for long time but we haven't reported this issue until now.

    I don't against adding "backslashescape" to PyUnicode_DecodeLocale(). But to backport the bugfix for UnicodeDecodeError, change should be minimum.

    So the main problem is: should we allow surrogateescape in error message?

    For the record, PyUnicode_DecodeLocale() is using mbstowcs(). I don't know how reliable the function is in various platforms. That is why I had suggested PyUnicode_DecodeFSDefault() at first.

    @methane
    Copy link
    Member

    methane commented Oct 8, 2020

    So the main problem is: should we allow surrogateescape in error message?

    Note that error message may be written to file, stream, structured log (JSON). They may be UTF-8:strict. We can not write surrogateescape-d string to them.

    @serhiy-storchaka
    Copy link
    Member

    In os.strerror() and PyErr_SetFromErrnoWithFilenameObjects() we use PyUnicode_DecodeLocale(s, "surrogateescape") for decoding the result of strerror().

    @methane
    Copy link
    Member

    methane commented Oct 8, 2020

    OK. Let's use PyUnicode_DecodeLocale() with surrogateescape for consistency.

    @kadler
    Copy link
    Mannequin Author

    kadler mannequin commented Oct 8, 2020

    Ok, so should I switch the PR back from PyUnicode_DecodeFSDefault?

    @methane
    Copy link
    Member

    methane commented Oct 9, 2020

    Yes, please.

    @methane
    Copy link
    Member

    methane commented Oct 15, 2020

    New changeset 2d2af32 by Kevin Adler in branch 'master':
    bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466)
    2d2af32

    @miss-islington
    Copy link
    Contributor

    New changeset 47ca679 by Miss Skeleton (bot) in branch '3.8':
    bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466)
    47ca679

    @miss-islington
    Copy link
    Contributor

    New changeset f07448b by Miss Skeleton (bot) in branch '3.9':
    bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466)
    f07448b

    @methane
    Copy link
    Member

    methane commented Oct 15, 2020

    Thank you for finding/fixing.

    @methane methane closed this as completed Oct 15, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants