Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

Closed
pitrou opened this issue Jun 25, 2020 · 14 comments
Closed

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

pitrou opened this issue Jun 25, 2020 · 14 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@pitrou
Copy link
Member

pitrou commented Jun 25, 2020

BPO 41115
Nosy @malemburg, @doerwalter, @pitrou, @benjaminp, @ezio-melotti, @serhiy-storchaka, @srinivasreddy, @eamanu, @utkarsh261
PRs
  • gh-85287: Convert UnicodeError to UnicodeEncodeError| UnicodeDecodeError  #21165
  • bpo-41115: Modified src to raise rather Unicode{Encode, Decode}Error rather than plain UnicodeError #21170
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-06-25.12:49:30.420>
    labels = ['easy', 'type-bug', '3.8', '3.9', '3.7', 'library']
    title = 'Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError'
    updated_at = <Date 2020-06-26.16:59:17.544>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2020-06-26.16:59:17.544>
    actor = 'utk'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2020-06-25.12:49:30.420>
    creator = 'pitrou'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 41115
    keywords = ['patch', 'easy']
    message_count = 6.0
    messages = ['372367', '372368', '372369', '372373', '372431', '372433']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'doerwalter', 'pitrou', 'benjamin.peterson', 'ezio.melotti', 'serhiy.storchaka', 'thatiparthy', 'eamanu', 'utk']
    pr_nums = ['21165', '21170']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue41115'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    Linked PRs

    @pitrou
    Copy link
    Member Author

    pitrou commented Jun 25, 2020

    A number of codecs raise bare UnicodeError, rather than Unicode{Decode,Encode}Error. Example:

    File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/encodings/utf_16.py", line 67, in _buffer_decode
    raise UnicodeError("UTF-16 stream does not start with BOM")

    A more complete list can be found here:
    https://gist.github.com/pitrou/60594b28d8e47edcdb97d9b15d5f9866

    @pitrou pitrou added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir easy type-bug An unexpected behavior, bug, or error labels Jun 25, 2020
    @srinivasreddy
    Copy link
    Mannequin

    srinivasreddy mannequin commented Jun 25, 2020

    This looks like an easy task. Shall I create a PR?

    @eamanu
    Copy link
    Mannequin

    eamanu mannequin commented Jun 25, 2020

    Hi,

    IMO this can be mark as an easy issue.

    @thatiparthy please, go ahead

    @doerwalter
    Copy link
    Contributor

    UnicodeEncodeError and UnicodeDecodeError are used to report un(en|de)codedable ranges in the source object, so it wouldn't make sense to use them for errors that have nothing to do with problems in the source object. Their constructor requires 5 arguments (encoding, object, start, end, reason), not just a simple message: e.g. UnicodeEncodeError("utf-8", "foo", 17, 23, "bad string").

    But for reporting e.g. missing BOMs at the start it would be useful to use (0, 0) as the offending range.

    @srinivasreddy
    Copy link
    Mannequin

    srinivasreddy mannequin commented Jun 26, 2020

    @utkarsh261
    Copy link
    Mannequin

    utkarsh261 mannequin commented Jun 26, 2020

    @thatiparthy These were the most logical changes, standard error messages, which were already there in the existing code, I just edited them as mentioned here. What part of your "work" do you think i copied?
    Sent this PR to get familiar to the process mostly, i will close it if you feel insecure. No need to be rude.
    thanks.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @SahillMulani
    Copy link

    Hello @pitrou

    Is this issue still open ?

    @arhadthedev
    Copy link
    Member

    @SahillMulani Unfortunately yes, and we need a PR. Your contribution is welcomed, no assignment is required.

    @mblahay
    Copy link
    Contributor

    mblahay commented Apr 26, 2023

    @SahillMulani, I am looking to perhaps work on this issue, but will avoid it if you are. Please let me know what your status is.

    @mblahay
    Copy link
    Contributor

    mblahay commented Apr 26, 2023

    @pitrou Regarding the start and end values that are used to communicate where the bad values are, what should be done if the absolute location cannot be determined? There are portions of code, such as with punycode, where the encoded byes is divided up for processing and the only positional data available at the time of exception is the relative position within the given segment. Should 0,0 be used in a situation like this?

    @mblahay
    Copy link
    Contributor

    mblahay commented Apr 26, 2023

    @methane I was told you may be an interested party that could answer questions about unicode exception handling. Regarding my question above, what are you expectations with regards to the reported start and end positions?

    @methane
    Copy link
    Member

    methane commented May 25, 2023

    It seems this issue is not fixed yet.

    c

    > rg -t c -w PyExc_UnicodeError
    
    Modules/cjkcodecs/multibytecodec.c
    831:            PyErr_SetString(PyExc_UnicodeError,
    863:            PyErr_SetString(PyExc_UnicodeError, "pending buffer overflow");
    944:            PyErr_SetString(PyExc_UnicodeError, "pending buffer too large");
    984:        PyErr_SetString(PyExc_UnicodeError, "pending buffer too large");
    1271:        PyErr_SetString(PyExc_UnicodeError, "pending buffer too large");
    

    python:

    ~/w/p/cpython (main)> rg -t py 'raise UnicodeError'
    Lib/test/support/os_helper.py
    104:            raise UnicodeError
    
    Lib/test/test_array.py
    1042:                raise UnicodeError
    1047:            raise UnicodeError
    
    Lib/urllib/parse.py
    1047:            raise UnicodeError("URL " + repr(url) +
    
    Lib/encodings/punycode.py
    137:                raise UnicodeError("incomplete punicode string")
    145:            raise UnicodeError("Invalid extended code point '%s'"
    174:                raise UnicodeError("Invalid character U+%x" % char)
    206:            raise UnicodeError("Unsupported error handling "+errors)
    217:            raise UnicodeError("Unsupported error handling "+self.errors)
    
    Lib/encodings/utf_16.py
    67:                raise UnicodeError("UTF-16 stream does not start with BOM")
    141:            raise UnicodeError("UTF-16 stream does not start with BOM")
    
    Lib/encodings/utf_32.py
    62:                raise UnicodeError("UTF-32 stream does not start with BOM")
    136:            raise UnicodeError("UTF-32 stream does not start with BOM")
    
    Lib/encodings/undefined.py
    19:        raise UnicodeError("undefined encoding")
    22:        raise UnicodeError("undefined encoding")
    26:        raise UnicodeError("undefined encoding")
    30:        raise UnicodeError("undefined encoding")
    
    Lib/encodings/idna.py
    38:            raise UnicodeError("Invalid character %r" % c)
    50:            raise UnicodeError("Violation of BIDI requirement 2")
    56:            raise UnicodeError("Violation of BIDI requirement 3")
    71:        raise UnicodeError("label empty or too long")
    86:        raise UnicodeError("label empty or too long")
    90:        raise UnicodeError("Label starts with ACE prefix")
    101:    raise UnicodeError("label empty or too long")
    113:        raise UnicodeError("label way too long")
    130:            raise UnicodeError("Invalid character in IDN label")
    147:        raise UnicodeError("IDNA does not round-trip", label, label2)
    159:            raise UnicodeError("unsupported error handling "+errors)
    173:                    raise UnicodeError("label empty or too long")
    175:                raise UnicodeError("label too long")
    195:            raise UnicodeError("Unsupported error handling "+errors)
    230:            raise UnicodeError("unsupported error handling "+errors)
    264:            raise UnicodeError("Unsupported error handling "+errors)
    

    @jjsloboda
    Copy link
    Contributor

    Decided to give this a shot as I see it's still unresolved on the main branch. Made some decisions about how to do certain things just so I could ship something, but very much open to discussion on other approaches.

    The biggest question is whether it's worth having the codec functions save their original arguments just so they can be shown if an exception happens. It might not even be extra memory overhead since the caller is probably still holding a reference to the argument object in most cases.

    Also this is my first time looking at the CPython codebase and my first time using the C API so please let me know if I'm making any beginner mistakes.

    methane added a commit that referenced this issue Mar 17, 2024
    …deDecodeError (#113674)
    
    Co-authored-by: Inada Naoki <songofacandy@gmail.com>
    vstinner pushed a commit to vstinner/cpython that referenced this issue Mar 20, 2024
    … UnicodeDecodeError (python#113674)
    
    Co-authored-by: Inada Naoki <songofacandy@gmail.com>
    adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
    … UnicodeDecodeError (python#113674)
    
    Co-authored-by: Inada Naoki <songofacandy@gmail.com>
    diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
    … UnicodeDecodeError (python#113674)
    
    Co-authored-by: Inada Naoki <songofacandy@gmail.com>
    @nineteendo
    Copy link
    Contributor

    Isn't this fixed?

    @methane methane closed this as completed May 19, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Development

    No branches or pull requests

    8 participants