Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

pitrou · 2020-06-25T12:49:30Z

BPO	41115
Nosy	@malemburg, @doerwalter, @pitrou, @benjaminp, @ezio-melotti, @serhiy-storchaka, @srinivasreddy, @eamanu, @utkarsh261
PRs	gh-85287: Convert UnicodeError to UnicodeEncodeError\| UnicodeDecodeError #21165 bpo-41115: Modified src to raise rather `Unicode{Encode, Decode}Error` rather than plain `UnicodeError` #21170

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-06-25.12:49:30.420>
labels = ['easy', 'type-bug', '3.8', '3.9', '3.7', 'library']
title = 'Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError'
updated_at = <Date 2020-06-26.16:59:17.544>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2020-06-26.16:59:17.544>
actor = 'utk'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2020-06-25.12:49:30.420>
creator = 'pitrou'
dependencies = []
files = []
hgrepos = []
issue_num = 41115
keywords = ['patch', 'easy']
message_count = 6.0
messages = ['372367', '372368', '372369', '372373', '372431', '372433']
nosy_count = 9.0
nosy_names = ['lemburg', 'doerwalter', 'pitrou', 'benjamin.peterson', 'ezio.melotti', 'serhiy.storchaka', 'thatiparthy', 'eamanu', 'utk']
pr_nums = ['21165', '21170']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue41115'
versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

Linked PRs

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

The text was updated successfully, but these errors were encountered:

pitrou · 2020-06-25T12:49:30Z

A number of codecs raise bare UnicodeError, rather than Unicode{Decode,Encode}Error. Example:

File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/encodings/utf_16.py", line 67, in _buffer_decode
raise UnicodeError("UTF-16 stream does not start with BOM")

A more complete list can be found here:
https://gist.github.com/pitrou/60594b28d8e47edcdb97d9b15d5f9866

srinivasreddy · 2020-06-25T13:11:35Z

This looks like an easy task. Shall I create a PR?

eamanu · 2020-06-25T14:13:03Z

Hi,

IMO this can be mark as an easy issue.

@thatiparthy please, go ahead

doerwalter · 2020-06-25T15:11:55Z

UnicodeEncodeError and UnicodeDecodeError are used to report un(en|de)codedable ranges in the source object, so it wouldn't make sense to use them for errors that have nothing to do with problems in the source object. Their constructor requires 5 arguments (encoding, object, start, end, reason), not just a simple message: e.g. UnicodeEncodeError("utf-8", "foo", 17, 23, "bad string").

But for reporting e.g. missing BOMs at the start it would be useful to use (0, 0) as the offending range.

srinivasreddy · 2020-06-26T16:31:42Z

@utk You could have taken some other easy issue from https://bugs.python.org/issue?status=1&@sort=-activity&@columns=id%2Cactivity%2Ctitle%2Ccreator%2Cstatus&@dispname=Easy%20issues&@startwith=0&@group=priority&keywords=6&@action=search&@filter=&@pagesize=50 instead of copy pasting my work.

utkarsh261 · 2020-06-26T16:59:17Z

@thatiparthy These were the most logical changes, standard error messages, which were already there in the existing code, I just edited them as mentioned here. What part of your "work" do you think i copied?
Sent this PR to get familiar to the process mostly, i will close it if you feel insecure. No need to be rude.
thanks.

SahillMulani · 2023-04-03T03:40:13Z

Hello @pitrou

Is this issue still open ?

arhadthedev · 2023-04-03T04:28:59Z

@SahillMulani Unfortunately yes, and we need a PR. Your contribution is welcomed, no assignment is required.

mblahay · 2023-04-26T06:38:44Z

@SahillMulani, I am looking to perhaps work on this issue, but will avoid it if you are. Please let me know what your status is.

mblahay · 2023-04-26T16:08:21Z

@pitrou Regarding the start and end values that are used to communicate where the bad values are, what should be done if the absolute location cannot be determined? There are portions of code, such as with punycode, where the encoded byes is divided up for processing and the only positional data available at the time of exception is the relative position within the given segment. Should 0,0 be used in a situation like this?

mblahay · 2023-04-26T16:19:36Z

@methane I was told you may be an interested party that could answer questions about unicode exception handling. Regarding my question above, what are you expectations with regards to the reported start and end positions?

methane · 2023-05-25T06:54:36Z

It seems this issue is not fixed yet.

c

> rg -t c -w PyExc_UnicodeError

Modules/cjkcodecs/multibytecodec.c
831:            PyErr_SetString(PyExc_UnicodeError,
863:            PyErr_SetString(PyExc_UnicodeError, "pending buffer overflow");
944:            PyErr_SetString(PyExc_UnicodeError, "pending buffer too large");
984:        PyErr_SetString(PyExc_UnicodeError, "pending buffer too large");
1271:        PyErr_SetString(PyExc_UnicodeError, "pending buffer too large");

python:

~/w/p/cpython (main)> rg -t py 'raise UnicodeError'
Lib/test/support/os_helper.py
104:            raise UnicodeError

Lib/test/test_array.py
1042:                raise UnicodeError
1047:            raise UnicodeError

Lib/urllib/parse.py
1047:            raise UnicodeError("URL " + repr(url) +

Lib/encodings/punycode.py
137:                raise UnicodeError("incomplete punicode string")
145:            raise UnicodeError("Invalid extended code point '%s'"
174:                raise UnicodeError("Invalid character U+%x" % char)
206:            raise UnicodeError("Unsupported error handling "+errors)
217:            raise UnicodeError("Unsupported error handling "+self.errors)

Lib/encodings/utf_16.py
67:                raise UnicodeError("UTF-16 stream does not start with BOM")
141:            raise UnicodeError("UTF-16 stream does not start with BOM")

Lib/encodings/utf_32.py
62:                raise UnicodeError("UTF-32 stream does not start with BOM")
136:            raise UnicodeError("UTF-32 stream does not start with BOM")

Lib/encodings/undefined.py
19:        raise UnicodeError("undefined encoding")
22:        raise UnicodeError("undefined encoding")
26:        raise UnicodeError("undefined encoding")
30:        raise UnicodeError("undefined encoding")

Lib/encodings/idna.py
38:            raise UnicodeError("Invalid character %r" % c)
50:            raise UnicodeError("Violation of BIDI requirement 2")
56:            raise UnicodeError("Violation of BIDI requirement 3")
71:        raise UnicodeError("label empty or too long")
86:        raise UnicodeError("label empty or too long")
90:        raise UnicodeError("Label starts with ACE prefix")
101:    raise UnicodeError("label empty or too long")
113:        raise UnicodeError("label way too long")
130:            raise UnicodeError("Invalid character in IDN label")
147:        raise UnicodeError("IDNA does not round-trip", label, label2)
159:            raise UnicodeError("unsupported error handling "+errors)
173:                    raise UnicodeError("label empty or too long")
175:                raise UnicodeError("label too long")
195:            raise UnicodeError("Unsupported error handling "+errors)
230:            raise UnicodeError("unsupported error handling "+errors)
264:            raise UnicodeError("Unsupported error handling "+errors)

jjsloboda · 2024-01-03T07:15:51Z

Decided to give this a shot as I see it's still unresolved on the main branch. Made some decisions about how to do certain things just so I could ship something, but very much open to discussion on other approaches.

The biggest question is whether it's worth having the codec functions save their original arguments just so they can be shown if an exception happens. It might not even be extra memory overhead since the caller is probably still holding a reference to the argument object in most cases.

Also this is my first time looking at the CPython codebase and my first time using the C API so please let me know if I'm making any beginner mistakes.

…deDecodeError (#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

nineteendo · 2024-05-19T07:40:56Z

Isn't this fixed?

pitrou added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir easy type-bug An unexpected behavior, bug, or error labels Jun 25, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-bot mentioned this issue Apr 3, 2023

gh-85287: Convert UnicodeError to UnicodeEncodeError| UnicodeDecodeError #21165

Closed

jjsloboda added a commit to jjsloboda/cpython that referenced this issue Jan 3, 2024

fix issue pythongh-85287

7339989

bedevere-app bot mentioned this issue Jan 3, 2024

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

Merged

methane added a commit that referenced this issue Mar 17, 2024

gh-85287: Change codecs to raise precise UnicodeEncodeError and Unico…

649857a

…deDecodeError (#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

vstinner pushed a commit to vstinner/cpython that referenced this issue Mar 20, 2024

pythongh-85287: Change codecs to raise precise UnicodeEncodeError and…

1ccbc3d

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024

pythongh-85287: Change codecs to raise precise UnicodeEncodeError and…

8d804eb

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024

pythongh-85287: Change codecs to raise precise UnicodeEncodeError and…

7c53561

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

methane closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

pitrou commented Jun 25, 2020 •

edited by bedevere-app bot

Loading

pitrou commented Jun 25, 2020

srinivasreddy mannequin commented Jun 25, 2020

eamanu mannequin commented Jun 25, 2020

doerwalter commented Jun 25, 2020

srinivasreddy mannequin commented Jun 26, 2020

utkarsh261 mannequin commented Jun 26, 2020

SahillMulani commented Apr 3, 2023

arhadthedev commented Apr 3, 2023

mblahay commented Apr 26, 2023

mblahay commented Apr 26, 2023

mblahay commented Apr 26, 2023

methane commented May 25, 2023 •

edited

Loading

jjsloboda commented Jan 3, 2024

nineteendo commented May 19, 2024

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

Comments

pitrou commented Jun 25, 2020 • edited by bedevere-app bot Loading

Linked PRs

pitrou commented Jun 25, 2020

srinivasreddy mannequin commented Jun 25, 2020

eamanu mannequin commented Jun 25, 2020

doerwalter commented Jun 25, 2020

srinivasreddy mannequin commented Jun 26, 2020

utkarsh261 mannequin commented Jun 26, 2020

SahillMulani commented Apr 3, 2023

Hello @pitrou

Is this issue still open ?

arhadthedev commented Apr 3, 2023

mblahay commented Apr 26, 2023

mblahay commented Apr 26, 2023

mblahay commented Apr 26, 2023

methane commented May 25, 2023 • edited Loading

jjsloboda commented Jan 3, 2024

nineteendo commented May 19, 2024

pitrou commented Jun 25, 2020 •

edited by bedevere-app bot

Loading

methane commented May 25, 2023 •

edited

Loading