gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

jjsloboda · 2024-01-03T07:05:54Z

Change codecs to raise UnicodeEncodeError and UnicodeDecodeError, depending on what operation the API user is attempting, instead of just UnicodeError. This involved providing more information to the exceptions, including the problematic string and start and end offsets within the string, which are not always relevant depending on the nature of the exception.

Some of the codecs will perform multiple encodes and decodes during an operation as part of their implementation. In these cases, I've opted to throw the exception that matches the API function the error is occurring in.

A tricky thing to watch out for while reviewing is that UnicodeEncodeError takes a str and UnicodeDecodeError takes a bytes, which is supposed to be the original object being en/decoded. However, most of these codec functions modify their arguments, so by the time the exception block is reached, the original argument is no longer available and needs to be reconstructed. Alternatively, I could change the PR to save the argument, but I don't think it's worth the extra memory just for the exception case (may be more pythonic though).

Another thing to watch for in the review is to make sure I'm not doing anything in the exception blocks that could trigger more en/decoding exceptions when I'm reconstructing the argument strings.

Issue: Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

cpython-cla-bot · 2024-01-03T07:05:57Z

All commit authors signed the Contributor License Agreement.

methane · 2024-01-05T12:29:55Z

Lib/encodings/idna.py

@@ -156,7 +170,7 @@ def encode(self, input, errors='strict'):

        if errors != 'strict':
            # IDNA is quite clear that implementations must be strict
-            raise UnicodeError("unsupported error handling "+errors)
+            raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}")


Suggested change

raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}")

raise UnicodeEncodeError("idna", input, 0, 0, f"unsupported error handling {errors}")

UnicodeEncodeError implies that there is an error in the encoded string.

Other codecs usually raise LookupError for unknown error handlers, but here an exception is raised even for known error handlers.

Perhaps ValueError better suits here. UnicodeError is a subclass of ValueError, so we can keep it for now.

Changed the cases where there was not an issue with the input string back to UnicodeError.

methane · 2024-01-05T12:32:49Z

Lib/encodings/idna.py

            for label in labels[:-1]:
                if not (0 < len(label) < 64):
-                    raise UnicodeError("label empty or too long")
+                    raise UnicodeEncodeError("idna", input, index, index+len(label), "label empty or too long")


Suggested change

raise UnicodeEncodeError("idna", input, index, index+len(label), "label empty or too long")

raise UnicodeEncodeError(

"idna", input, index, index+len(label),

"label empty or too long")

Agree we using absolute offset.

But it still reports weird range for empty label. I think that it is better to include an opening or closing dot.

Also, I think that it is better to use different messages for empty and too long labels.

Split out the empty vs too long cases. Note that UnicodeEncodeError takes a python-style [start, end) range, but outputs a (start)-(end-1) inclusive range in the error message, so using something like 0, 0, gives the weird-looking 0--1 error message whereas 0, 1, gives position 0.

methane · 2024-01-05T12:34:08Z

Lib/encodings/idna.py

            if len(labels[-1]) >= 64:
-                raise UnicodeError("label too long")
+                raise UnicodeEncodeError("idna", input, index, len(input), "label too long")


Suggested change

raise UnicodeEncodeError("idna", input, index, len(input), "label too long")

raise UnicodeEncodeError("idna", input, index, len(input),

"label too long")

Fixed the linebreaks of some of the longer exceptions to look like this

Lib/encodings/idna.py

jjsloboda · 2024-01-06T01:21:33Z

Thanks for the reviews @methane and @serhiy-storchaka, comments addressed and ready for another look

serhiy-storchaka

We cannot simply translate offsets from a label to the full input, because the label is transformed several times during encoding: some characters are ignored, others are combined in the normalization or transformed in the punycode encoding. It is difficult to track offsets of original characters that cannot be encoded. So I suggest to simply report the range of the whole label that cannot be encoded. The main goal is changing the type of the exception to more specific, and it should be enough for now. Later we can narrow the range of error if it is useful and possible.

…culating them

jjsloboda · 2024-01-07T22:42:14Z

I decided to add some tests for the error offset in one more hope to get the offsets figured out properly, or at least document the current behavior. They seem to be working now at least for those cases, but I'm not an expert in IDNA or Punycode so I'm not sure if there are important edgecases I'm missing.

My goal was to avoid the need to revisit this in the future to further tighten up the offsets, but @serhiy-storchaka let me know if you'd still prefer I switch it to use the whole range of the problematic label.

Ready for another look @methane @serhiy-storchaka

methane · 2024-02-02T06:25:23Z

Modules/cjkcodecs/multibytecodec.c

+            excobj = PyObject_CallFunction(PyExc_UnicodeEncodeError,
+                                           "ssnns",
+                                           ctx->codec->encoding,
+                                           PyUnicode_AsUTF8(inbuf),


Do we need copy inbuf? If no, use just inbuf with O format.

Done 👍 I think I tried this originally, but I must have been doing something different because it seems to work fine now.

Modules/cjkcodecs/multibytecodec.c

methane · 2024-02-02T06:28:55Z

Modules/cjkcodecs/multibytecodec.c

+                       (const char *)buf->inbuf_top, bufsize,
+                       0, bufsize, "pending buffer overflow");
+            PyErr_SetObject(PyExc_UnicodeDecodeError, excobj);
+            goto errorexit;


methane · 2024-02-02T06:29:27Z

Modules/cjkcodecs/multibytecodec.c

+                                           ctx->codec->encoding,
+                                           PyUnicode_AsUTF8(inbuf),
+                                           inpos, datalen,
+                                           "pending buffer overflow");


missing if (excobj == NULL) goto errorexit;

Done, for all new and pre-existing code

methane · 2024-02-02T06:29:50Z

Modules/cjkcodecs/multibytecodec.c

+                                           0, pendingsize,
+                                           "pending buffer too large");
+            PyErr_SetObject(PyExc_UnicodeEncodeError, excobj);
+            goto errorexit;


jjsloboda · 2024-02-16T18:18:45Z

Thanks @methane , ready for another look

methane · 2024-02-21T09:02:14Z

Modules/cjkcodecs/multibytecodec.c

+                                                 0, sizeof(statebytes),
+                                                 "pending buffer too large");
+        if (excobj == NULL) goto errorexit;
+        PyErr_SetObject(PyExc_UnicodeEncodeError, excobj);


I don't feel this is valid UnicodeEncodeError.
Should we really avoid UnicodeError here?

If UnicodeEncodeError is because the user passed an input that directly causes the encoding problem, rather than any problem that occurs during encoding, then UnicodeError makes sense here. Changed this one back to UnicodeError.

jjsloboda · 2024-02-22T04:42:14Z

Comment addressed @methane , thanks, ready for review

@serhiy-storchaka I think the new IDNA errors work now at least for the main cases, but let me know if you still want me to go back and change the IDNA encoding errors to report the whole label as the problematic range

Modules/cjkcodecs/multibytecodec.c

methane · 2024-02-23T09:19:24Z

punycode looks ugly. Although it doesn't have __all__, I think only punycode_decode, punycode_encode, Codec, IncrementalEncoder, IncrementalDecoder, StreamWriter, StreamReader, and getregentry are public APIs.
(Strictly speaking, getentry is the only API used from external other than tests.)
So I will change where to encode/decode in them.

methane

LGTM. But I will wait for a few weeks to have chance other core developer review my changes.

Lib/encodings/idna.py

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

jjsloboda added 4 commits January 2, 2024 22:28

fix issue pythongh-85287

7339989

add news blurb

e92d414

add more lenient unicode error handling within the except blocks

10e7cd0

fix IDNA-specific length issue

0122f90

jjsloboda requested a review from corona10 as a code owner January 3, 2024 07:05

bedevere-app bot added the awaiting review label Jan 3, 2024

bedevere-app bot mentioned this pull request Jan 3, 2024

Codecs should raise precise UnicodeDecodeError or UnicodeEncodeError #85287

Closed

serhiy-storchaka self-requested a review January 3, 2024 07:13

jjsloboda added 3 commits January 3, 2024 02:20

Merge branch 'main' into unicode-errors-fix-85287

a310dd2

fix two issues

63948d2

Merge branch 'main' into unicode-errors-fix-85287

4479ab2

This comment was marked as off-topic.

Sign in to view

This comment was marked as resolved.

Sign in to view

methane reviewed Jan 5, 2024

View reviewed changes

jjsloboda added 6 commits January 5, 2024 19:02

use plain UnicodeError for problems outside the en/decoded string

81310e3

split label empty vs too long

367de4e

use labels input for finding error offset, not output result

9f57515

update test for undefined encoding

389122d

fixed linebreaks on some of the longer exceptions

fe47caa

Merge branch 'main' into unicode-errors-fix-85287

a4098fa

serhiy-storchaka reviewed Jan 7, 2024

View reviewed changes

jjsloboda added 2 commits January 7, 2024 16:46

add tests for unicode error offsets, and tighten up the logic for cal…

95cb5bb

…culating them

Merge branch 'main' into unicode-errors-fix-85287

10d092f

methane reviewed Feb 2, 2024

View reviewed changes

This comment was marked as off-topic.

Sign in to view

update MultibyteIncrementalEncoder.getstate()

93e99ae

methane reviewed Feb 21, 2024

View reviewed changes

methane and others added 3 commits February 21, 2024 18:12

fixup

87e1f99

change buffer size issue error back to UnicodeError

0728a43

Merge branch 'main' into unicode-errors-fix-85287

0f80786

jjsloboda added 2 commits February 22, 2024 10:29

Merge branch 'main' into unicode-errors-fix-85287

5c8c59e

update test to match changed exception

1cc911d

methane reviewed Feb 23, 2024

View reviewed changes

Modules/cjkcodecs/multibytecodec.c Outdated Show resolved Hide resolved

methane added 2 commits February 23, 2024 16:14

Update Modules/cjkcodecs/multibytecodec.c

9594bae

improve idna codec errors

ea3ff8a

improve punycode.decode()

8a2bc50

Eclips4 mentioned this pull request Feb 23, 2024

New warning: variable ‘callable’ set but not used [-Wunused-but-set-variable] #115846

Closed

methane approved these changes Feb 23, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Feb 23, 2024

methane added the awaiting core review label Feb 23, 2024

improve punycode_decode again

a63e17a

methane approved these changes Feb 23, 2024

View reviewed changes

jjsloboda commented Feb 24, 2024

View reviewed changes

Lib/encodings/idna.py Show resolved Hide resolved

Merge branch 'main' into unicode-errors-fix-85287

4c329e4

methane enabled auto-merge (squash) March 17, 2024 04:51

methane merged commit 649857a into python:main Mar 17, 2024
35 of 36 checks passed

bedevere-app bot removed awaiting core review awaiting merge labels Mar 17, 2024

vstinner pushed a commit to vstinner/cpython that referenced this pull request Mar 20, 2024

pythongh-85287: Change codecs to raise precise UnicodeEncodeError and…

1ccbc3d

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 25, 2024

pythongh-85287: Change codecs to raise precise UnicodeEncodeError and…

8d804eb

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024

pythongh-85287: Change codecs to raise precise UnicodeEncodeError and…

7c53561

… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <songofacandy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

jjsloboda commented Jan 3, 2024 •

edited

Loading

cpython-cla-bot bot commented Jan 3, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as resolved.

This comment was marked as resolved.

methane Jan 5, 2024

serhiy-storchaka Jan 5, 2024

jjsloboda Jan 6, 2024

methane Jan 5, 2024

serhiy-storchaka Jan 5, 2024

jjsloboda Jan 6, 2024

methane Jan 5, 2024

jjsloboda Jan 6, 2024

jjsloboda commented Jan 6, 2024

serhiy-storchaka left a comment

jjsloboda commented Jan 7, 2024

methane Feb 2, 2024

jjsloboda Feb 16, 2024

methane Feb 2, 2024

methane Feb 2, 2024

jjsloboda Feb 16, 2024

methane Feb 2, 2024

This comment was marked as off-topic.

jjsloboda commented Feb 16, 2024

methane Feb 21, 2024

jjsloboda Feb 22, 2024

jjsloboda commented Feb 22, 2024

methane commented Feb 23, 2024

methane left a comment

	raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}")
	raise UnicodeEncodeError("idna", input, 0, 0, f"unsupported error handling {errors}")

	raise UnicodeEncodeError("idna", input, index, len(input), "label too long")
	raise UnicodeEncodeError("idna", input, index, len(input),
	"label too long")

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

Conversation

jjsloboda commented Jan 3, 2024 • edited Loading

cpython-cla-bot bot commented Jan 3, 2024 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsloboda commented Jan 6, 2024

serhiy-storchaka left a comment

Choose a reason for hiding this comment

jjsloboda commented Jan 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

jjsloboda commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsloboda commented Feb 22, 2024

methane commented Feb 23, 2024

methane left a comment

Choose a reason for hiding this comment

jjsloboda commented Jan 3, 2024 •

edited

Loading

cpython-cla-bot bot commented Jan 3, 2024 •

edited

Loading