Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674

Merged
merged 30 commits into from
Mar 17, 2024

Conversation

jjsloboda
Copy link
Contributor

@jjsloboda jjsloboda commented Jan 3, 2024

Change codecs to raise UnicodeEncodeError and UnicodeDecodeError, depending on what operation the API user is attempting, instead of just UnicodeError. This involved providing more information to the exceptions, including the problematic string and start and end offsets within the string, which are not always relevant depending on the nature of the exception.

Some of the codecs will perform multiple encodes and decodes during an operation as part of their implementation. In these cases, I've opted to throw the exception that matches the API function the error is occurring in.

A tricky thing to watch out for while reviewing is that UnicodeEncodeError takes a str and UnicodeDecodeError takes a bytes, which is supposed to be the original object being en/decoded. However, most of these codec functions modify their arguments, so by the time the exception block is reached, the original argument is no longer available and needs to be reconstructed. Alternatively, I could change the PR to save the argument, but I don't think it's worth the extra memory just for the exception case (may be more pythonic though).

Another thing to watch for in the review is to make sure I'm not doing anything in the exception blocks that could trigger more en/decoding exceptions when I'm reconstructing the argument strings.

Copy link

cpython-cla-bot bot commented Jan 3, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@davidmerwin

This comment was marked as off-topic.

@davidmerwin

This comment was marked as off-topic.

@jjsloboda

This comment was marked as resolved.

@davidmerwin

This comment was marked as resolved.

@@ -156,7 +170,7 @@ def encode(self, input, errors='strict'):

if errors != 'strict':
# IDNA is quite clear that implementations must be strict
raise UnicodeError("unsupported error handling "+errors)
raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}")
raise UnicodeEncodeError("idna", input, 0, 0, f"unsupported error handling {errors}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeEncodeError implies that there is an error in the encoded string.

Other codecs usually raise LookupError for unknown error handlers, but here an exception is raised even for known error handlers.

Perhaps ValueError better suits here. UnicodeError is a subclass of ValueError, so we can keep it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the cases where there was not an issue with the input string back to UnicodeError.

for label in labels[:-1]:
if not (0 < len(label) < 64):
raise UnicodeError("label empty or too long")
raise UnicodeEncodeError("idna", input, index, index+len(label), "label empty or too long")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise UnicodeEncodeError("idna", input, index, index+len(label), "label empty or too long")
raise UnicodeEncodeError(
"idna", input, index, index+len(label),
"label empty or too long")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we using absolute offset.

But it still reports weird range for empty label. I think that it is better to include an opening or closing dot.

Also, I think that it is better to use different messages for empty and too long labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out the empty vs too long cases. Note that UnicodeEncodeError takes a python-style [start, end) range, but outputs a (start)-(end-1) inclusive range in the error message, so using something like 0, 0, gives the weird-looking 0--1 error message whereas 0, 1, gives position 0.

if len(labels[-1]) >= 64:
raise UnicodeError("label too long")
raise UnicodeEncodeError("idna", input, index, len(input), "label too long")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise UnicodeEncodeError("idna", input, index, len(input), "label too long")
raise UnicodeEncodeError("idna", input, index, len(input),
"label too long")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the linebreaks of some of the longer exceptions to look like this

Lib/encodings/idna.py Show resolved Hide resolved
@jjsloboda
Copy link
Contributor Author

Thanks for the reviews @methane and @serhiy-storchaka, comments addressed and ready for another look

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot simply translate offsets from a label to the full input, because the label is transformed several times during encoding: some characters are ignored, others are combined in the normalization or transformed in the punycode encoding. It is difficult to track offsets of original characters that cannot be encoded. So I suggest to simply report the range of the whole label that cannot be encoded. The main goal is changing the type of the exception to more specific, and it should be enough for now. Later we can narrow the range of error if it is useful and possible.

@jjsloboda
Copy link
Contributor Author

I decided to add some tests for the error offset in one more hope to get the offsets figured out properly, or at least document the current behavior. They seem to be working now at least for those cases, but I'm not an expert in IDNA or Punycode so I'm not sure if there are important edgecases I'm missing.

My goal was to avoid the need to revisit this in the future to further tighten up the offsets, but @serhiy-storchaka let me know if you'd still prefer I switch it to use the whole range of the problematic label.

Ready for another look @methane @serhiy-storchaka

excobj = PyObject_CallFunction(PyExc_UnicodeEncodeError,
"ssnns",
ctx->codec->encoding,
PyUnicode_AsUTF8(inbuf),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need copy inbuf? If no, use just inbuf with O format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍 I think I tried this originally, but I must have been doing something different because it seems to work fine now.

Modules/cjkcodecs/multibytecodec.c Show resolved Hide resolved
(const char *)buf->inbuf_top, bufsize,
0, bufsize, "pending buffer overflow");
PyErr_SetObject(PyExc_UnicodeDecodeError, excobj);
goto errorexit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

ctx->codec->encoding,
PyUnicode_AsUTF8(inbuf),
inpos, datalen,
"pending buffer overflow");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing if (excobj == NULL) goto errorexit;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, for all new and pre-existing code

0, pendingsize,
"pending buffer too large");
PyErr_SetObject(PyExc_UnicodeEncodeError, excobj);
goto errorexit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@nitresor

This comment was marked as off-topic.

@jjsloboda
Copy link
Contributor Author

Thanks @methane , ready for another look

0, sizeof(statebytes),
"pending buffer too large");
if (excobj == NULL) goto errorexit;
PyErr_SetObject(PyExc_UnicodeEncodeError, excobj);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel this is valid UnicodeEncodeError.
Should we really avoid UnicodeError here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If UnicodeEncodeError is because the user passed an input that directly causes the encoding problem, rather than any problem that occurs during encoding, then UnicodeError makes sense here. Changed this one back to UnicodeError.

@jjsloboda
Copy link
Contributor Author

Comment addressed @methane , thanks, ready for review

@serhiy-storchaka I think the new IDNA errors work now at least for the main cases, but let me know if you still want me to go back and change the IDNA encoding errors to report the whole label as the problematic range

@methane
Copy link
Member

methane commented Feb 23, 2024

punycode looks ugly. Although it doesn't have __all__, I think only punycode_decode, punycode_encode, Codec, IncrementalEncoder, IncrementalDecoder, StreamWriter, StreamReader, and getregentry are public APIs.
(Strictly speaking, getentry is the only API used from external other than tests.)
So I will change where to encode/decode in them.

Copy link
Member

@methane methane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But I will wait for a few weeks to have chance other core developer review my changes.

@methane methane enabled auto-merge (squash) March 17, 2024 04:51
@methane methane merged commit 649857a into python:main Mar 17, 2024
35 of 36 checks passed
vstinner pushed a commit to vstinner/cpython that referenced this pull request Mar 20, 2024
… UnicodeDecodeError (python#113674)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 25, 2024
… UnicodeDecodeError (python#113674)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
… UnicodeDecodeError (python#113674)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants