Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email.message get_payload throws UnicodeEncodeError with some surrogate Unicode characters #94606

Closed
sidney opened this issue Jul 6, 2022 · 1 comment · Fixed by #94641
Closed
Labels
topic-email type-bug An unexpected behavior, bug, or error

Comments

@sidney
Copy link
Contributor

sidney commented Jul 6, 2022

email.message get_payload gets a UnicodeEncodeError if the message body contains a line that has either:
a Unicode surrogate code point that is valid for surrogateescape encoding (U-DC80 through U-DCFF) and a non ASCII UTF-8 character
OR
a Unicode surrogate character that is not valid for surrogateescape encoding

Here is a minimal code example with one of the cases commented out

from email import message_from_string
from email.message import EmailMessage

m = message_from_string("surrogate char \udcc3 and 8-bit utf-8 ë on same line")
# m = message_from_string("surrogate char \udfff does it by itself")
payload = m.get_payload(decode=True)

On my python 3.10.5 on macOS this produces:

Traceback (most recent call last):
  File "/Users/sidney/tmp/./test5.py", line 8, in <module>
    payload = m.get_payload(decode=True)
  File "/usr/local/Cellar/python@3.10/3.10.5/Frameworks/Python.framework/Versions/3.10/lib/python3.10/email/message.py", line 264, in get_payload
    bpayload = payload.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode character '\xeb' in position 33: ordinal not in range(128)

This was tested on python 3.10.5 on macOS, however I tracked it down based on a report in the wild that was running python 3.8 on Ubuntu 20.04 processing actual emails

Linked PRs

@sidney sidney added the type-bug An unexpected behavior, bug, or error label Jul 6, 2022
@sidney
Copy link
Contributor Author

sidney commented Jul 6, 2022

I tracked this down to code in email.message.py that has the following

   if utils._has_surrogates(payload):
       bpayload = payload.encode('ascii', 'surrogateescape')

This looks like it is supposed to check if payload is a string that was created by decode with surrogateescape, and if it was, then convert it back to bytes the same way. PEP 383 is clear that surrogateescape is only for round trip decoding then encoding. encode with surrogateescaape should never be called on strings not created with decode surrogateescape.

The problem is that utils._has_surrogates(payload) is only a fast heuristic when it is used to guess if payload contains a string that was produced by decoding with surrogateescape. The function actually flags strings that contain any Unicode surrogate characters. However, strings that were created by decode with surrogateescape will only have Unicode surrogate characters in the U-DC80 through U-DCFF range and remaining characters will only be 7 bit ASCII. If a string either has an non-ASCII UTF8 character in addition to a Unicode surrogate character, or has a Unicode surrogate character out of that range, utils._has_surrogates() will return true but the string encode will raise the exception.

This should be able to be fixed by catching the exception raised by the encode and proceeding to get the same result the code would have if utils._has_surrogates(payload) had returned false.

I'll submit a PR for that after testing.

sidney added a commit to sidney/cpython that referenced this issue Jul 7, 2022
sidney added a commit to sidney/cpython that referenced this issue Jul 7, 2022
serhiy-storchaka added a commit that referenced this issue Dec 11, 2023
…escaped string (GH-94641)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 11, 2023
…rogateescaped string (pythonGH-94641)

(cherry picked from commit 27a5fd8)

Co-authored-by: Sidney Markowitz <sidney@sidney.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 11, 2023
…rogateescaped string (pythonGH-94641)

(cherry picked from commit 27a5fd8)

Co-authored-by: Sidney Markowitz <sidney@sidney.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Dec 11, 2023
…rrogateescaped string (GH-94641) (GH-112972)

(cherry picked from commit 27a5fd8)

Co-authored-by: Sidney Markowitz <sidney@sidney.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Dec 11, 2023
…rrogateescaped string (GH-94641) (GH-112971)

(cherry picked from commit 27a5fd8)

Co-authored-by: Sidney Markowitz <sidney@sidney.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…rogateescaped string (pythonGH-94641)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants