-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Email header folded incorrectly #80701
Comments
I encountered a problem with replacing the 'Subject' header of an email. After serializing it again, the utf8 encoding was wrong. It seems to be occurring when folding the internal header objects. Example:
I'm running Python 3.7.3 on Arch Linux using Linux 5.0. |
Can you demonstrate the problem with an actual email object? header_store_parse is not meant to be called directly. |
Nevermind, I was testing with the wrong version of python. This bug was introduced somewhere after 3.4 :( >>> from email.message import EmailMessage
>>> m = EmailMessage()
>>> m['Subject'] = 'Hello Wörld! Hello Wörld! Hello Wörld! Hello Wörld!Hello Wörld!'
>>> bytes(m)
b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=\n Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=\n\n' |
To aid with debugging the code, the Subject line can be simplified: >>> from email.message import EmailMessage
>>> m = EmailMessage()
>>> m['Subject'] = 'Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?= Hello Wörld!Hello Wörld!'
>>> print(bytes(m))
b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=\n Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=\n\n' |
I uploaded a test script with some test cases: The failure mode occurs when
For example, the first folded and encoded line of 'Hello Wörld! Hello Wörld! Hello Wörld! Hello Wörld!Hello Wörld!' is b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=' and the second line should be b' Hello =?utf-8?q?W=C3=B6rld!Hello_W=C3=B6rld!?=' but instead, it is b' Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=' The function at fault is _refold_parse_tree() in Lib/email/_header_value_parser.py. In the first line, it encodes the first UTF-8 word and saves the starting offset in the output string (15). When it encounters the second UTF-8 word, it re-encodes the entire string starting at the saved offset. This is to help reduce the bloat added by multiple '=?utf-8?q?' start-of-encoding tokens. When it encodes the first UTF-8 word on the second line, it tries to store it at the saved offset into the second line output string, but that is past the end of the string so it just gets appended. When it encounter the second UTF-8 word in the second line, it re-encodes the entire second-line string starting at the saved offset (15), which is in the middle of the first encoded UTF-8 string. The failure mode is not triggered if there is at most one UTF-8 word in each folded line. It also is not triggered when folding occurs in the middle of a word instead of at whitespace because the code follows a different path. The solution is to set the saved starting offset to None when starting a new folded line when the fold-point is whitespace. I will submit a pull request soon with a fix. |
The pull request has been submitted with both the code fix and tests. |
This seems complete, can it be closed? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: