-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decode_header does not follow RFC 2047 #45420
Comments
email.header.decode_header expect a space or end of line after the end Python 2.5.1 ChangeLog seems to indicate that this bug has been solved. A visible effet of the bad regex used has the consequence found in Issue it seems there are 2 different regex with the same purpose in two |
Can you provide an example of an address that triggers this? Preferably |
The only difference between the two regexps is that the email/header.py (?=[ \t]|$) # whitespace or the end of the string at the end (with re.MULTILINE, so $ also matches '\n'). To expand on "There is nothing about that thing in RFC 2047", it says:: IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's RFC 822 says:: atom = 1*<any CHAR except specials, SPACE and CTLs>
...
specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
/ "," / ";" / ":" / "\" / <"> ; string, to use
/ "." / "[" / "]" ; within a word. So an example of mis-parsing is:: >>> import email.header
>>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)'
>>> email.header.decode_header(h)
[('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)] The correct result would be:: >>> email.header.decode_header(h)
[('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)] which is what you get if you insert a space before the '(' in h. |
I think the problem is best viewed as headers are not being parsed Patch and test, tested on 2.6.1, 2.7trunk. The test mostly just |
Tony, I don't think I agree with your reading of the RFC. IMO, your The encoded word followed by a 'special' is more subtle. In section 5,
This would apply to encoded words in names in To and From headers. But It would probably be reasonable to fix this case so that a word followed |
The email package does not follow the RFCs in anything to do with header Until email.header actually parses headers into atoms and then decodes In order to interpret the RFC correctly, email.header.decode_header() |
+1 for Tony's patch. This patch reverts fix for bpo-1582282 filed by tkikuchi. |
I got bitten by this too. In addition to not decoding encoded words without whitespace after them, it throws an exception if there is a valid encoded word later in the string and the first encoded word is followed by something that isn't a hex number: >>> decode_header('aaa=?iso-8859-1?q?bbb?=xxx asdf =?iso-8859-1?q?jkl?=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/email/header.py", line 93, in decode_header
dec = email.quoprimime.header_decode(encoded)
File "/usr/lib/python2.5/email/quoprimime.py", line 336, in header_decode
return re.sub(r'=\w{2}', _unquote_match, s)
File "/usr/lib/python2.5/re.py", line 150, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "/usr/lib/python2.5/email/quoprimime.py", line 324, in _unquote_match
return unquote(s)
File "/usr/lib/python2.5/email/quoprimime.py", line 106, in unquote
return chr(int(s[1:3], 16))
ValueError: invalid literal for int() with base 16: 'xx' I think it should join the encoded words with the surrounding text if there's no whitespace in between. That seems to be consistent with what the non-RFC-compliant MUAs out there mean when they send such things. Reverting the change from bpo-1582282 doesn't seem to be a good idea, since it was introduced in response to problems with mailman (see https://bugs.launchpad.net/mailman/+bug/266370). Instead of leaving "Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord" as it is, my patch converts it to [('Sm\xf6rg\xe5sbord', 'iso-8859-1')]. This shouldn't reintroduce the problem mailman was having while fixing ours. |
Hi, all I am against applying these patches because they will insert space separations in re-composed header (with str() function). Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord Instead, I submit a small recipe for decoding non-compliant RFC2047 header where space separation is not properly inserted between encoded_word and us-ascii characters. Please try! |
maybe it would be a good start to include the examples at the end of RFC2047 into the regression tests? These examples at least support the case that a '?' may immediately follow an encoded string: encoded form displayed as when trying this in python 2.7: >>> decode_header ('(=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)')
[('(', None), ('a', 'iso-8859-1'), ('=?ISO-8859-1?Q?b?=)', None)] this fails. So I consider this a bug. Note that although RFC2047 is vague concerning the interpretation if two encoded strings could follow each other without a whitespace, these *are* seen in the wild and *are* interpreted correctly by the mailers I've tested: mutt, thunderbird, exchange in various versions, even lotus notes seems to get this right. So I guess python should be "liberal in what you accept" and parse something like |
The RFC isn't at all vague about encoded words not separated by white space. That isn't allowed by the BNF. As you say, though, they occur in the wild and should be parsed correctly. In your other point I think you mean "immediately followed by a )", right? Yes, that is allowed and no, we don't currently parse that correctly. Adding the RFC tests would be great (patches gladly accepted). Fixes for ones we fail would be great, too, but at the very least we can mark them as expected failures. I don't usually like adding tests that we expect to fail, but in the case of externally defined tests such as the RFC examples I think it is worthwhile, so that we can check in a complete test set. |
Fine, I see what you mean, this involves very careful reading of the RFC Right. Should have been a ')'
Patch attached (against current tip, 74241:120a79b8bb11). We currently I've made the 5th test (with newline in the string) two cases, one with I plan to look into this a little more, my current plan is to make the Any objections/further infos? RalfDr. Ralf Schlatterbeck Tel: +43/2243/26465-16 |
enclosed please find a fixed patch -- decode_header consolidates
|
Well, a caution that tweaking the regex can have unexpected consequences as past issues have proven (but by all means go for it), and a note that the parsing strategy is going to change completely in email6 (see http://pypi.python.org/email and http://hg.python.org/features/email6). I think your tests should pass on that branch; I'll be interested to try it when I get some time. (Note: I'm removing 3.1 from versions since it doesn't get bug fixes any more.) Also, I'm not sure the (non-essential) change to consolidate like-charset encoded words is appropriate for a bug fix. It's hard to see how it would break anything, but why take the risk if it isn't needed to fix the bug. |
Gah, that's what I get for not reading carefully (or looking at the patch first). Your test change is fine, of course. |
Attached please find a patch that
I've re-read RFC2047 (and parts of 822) and now share your opinion that With the special-casing of ctext characters mentioned above, I still think we should do it like everyone else and *not* automatically I've *not yet* tested this against the email6 branch you mentioned. Note that I didn't have to make the regex non-greedy, it already I've changed all the tests that test for removal of whitespace between The rfc2047 tests now all pass. The patch also fixes bpo-1467619 "Header.decode_header eats up spaces" RalfDr. Ralf Schlatterbeck Tel: +43/2243/26465-16 |
Ralf, thanks very much for this patch. I'm considering applying it. Given that the current code breaks on parsing various legitimate constructs, it seems like the behavior change (preserving whitespace in the non-EW parts...which IMO is correct) should be an acceptable tradeoff. Could you please submit a contributor agreement? (http://www.python.org/psf/contrib/) |
On Mon, May 28, 2012 at 08:15:05PM +0000, R. David Murray wrote:
Thanks for considering my patch. Ralf |
New changeset 8c03fe231877 by R David Murray in branch 'default': |
I've applied this to 3.3. Because the preservation of spaces around the ascii parts is a visible behavior change that could cause working programs to break, I don't think I can backport it. I'm going to leave this open until I can consult with Barry to see if he thinks a backport is justified. Anyone else can feel free to chime in with an opinion as well :) |
On Jun 02, 2012, at 09:59 PM, R. David Murray wrote:
I think a backport is risky. |
OK, I'm closing this, then, and will close the related issues as well. Thanks again for the patch, Ralf. |
New changeset 0808cb8c60fd by R David Murray in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: