-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
email.message.Message.get_payload(decode=True) raises AssertionError that "should never happen" #71584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm processing Yahoo! Groups backup archives, and came across an email message which causes the
Attached is the file which, when run under Python 3.5.1, causes the exception to be raised. |
See attached another file with more test cases. |
This appears to be a bug in b64decode. It should not be raising that error if validate=False, as it is in the code being executed during the test case. It's not that b64decode is ignoring the validate argument; the error appears to be data-dependent. I don't have time to investigate this further at the moment, perhaps someone else can pick it up. |
I stumbled upon this bug as well while fuzzing with AFL. The curious thing is that email.message_from_string still accepts that garbled message as a valid email. |
I also attached a minimal script containing only the decode call and the garbage payload. |
Thanks for narrowing this down. There are two problems: the first is that the base64 module is raising "Incorrect padding" when the issue is actually that the input string cannot be decoded using the base64 algorithm no matter what padding might be supplied. The second is that the else clause in _encoded_words *can* happen, in this circumstance. The circumstance is that the input string contains n*4+1 characters. No encoding of source text will ever produce that number of output characters. (I also seem to have used an index one greater than needed for the loop; the maximum number of padding characters is 2.) The interesting question is what we want to do for error recovery in the n*4+1 case. Do we declare the part invalid and return an empty string? Do we try chopping off the last character and see if we can decode it? The latter is complicated by the possible presence of invalid alphabet characters, but it is what I lean toward. I'm open to alternate suggestions. And to a patch if someone wants to write one :) |
It might be better to add a character: that will produce garbage as the last byte, but that could serve as an alert that something is wrong...and we don't know if the extra character is at the end or at the beginning of the base64 encoded string, or if, as in this case, it is multiple extra characters which will produce trailing garbage anyway. The example in hand is, I would guess, by far the most likely scenario. |
In my opinion, this is still a clear case of "invalid input, raise an error", but I don't think AssertionError is the right one. Maybe just don't catch the unexpected binascii.Error and let it fly towards the user? I might go even one step further and just let base64 library handle and raise all the necessary errors here, similarly to how your "encode_b" function works. If you want more error resilience (meaning: email lib would not crash with invalid inputs), I would think returning an empty string is okay, since that base64 input is clearly not a valid base64 that is going to decode into anything. |
Also, I'm not sure what is the development status in Python 3.4 but in my case this bug happened using the Debian Jessie Python version 3.4.2-2 |
It might also make sense to return the payload undecoded. The documentation for get_payload() function says: "[...] the payload will be decoded if this header’s value is quoted-printable or base64. If some other encoding is used, or Content-Transfer-Encoding header is missing, the payload is returned as-is (undecoded)." Even though the header's value tries to convince you "base64" is the encoding, it is - in this case - either broken base64 or not. Hence it might fall into the category "some other encoding is used", justifying the "payload is returned as-is (undecoded)". As to the original payload Claudiu posted, in that the mailserver has truncated the email. This already provides the user with non-base64 string that they try to convince you to decode as base64. My argument is still valid in this case. |
Returning the undecoded payload is a good idea. Let's go with that. The email module, unlike most stdlib packages, has a mandate that the parser should never raise an error. Instead we do our best to guess (very unlike everything else in python!) and note 'defects' in the message. The reason this is the case is Postel's Law, which has become one of the guiding principles in dealing with email over the years: "be conservative in what you do, and generous in what you accept". So, the generator will raise errors (if the original input didn't come from the parser), but the parser will not, if at all possible. (For those who want aggressive error checking, python3 the 'raise_on_defect' policy setting.) For the versions, we use that field to indicate which versions the bug will get fixed in, which is why I removed 3.4. |
See bpo-33770, which will make recognizing and handling this case straightforward. |
The fix for bpo-33770 is only in 3.7.0 and later, so does this mean this change should not be backported to 3.6? Updating the the versions accordingly. If it is appropriate for 3.6, we can add that back in. |
I ended up fixing this independently of bpo-33770. AFAIK this should just work with 3.6. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: