-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smtpd.py should not decode utf-8 #63861
Comments
http://hg.python.org/cpython/file/3.3/Lib/smtpd.py#l289 as of now decodes incoming bytes as UTF-8. An SMTP server must not attempt to interpret characters beyond ASCII, however. Originally mail servers were not 8-bit clean, meaning they would only guarantee the lower 7 bits of each octet to be preserved. I am not aware of the reasons that caused the current state, but to correct this behavior and make it possible to support the 8BITMIME feature I suggest decoding received bytes as latin1, leaving it to the user to reinterpret it as UTF-8 or whatever charset they need. Any other simple extended encoding could be used for this, but latin1 is the default in asynchat. The documentation should also mention charset handling. I'll be happy to submit a patch for both code and docs. |
Patch attached. This also adds some more charset clarification to the docs and corrects a minor spelling issue. It is also conceivable that we add a charset attribute to the class. This should have the safe default of latin1, and some notes in the docs that setting this to utf-8 (and probably other utf-* encodings) is not really standards-compliant. |
This bug was apparently introduced as part of the work from bpo-4184 in python 3.2. My guess, looking at the code, is that the module simply didn't work before that patch, since it would have been attempting to join binary data using a string join (''.join(...)). Richard says in the issue that he wrote tests, so he probably figured out it wasn't working and "fixed" it. It looks like there was no final review of his patch (at least not via the tracker...the patch uploaded to the tracker did not include the decode). Not that a final review would necessarily have caught the bug... The problem here is backward compatibility. In terms of the API, it really ought to be producing binary data, and not decoding at all. But, at the time he wrote the patch the email package couldn't handle binary data (Richard's patch landed in July 2010, binary support in the email package landed in October), so presumably nobody was thinking about binary emails. I'm really not sure what to do here, I'll have to give it some thought. |
Since this is my first contribution I'm not entirely sure about the fine details of backwards compatibility in Python, so please forgive me if I'm totally missing the mark here. There are facilities in smtpd's parent class asynchat that perform the necessary conversions automatically if the user sets an encoding, so smtpd should be adjusted to rely on that and thus give the user the opportunity to choose for themselves. Then it boils down to breaking backwards compatibility by setting a default encoding, which could be none as you suggest or latin1 as I suggest; either will probably be painful for current users. My take here is that whoever is using this code for their SMTP server and hasn't given the encoding issues any thought will need to take a look at their code in that respect anyway, so IMHO a break with compatibility might be a bit painful but necessary. If you agree then I will gladly rework the patch to have smtpd work with an underlying byte stream by default, rejecting anything non-ASCII where necessary. Later patches could bring 8BITMIME support to smtpd, with charset conversion as specified by the MIME metadata. |
I think the only backward compatible solution is to add a switch of *some* sort (exact API TBD), whose default is to continue to decode using utf-8, and document it as wrong. Conversion of an email to unicode should be handled by the email package, not by smtpd, which is why I say smtpd should be emitting binary. As I say, I need to find time to look at the current API in more detail before I'll be comfortable discussing the new API. I've put it on my list, but likely I won't get to it until the weekend. |
Oh, and to clarify: the backward compatibility is that if code works with X.Y.Z, it should work with X.Y.Z+1. So even though correctly handling binary mail would indeed require someone to reexamine their code, if things happen to be working OK for them (eg: their program only needs to handle utf-8 email), we don't want to break their working program. |
Here is another patch for fixing this issue: Sorry for my bad english |
As I said, the decoding needs to be controlled by a switch (presumably a keyword argument to SMTPServer) that defaults to the present (incorrect) behavior. |
Is there a workaround for this as I'd like to just be receiving binary data from SMTPD. I'm new to this system - is this scheduled for fixing in Python 3.4? |
Unfortunately I did not get to this before the 3.4 beta release, so no, it won't be fixed in 3.4. You can work around it by overriding collect_incoming_data in your subclass and doing data.decode('ascii', 'surrogateescape') instead of str(data, 'utf-8'), and then doing mydata.encode('ascii', 'surrogateescape') at the point where you want to turn the data back into binary. |
Hi David, I would like to work on this bug. Can you give some more insights about the main issue? As far as I understood, the smtp server is now decoding the incoming bytes as UTF-8. Why do you say that it is not the right way? Can you give some idea about the right convention? Also, you mention about a solution with a switch statement having default case as utf8. What are the other cases? And you also mention that smtpd should be emitting binary and unicode should be handled by the email package. |
I propose that we add a new keyword argument to SMTP's __init__, 'decode_data'. This would be set to True by default, and would preserve the current behavior of passing utf-8 decoded data to process_message. Setting it to True would mean that process_message would get passed binary (undecoded) data. In 3.5 we add this keyword, but we immediately deprecate 'decode_data=True'. In 3.6 we change the default to decode_data=False, and we deprecate the decode_data keyword. Then in 3.7 we drop the decode_data keyword. Now, as for implementation: what 'push' currently does (encode to ascii) is just fine for now. What we need to change is collect_incoming_data (where the decode happens) and found_terminator (where the data is passed to other parts of the class or its subclasses). When decode_data is False, collect_incoming_data should not decode. received_lines should be binary. Then, in found_terminator the else branch of the if can pass the binary received_lines into process_message (care will be needed to use the correct data types for the various operations). In the first branch of the if, though, when decode_data is False the data will now need to be decoded (still, I think, using utf-8) so that text can still be used to manipulate this part of the API, since unlike the message data it *is* conceptually text, just encoded as ASCII. (I suggest still decoding using utf-8 rather than ASCII because this will be useful when we implement RFC6531.) This will provide for the smallest number of needed changes to subclasses when converting to decode_data=False mode. |
Hi David, |
Sreepriya, are you still working on this issue? If no I'll be happy to take it over, is yes start with fixing following things:
|
Hi Maciej, |
Is this one likely to be included in 3.5? It effectively breaks smtpd so it would be good to see it working again. |
Yes, this will be fixed in 3.5 one way or another. |
I'll try to take care of this issue in the following few days. |
I'm attaching file issue19662_v1.patch. David please have a look at it and let me know if this is it, if not I'm waiting for your suggestions. |
Added review comments. |
I've implemented all your proposed changes, because for most of your changes I was thinking pretty the same way for the whole day today, to make the code more elegant. The current state of work is attached as issue19662_v2.patch |
I've included Leslie's comments in rst file. The 3rd version is attached in issue19662_v3.patch. |
New changeset 4e22213ca275 by R David Murray in branch 'default': |
Thanks, Maciej. I tweaked the patch a bit, you might want to take a look just for your own information. Mostly I fixed the warning stuff, which I didn't explain very well. The idea is that if the default is used (no value is specified), we want there to be a warning. But if a value *is* specified, there should be no warning (the user knows what they want). To accomplish that we make the actual default value None, and check for that. I also had to modify the tests so that warnings aren't issued, as well as test that they actually get issued when the default is used. I also added versionchanged directives and a whatsnew entry, and expanded the decode_data docs a bit. |
New changeset a6c846ec5fd3 by R David Murray in branch 'default': |
New changeset a7d3074fa888 by R David Murray in branch 'default': |
s/keword/keyword/ |
New changeset a3f2b171b765 by R David Murray in branch 'default': |
Thanks, Arfrever. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: