smtpd.py should not decode utf-8 #63861

lpolzer · 2013-11-20T10:51:44Z

BPO	19662
Nosy	@warsaw, @vstinner, @bitdancer, @soltysh, @zvyn
Files	smtpd_charset_latin1.diff: Make smtpd.py use latin1 instead of utf-8 as default decoding. python3.3-lib-smtpd-patch.diff: move utf-8 decode to the end of line rcv process switch_while_decode1.patch: Patch to switch between utf8 and binary decode with decode_data variable switch_while_decode2.patch: Switch between utf8 and binary decode based on decode_data var issue19662_v1.patch: decode_data extension for smptd (patch v1) issue19662_v2.patch: decode_data extension for smptd (patch v2) issue19662_v3.patch: decode_data extension for smptd (patch v3)

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-06-11.15:25:23.243>
created_at = <Date 2013-11-20.10:51:43.730>
labels = ['type-feature', 'library', 'expert-email']
title = 'smtpd.py should not decode utf-8'
updated_at = <Date 2015-05-19.11:19:16.216>
user = 'https://bugs.python.org/lpolzer'

bugs.python.org fields:

activity = <Date 2015-05-19.11:19:16.216>
actor = 'r.david.murray'
assignee = 'none'
closed = True
closed_date = <Date 2014-06-11.15:25:23.243>
closer = 'r.david.murray'
components = ['Library (Lib)', 'email']
creation = <Date 2013-11-20.10:51:43.730>
creator = 'lpolzer'
dependencies = []
files = ['32719', '32861', '34700', '34704', '35390', '35404', '35409']
hgrepos = []
issue_num = 19662
keywords = ['patch']
message_count = 29.0
messages = ['203467', '203473', '203477', '203488', '203496', '203497', '204527', '204540', '210431', '210433', '213897', '214010', '215375', '216843', '217135', '218888', '218899', '218900', '219308', '219353', '219363', '219382', '220278', '220279', '220284', '243348', '243564', '243579', '243580']
nosy_count = 13.0
nosy_names = ['barry', 'richard', 'vstinner', 'Arfrever', 'r.david.murray', 'jesstess', 'python-dev', 'maciej.szulik', 'lpolzer', 'Illirgway', 'Duke.Dougal', 'zvyn', 'sreepriya']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue19662'
versions = ['Python 3.5']

lpolzer · 2013-11-20T10:51:43Z

http://hg.python.org/cpython/file/3.3/Lib/smtpd.py#l289

as of now decodes incoming bytes as UTF-8.

An SMTP server must not attempt to interpret characters beyond ASCII, however. Originally mail servers were not 8-bit clean, meaning they would only guarantee the lower 7 bits of each octet to be preserved.
However even then they were not expected to choke on any input because of attempts to decode it into a specific extended charset. Whenever a mail server does not need to interpret data (like base64-encoded auth information) it is simply left alone and passed through.

I am not aware of the reasons that caused the current state, but to correct this behavior and make it possible to support the 8BITMIME feature I suggest decoding received bytes as latin1, leaving it to the user to reinterpret it as UTF-8 or whatever charset they need. Any other simple extended encoding could be used for this, but latin1 is the default in asynchat.

The documentation should also mention charset handling. I'll be happy to submit a patch for both code and docs.

lpolzer · 2013-11-20T12:48:27Z

Patch attached. This also adds some more charset clarification to the docs and corrects a minor spelling issue.

It is also conceivable that we add a charset attribute to the class. This should have the safe default of latin1, and some notes in the docs that setting this to utf-8 (and probably other utf-* encodings) is not really standards-compliant.

bitdancer · 2013-11-20T13:53:53Z

This bug was apparently introduced as part of the work from bpo-4184 in python 3.2. My guess, looking at the code, is that the module simply didn't work before that patch, since it would have been attempting to join binary data using a string join (''.join(...)). Richard says in the issue that he wrote tests, so he probably figured out it wasn't working and "fixed" it. It looks like there was no final review of his patch (at least not via the tracker...the patch uploaded to the tracker did not include the decode). Not that a final review would necessarily have caught the bug...

The problem here is backward compatibility.

In terms of the API, it really ought to be producing binary data, and not decoding at all. But, at the time he wrote the patch the email package couldn't handle binary data (Richard's patch landed in July 2010, binary support in the email package landed in October), so presumably nobody was thinking about binary emails.

I'm really not sure what to do here, I'll have to give it some thought.

lpolzer · 2013-11-20T15:02:42Z

Since this is my first contribution I'm not entirely sure about the fine details of backwards compatibility in Python, so please forgive me if I'm totally missing the mark here.

There are facilities in smtpd's parent class asynchat that perform the necessary conversions automatically if the user sets an encoding, so smtpd should be adjusted to rely on that and thus give the user the opportunity to choose for themselves.

Then it boils down to breaking backwards compatibility by setting a default encoding, which could be none as you suggest or latin1 as I suggest; either will probably be painful for current users.

My take here is that whoever is using this code for their SMTP server and hasn't given the encoding issues any thought will need to take a look at their code in that respect anyway, so IMHO a break with compatibility might be a bit painful but necessary.

If you agree then I will gladly rework the patch to have smtpd work with an underlying byte stream by default, rejecting anything non-ASCII where necessary.

Later patches could bring 8BITMIME support to smtpd, with charset conversion as specified by the MIME metadata.

bitdancer · 2013-11-20T16:06:33Z

I think the only backward compatible solution is to add a switch of *some* sort (exact API TBD), whose default is to continue to decode using utf-8, and document it as wrong.

Conversion of an email to unicode should be handled by the email package, not by smtpd, which is why I say smtpd should be emitting binary.

As I say, I need to find time to look at the current API in more detail before I'll be comfortable discussing the new API. I've put it on my list, but likely I won't get to it until the weekend.

bitdancer · 2013-11-20T16:10:56Z

Oh, and to clarify: the backward compatibility is that if code works with X.Y.Z, it should work with X.Y.Z+1. So even though correctly handling binary mail would indeed require someone to reexamine their code, if things happen to be working OK for them (eg: their program only needs to handle utf-8 email), we don't want to break their working program.

Illirgway · 2013-11-26T20:29:07Z

Here is another patch for fixing this issue:

Illirgway/cpython@12d7c59

Sorry for my bad english

bitdancer · 2013-11-26T22:22:49Z

As I said, the decoding needs to be controlled by a switch (presumably a keyword argument to SMTPServer) that defaults to the present (incorrect) behavior.

DukeDougal · 2014-02-07T01:58:06Z

Is there a workaround for this as I'd like to just be receiving binary data from SMTPD. I'm new to this system - is this scheduled for fixing in Python 3.4?

bitdancer · 2014-02-07T02:38:31Z

Unfortunately I did not get to this before the 3.4 beta release, so no, it won't be fixed in 3.4.

You can work around it by overriding collect_incoming_data in your subclass and doing data.decode('ascii', 'surrogateescape') instead of str(data, 'utf-8'), and then doing mydata.encode('ascii', 'surrogateescape') at the point where you want to turn the data back into binary.

sreepriya · 2014-03-17T21:17:35Z

Hi David,

I would like to work on this bug. Can you give some more insights about the main issue? As far as I understood, the smtp server is now decoding the incoming bytes as UTF-8. Why do you say that it is not the right way? Can you give some idea about the right convention? Also, you mention about a solution with a switch statement having default case as utf8. What are the other cases? And you also mention that smtpd should be emitting binary and unicode should be handled by the email package.
But is it possible to make that change now as other functions depending on this might be affected?

bitdancer · 2014-03-18T19:48:57Z

I propose that we add a new keyword argument to SMTP's __init__, 'decode_data'. This would be set to True by default, and would preserve the current behavior of passing utf-8 decoded data to process_message.

Setting it to True would mean that process_message would get passed binary (undecoded) data.

In 3.5 we add this keyword, but we immediately deprecate 'decode_data=True'. In 3.6 we change the default to decode_data=False, and we deprecate the decode_data keyword. Then in 3.7 we drop the decode_data keyword.

Now, as for implementation: what 'push' currently does (encode to ascii) is just fine for now. What we need to change is collect_incoming_data (where the decode happens) and found_terminator (where the data is passed to other parts of the class or its subclasses).

When decode_data is False, collect_incoming_data should not decode. received_lines should be binary. Then, in found_terminator the else branch of the if can pass the binary received_lines into process_message (care will be needed to use the correct data types for the various operations). In the first branch of the if, though, when decode_data is False the data will now need to be decoded (still, I think, using utf-8) so that text can still be used to manipulate this part of the API, since unlike the message data it *is* conceptually text, just encoded as ASCII. (I suggest still decoding using utf-8 rather than ASCII because this will be useful when we implement RFC6531.) This will provide for the smallest number of needed changes to subclasses when converting to decode_data=False mode.

sreepriya · 2014-04-02T11:39:07Z

Hi David,
The variable decode_data is included to control decoding. But I am not sure what needs to be done while calling the process_message inside found_terminator when it is binary data. How to work around with binary data? Can you tell me what are the data types concerning binary data?

soltysh · 2014-04-19T05:07:12Z

Sreepriya, are you still working on this issue? If no I'll be happy to take it over, is yes start with fixing following things:

start with test - this is the most important to have each feautre tested
decode_data, as David mentioned, needs to have default value True, meaning that __init__ should look like this:
def __init__(self, server, conn, addr, data_size_limit=DATA_SIZE_DEFAULT, map=None, decode_data=True)
Assigning True in __init__ will make this value always True, and that's not the point.
add deprecation warning about this parameter using warnings module:
warnings.warn('decode_data=True is deprecated, data will not be decoded by default', DeprecationWarning, 2)
as for the found_terminator method what David means is to decode data in the first if, where commands are checked, to simplify processing of this part (David please correct me if I'm wrong) and not what you did
and finally you need to update the docs to include decode_data parameter with information about how it works and it's deprecation

sreepriya · 2014-04-24T18:44:59Z

Hi Maciej,
I am travelling now and it might take some delay for me to work on this! I got to know that you are working on RFC 6532. You might take this up and fix it as this is related to your work and I don't want to create delays.

DukeDougal · 2014-05-21T22:22:01Z

Is this one likely to be included in 3.5? It effectively breaks smtpd so it would be good to see it working again.

bitdancer · 2014-05-22T15:21:27Z

Yes, this will be fixed in 3.5 one way or another.

soltysh · 2014-05-22T15:23:33Z

I'll try to take care of this issue in the following few days.

soltysh · 2014-05-28T21:53:00Z

I'm attaching file issue19662_v1.patch. David please have a look at it and let me know if this is it, if not I'm waiting for your suggestions.

bitdancer · 2014-05-29T17:31:39Z

Added review comments.

soltysh · 2014-05-29T20:35:59Z

I've implemented all your proposed changes, because for most of your changes I was thinking pretty the same way for the whole day today, to make the code more elegant. The current state of work is attached as issue19662_v2.patch

soltysh · 2014-05-30T10:21:28Z

I've included Leslie's comments in rst file. The 3rd version is attached in issue19662_v3.patch.

python-dev · 2014-06-11T15:18:34Z

New changeset 4e22213ca275 by R David Murray in branch 'default':
bpo-19662: add decode_data to smtpd so you can get at DATA in bytes form.
http://hg.python.org/cpython/rev/4e22213ca275

bitdancer · 2014-06-11T15:25:23Z

Thanks, Maciej.

I tweaked the patch a bit, you might want to take a look just for your own information. Mostly I fixed the warning stuff, which I didn't explain very well. The idea is that if the default is used (no value is specified), we want there to be a warning. But if a value *is* specified, there should be no warning (the user knows what they want). To accomplish that we make the actual default value None, and check for that. I also had to modify the tests so that warnings aren't issued, as well as test that they actually get issued when the default is used.

I also added versionchanged directives and a whatsnew entry, and expanded the decode_data docs a bit.

python-dev · 2014-06-11T16:27:58Z

New changeset a6c846ec5fd3 by R David Murray in branch 'default':
bpo-19662: Eliminate warnings in other test modules that use smtpd.
http://hg.python.org/cpython/rev/a6c846ec5fd3

python-dev · 2015-05-16T18:18:26Z

New changeset a7d3074fa888 by R David Murray in branch 'default':
bpo-19662: Make requirement to support arbitrary keywords explicit.
https://hg.python.org/cpython/rev/a7d3074fa888

Arfrever · 2015-05-19T08:00:16Z

New changeset a7d3074fa888 by R David Murray in branch 'default':
bpo-19662: Make requirement to support arbitrary keywords explicit.
https://hg.python.org/cpython/rev/a7d3074fa888

s/keword/keyword/

python-dev · 2015-05-19T11:18:54Z

New changeset a3f2b171b765 by R David Murray in branch 'default':
bpo-19662: fix typo
https://hg.python.org/cpython/rev/a3f2b171b765

bitdancer · 2015-05-19T11:19:16Z

Thanks, Arfrever.

lpolzer mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Nov 20, 2013

bitdancer added the topic-email label Nov 20, 2013

bitdancer closed this as completed Jun 11, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smtpd.py should not decode utf-8 #63861

smtpd.py should not decode utf-8 #63861

lpolzer mannequin commented Nov 20, 2013

lpolzer mannequin commented Nov 20, 2013

lpolzer mannequin commented Nov 20, 2013

bitdancer commented Nov 20, 2013

lpolzer mannequin commented Nov 20, 2013

bitdancer commented Nov 20, 2013

bitdancer commented Nov 20, 2013

Illirgway mannequin commented Nov 26, 2013

bitdancer commented Nov 26, 2013

DukeDougal mannequin commented Feb 7, 2014

bitdancer commented Feb 7, 2014

sreepriya mannequin commented Mar 17, 2014

bitdancer commented Mar 18, 2014

sreepriya mannequin commented Apr 2, 2014

soltysh commented Apr 19, 2014

sreepriya mannequin commented Apr 24, 2014

DukeDougal mannequin commented May 21, 2014

bitdancer commented May 22, 2014

soltysh commented May 22, 2014

soltysh commented May 28, 2014

bitdancer commented May 29, 2014

soltysh commented May 29, 2014

soltysh commented May 30, 2014

python-dev mannequin commented Jun 11, 2014

bitdancer commented Jun 11, 2014

python-dev mannequin commented Jun 11, 2014

python-dev mannequin commented May 16, 2015

Arfrever mannequin commented May 19, 2015

python-dev mannequin commented May 19, 2015

bitdancer commented May 19, 2015

smtpd.py should not decode utf-8 #63861

smtpd.py should not decode utf-8 #63861

Comments

lpolzer mannequin commented Nov 20, 2013

lpolzer mannequin commented Nov 20, 2013

lpolzer mannequin commented Nov 20, 2013

bitdancer commented Nov 20, 2013

lpolzer mannequin commented Nov 20, 2013

bitdancer commented Nov 20, 2013

bitdancer commented Nov 20, 2013

Illirgway mannequin commented Nov 26, 2013

bitdancer commented Nov 26, 2013

DukeDougal mannequin commented Feb 7, 2014

bitdancer commented Feb 7, 2014

sreepriya mannequin commented Mar 17, 2014

bitdancer commented Mar 18, 2014

sreepriya mannequin commented Apr 2, 2014

soltysh commented Apr 19, 2014

sreepriya mannequin commented Apr 24, 2014

DukeDougal mannequin commented May 21, 2014

bitdancer commented May 22, 2014

soltysh commented May 22, 2014

soltysh commented May 28, 2014

bitdancer commented May 29, 2014

soltysh commented May 29, 2014

soltysh commented May 30, 2014

python-dev mannequin commented Jun 11, 2014

bitdancer commented Jun 11, 2014

python-dev mannequin commented Jun 11, 2014

python-dev mannequin commented May 16, 2015

Arfrever mannequin commented May 19, 2015

python-dev mannequin commented May 19, 2015

bitdancer commented May 19, 2015