New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow UTF-8 characters (non-ASCII) in From: and To: fields in e-mail headers (RFC 6531) #867
Conversation
See the atext regexp in: https://tools.ietf.org/html/draft-seantek-mail-regexen-00#section-3.1.1 for an example.
See the atext regexp in: https://tools.ietf.org/html/draft-seantek-mail-regexen-00#section-3.1.1 for an example.
This patch was tested by @chrwen-omicron here: who tested sending e-mail with a From: header that contained non-ASCII characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments. unicode decode looks good
src/twisted/mail/smtp.py
Outdated
|
||
COMMAND, DATA, AUTH = 'COMMAND', 'DATA', 'AUTH' | ||
|
||
|
||
# Character classes for parsing addresses | ||
atom = br"[-A-Za-z0-9!\#$%&'*+/=?^_`{|}~]" | ||
atom = u"[-A-Za-z0-9!\#$%&'*+/=?^_`{|}~\u00a0-\U001000ff]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a reference link to where you did find the magic value \u00a0-\U001000ff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does not implement RFC 6531, but it can be made to with relatively minor changes.
Please make those changes and resubmit for review.
src/twisted/mail/smtp.py
Outdated
@@ -533,7 +542,8 @@ def do_MAIL(self, rest): | |||
return | |||
# Clear old recipient list | |||
self._to = [] | |||
m = self.mail_re.match(rest) | |||
|
|||
m = self.mail_re.match(rest.decode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't comply with RFC 6531. A client may send UTF-8 encoded sequences if and only if an ESMTP server advertises the SMTPUTF8
extension;
This document applies when an SMTPUTF8-aware SMTP client or server
supports the SMTPUTF8 extension. For all other cases, and for
addresses and messages that do not require an SMTPUTF8 extension,
SMTPUTF8-aware SMTP clients and servers do not change the behavior
specified in RFC 5321 [RFC5321].
Where RFC 5321 obsoletes RFC 2821, the RFC called out in the previous implementation.
In this PR, the SMTP
and ESTMP
server always expects UTF-8 encoded messages without advertising SMTPUTF8
. This contravenes the RFC's requirements. Furthermore, the RFC only pertains to ESMTP, as SMTP does not support extensions.
The good news is this is easy to fix: first, instead of decoding as utf-8
, decode these strings as self._encoding
. SMTP._encoding
should be ascii
, and ESMTP._encoding
can be utf-8
. Then, add SMTPUTF8
to the extensions returned by ESMTP.extensions. This will make both SMTP and ESMTP compliant with my reading of the relevant RFCs.
src/twisted/mail/smtp.py
Outdated
@@ -1888,14 +1898,14 @@ def __init__(self, fromEmail, toEmail, file, deferred, retries=5, | |||
assert isinstance(retries, (int, long)) | |||
|
|||
if isinstance(toEmail, unicode): | |||
toEmail = [toEmail.encode('ascii')] | |||
toEmail = [toEmail.encode('utf-8')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also violates RFC 6531, because a a client may send UTF-8 encoded sequences if and only if an ESMTP server advertises the SMTPUTF8
extension.
This implementation has both SMTPSender
and ESMTPSender
always send UTF-8 encoded sequences, which violates the RFC. SMTPSender
should never send UTF-8, while ESMTPSender
should only send UTF-8 when ESMTP.esmtpState_serverConfig
encounters the SMTPUTF8
extension. ESMTPSenderFactory
can then encode its addresses as UTF-8, and raise an informative exception if the server doesn't advertise SMTPUTF8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third parties such as buildbot are creating ESMTPSenderFactory directly, which in turn calls SMTPSenderFactory.init().
SMTPSenderFactory.__init__()
is encoding the addresses. This is all done before the EHLO
message is sent to the server in order query the ESMTP extensions, such as SMTPUTF8. So SMTPSenderFactory is encoding the addresses before it receives SMTPUTF8.
Please clarify in the code how we can get this to work.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class SMTPSenderFactory(...):
..
_encoding = 'utf-8'
...
def __init__(self, ...):
...
if isinstance(toEmail, unicode):
toEmail = [toEmail.encode(self._encoding)]
elif isinstance(toEmail, bytes):
toEmail = [toEmail]
else:
toEmailFinal = []
for _email in toEmail:
if not isinstance(_email, bytes):
_email = _email.encode(self._encoding)
toEmailFinal.append(_email)
toEmail = toEmailFinal
...
class ESMTPSenderFactory(SMTPSenderFactory):
...
_encoding = 'utf-8'
class ESMTPClient(SMTPClient):
...
def esmtpState_serverConfig(self, code, resp):
...
if b'SMTPUTF8' not in items:
self.sendError(UTF8Required(...))
SMTP clients still encode addresses as ASCII. ESMTP clients encode them preemptively as UTF-8, but if the server can't accept UTF-8 addresses, the Sender Factory's results Deferred errbacks with a new exception UTF8Required
. This gives clients an informative error message and shuts down the connection without violating the RFC and causing potentially undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example, did you mean to put:
class SMTPSenderFactory(...):
..
_encoding = 'ascii'
...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case of ESMTP, if the ESMTPClient encodes addresses as unicode,
but they fall in the ascii range, should the client still throw an exception, or should
it just change the encoding to ascii, and re-encode and send it along?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rodrigc Whoops, yeah, I meant UTF-8 for SMTPSenderFactory
. It might be neat to try to re-encode the address as ASCII, but it seems kind of tricky and would be worth a log message certainly.
It'd be better to defer encoding until the extensions have come in but it will require a fair amount of refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if b'SMTPUTF8' is not supported by the server, would the exception be bettter named something like UTF8NotSupported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern I have is that for people who are now using ESMTPClient with ascii addresses against
older servers which do not support the SMTPUTF8 option, then things go through. Throwing this exception will be new unexpected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what I can do is modify the ESMTPClient so that if SMTPUTF8
is not in the options on the server, it inspects the bytes of the addresses,
and if any non-ASCII characters are in the address, it throws an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that In the case of Buildbot we want to anyway send that email, even if we have to replace non-ascii characters by ascii equivalent. Probably most of the client would have same need, but some other want to error out i that case, or at least warn in a UI. So we should probably raise that exception, and document an easy and portable way to translate that email into somthing that can be send via that old smtp server
Codecov Report
@@ Coverage Diff @@
## trunk #867 +/- ##
==========================================
- Coverage 91.91% 91.84% -0.08%
==========================================
Files 843 840 -3
Lines 149214 148664 -550
Branches 13073 12983 -90
==========================================
- Hits 137156 136545 -611
- Misses 9959 10010 +51
- Partials 2099 2109 +10 |
and in ESMTPSenderFactory, encode addresses with utf-8
within the encoding specified by self._encoding, raise an error.
OK, I pushed a patch which works. It does the following:
|
I used the following test program:
and tested against two mail servers on FreeBSD:
With Postfix, when I sent an e-mail with non-ASCII UTF-8 characters in the address, |
src/twisted/mail/_except.py
Outdated
else: | ||
def __str__(self): | ||
return self.__bytes__() | ||
|
||
|
||
def __bytes__(self): | ||
if self.code > 0: | ||
res = [networkString("%.3d %s" % (self.code, self.resp))] | ||
res = [u"{:03d} {}".format(self.code, self.resp).encode("utf-8")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this UTF-8? This is used by SMTPClient
, so it shouldn't be UTF-8. This could using self._encoding
to allow ESMTPClientError
to inherit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this can be done easily. These exceptions don't have access to self._encoding, so every single exception inheriting from SMTPClientError needs to be modified to take an encoding argument in its constructor....and there are lots.
I'm relying on this code in SMTPClient.sendLine() to fail if any invalid characters are in the stream:
try:
line.decode(self._encoding)
except UnicodeDecodeError:
self.sendError(
SMTPProtocolError(-1,
"Server supports {} encoding, invalid "
"characters detected in sent line:\n{} .".format(
self._encoding, repr(line)),
self.log.str()))
return
I've used that successfully to test sending invalid ASCII characters to a Sendmail server which does not support SMTPUTF8.
else: | ||
res = [self.resp] | ||
if isinstance(self.resp, unicode): | ||
res = [self.resp.encode("utf-8")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above - this shouldn't be UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
src/twisted/mail/smtp.py
Outdated
@@ -175,24 +175,32 @@ def quoteaddr(addr): | |||
return b'<' + bytes(addr) + b'>' | |||
|
|||
if isinstance(addr, bytes): | |||
addr = addr.decode('ascii') | |||
addr = addr.decode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be universally UTF-8 because it's also used by SMTPClient
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/twisted/mail/smtp.py
Outdated
# It didn't parse, use it as-is | ||
return b'<' + bytes(addr) + b'>' | ||
return b'<' + str(addr).encode('utf-8') + b'>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/twisted/mail/smtp.py
Outdated
else: | ||
return b'<' + res[1].encode('ascii') + b'>' | ||
return b'<' + res[1].encode('utf-8') + b'>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/twisted/mail/smtp.py
Outdated
@@ -574,7 +586,7 @@ def do_RCPT(self, rest): | |||
if not self._from: | |||
self.sendCode(503, b"Must have sender before recipient") | |||
return | |||
m = self.rcpt_re.match(rest) | |||
m = self.rcpt_re.match(rest.decode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be self._encoding
, not "utf-8"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/twisted/mail/smtp.py
Outdated
@@ -919,6 +931,8 @@ class SMTPClient(basic.LineReceiver, policies.TimeoutMixin): | |||
# None, perform no timeout checking. | |||
timeout = None | |||
|
|||
_encoding = "ascii" | |||
|
|||
def __init__(self, identity, logsize=10): | |||
if isinstance(identity, unicode): | |||
identity = identity.encode('ascii') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be self._encoding
, not "ascii"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -1791,7 +1826,7 @@ class SenderMixin: | |||
def getMailFrom(self): | |||
if not self.done: | |||
self.done = 1 | |||
return str(self.factory.fromEmail) | |||
return unicode(self.factory.fromEmail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to return the repr of b'some@address.com'
on Python 3, and decode self.factory.fromEmail
as ASCII on Python 2. This should probably be unicode(self.factory.fromEmail, encoding=self._encoding)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work:
File "/Users/crodrigues/twisted_15/src/twisted/mail/smtp.py", line 1829, in getMailFrom
return unicode(self.factory.fromEmail, encoding=self._encoding)
builtins.TypeError: decoding to str: need a bytes-like object, Address found
src/twisted/mail/test/test_smtp.py
Outdated
if _PY3: | ||
self.assertEqual(str(err), u"\N{SNOWMAN} smiling") | ||
else: | ||
self.assertEqual(str(err), b"\xe2\x98\x83 smiling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better:
expected = u"\N{SNOWMAN} smiling"
if _PY3:
errorAsString = str(err, encoding='utf-8')
else:
errorAsString = str(err)
expected.encode('utf-8')
self.assertEqual(stringError, expected)
because it is less redundant, makes it clear that the final result should have been encoded as UTF-8, and doesn't depend on an implicit default of UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work:
Traceback (most recent call last):
File "/Users/crodrigues/twisted_15/src/twisted/mail/test/test_smtp.py", line 1523, in test_strWithNegativeCode
errorAsString = str(err, encoding='utf-8')
builtins.TypeError: decoding to str: need a bytes-like object, SMTPClientError found
src/twisted/mail/test/test_smtp.py
Outdated
# Try to send UTF-8 text when client encoding is ascii | ||
# will result in an error which causes the transport to disconnect. | ||
self.clientProtocol.sendLine(u"RCPT TO:<прив@example.com>".encode("utf-8")) | ||
self.assertTrue(self.clientProtocol.transport.disconnecting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to assert that an exception is raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to get the exception here, but I found that self.clientProtocol.transport.disconnecting is set to True on failure, due to the exception thrown when trying to decode().
@jlquinn was able to successfully test this branch of Twisted as part of a buildbot install here: |
http://twistedmatrix.com/trac/ticket/9250
Original problem reported in:
buildbot/buildbot#3499