Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow UTF-8 characters (non-ASCII) in From: and To: fields in e-mail headers (RFC 6531) #867

Closed
wants to merge 58 commits into from

Conversation

rodrigc
Copy link
Contributor

@rodrigc rodrigc commented Aug 16, 2017

@rodrigc rodrigc changed the title Allow UTF-8 characters (non-ASCII) in From: and To: fields in e-mail headers Allow UTF-8 characters (non-ASCII) in From: and To: fields in e-mail headers (RFC 6531) Aug 16, 2017
@rodrigc
Copy link
Contributor Author

rodrigc commented Aug 16, 2017

This patch was tested by @chrwen-omicron here:
buildbot/buildbot#3499 (comment)

who tested sending e-mail with a From: header that contained non-ASCII characters.

Copy link
Contributor

@tardyp tardyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments. unicode decode looks good


COMMAND, DATA, AUTH = 'COMMAND', 'DATA', 'AUTH'


# Character classes for parsing addresses
atom = br"[-A-Za-z0-9!\#$%&'*+/=?^_`{|}~]"
atom = u"[-A-Za-z0-9!\#$%&'*+/=?^_`{|}~\u00a0-\U001000ff]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a reference link to where you did find the magic value \u00a0-\U001000ff

Copy link
Member

@markrwilliams markrwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not implement RFC 6531, but it can be made to with relatively minor changes.

Please make those changes and resubmit for review.

@@ -533,7 +542,8 @@ def do_MAIL(self, rest):
return
# Clear old recipient list
self._to = []
m = self.mail_re.match(rest)

m = self.mail_re.match(rest.decode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't comply with RFC 6531. A client may send UTF-8 encoded sequences if and only if an ESMTP server advertises the SMTPUTF8 extension;

This document applies when an SMTPUTF8-aware SMTP client or server
supports the SMTPUTF8 extension. For all other cases, and for
addresses and messages that do not require an SMTPUTF8 extension,
SMTPUTF8-aware SMTP clients and servers do not change the behavior
specified in RFC 5321
[RFC5321].

Where RFC 5321 obsoletes RFC 2821, the RFC called out in the previous implementation.

In this PR, the SMTP and ESTMP server always expects UTF-8 encoded messages without advertising SMTPUTF8. This contravenes the RFC's requirements. Furthermore, the RFC only pertains to ESMTP, as SMTP does not support extensions.

The good news is this is easy to fix: first, instead of decoding as utf-8, decode these strings as self._encoding. SMTP._encoding should be ascii, and ESMTP._encoding can be utf-8. Then, add SMTPUTF8 to the extensions returned by ESMTP.extensions. This will make both SMTP and ESMTP compliant with my reading of the relevant RFCs.

@@ -1888,14 +1898,14 @@ def __init__(self, fromEmail, toEmail, file, deferred, retries=5,
assert isinstance(retries, (int, long))

if isinstance(toEmail, unicode):
toEmail = [toEmail.encode('ascii')]
toEmail = [toEmail.encode('utf-8')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also violates RFC 6531, because a a client may send UTF-8 encoded sequences if and only if an ESMTP server advertises the SMTPUTF8 extension.

This implementation has both SMTPSender and ESMTPSender always send UTF-8 encoded sequences, which violates the RFC. SMTPSender should never send UTF-8, while ESMTPSender should only send UTF-8 when ESMTP.esmtpState_serverConfig encounters the SMTPUTF8 extension. ESMTPSenderFactory can then encode its addresses as UTF-8, and raise an informative exception if the server doesn't advertise SMTPUTF8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third parties such as buildbot are creating ESMTPSenderFactory directly, which in turn calls SMTPSenderFactory.init().
SMTPSenderFactory.__init__() is encoding the addresses. This is all done before the EHLO message is sent to the server in order query the ESMTP extensions, such as SMTPUTF8. So SMTPSenderFactory is encoding the addresses before it receives SMTPUTF8.

Please clarify in the code how we can get this to work.
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodrigc

class SMTPSenderFactory(...):
     ..
    _encoding = 'utf-8'
    ...
    def __init__(self, ...):
        ...
        if isinstance(toEmail, unicode):
            toEmail = [toEmail.encode(self._encoding)]
        elif isinstance(toEmail, bytes):
            toEmail = [toEmail]
        else:
            toEmailFinal = []
            for _email in toEmail:
                if not isinstance(_email, bytes):
                    _email = _email.encode(self._encoding)

                toEmailFinal.append(_email)
            toEmail = toEmailFinal
        ...
class ESMTPSenderFactory(SMTPSenderFactory):
    ...
    _encoding = 'utf-8'
class ESMTPClient(SMTPClient):
    ...
    def esmtpState_serverConfig(self, code, resp):
        ...
        if b'SMTPUTF8' not in items:
            self.sendError(UTF8Required(...))

SMTP clients still encode addresses as ASCII. ESMTP clients encode them preemptively as UTF-8, but if the server can't accept UTF-8 addresses, the Sender Factory's results Deferred errbacks with a new exception UTF8Required. This gives clients an informative error message and shuts down the connection without violating the RFC and causing potentially undefined behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, did you mean to put:

class SMTPSenderFactory(...):
     ..
    _encoding = 'ascii'
    ...

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case of ESMTP, if the ESMTPClient encodes addresses as unicode,
but they fall in the ascii range, should the client still throw an exception, or should
it just change the encoding to ascii, and re-encode and send it along?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodrigc Whoops, yeah, I meant UTF-8 for SMTPSenderFactory. It might be neat to try to re-encode the address as ASCII, but it seems kind of tricky and would be worth a log message certainly.

It'd be better to defer encoding until the extensions have come in but it will require a fair amount of refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if b'SMTPUTF8' is not supported by the server, would the exception be bettter named something like UTF8NotSupported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern I have is that for people who are now using ESMTPClient with ascii addresses against
older servers which do not support the SMTPUTF8 option, then things go through. Throwing this exception will be new unexpected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what I can do is modify the ESMTPClient so that if SMTPUTF8
is not in the options on the server, it inspects the bytes of the addresses,
and if any non-ASCII characters are in the address, it throws an exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that In the case of Buildbot we want to anyway send that email, even if we have to replace non-ascii characters by ascii equivalent. Probably most of the client would have same need, but some other want to error out i that case, or at least warn in a UI. So we should probably raise that exception, and document an easy and portable way to translate that email into somthing that can be send via that old smtp server

@codecov
Copy link

codecov bot commented Sep 7, 2017

Codecov Report

Merging #867 into trunk will decrease coverage by 0.07%.
The diff coverage is 96.2%.

@@            Coverage Diff             @@
##            trunk     #867      +/-   ##
==========================================
- Coverage   91.91%   91.84%   -0.08%     
==========================================
  Files         843      840       -3     
  Lines      149214   148664     -550     
  Branches    13073    12983      -90     
==========================================
- Hits       137156   136545     -611     
- Misses       9959    10010      +51     
- Partials     2099     2109      +10

@rodrigc
Copy link
Contributor Author

rodrigc commented Sep 10, 2017

OK, I pushed a patch which works. It does the following:

  1. In SMTPClient, self._encoding is set to "ascii"
  2. In ESMTPClient, self._encoding is set to "utf-8".
  3. In ESMTPClient, after sending the EHLO, it looks for SMTPUTF8 as an advertised option. If the option is not there, self._encoding is set to "ascii", and an message is logged.
  4. In SMTPClient.sendLine(), it tries to verify that the bytes being sent comply with self._encoding. If not, the connection is closed and an error is raised.

@rodrigc
Copy link
Contributor Author

rodrigc commented Sep 10, 2017

I used the following test program:

from __future__ import print_function
from twisted.mail.smtp import sendmail
from twisted.python.log import startLogging
from twisted.internet.task import react

startLogging(open("/tmp/foobar.txt", "w"))

def main(reactor):
    d = sendmail(smtphost="192.168.1.13", from_addr=u"\N{SNOWMAN}foo@twistedmatrix.com", to_addrs=[u"rodrigc@dibbler.crodrigues.org"],
             msg=u"UNGA BUNGA".encode("utf-8"),
             senderDomainName=None,
             port=25,
             reactor=reactor,
             username=None,
             password=None,
             requireAuthentication=False,
             requireTransportSecurity=False)
    return d

react(main)

and tested against two mail servers on FreeBSD:

  1. Postfix 3.2.2 with SMTPUTF8 offered
  2. Sendmail 8.15.2 without SMTPUTF8 offered

With Postfix, when I sent an e-mail with non-ASCII UTF-8 characters in the address,
the email went through. With Sendmail, the client raised an error and closed the connection.

else:
def __str__(self):
return self.__bytes__()


def __bytes__(self):
if self.code > 0:
res = [networkString("%.3d %s" % (self.code, self.resp))]
res = [u"{:03d} {}".format(self.code, self.resp).encode("utf-8")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this UTF-8? This is used by SMTPClient, so it shouldn't be UTF-8. This could using self._encoding to allow ESMTPClientError to inherit from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this can be done easily. These exceptions don't have access to self._encoding, so every single exception inheriting from SMTPClientError needs to be modified to take an encoding argument in its constructor....and there are lots.

I'm relying on this code in SMTPClient.sendLine() to fail if any invalid characters are in the stream:

        try:
            line.decode(self._encoding)
        except UnicodeDecodeError:
            self.sendError(
                SMTPProtocolError(-1,
                    "Server supports {} encoding, invalid "
                    "characters detected in sent line:\n{} .".format(
                        self._encoding, repr(line)),
                    self.log.str()))
            return

I've used that successfully to test sending invalid ASCII characters to a Sendmail server which does not support SMTPUTF8.

else:
res = [self.resp]
if isinstance(self.resp, unicode):
res = [self.resp.encode("utf-8")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above - this shouldn't be UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@@ -175,24 +175,32 @@ def quoteaddr(addr):
return b'<' + bytes(addr) + b'>'

if isinstance(addr, bytes):
addr = addr.decode('ascii')
addr = addr.decode('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be universally UTF-8 because it's also used by SMTPClient, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# It didn't parse, use it as-is
return b'<' + bytes(addr) + b'>'
return b'<' + str(addr).encode('utf-8') + b'>'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

else:
return b'<' + res[1].encode('ascii') + b'>'
return b'<' + res[1].encode('utf-8') + b'>'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -574,7 +586,7 @@ def do_RCPT(self, rest):
if not self._from:
self.sendCode(503, b"Must have sender before recipient")
return
m = self.rcpt_re.match(rest)
m = self.rcpt_re.match(rest.decode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be self._encoding, not "utf-8".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -919,6 +931,8 @@ class SMTPClient(basic.LineReceiver, policies.TimeoutMixin):
# None, perform no timeout checking.
timeout = None

_encoding = "ascii"

def __init__(self, identity, logsize=10):
if isinstance(identity, unicode):
identity = identity.encode('ascii')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be self._encoding, not "ascii".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1791,7 +1826,7 @@ class SenderMixin:
def getMailFrom(self):
if not self.done:
self.done = 1
return str(self.factory.fromEmail)
return unicode(self.factory.fromEmail)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to return the repr of b'some@address.com' on Python 3, and decode self.factory.fromEmail as ASCII on Python 2. This should probably be unicode(self.factory.fromEmail, encoding=self._encoding).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work:

          File "/Users/crodrigues/twisted_15/src/twisted/mail/smtp.py", line 1829, in getMailFrom
            return unicode(self.factory.fromEmail, encoding=self._encoding)
        builtins.TypeError: decoding to str: need a bytes-like object, Address found

if _PY3:
self.assertEqual(str(err), u"\N{SNOWMAN} smiling")
else:
self.assertEqual(str(err), b"\xe2\x98\x83 smiling")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better:

expected = u"\N{SNOWMAN} smiling"
if _PY3:
    errorAsString = str(err, encoding='utf-8')
else:
    errorAsString = str(err)
    expected.encode('utf-8')
self.assertEqual(stringError, expected)

because it is less redundant, makes it clear that the final result should have been encoded as UTF-8, and doesn't depend on an implicit default of UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work:

Traceback (most recent call last):
  File "/Users/crodrigues/twisted_15/src/twisted/mail/test/test_smtp.py", line 1523, in test_strWithNegativeCode
    errorAsString = str(err, encoding='utf-8')
builtins.TypeError: decoding to str: need a bytes-like object, SMTPClientError found

# Try to send UTF-8 text when client encoding is ascii
# will result in an error which causes the transport to disconnect.
self.clientProtocol.sendLine(u"RCPT TO:<прив@example.com>".encode("utf-8"))
self.assertTrue(self.clientProtocol.transport.disconnecting)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to assert that an exception is raised?

Copy link
Contributor Author

@rodrigc rodrigc Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to get the exception here, but I found that self.clientProtocol.transport.disconnecting is set to True on failure, due to the exception thrown when trying to decode().

@rodrigc
Copy link
Contributor Author

rodrigc commented Sep 21, 2017

@jlquinn was able to successfully test this branch of Twisted as part of a buildbot install here:

buildbot/buildbot#3633 (comment)

@rodrigc rodrigc closed this Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants